JavaRush /Java Blog /Random EN /Cool SQL optimizations that do not depend on the cost mod...

Cool SQL optimizations that do not depend on the cost model. Part 1

Published in the Random EN group
Five simple optimizations that can be implemented based only on metadata (i.e. constraints) and the query itself Cool SQL optimizations that do not depend on the cost model.  Part 1 - 1We offer you an adaptation of Lukas Eder's article, designed for those who have a general understanding of databases and SQL, as well as some practical experience with DBMS. Cost optimization is actually a standard way to optimize SQL queries in modern databases. This is why it is so difficult to manually write a complex algorithm in 3GL (third generation programming languages) whose performance would exceed the dynamically calculated execution plan generated by a modern optimizer. Today we will not discuss cost optimization, that is, optimization based on the cost model of the database. We'll look at much simpler optimizations. Those that can be implemented based only on metadata (i.e. restrictions) and the request itself. Usually their implementation for a database is not a Newton binomial, since, in this case, any optimization will lead to a better execution plan, regardless of the presence of indexes, data volumes and skewness of data distribution. "Not a Newton binomial" is not in the sense of how easy it is to implement the optimization, but whether it should be done. These optimizations eliminate unnecessary, extra work [for the database] ( as opposed to unnecessary, required work, which I already wrote about ).

What are these optimizations used for?

Most of them are used for:
  • bug fixes in queries;
  • allowing views to be reused without the database actually executing the view logic.
In the first case, one could say: “So what, just go ahead and fix this stupid SQL query.” But let the one who has never made a mistake throw a stone at me first. The second case is especially interesting: it gives us the ability to create complex libraries of views and table functions that can be reused across multiple layers.

Databases used

In this article we will compare 10 SQL optimizations in the five most widely used DBMSs ( according to database rankings ):
  • Oracle 12.2;
  • MySQL 8.0.2;
  • SQL Server 2014;
  • PostgreSQL 9.6;
  • DB2 LUW 10.5.
Another rating almost echoes it. As usual, in this article I will be querying the Sakila database .
Cool SQL optimizations that do not depend on the cost model.  Part 1 - 2
Here is a list of these ten types of optimizations:
  1. transitive closure;
  2. impossible predicates and unnecessary table calls;
  3. eliminating JOIN;
  4. elimination of "meaningless" predicates;
  5. projections in EXISTS subqueries;
  6. merging of predicates;
  7. provably empty sets;
  8. constraints CHECK;
  9. unnecessary reflexive connections;
  10. Pushdown predicates
Today we will discuss pp. 1-3, in the second part - 4 and 5, and in part 3 - 6-10.

1. Transitive closure

Let's start with something simpler: transitive closure . This is a trivial concept that applies to many mathematical operations, such as the equality operator. It can be formulated in this case as follows: if A = B and B = C, then A = C.

Not difficult, right? But this has some interesting implications for SQL optimizers. Let's look at an example. Let's extract all movies with ACTOR_ID = 1:

SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE a.actor_id = 1;
The result is as follows:

FIRST_NAME      LAST_NAME  FILM_ID
PENELOPE        GUINESS    1
PENELOPE        GUINESS    23
PENELOPE        GUINESS    25
PENELOPE        GUINESS    106
PENELOPE        GUINESS    140
PENELOPE        GUINESS    166
...
Let's now take a look at the plan for executing this query in the case of the Oracle DBMS:

--------------------------------------------------------------
| Id  | Operation                    | Name          | Rows  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT             |               |       |
|   1 |  NESTED LOOPS                |               |    19 |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR         |     1 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR      |     1 |
|*  4 |   INDEX RANGE SCAN           | PK_FILM_ACTOR |    19 |
--------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("A"."ACTOR_ID"=1)
   4 - access("FA"."ACTOR_ID"=1)
The section on predicates is especially interesting here. The predicate ACTOR_ID = 1, due to transitive closure, applies to both the ACTOR table and the FILM_ACTOR table. If:
A.ACTOR_ID = 1 (из предиката WHERE) и…
• A.ACTOR_ID = FA.ACTOR_ID (из предиката ON)
  То:
• FA.ACTOR_ID = 1
For more complex queries, this produces some very nice results. In particular, the accuracy of cardinality estimates increases significantly, since it becomes possible to select estimates based on a specific constant value of the predicate, and not, for example, the average number of films by actors, as in the following query (returning the same result):

SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE first_name = 'PENELOPE'
AND last_name = 'GUINESS'
His plan:

----------------------------------------------------------------------------
| Id  | Operation                            | Name                | Rows  |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                     |       |
|   1 |  NESTED LOOPS                        |                     |     2 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR               |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME |     3 |
|*  4 |   INDEX RANGE SCAN                   | PK_FILM_ACTOR       |    27 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("A"."FIRST_NAME"='PENELOPE')
   3 - access("A"."LAST_NAME"='GUINESS')
   4 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
As you can see, the number of rows in the FILM_ACTOR table is overestimated, while the NESTED LOOP is underestimated. Here are a couple of interesting values:

SELECT count(*) FROM film_actor WHERE actor_id = 1;
SELECT avg(c) FROM (
  SELECT count(*) c FROM film_actor GROUP BY actor_id
);
Result:

19
27.315
This is where the estimates come from. If the database knows that we are talking about ACTOR_ID = 1, then it can collect statistics on the number of films for this particular actor . If it doesn’t (since the standard statistics collection mechanism does not correlate FIRST_NAME/LAST_NAME with ACTOR_ID), then we will get the average number of films for all actors . A simple, unimportant error in this particular case, but in a complex query it can propagate further, accumulate and lead further into the query (higher up in the plan) to an incorrect JOIN choice. So whenever you can, design your joins and simple predicates to take advantage of transitive closure. What other databases support this feature?

DB2

Yes!

Explain Plan                                               
-----------------------------------------------------------
ID | Operation              |                 Rows | Cost  
 1 | RETURN                 |                      |   13  
 2 |  NLJOIN                |              27 of 1 |   13  
 3 |   FETCH ACTOR          |     1 of 1 (100.00%) |    6  
 4 |    IXSCAN PK_ACTOR     |   1 of 200 (   .50%) |    0  
 5 |   IXSCAN PK_FILM_ACTOR | 27 of 5462 (   .49%) |    6  
Predicate Information                                      
 4 - START (Q2.ACTOR_ID = 1)                               
      STOP (Q2.ACTOR_ID = 1)                               
 5 - START (1 = Q1.ACTOR_ID)                               
      STOP (1 = Q1.ACTOR_ID)
By the way, if you like cool execution plans like this, check out Markus Winand's script .

MySQL

Unfortunately, MySQL execution plans are not well suited for this kind of analysis. The predicate itself is missing from the output information:

ID  SELECT TYPE  TABLE  TYPE   REF    ROWS
------------------------------------------
1   SIMPLE       a      const  const  1 
1   SIMPLE       fa     ref    const  19
But the fact that const is specified twice in the REF column shows that both tables are searching for a constant value. At the same time, the query plan with FIRST_NAME/LAST_NAME looks like this:

ID  SELECT TYPE  TABLE  TYPE   REF         ROWS
-----------------------------------------------
1   SIMPLE       a      ref    const       3 
1   SIMPLE       fa     ref    a.actor_id  27
And as you can see, the REF now references the column from the JOIN predicate. The cardinality score is almost the same as in Oracle. So yes, MySQL supports transitive closures too.

PostgreSQL

Yes!

QUERY PLAN                                                                          
------------------------------------------------------------------------------------
Nested Loop  (cost=4.49..40.24 rows=27 width=15)                                    
  ->  Seq Scan on actor a  (cost=0.00..4.50 rows=1 width=17)                        
        Filter: (actor_id = 1)                                                      
  ->  Bitmap Heap Scan on film_actor fa  (cost=4.49..35.47 rows=27 width=4)         
        Recheck Cond: (actor_id = 1)                                                
        ->  Bitmap Index Scan on film_actor_pkey  (cost=0.00..4.48 rows=27 width=0) 
              Index Cond: (actor_id = 1)

SQL Server

Yes!

  |--Nested Loops(Inner Join)
       |--Nested Loops(Inner Join)
       |    |--Index Seek (SEEK:([a].[actor_id]=(1)))
       |    |--RID Lookup
       |--Index Seek (SEEK:([fa].[actor_id]=(1)))

Summary

All of our databases support transitive closure.
Database Transitive closure
DB2 LUW 10.5 Yes
MySQL 8.0.2 Yes
Oracle 12.2.0.1 Yes
PostgreSQL 9.6 Yes
SQL Server 2014 Yes
However, wait for #6 in the next part of the article. There are complex cases of transitive closure that not all databases can handle.

2. Impossible predicates and unnecessary table calls

This is a completely stupid optimization, but why not? If users write impossible predicates, then why bother executing them? Here are some examples:

-- "Очевидный"
SELECT * FROM actor WHERE 1 = 0
-- "Хитрый"
SELECT * FROM actor WHERE NULL = NULL
The first query will obviously never return any results, but the same statement is true for the second. After all, although NULL IS NULL is always TRUE, the result of the calculation NULL = NULL is NULL, which, according to three-valued logic , is equivalent to FALSE. This is pretty self-explanatory, so let's jump straight into finding out which databases perform this optimization.

DB2

Yes!

Explain Plan                       
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0
As you can see, access to the ACTOR table is completely excluded from the plan. It contains only the GENROW operation, which generates zero rows. Perfect.

MySQL

Yes!

ID  SELECT TYPE  TABLE   EXTRAS
-----------------------------------------
1   SIMPLE         Impossible WHERE
This time, MySQL was kind enough to inform us about the impossible WHERE clause. Thank you! This makes analysis much easier, especially compared to other databases.

Oracle

Yes!

---------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |      0 |
|*  1 |  FILTER            |       |      1 |        |      0 |
|   2 |   TABLE ACCESS FULL| ACTOR |      0 |    200 |      0 |
---------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NULL IS NOT NULL)
We see that the plan still mentions access to the ACTOR table, and the expected number of rows is still 200, but there is also a filtering operation (FILTER) with Id=1, where there will never be TRUE. Due to Oracle's dislike of the standard SQL Boolean data type , Oracle displays NULL IS NOT NULL in the plan, instead of just FALSE. Oh well... But seriously, watch that predicate. I've had occasion to debug execution plans with 1000-line subtrees and extremely high cost values, only to discover after the fact that the entire subtree was being "cut off" by the NULL IS NOT NULL filter. A little discouraging, I tell you.

PostgreSQL

Yes!

QUERY PLAN                                 
-------------------------------------------
Result  (cost=0.00..0.00 rows=0 width=228) 
  One-Time Filter: false
Already better. No annoying ACTOR table calls and a neat little FALSE predicate.

SQL Server?

Yes!

  |--Constant Scan
SQL Server calls this a " constant scan", which is a scan where nothing happens - similar to DB2. All our databases can exclude impossible predicates:
Database Impossible predicates Unnecessary table accesses
DB2 LUW 10.5 Yes Yes
MySQL 8.0.2 Yes Yes
Oracle 12.2.0.1 Yes Yes
PostgreSQL 9.6 Yes Yes
SQL Server 2014 Yes Yes

3. Eliminate JOINs

In the previous section, we observed unnecessary table accesses in single-table queries. But what happens if the JOIN does not require one of several table accesses? I already wrote about eliminating JOIN in a previous post from my blog . The SQL engine is able to determine, based on the type of query and the presence of primary and foreign keys, whether a particular JOIN is actually needed in a given query, or whether eliminating it will not affect the semantics of the query. In all of the next three examples, JOIN is not needed. An inner ...-to-one join can be eliminated by having a NOT NULL foreign key. Instead:

SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id
The database can do the following:

SELECT first_name, last_name
FROM customer c
An INNER JOIN of type "...-to-one" can be replaced if there is a nullable foreign key. The above query works if the foreign key is subject to a NOT NULL constraint. If not, for example, as in this request:

SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id
then JOIN can still be eliminated, but you will have to add the NOT NULL predicate, like this:

SELECT title
FROM film
WHERE original_language_id IS NOT NULL
An OUTER JOIN of the "...-to-one" type can be removed if there is a unique key. Instead of this:

SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id
The database, again, can do the following:

SELECT first_name, last_name
FROM customer c
... even if there is no foreign key for CUSTOMER.ADDRESS_ID. The unique outer connection (DISTINCT OUTER JOIN) of the "...-to-many" type can be removed. Instead of this:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
The database can do the following:

SELECT DISTINCT first_name, last_name
FROM actor a
All these examples were studied in detail in the previous article, so I will not repeat myself, but will just summarize everything that various databases can eliminate:
Database INNER JOIN: ...-to-one (can be NULL): ...-to-one OUTER JOIN: ...-to-one OUTER JOIN DISTINCT: ...-to-many
DB2 LUW 10.5 Yes Yes Yes Yes
MySQL 8.0.2 No No No No
Oracle 12.2.0.1 Yes Yes Yes No
PostgreSQL 9.6 No No Yes No
SQL Server 2014 Yes No Yes Yes
Unfortunately, not all databases can resolve all types of connections. DB2 and SQL Server are the undisputed leaders here! To be continued
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION