JavaRush /Java Blog /Random EN /Cool SQL optimizations that don't depend on the cost mode...

Cool SQL optimizations that don't depend on the cost model. Part 1

Published in the Random EN group
Five simple optimizations that can be implemented based on metadata (i.e. constraints) and the query itself. Cool SQL optimizations that don't depend on the cost model.  Part 1 - 1Here is an adaptation of Lukas Eder's article for those who have a basic understanding of databases and SQL, as well as a little practical experience with DBMS. Cost optimization is the de facto standard way to optimize SQL queries in modern databases. That is why it is so difficult to manually write a complex algorithm in 3GL (third generation programming languages), whose performance would exceed the dynamically calculated execution plan generated by a modern optimizer. Today we will not discuss cost optimization, that is, optimization based on the database cost model. We'll look at much simpler optimizations. Those that can be implemented on the basis of metadata alone (i.e., constraints) and the request itself. Usually their implementation for the database is not Newton's binomial, since, in this case, any optimization will lead to a better execution plan, regardless of the presence of indexes, data volumes and data distribution asymmetry. "Not Newton's binomial" is not in the sense of ease of implementation of the optimization, but in whether it should be done. These optimizations eliminate [for the database] unnecessary, extra work (as opposed to the unnecessary, obligatory work I wrote about earlier ).

What are these optimizations used for?

Most of them are used for:
  • fixing errors in requests;
  • Ensuring that views are reused without actually executing the view logic by the database.
In the first case, one could say, "So what, just go ahead and fix that stupid SQL query." But let the first one to throw a stone at me is the one who has never made a mistake. The second case is particularly interesting: it gives us the ability to create complex libraries of views and table functions that can be reused across multiple layers.

Databases Used

In this article, we will be comparing 10 SQL optimizations in the five most widely used RDBMSs ( according to Database Rankings ):
  • Oracle 12.2;
  • MySQL 8.0.2;
  • SQL Server 2014;
  • PostgreSQL 9.6;
  • DB2 LUW 10.5.
Another rating almost echoes him. As usual, in this article, I will be querying the Sakila database .
Cool SQL optimizations that don't depend on the cost model.  Part 1 - 2
Here is a list of these ten types of optimizations:
  1. transitive closure;
  2. impossible predicates and unnecessary table calls;
  3. elimination of JOIN;
  4. elimination of "meaningless" predicates;
  5. projections in EXISTS subqueries;
  6. merging predicates;
  7. provably empty sets;
  8. CHECK restrictions;
  9. unnecessary reflective connections;
  10. Pushdown Predicates
Today we will discuss pp. 1-3, in the second part - 4 and 5, and in part 3 - 6-10.

1. Transitive closure

Let's start with something simpler: a transitive closure . This is a trivial concept applicable to many mathematical operations, such as the equality operator. It can be formulated in this case as follows: if A = B and B = C, then A = C.

Easy, right? But this has some interesting implications for SQL optimizers. Consider an example. Retrieve all movies with ACTOR_ID = 1:
SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE a.actor_id = 1;
The result is the following:
FIRST_NAME      LAST_NAME  FILM_ID
PENELOPE        GUINESS    1
PENELOPE        GUINESS    23
PENELOPE        GUINESS    25
PENELOPE        GUINESS    106
PENELOPE        GUINESS    140
PENELOPE        GUINESS    166
...
Let's now take a look at the execution plan for this query in the case of the Oracle DBMS:
--------------------------------------------------------------
| Id  | Operation                    | Name          | Rows  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT             |               |       |
|   1 |  NESTED LOOPS                |               |    19 |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR         |     1 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR      |     1 |
|*  4 |   INDEX RANGE SCAN           | PK_FILM_ACTOR |    19 |
--------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("A"."ACTOR_ID"=1)
   4 - access("FA"."ACTOR_ID"=1)
Particularly interesting here is the section of predicates. The predicate ACTOR_ID = 1, due to transitive closure, applies to both the ACTOR table and the FILM_ACTOR table. If:
A.ACTOR_ID = 1 (из предиката WHERE) и…
• A.ACTOR_ID = FA.ACTOR_ID (из предиката ON)
  То:
• FA.ACTOR_ID = 1
In the case of more complex queries, this leads to some very nice results. In particular, the accuracy of the cardinality estimates is significantly improved, since it becomes possible to select estimates based on a specific constant value of the predicate, and not, for example, the average number of films by actors, as in the following query (returning the same result):
SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE first_name = 'PENELOPE'
AND last_name = 'GUINESS'
His plan:
----------------------------------------------------------------------------
| Id  | Operation                            | Name                | Rows  |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                     |       |
|   1 |  NESTED LOOPS                        |                     |     2 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR               |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME |     3 |
|*  4 |   INDEX RANGE SCAN                   | PK_FILM_ACTOR       |    27 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("A"."FIRST_NAME"='PENELOPE')
   3 - access("A"."LAST_NAME"='GUINESS')
   4 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
As you can see, the number of rows in the FILM_ACTOR table is overestimated, while the nested loops (NESTED LOOP) are underestimated. Here are a couple of interesting values:
SELECT count(*) FROM film_actor WHERE actor_id = 1;
SELECT avg(c) FROM (
  SELECT count(*) c FROM film_actor GROUP BY actor_id
);
Result:
19
27.315
This is where the ratings come from. If the database knows that this is about ACTOR_ID = 1, then it can collect statistics on the number of films for this particular actor . If it doesn't (because the standard statistics collection mechanism doesn't correlate FIRST_NAME/LAST_NAME with ACTOR_ID), then we'll get the average number of films for all actors . A simple, insignificant error in this particular case, but in a complex query it can propagate further, accumulate and lead further in the query (higher in the plan) to an incorrect JOIN choice. So whenever you can, design your joins and simple predicates to take advantage of transitive closure. What other databases support this feature?

DB2

Yes!
Explain Plan
-----------------------------------------------------------
ID | Operation              |                 Rows | Cost
 1 | RETURN                 |                      |   13
 2 |  NLJOIN                |              27 of 1 |   13
 3 |   FETCH ACTOR          |     1 of 1 (100.00%) |    6
 4 |    IXSCAN PK_ACTOR     |   1 of 200 (   .50%) |    0
 5 |   IXSCAN PK_FILM_ACTOR | 27 of 5462 (   .49%) |    6
Predicate Information
 4 - START (Q2.ACTOR_ID = 1)
      STOP (Q2.ACTOR_ID = 1)
 5 - START (1 = Q1.ACTOR_ID)
      STOP (1 = Q1.ACTOR_ID)
By the way, if you like cool execution plans like this, check out Markus Winand's script .

MySQL

Unfortunately, MySQL's execution plans are not well suited for this kind of analysis. The output does not contain the predicate itself:
ID  SELECT TYPE  TABLE  TYPE   REF    ROWS
------------------------------------------
1   SIMPLE       a      const  const  1
1   SIMPLE       fa     ref    const  19
But the fact that const is listed twice in the REF column shows that both tables are looking for a constant value. At the same time, the query plan with FIRST_NAME / LAST_NAME looks like this:
ID  SELECT TYPE  TABLE  TYPE   REF         ROWS
-----------------------------------------------
1   SIMPLE       a      ref    const       3
1   SIMPLE       fa     ref    a.actor_id  27
And as you can see, the REF now has a column reference from the JOIN predicate. The cardinality score is almost the same as in Oracle. So yes, MySQL supports transitive closure too.

PostgreSQL

Yes!
QUERY PLAN
------------------------------------------------------------------------------------
Nested Loop  (cost=4.49..40.24 rows=27 width=15)
  ->  Seq Scan on actor a  (cost=0.00..4.50 rows=1 width=17)
        Filter: (actor_id = 1)
  ->  Bitmap Heap Scan on film_actor fa  (cost=4.49..35.47 rows=27 width=4)
        Recheck Cond: (actor_id = 1)
        ->  Bitmap Index Scan on film_actor_pkey  (cost=0.00..4.48 rows=27 width=0)
              Index Cond: (actor_id = 1)

SQL Server

Yes!
|--Nested Loops(Inner Join)
     |--Nested Loops(Inner Join)
     |    |--Index Seek (SEEK:([a].[actor_id]=(1)))
     |    |--RID Lookup
     |--Index Seek (SEEK:([fa].[actor_id]=(1)))

Summary

All of our databases support transitive closure.
Database transitive closure
DB2 LUW 10.5 Yes
MySQL 8.0.2 Yes
Oracle 12.2.0.1 Yes
PostgreSQL 9.6 Yes
SQL Server 2014 Yes
However, wait for #6 in the next part of the article. There are tricky cases of transitive closure that not all databases handle.

2. Impossible predicates and unnecessary table calls

This is a completely stupid optimization, but why not? If users write impossible predicates, why even execute them? Here are some examples:
-- "Очевидный"
SELECT * FROM actor WHERE 1 = 0
-- "Хитрый"
SELECT * FROM actor WHERE NULL = NULL
The first query will obviously never return any results, but the same is true for the second. After all, although NULL IS NULL is always TRUE, the result of the calculation NULL = NULL is NULL, which, according to three-valued logic , is equivalent to FALSE. This doesn't require much explanation, so let's jump straight into finding out which databases perform this optimization.

DB2

Yes!
Explain Plan
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0
As you can see, access to the ACTOR table is completely excluded from the plan. It contains only the GENROW operation, which generates zero rows. Perfect.

MySQL

Yes!
ID  SELECT TYPE  TABLE   EXTRAS
-----------------------------------------
1   SIMPLE         Impossible WHERE
This time, MySQL was kind enough to let us know about the impossible WHERE clause. Thank you! This greatly facilitates analysis, especially compared to other databases.

Oracle

Yes!
---------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |      0 |
|*  1 |  FILTER            |       |      1 |        |      0 |
|   2 |   TABLE ACCESS FULL| ACTOR |      0 |    200 |      0 |
---------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NULL IS NOT NULL)
We see that the plan still refers to the ACTOR table, and the expected number of rows is still 200, but there is also a filtering operation (FILTER) with Id=1, where it will never be TRUE. Due to Oracle's dislike of the standard SQL boolean data type , Oracle maps NULL IS NOT NULL in the plan, instead of just FALSE. Well... But seriously, watch out for this predicate. I've debugged execution plans with 1000 row subtrees and extremely high cost values, only to find out after the fact that the entire subtree was "cut off" by the NULL IS NOT NULL filter. A little discouraging, I tell you.

PostgreSQL

Yes!
QUERY PLAN
-------------------------------------------
Result  (cost=0.00..0.00 rows=0 width=228)
  One-Time Filter: false
Already better. No annoying reference to the ACTOR table and a neat little FALSE predicate.

SQL Server?

Yes!
|--Constant Scan
SQL Server calls this a " constant lookup", which means a lookup where nothing happens - similar to DB2. All our databases are able to exclude impossible predicates:
Database Impossible Predicates Unnecessary table references
DB2 LUW 10.5 Yes Yes
MySQL 8.0.2 Yes Yes
Oracle 12.2.0.1 Yes Yes
PostgreSQL 9.6 Yes Yes
SQL Server 2014 Yes Yes

3. Eliminate JOIN

In the previous section, we saw unnecessary table accesses in single-table queries. But what happens if the JOIN doesn't require one of several table accesses? I already wrote about eliminating JOINs in a previous post from my blog . The SQL engine is able to determine, based on the type of query and the presence of primary and foreign keys, whether a particular JOIN is really needed in a given query, or whether removing it will not affect the semantics of the query. In all of the following three examples, JOIN is not needed. A ...-to-one inner join can be eliminated by having a NOT NULL foreign key Instead of this:
SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id
The database can do the following:
SELECT first_name, last_name
FROM customer c
An INNER JOIN of the "...-to-one" type can be replaced with a nullable foreign key. The above query works if the foreign key has a NOT NULL constraint. If not, for example, as in this query:
SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id
then JOIN can still be eliminated, but you have to add a NOT NULL predicate, like this:
SELECT title
FROM film
WHERE original_language_id IS NOT NULL
An OUTER JOIN of the "...-to-one" type can be removed if there is a unique key. Instead of this:
SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id
The database, again, can do the following:
SELECT first_name, last_name
FROM customer c
... even if there is no foreign key by CUSTOMER.ADDRESS_ID. The unique outer join (DISTINCT OUTER JOIN) of type "...-to-many" can be removed. Instead of this:
SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
The database can do the following:
SELECT DISTINCT first_name, last_name
FROM actor a
All of these examples have been covered in detail in the previous article, so I won't repeat myself, but will just summarize everything that different databases can eliminate:
Database INNER JOIN: ...-to-one (may be NULL): ...-to-one OUTER JOIN: ...-to-one OUTER JOIN DISTINCT: ...-to-many
DB2 LUW 10.5 Yes Yes Yes Yes
MySQL 8.0.2 No No No No
Oracle 12.2.0.1 Yes Yes Yes No
PostgreSQL 9.6 No No Yes No
SQL Server 2014 Yes No Yes Yes
Unfortunately, not all databases can resolve all kinds of joins. DB2 and SQL Server are the undisputed leaders here! To be continued
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION