JavaRush /Java Blog /Random EN /Cool SQL optimizations that don't depend on the cost mode...

Level 41

8 August 2023
232 views
0 comments

Cool SQL optimizations that don't depend on the cost model. Part 3

Cool SQL optimizations that don't depend on the cost model. Part 1 Cool SQL optimizations that don't depend on the cost model. Part 2

6. Merging predicates

This is an interesting possibility that I once stumbled upon, mistakenly assuming that my DBMS was capable of this. Consider the following query:

SELECT *
FROM actor
WHERE actor_id IN (2, 3, 4)
AND actor_id IN (1, 2, 3);

Obviously, the two predicates intersect and can be merged together. You can expect the database to translate the above query into the following:

SELECT *
FROM actor
WHERE actor_id IN (2, 3);

Looks pretty obvious, right? This is a more complicated case of transitive closure. Another case of it is the merging of two ranges. When making a request:

SELECT *
FROM film
WHERE film_id BETWEEN 1 AND 100
AND film_id BETWEEN 99 AND 200

we hope the database will rewrite the query like this:

SELECT *
FROM film
WHERE film_id BETWEEN 99 AND 100

The cardinality of the second option would be 2 rows, but in the first case the database might not understand that the ranges could be combined and would choose a full table scan when it should have used an index. What databases are capable of these optimizations?

DB2

Merge predicates IN Yes

Explain Plan
--------------------------------------------------
ID | Operation         |               Rows | Cost
 1 | RETURN            |                    |   11
 2 |  FETCH ACTOR      |   2 of 2 (100.00%) |   11
 3 |   IXSCAN PK_ACTOR | 2 of 200 (  1.00%) |    0

Predicate Information
3 - SARG Q3.ACTOR_ID IN (2, 3)

Range Predicate Merging Yes (but don't let the plan fool you!)

Explain Plan
--------------------------------------------------
ID | Operation        |                Rows | Cost
 1 | RETURN           |                     |   13
 2 |  FETCH FILM      |    2 of 2 (100.00%) |   13
 3 |   IXSCAN PK_FILM | 2 of 1000 (   .20%) |    6

Predicate Information
3 - START (99 <= Q1.FILM_ID)
      STOP (Q1.FILM_ID <= 100)
      SARG (Q1.FILM_ID <= 200)
      SARG (1 <= Q1.FILM_ID)

As you can see, the predicate was not fully optimized. The filter (SARG), which checks for a hit between the lower and upper bounds of the combined range, is in place, but the START and STOP operations are more important, indicating fast index access. In addition, the cardinality is also the same as it should be. If you want to be sure, run the query with the following impossible predicate

SELECT *
FROM film
WHERE film_id BETWEEN 1 AND 2
AND film_id BETWEEN 199 AND 200;

and you will get the correct plan:

Explain Plan
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information
 2 - RESID (1 = 0)

MySQL

Merging IN predicates Again, unfortunately, MySQL doesn't display predicate information well. The plans for both requests are the same:

ID  TABLE  TYPE   KEY      ROWS  FILTERED  EXTRA
------------------------------------------------------
1   actor  range  PRIMARY  2     100.00    Using where

Two times one cardinality, two times "Using where" without any hint of what is actually going on inside "where" . But based on the cardinality, we can conclude that the transformation was performed correctly. You can look at it from the other side, for which we will execute the query:

SELECT * FROM actor
WHERE actor_id IN (3, 4, 5)
AND actor_id IN (1, 2, 3);

Which should be converted to the following:

SELECT * FROM actor
WHERE actor_id = 3;

And indeed, this is what happens:

ID  TABLE  TYPE   KEY      ROWS  FILTERED  EXTRA
------------------------------------------------------
1   actor  const  PRIMARY  1     100.00

Notice that TYPE=range has changed to TYPE=const. So, we can conclude that yes, MySQL performs this optimization. Merging Range Predicates Again, the query plan yields nothing:

ID  TABLE  TYPE   KEY      ROWS  FILTERED  EXTRA
------------------------------------------------------
1   film   range  PRIMARY  2     100.00    Using where

But it is possible to confirm the optimization is performed with the following "impossible" predicate:

SELECT *
FROM film
WHERE film_id BETWEEN 1 AND 2
AND film_id BETWEEN 199 AND 200

in which case the plan changes to this:

ID  TABLE  EXTRA
-----------------------------------------
1          no matching row in const table

So, again good news regarding MySQL.

Oracle

Merge predicates IN Yes

----------------------------------------------------------
| Id  | Operation                    | Name     | E-Rows |
----------------------------------------------------------
|   0 | SELECT STATEMENT             |          |        |
|   1 |  INLIST ITERATOR             |          |        |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |      2 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |      2 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(("ACTOR_ID"=2 OR "ACTOR_ID"=3))

The applied predicate only includes the values 2 and 3 , so the conversion worked correctly. Merging range predicates Again, yes:

----------------------------------------------------------------
| Id  | Operation                           | Name    | E-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT                    |         |        |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| FILM    |      2 |
|*  2 |   INDEX RANGE SCAN                  | PK_FILM |      2 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("FILM_ID">=99 AND "FILM_ID"<=100)

PostgreSQL

Merging IN predicates Alas, no, there is no optimization!

QUERY PLAN
-----------------------------------------------------------------------------------------------
Seq Scan on actor  (cost=0.00..5.50 rows=1 width=25)
  Filter: ((actor_id = ANY ('{2,3,4}'::integer[])) AND (actor_id = ANY ('{1,2,3}'::integer[])))

Both predicates are still present in the execution plan, and the cardinality estimate is wrong, it should be 2 , not 1 . If we manually transform the query, we would get the following query plan:

QUERY PLAN
-----------------------------------------------------
Seq Scan on actor  (cost=0.00..4.50 rows=2 width=25)
  Filter: (actor_id = ANY ('{2,3}'::integer[]))

In particular, we see an incorrect plan in the case when two predicates do not intersect and an "impossible" predicate is formed:

SELECT *
FROM actor
WHERE actor_id IN (2, 3, 4)
AND actor_id IN (7, 8, 9)
Опять неправильный план:
QUERY PLAN
-----------------------------------------------------------------------------------------------
Seq Scan on actor  (cost=0.00..5.50 rows=1 width=25)
  Filter: ((actor_id = ANY ('{2,3,4}'::integer[])) AND (actor_id = ANY ('{7,8,9}'::integer[])))

Bummer! Merging range predicates Looks no better:

QUERY PLAN
--------------------------------------------------------------------------------------------
Index Scan using film_pkey on film  (cost=0.28..8.30 rows=1 width=386)
  Index Cond: ((film_id >= 1) AND (film_id <= 100) AND (film_id >= 99) AND (film_id <= 200))

It's hard to say if it worked or not. In the end, we got the right plan with reasonable cardinality, so that everything can work, just like on DB2. But what happens, again, if you create an "impossible" predicate?

SELECT *
FROM film
WHERE film_id BETWEEN 1 AND 2
AND film_id BETWEEN 199 AND 200;

The plan got worse:

QUERY PLAN
-------------------------------------------------------------------------------------------
Index Scan using film_pkey on film  (cost=0.28..8.42 rows=5 width=386)
  Index Cond: ((film_id >= 1) AND (film_id >= 2) AND (film_id >= 199) AND (film_id >= 200))

Cardinality has gone up instead of down. And, in the end, such a request should not be executed at all. Fat minus PostgreSQL.

SQL Server

Merge predicates IN Yes, everything works:

|--Nested Loops(Inner Join)
     |--Index Seek(SEEK:([actor_id]=(2) OR [actor_id]=(3)))
     |--RID Lookup(OBJECT:([actor]))

Merging range predicates Again, similar to the DB2 case:

|--Nested Loops(Inner Join)
     |--Index Seek(SEEK:([film_id] >= (1) AND [film_id] <= (100)), WHERE:([film_id]>=(99) AND [film_id]<=(200)))
     |--RID Lookup(OBJECT:([film]))

Unfortunately, note the difference between SEEK and WHERE . I would like to see the [99, 100] range in SEEK as in DB2, since SEEK is fast due to index access in O(log N) time , while WHERE access time grows linearly, on the order of O(N) . Bummer! It seems to me that this is a programming error, because an impossible predicate leads to a much more valid one:

|--Constant Scan

Summary

Keep in mind that there are many predicates that merge correctly in some databases and not in others. When in doubt, be sure to check the execution plan!

Database	Merge IN	Merging Ranges
DB2 LUW 10.5	Yes	Yes
MySQL 8.0.2	Yes	Yes
Oracle 12.2.0.1	Yes	Yes
PostgreSQL 9.6	No	No
SQL Server 2014	Yes	No

7. Provably empty sets

This is a particularly cool feature. We have already seen impossible predicates and unnecessary table calls above . What if we do it again, but now with a JOIN ? Will JOIN elimination work here ? Let's try the following queries: IS NULL predicate on column with NOT NULL constraint The predicate in the WHERE clause cannot be TRUE because the FILM_ID column has a NOT NULL constraint .

SELECT first_name, last_name
FROM actor a
JOIN (
  SELECT *
  FROM film_actor
  WHERE film_id IS NULL
) fa ON a.actor_id = fa.actor_id;

The FA view will not return any columns because the NOT NULL constraint on the FA.FILM_ID column makes it provably empty. And since INNER JOIN with an empty table also returns no rows, there is no need to access the ACTOR table , so the above query should be rewritten like this:

SELECT NULL AS first_name, NULL AS last_name
WHERE 1 = 0;

Intersection of nullable columns with columns with NOT NULL constraint In principle, this is equivalent to the previous example, only with slightly more convoluted syntax:

SELECT *
FROM actor a
JOIN (
  SELECT actor_id, film_id
  FROM film_actor
  INTERSECT
  SELECT NULL, NULL
  FROM dual
) fa ON a.actor_id = fa.actor_id;

Due to the NOT NULL constraints on both the FA.ACTOR_ID and FA.FILM_ID columns , their intersection with the (NULL, NULL) tuple will return no results, so the view is provably empty, and hence the inner join can be eliminated. And again, with the EXISTS subquery Finally, repeat the above query, but this time with a semi join instead of an inner join . First with the impossible predicate:

SELECT *
FROM actor a
WHERE a.actor_id IN (
  SELECT actor_id
  FROM film_actor
  WHERE actor_id IS NULL
);

... and then again with the intersection.

SELECT *
FROM actor a
WHERE a.actor_id IN (
  SELECT actor_id
  FROM film_actor
  INTERSECT
  SELECT NULL
  FROM sysibm.dual
)

Forward. Let's see which databases can perform these optimizations.

DB2

Joining a provably empty set (predicate IS NULL ):

Explain Plan
-----------------------------------
*ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information
 2 - RESID (1 = 0)

Joining a provably empty set (INTERSECT) :

Explain Plan
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information
 2 - RESID (1 = 0)

Semijoin of a provably empty set (predicate IS NULL ):

Explain Plan
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information
 2 - RESID (1 = 0)

Semijoin of a provably empty set (INTERSECT) :

Explain Plan
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information
 2 - RESID (1 = 0)

Wow, cool! Looks like a race winner!

MySQL

Joining a provably empty set (predicate IS NULL ):

ID  TABLE   EXTRA
----------------------------
1           Impossible WHERE

Cool, I didn't expect it! Joining a provably empty set (INTERSECT) : Alas, MySQL does not support INTERSECT . Semijoin of a provably empty set (predicate IS NULL ):

ID  TABLE   EXTRA
----------------------------
1           Impossible WHERE

Provably Empty Set Semi-Join (INTERSECT) : Alas, MySQL does not support INTERSECT . But still, MySQL shows a great result!

Oracle

Joining a provably empty set (predicate IS NULL ):

---------------------------------------------------------------------------
| Id  | Operation              | Name          | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |               |      1 |        |      0 |
|*  1 |  FILTER                |               |      1 |        |      0 |
|*  2 |   HASH JOIN            |               |      0 |   5462 |      0 |
|   3 |    TABLE ACCESS FULL   | ACTOR         |      0 |    200 |      0 |
|   4 |    INDEX FAST FULL SCAN| PK_FILM_ACTOR |      0 |   5462 |      0 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(NULL IS NOT NULL)
   2 - access("A"."ACTOR_ID"="FILM_ACTOR"."ACTOR_ID")

Again, a very strange execution plan in Oracle, but the NULL IS NOT NULL filter is in place, and it is in front of all other operations, which are thus not executed. Joining a provably empty set (INTERSECT) :

---------------------------------------------------------------------------------
| Id  | Operation                    | Name          | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |               |      1 |        |      0 |
|   1 |  NESTED LOOPS                |               |      1 |      1 |      0 |
|   2 |   NESTED LOOPS               |               |      1 |      1 |      0 |
|   3 |    VIEW                      |               |      1 |      1 |      0 |
|   4 |     INTERSECTION             |               |      1 |        |      0 |
|   5 |      SORT UNIQUE             |               |      1 |   5462 |   5463 |
|   6 |       INDEX FAST FULL SCAN   | PK_FILM_ACTOR |      1 |   5462 |   5463 |
|   7 |      FAST DUAL               |               |      1 |      1 |      1 |
|*  8 |    INDEX UNIQUE SCAN         | PK_ACTOR      |      0 |      1 |      0 |
|   9 |   TABLE ACCESS BY INDEX ROWID| ACTOR         |      0 |      1 |      0 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")

It's funny. As you can see, this execution plan scans the entire primary key of the FILM_ACTOR table . This can save us from accessing the ACTOR table and the primary key index, since the derived table is processed first (in which there is not a single row), but operations with Id=5 and 6 still should not be here. Bummer! Semijoin of a provably empty set (predicate IS NULL ): And this again works correctly:

-------------------------------------------------------------------------------------
| Id  | Operation              | Name                    | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                         |      1 |        |      0 |
|*  1 |  FILTER                |                         |      1 |        |      0 |
|*  2 |   HASH JOIN SEMI       |                         |      0 |    200 |      0 |
|   3 |    TABLE ACCESS FULL   | ACTOR                   |      0 |    200 |      0 |
|   4 |    INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |      0 |   5462 |      0 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(NULL IS NOT NULL)
   2 - access("A"."ACTOR_ID"="ACTOR_ID")

... with the same weird execution plan containing a non-executable subtree. Semijoin of a provably empty set (INTERSECT) : Again, no optimization:

-------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                    | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                         |      1 |        |      0 |
|   1 |  NESTED LOOPS                |                         |      1 |      1 |      0 |
|   2 |   NESTED LOOPS               |                         |      1 |      1 |      0 |
|   3 |    VIEW                      | VW_NSO_1                |      1 |      1 |      0 |
|   4 |     INTERSECTION             |                         |      1 |        |      0 |
|   5 |      SORT UNIQUE             |                         |      1 |   5462 |    200 |
|   6 |       INDEX FAST FULL SCAN   | IDX_FK_FILM_ACTOR_ACTOR |      1 |   5462 |   5463 |
|   7 |      FAST DUAL               |                         |      1 |      1 |      1 |
|*  8 |    INDEX UNIQUE SCAN         | PK_ACTOR                |      0 |      1 |      0 |
|   9 |   TABLE ACCESS BY INDEX ROWID| ACTOR                   |      0 |      1 |      0 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access("A"."ACTOR_ID"="ACTOR_ID")

Not too good results!

PostgreSQL

Much to our dismay, PostgreSQL does not perform well in this experiment! Joining a provably empty set (predicate IS NULL ): Nope:

QUERY PLAN
--------------------------------------------------------------------------------------------
Hash Join  (cost=8.31..13.07 rows=1 width=13)
  Hash Cond: (a.actor_id = film_actor.actor_id)
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)
  ->  Hash  (cost=8.30..8.30 rows=1 width=2)
        ->  Index Scan using idx_fk_film_id on film_actor  (cost=0.28..8.30 rows=1 width=2)
              Index Cond: (film_id IS NULL)

Joining a provably empty set (INTERSECT) : Even worse:

QUERY PLAN
---------------------------------------------------------------------------------------------------
Hash Join  (cost=166.60..171.36 rows=1 width=29)
  Hash Cond: (a.actor_id = fa.actor_id)
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=25)
  ->  Hash  (cost=166.59..166.59 rows=1 width=4)
        ->  Subquery Scan on fa  (cost=0.00..166.59 rows=1 width=4)
              ->  HashSetOp Intersect  (cost=0.00..166.58 rows=1 width=8)
                    ->  Append  (cost=0.00..139.26 rows=5463 width=8)
                          ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=8)
                                ->  Result  (cost=0.00..0.01 rows=1 width=4)
                          ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..139.24 rows=5462 width=8)
                                ->  Seq Scan on film_actor  (cost=0.00..84.62 rows=5462 width=4)

Semi join of provably empty set (predicate IS NULL ) : Same as with inner join:

QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Semi Join  (cost=6.06..10.60 rows=1 width=25)
  Hash Cond: (a.actor_id = film_actor.actor_id)
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=25)
  ->  Hash  (cost=6.05..6.05 rows=1 width=2)
        ->  Index Only Scan using film_actor_pkey on film_actor  (cost=0.28..6.05 rows=1 width=2)
              Index Cond: (actor_id IS NULL)

Semijoin of a provably empty set (INTERSECT) : As expected:

QUERY PLAN
--------------------------------------------------------------------------------------------------
Hash Semi Join  (cost=152.94..157.48 rows=1 width=25)
  Hash Cond: (a.actor_id = "ANY_subquery".actor_id)
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=25)
  ->  Hash  (cost=152.93..152.93 rows=1 width=2)
        ->  Subquery Scan on "ANY_subquery"  (cost=0.00..152.93 rows=1 width=2)
              ->  HashSetOp Intersect  (cost=0.00..152.92 rows=1 width=6)
                    ->  Append  (cost=0.00..139.26 rows=5463 width=6)
                          ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=6)
                                ->  Result  (cost=0.00..0.01 rows=1 width=2)
                          ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..139.24 rows=5462 width=6)
                                ->  Seq Scan on film_actor  (cost=0.00..84.62 rows=5462 width=2)

SQL Server

SQL Server is here in all its glory, as is DB2! Joining a provably empty set (predicate IS NULL ) :

|--Constant Scan

Joining a provably empty set (INTERSECT) :

|--Constant Scan

Semijoin of a provably empty set (predicate IS NULL ) :

|--Constant Scan

Semijoin of a provably empty set (INTERSECT) :

|--Constant Scan

Summary

Database	JOIN/NULL	JOIN/INTERSECT	SEMI JOIN / NULL	SEMI JOIN/INTERSECT
DB2 LUW 10.5	Yes	Yes	Yes	Yes
MySQL 8.0.2	Yes	Not supported	Yes	Not supported
Oracle 12.2.0.1	Yes	No	Yes	No
PostgreSQL 9.6	No	No	No	No
SQL Server 2014	Yes	Yes	Yes	Yes

Note that this can be done in many other ways. Feel free to comment and suggest your own "provably empty sets" to test how these databases optimize them.