Cool SQL optimizations that do not depend on the cost model. Part 1
4. Elimination of "meaningless" predicates
Equally meaningless are predicates that are (almost) always true. As you can imagine, if you are asking:
SELECT * FROM actor WHERE 1 = 1;
...then the databases will not actually execute it, but will simply ignore it.
I once answered a question about this on Stack Overflow and that is why I decided to write this article. I'll leave testing this as an exercise to the reader, but what happens if the predicate is a little less "meaningless"? For example:
SELECT * FROM film WHERE release_year = release_year;
Do you really need to compare the value to itself for each row? No, there is no value for which this predicate would be
FALSE , right? But we still need to check it. Although the predicate cannot be equal to
FALSE , it may well be equal to
NULL everywhere , again due to three-valued logic.
The RELEASE_YEAR column is nullable, and if any of the rows have
RELEASE_YEAR IS NULL , then
NULL = NULL results in
NULL and the row must be eliminated. So the request becomes the following:
SELECT * FROM film WHERE release_year IS NOT NULL;
Which databases do this?
DB2
Yes!
Explain Plan
-------------------------------------------------
ID | Operation | Rows | Cost
1 | RETURN | | 49
2 | TBSCAN FILM | 1000 of 1000 (100.00%) | 49
Predicate Information
2 - SARG Q1.RELEASE_YEAR IS NOT NULL
MySQL
It's a shame, but MySQL, again, doesn't map predicates into execution plans, so figuring out whether MySQL implements this particular optimization is a little tricky. You can perform a performance assessment and see if any large-scale comparisons are being made. Or you can add an index:
CREATE INDEX i_release_year ON film (release_year);
And get plans for the following requests in return:
SELECT * FROM film WHERE release_year = release_year;
SELECT * FROM film WHERE release_year IS NOT NULL;
If the optimization works, then the plans of both queries should be approximately the same. But in this case this is not the case:
ID TABLE POSSIBLE_KEYS ROWS FILTERED EXTRA
------------------------------------------------------
1 film 1000 10.00 Using where
ID TABLE POSSIBLE_KEYS ROWS FILTERED EXTRA
------------------------------------------------------
1 film i_release_year 1000 100.00 Using where
As you can see, our two queries differ significantly in the values of the
POSSIBLE_KEYS and
FILTERED columns . So I'd venture a reasonable guess that MySQL doesn't optimize this.
Oracle
Yes!
----------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |
----------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | |
|* 1 | TABLE ACCESS FULL| FILM | 1 | 1000 |
----------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("RELEASE_YEAR" IS NOT NULL)
PostgreSQL
Unfortunately no!
QUERY PLAN
--------------------------------------------------------------
Seq Scan on film (cost=0.00..67.50 rows=5 width=386)
Filter: ((release_year)::integer = (release_year)::integer)
Plans and costs vary. Namely, look at the evaluation of cardinality, which is absolutely no good, while this predicate:
SELECT * FROM film WHERE release_year IS NOT NULL;
gives much better results:
QUERY PLAN
---------------------------------------------------------
Seq Scan on film (cost=0.00..65.00 rows=1000 width=386)
Filter: (release_year IS NOT NULL)
Bummer!
SQL Server
Oddly enough, SQL Server doesn't seem to do this either:
|--Table Scan(OBJECT:([film]), WHERE:([release_year]=[release_year]))
However, based on the appearance of the plan, the cardinality assessment is correct, as is the cost. But in my experience with SQL Server, I would say that in this case, no optimization occurs, since SQL Server would display the actually executed predicate in the plan (to see why, take a look at the
CHECK constraint examples below). What about "meaningless" predicates on
NOT NULL columns? The above conversion was only necessary because
RELEASE_YEAR can be undefined. What happens if you run the same meaningless query on, for example, the
FILM_ID column ?
SELECT * FROM film WHERE film_id = film_id
Does it now correspond to no predicate at all? Or at least that's how it should be. But is it?
DB2
Yes!
Explain Plan
-------------------------------------------------
ID | Operation | Rows | Cost
1 | RETURN | | 49
2 | TBSCAN FILM | 1000 of 1000 (100.00%) | 49
No predicates are applied at all and we select all movies.
MySQL
Yes! (Again, educated guess)
ID TABLE POSSIBLE_KEYS ROWS FILTERED EXTRA
------------------------------------------------------
1 film 1000 100.00
Notice that the
EXTRA column is now empty, as if we don't have a
WHERE clause at all!
Oracle
Yes!
----------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |
----------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | |
| 1 | TABLE ACCESS FULL| FILM | 1 | 1000 |
----------------------------------------------------
Again, no predicates apply.
PostgreSQL
Wow, no again!
QUERY PLAN
------------------------------------------------------
Seq Scan on film (cost=0.00..67.50 rows=5 width=386)
Filter: (film_id = film_id)
The filter is applied and the cardinality score is still 5. Bummer!
SQL Server
And here again no!
|--Table Scan(OBJECT:([film]), WHERE:([film_id]=[film_id]))
Summary
It seems like a simple optimization, but it is not used in all DBMSs; in particular, oddly enough, it is not used in SQL Server!
Database |
Meaningless but necessary predicates (NULL semantics) |
Meaningless and unnecessary predicates (non-NULL semantics) |
DB2 LUW 10.5 |
Yes |
Yes |
MySQL 8.0.2 |
No |
Yes |
Oracle 12.2.0.1 |
Yes |
Yes |
PostgreSQL 9.6 |
No |
No |
SQL Server 2014 |
No |
No |
5. Projections in EXISTS subqueries
Interestingly, I get asked about them all the time at my master class, where I defend the point of view that
SELECT * usually does not lead to any good. The question is: is it possible to use
SELECT * in an
EXISTS subquery ? For example, if we need to find actors who played in films...
SELECT first_name, last_name
FROM actor a
WHERE EXISTS (
SELECT * -- Is this OK?
FROM film_actor fa
WHERE a.actor_id = fa.actor_id
)
And the answer is... yes. Can. The asterisk does not affect the request. How can you be sure of this? Consider the following query:
-- DB2
SELECT 1 / 0 FROM sysibm.dual
-- Oracle
SELECT 1 / 0 FROM dual
-- PostgreSQL, SQL Server
SELECT 1 / 0
-- MySQL
SELECT pow(-1, 0.5);
All of these databases report a division by zero error. Note an interesting fact: in MySQL, when we divide by zero, we get
NULL instead of an error, so we have to do something else that is not allowed. Now, what happens if we execute, instead of the above, the following queries?
-- DB2
SELECT CASE WHEN EXISTS (
SELECT 1 / 0 FROM sysibm.dual
) THEN 1 ELSE 0 END
FROM sysibm.dual
-- Oracle
SELECT CASE WHEN EXISTS (
SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual
-- PostgreSQL
SELECT EXISTS (SELECT 1 / 0)
-- SQL Server
SELECT CASE WHEN EXISTS (
SELECT 1 / 0
) THEN 1 ELSE 0 END
-- MySQL
SELECT EXISTS (SELECT pow(-1, 0.5));
Now none of the databases return an error. They all return
TRUE or
1 . This means that none of our databases actually evaluate the projection (that is, the
SELECT clause ) of the
EXISTS subquery . SQL Server, for example, shows the following plan:
|--Constant Scan(VALUES:((CASE WHEN (1) THEN (1) ELSE (0) END)))
As you can see, the
CASE expression has been converted to a constant and the subquery has been eliminated. Other databases store the subquery in the plan and don't mention anything about the projection, so let's take another look at the original query plan in Oracle:
SELECT first_name, last_name
FROM actor a
WHERE EXISTS (
SELECT *
FROM film_actor fa
WHERE a.actor_id = fa.actor_id
)
The above query plan looks like this:
------------------------------------------------------------------
| Id | Operation | Name | E-Rows |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
|* 1 | HASH JOIN SEMI | | 200 |
| 2 | TABLE ACCESS FULL | ACTOR | 200 |
| 3 | INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR | 5462 |
------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=1) LAST_NAME, FIRST_NAME
2 - (rowset=256) A.ACTOR_ID, FIRST_NAME, LAST_NAME
3 - FA.ACTOR_ID
We observe information about the projection at
Id=3 . In fact, we don't even access the
FILM_ACTOR table because we don't need to.
The EXISTS predicate can be performed using a foreign key index on a single
ACTOR_ID column - all that is needed for this query - even though we wrote
SELECT * .
Summary
Luckily, all of our databases remove projection from
EXISTS subqueries :
Database |
Projection EXISTS |
DB2 LUW 10.5 |
Yes |
MySQL 8.0.2 |
Yes |
Oracle 12.2.0.1 |
Yes |
PostgreSQL 9.6 |
Yes |
SQL Server 2014 |
Yes |
Stay tuned for
Part 3 , where we'll discuss other cool SQL optimizations.
GO TO FULL VERSION