Five simple optimizations that can be implemented based only on metadata (i.e. constraints) and the query itself
We offer you an adaptation of Lukas Eder's article, designed for those who have a general understanding of databases and SQL, as well as some practical experience with DBMS.
Cost optimization is actually a standard way to optimize SQL queries in modern databases. This is why it is so difficult to manually write a complex algorithm in
3GL (third generation programming languages) whose performance would exceed the dynamically calculated execution plan generated by a modern optimizer. Today we will not discuss cost optimization, that is, optimization based on the cost model of the database. We'll look at much simpler optimizations. Those that can be implemented based only on metadata (i.e. restrictions) and the request itself. Usually their implementation for a database is not a Newton binomial, since, in this case, any optimization will lead to a better execution plan, regardless of the presence of indexes, data volumes and skewness of data distribution. "Not a Newton binomial" is not in the sense of how easy it is to implement the optimization, but whether it should be done. These optimizations eliminate unnecessary, extra work [for the database] (
as opposed to unnecessary, required work, which I already wrote about ).
What are these optimizations used for?
Most of them are used for:
- bug fixes in queries;
- allowing views to be reused without the database actually executing the view logic.
In the first case, one could say: “So what, just go ahead and fix this stupid SQL query.” But let the one who has never made a mistake throw a stone at me first. The second case is especially interesting: it gives us the ability to create complex libraries of views and table functions that can be reused across multiple layers.
Databases used
In this article we will compare 10 SQL optimizations in the five most widely used DBMSs (
according to database rankings ):
- Oracle 12.2;
- MySQL 8.0.2;
- SQL Server 2014;
- PostgreSQL 9.6;
- DB2 LUW 10.5.
Another
rating almost echoes it. As usual, in this article I will be querying the Sakila database
.
Here is a list of these ten types of optimizations:
- transitive closure;
- impossible predicates and unnecessary table calls;
- eliminating JOIN;
- elimination of "meaningless" predicates;
- projections in EXISTS subqueries;
- merging of predicates;
- provably empty sets;
- constraints CHECK;
- unnecessary reflexive connections;
- Pushdown predicates
Today we will discuss pp. 1-3, in the second part - 4 and 5, and in part 3 - 6-10.
1. Transitive closure
Let's start with something simpler: transitive closure . This is a trivial concept that applies to many mathematical operations, such as the equality operator. It can be formulated in this case as follows: if A = B and B = C, then A = C.
Not difficult, right? But this has some interesting implications for SQL optimizers. Let's look at an example. Let's extract all movies with ACTOR_ID = 1:
SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE a.actor_id = 1;
The result is as follows:
FIRST_NAME LAST_NAME FILM_ID
PENELOPE GUINESS 1
PENELOPE GUINESS 23
PENELOPE GUINESS 25
PENELOPE GUINESS 106
PENELOPE GUINESS 140
PENELOPE GUINESS 166
...
Let's now take a look at the plan for executing this query in the case of the Oracle DBMS:
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | NESTED LOOPS | | 19 |
| 2 | TABLE ACCESS BY INDEX ROWID| ACTOR | 1 |
|* 3 | INDEX UNIQUE SCAN | PK_ACTOR | 1 |
|* 4 | INDEX RANGE SCAN | PK_FILM_ACTOR | 19 |
--------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("A"."ACTOR_ID"=1)
4 - access("FA"."ACTOR_ID"=1)
The section on predicates is especially interesting here. The predicate ACTOR_ID = 1, due to transitive closure, applies to both the ACTOR table and the FILM_ACTOR table. If:
• A.ACTOR_ID = 1 (из предиката WHERE) и…
• A.ACTOR_ID = FA.ACTOR_ID (из предиката ON)
То:
• FA.ACTOR_ID = 1
For more complex queries, this produces some very nice results. In particular, the accuracy of cardinality estimates increases significantly, since it becomes possible to select estimates based on a specific constant value of the predicate, and not, for example, the average number of films by actors, as in the following query (returning the same result):
SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE first_name = 'PENELOPE'
AND last_name = 'GUINESS'
His plan:
----------------------------------------------------------------------------
| Id | Operation | Name | Rows |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | NESTED LOOPS | | 2 |
|* 2 | TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR | 1 |
|* 3 | INDEX RANGE SCAN | IDX_ACTOR_LAST_NAME | 3 |
|* 4 | INDEX RANGE SCAN | PK_FILM_ACTOR | 27 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("A"."FIRST_NAME"='PENELOPE')
3 - access("A"."LAST_NAME"='GUINESS')
4 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
As you can see, the number of rows in the FILM_ACTOR table is overestimated, while the NESTED LOOP is underestimated. Here are a couple of interesting values:
SELECT count(*) FROM film_actor WHERE actor_id = 1;
SELECT avg(c) FROM (
SELECT count(*) c FROM film_actor GROUP BY actor_id
);
Result:
19
27.315
This is where the estimates come from. If the database knows that we are talking about ACTOR_ID = 1, then it can collect statistics on the number of films for
this particular actor . If
it doesn’t (since the standard statistics collection mechanism does not correlate FIRST_NAME/LAST_NAME with ACTOR_ID), then we will get the average number of films for all
actors . A simple, unimportant error in this particular case, but in a complex query it can propagate further, accumulate and lead further into the query (higher up in the plan) to an incorrect JOIN choice. So whenever you can, design your joins and simple predicates to take advantage of transitive closure. What other databases support this feature?
DB2
Yes!
Explain Plan
-----------------------------------------------------------
ID | Operation | Rows | Cost
1 | RETURN | | 13
2 | NLJOIN | 27 of 1 | 13
3 | FETCH ACTOR | 1 of 1 (100.00%) | 6
4 | IXSCAN PK_ACTOR | 1 of 200 ( .50%) | 0
5 | IXSCAN PK_FILM_ACTOR | 27 of 5462 ( .49%) | 6
Predicate Information
4 - START (Q2.ACTOR_ID = 1)
STOP (Q2.ACTOR_ID = 1)
5 - START (1 = Q1.ACTOR_ID)
STOP (1 = Q1.ACTOR_ID)
By the way, if you like cool execution plans like this, check out
Markus Winand's script .
MySQL
Unfortunately, MySQL execution plans are not well suited for this kind of analysis. The predicate itself is missing from the output information:
ID SELECT TYPE TABLE TYPE REF ROWS
------------------------------------------
1 SIMPLE a const const 1
1 SIMPLE fa ref const 19
But the fact that const is specified twice in the REF column shows that both tables are searching for a constant value. At the same time, the query plan with FIRST_NAME/LAST_NAME looks like this:
ID SELECT TYPE TABLE TYPE REF ROWS
-----------------------------------------------
1 SIMPLE a ref const 3
1 SIMPLE fa ref a.actor_id 27
And as you can see, the REF now references the column from the JOIN predicate. The cardinality score is almost the same as in Oracle. So yes, MySQL supports transitive closures too.
PostgreSQL
Yes!
QUERY PLAN
------------------------------------------------------------------------------------
Nested Loop (cost=4.49..40.24 rows=27 width=15)
-> Seq Scan on actor a (cost=0.00..4.50 rows=1 width=17)
Filter: (actor_id = 1)
-> Bitmap Heap Scan on film_actor fa (cost=4.49..35.47 rows=27 width=4)
Recheck Cond: (actor_id = 1)
-> Bitmap Index Scan on film_actor_pkey (cost=0.00..4.48 rows=27 width=0)
Index Cond: (actor_id = 1)
SQL Server
Yes!
|--Nested Loops(Inner Join)
|--Nested Loops(Inner Join)
| |--Index Seek (SEEK:([a].[actor_id]=(1)))
| |--RID Lookup
|--Index Seek (SEEK:([fa].[actor_id]=(1)))
Summary
All of our databases support transitive closure.
Database |
Transitive closure |
DB2 LUW 10.5 |
Yes |
MySQL 8.0.2 |
Yes |
Oracle 12.2.0.1 |
Yes |
PostgreSQL 9.6 |
Yes |
SQL Server 2014 |
Yes |
However, wait for #6 in the next part of the article. There are complex cases of transitive closure that not all databases can handle.
2. Impossible predicates and unnecessary table calls
This is a completely stupid optimization, but why not? If users write impossible predicates, then why bother executing them? Here are some examples:
-- "Очевидный"
SELECT * FROM actor WHERE 1 = 0
-- "Хитрый"
SELECT * FROM actor WHERE NULL = NULL
The first query will obviously never return any results, but the same statement is true for the second. After all, although NULL IS NULL is always TRUE, the result of the calculation NULL = NULL is NULL, which, according to
three-valued logic , is equivalent to FALSE. This is pretty self-explanatory, so let's jump straight into finding out which databases perform this optimization.
DB2
Yes!
Explain Plan
-----------------------------------
ID | Operation | Rows | Cost
1 | RETURN | | 0
2 | TBSCAN GENROW | 0 of 0 | 0
As you can see, access to the ACTOR table is completely excluded from the plan. It contains only the GENROW operation, which generates zero rows. Perfect.
MySQL
Yes!
ID SELECT TYPE TABLE EXTRAS
-----------------------------------------
1 SIMPLE Impossible WHERE
This time, MySQL was kind enough to inform us about the impossible WHERE clause. Thank you! This makes analysis much easier, especially compared to other databases.
Oracle
Yes!
---------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 0 |
|* 1 | FILTER | | 1 | | 0 |
| 2 | TABLE ACCESS FULL| ACTOR | 0 | 200 | 0 |
---------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(NULL IS NOT NULL)
We see that the plan still mentions access to the ACTOR table, and the expected number of rows is still 200, but there is also a filtering operation (FILTER) with Id=1, where there will never be TRUE. Due to Oracle's dislike of
the standard SQL Boolean data type , Oracle displays NULL IS NOT NULL in the plan, instead of just FALSE. Oh well... But seriously, watch that predicate. I've had occasion to debug execution plans with 1000-line subtrees and extremely high cost values, only to discover after the fact that the entire subtree was being "cut off" by the NULL IS NOT NULL filter. A little discouraging, I tell you.
PostgreSQL
Yes!
QUERY PLAN
-------------------------------------------
Result (cost=0.00..0.00 rows=0 width=228)
One-Time Filter: false
Already better. No annoying ACTOR table calls and a neat little FALSE predicate.
SQL Server?
Yes!
|--Constant Scan
SQL Server calls this a "
constant scan", which is a scan where nothing happens - similar to DB2. All our databases can exclude impossible predicates:
Database |
Impossible predicates |
Unnecessary table accesses |
DB2 LUW 10.5 |
Yes |
Yes |
MySQL 8.0.2 |
Yes |
Yes |
Oracle 12.2.0.1 |
Yes |
Yes |
PostgreSQL 9.6 |
Yes |
Yes |
SQL Server 2014 |
Yes |
Yes |
3. Eliminate JOINs
In the previous section, we observed unnecessary table accesses in single-table queries. But what happens if the JOIN does not require one of several table accesses?
I already wrote about eliminating JOIN in a previous post from my blog . The SQL engine is able to determine, based on the type of query and the presence of primary and foreign keys, whether a particular JOIN is actually needed in a given query, or whether eliminating it will not affect the semantics of the query. In all of the next three examples, JOIN is not needed. An inner ...-to-one join can be eliminated by having a NOT NULL foreign key. Instead:
SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id
The database can do the following:
SELECT first_name, last_name
FROM customer c
An INNER JOIN of type "...-to-one" can be replaced if there is a nullable foreign key. The above query works if the foreign key is subject to a NOT NULL constraint. If not, for example, as in this request:
SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id
then JOIN can still be eliminated, but you will have to add the NOT NULL predicate, like this:
SELECT title
FROM film
WHERE original_language_id IS NOT NULL
An OUTER JOIN of the "...-to-one" type can be removed if there is a unique key. Instead of this:
SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id
The database, again, can do the following:
SELECT first_name, last_name
FROM customer c
... even if there is no foreign key for CUSTOMER.ADDRESS_ID. The unique outer connection (DISTINCT OUTER JOIN) of the "...-to-many" type can be removed. Instead of this:
SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
The database can do the following:
SELECT DISTINCT first_name, last_name
FROM actor a
All these examples were studied in detail in the previous article, so I will not repeat myself, but will just summarize everything that various databases can eliminate:
Database |
INNER JOIN: ...-to-one |
(can be NULL): ...-to-one |
OUTER JOIN: ...-to-one |
OUTER JOIN DISTINCT: ...-to-many |
DB2 LUW 10.5 |
Yes |
Yes |
Yes |
Yes |
MySQL 8.0.2 |
No |
No |
No |
No |
Oracle 12.2.0.1 |
Yes |
Yes |
Yes |
No |
PostgreSQL 9.6 |
No |
No |
Yes |
No |
SQL Server 2014 |
Yes |
No |
Yes |
Yes |
Unfortunately, not all databases can resolve all types of connections. DB2 and SQL Server are the undisputed leaders here!
To be continued
GO TO FULL VERSION