JavaRush /Java Blog /Random EN /SQL performance issues due to "unnecessary but required w...

SQL performance issues due to "unnecessary but required work"

Published in the Random EN group
Level of knowledge required to understand the article: a general understanding of databases and SQL, some practical experience with DBMS.
SQL performance issues caused by
Probably the most important thing you can learn to write effective SQL queries is indexing. However, in second place, very close behind, is the knowledge that many SQL clients require the database to do a lot of "unnecessary but necessary work" . Repeat after me:
Unnecessary but required work
What is “unnecessary but obligatory work”? As Captain Obvious tells us, she:

Unnecessary

Let our client application need the following data:
SQL performance issues caused by
Nothing unusual. We are working with a movie database (such as the Sakila database ) and want to display the title and rating of all movies to users. The following query can give the result we need:
SELECT title, rating
FROM film
However, our application (or our ORM) instead executes this query:
SELECT *
FROM film
What do we get as a result? Guess. We receive a lot of useless information:
SQL performance issues caused by
On the right you can even see some complex JSON being loaded:
  • from disk
  • to cache
  • by wire
  • in memory of the client
  • and finally thrown away [as unnecessary]
Yes, we throw away most of this information. All actions taken to extract this information turned out to be completely useless. Is it true? Is it true.

Mandatory

And now - the worst part. Although optimizers can now do a lot, these actions are mandatory for the database. The database has no way of knowing that the client application does not need 95% of this data. And this is just the simplest example. Imagine connecting several tables... So what, you say, but databases are fast? Let me enlighten you on some things you probably haven't thought about. Of course, the execution time of an individual request does not really affect anything. Okay, it ran one and a half times slower, but we’ll get through it, right? For convenience? Sometimes this is true. But if you always sacrifice performance for convenience , these little things will start to add up. We will no longer be talking about performance (the speed of execution of individual requests), but about throughput (system response time), and then serious problems will begin, which are not so easy to solve. That's when you lose scalability. Let's take a look at the execution plans, in this case, the Oracle DBMS:
--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 |   166K|
|   1 |  TABLE ACCESS FULL| FILM |  1000 |   166K|
--------------------------------------------------
compared with:
--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 | 20000 |
|   1 |  TABLE ACCESS FULL| FILM |  1000 | 20000 |
--------------------------------------------------
Running a SELECT * query instead of SELECT title, rating uses 8 times more memory in the database. Nothing unexpected, right? We knew this would happen. But we still agree to this for many of our requests in which we simply do not need all this data. We create unnecessary but mandatory work for the database , which keeps piling up and piling up. We use 8 times more memory than needed (the multiplier will change, of course). Meanwhile, at all other stages (disk I/O, data transfer over the network, memory consumption by the client) the problems are exactly the same, but I’ll skip them and look instead at...

Using Indexes

Most databases today have already appreciated the concept of covering indexes . A covering index is not itself a special type of index. But it may turn out to be a "special index" for a particular query, either "by accident" or because it was intended to be so. Consider the following query:
SELECT *
FROM actor
WHERE last_name LIKE 'A%'
There is nothing unexpected in terms of its implementation. This is a simple request. View the range by index, access the table - and you're done:
-------------------------------------------------------------------
| Id  | Operation                   | Name                | Rows  |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                     |     8 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR               |     8 |
|*  2 |   INDEX RANGE SCAN          | IDX_ACTOR_LAST_NAME |     8 |
-------------------------------------------------------------------
Good plan, isn't it? Well, if we really needed this, then no:
SQL performance issues caused by
Obviously, we are wasting memory, etc. Let's consider this query as an alternative:
SELECT first_name, last_name
FROM actor
WHERE last_name LIKE 'A%'
His plan is this:
----------------------------------------------------
| Id  | Operation        | Name            | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT |                 |     8 |
|*  1 |  INDEX RANGE SCAN| IDX_ACTOR_NAMES |     8 |
----------------------------------------------------
We were able to completely eliminate access to the table, thanks to the presence of an index that satisfies all the needs of our query... a covering index. Is it important? And how! This approach allows you to speed up some queries by an order of magnitude (or slow them down by an order of magnitude when the index is no longer covering after some changes). Covering indexes cannot always be used. You have to pay for indexes and you shouldn't add too many of them. But in this case, everything is obvious. Let's evaluate the performance:
SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Наихудший вариант requestа: перерасход памяти ПЛЮС доступ к таблице
      SELECT *
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Оператор 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Улучшенный request, но все равно с доступом к таблице
      SELECT /*+INDEX(actor(last_name))*/
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Оператор 2 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Оптимальный request: покрывающий индекс
      SELECT /*+INDEX(actor(last_name, first_name))*/
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Оператор 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/

As a result we get:


Operator 1: +000000000 00:00:02.479000000

Operator 2: +000000000 00:00:02.261000000

Operator 3: +000000000 00:00:01.857000000

Note that the actor table only has 4 columns, so the performance difference between statements 1 and 2 isn't that big, but it's still significant. I will also note that I used Oracle optimizer hints to have the optimizer select one or another specific index for the query. Operator 3 is the undisputed winner of our race. Its performance is much better, and we are talking about an extremely simple query. Again, when we write SELECT *, we create unnecessary but mandatory work for the database that it cannot optimize. She won't select the covering index because it has slightly higher overhead than the LAST_NAME index she chose, and, among other things, she still has to access the table to retrieve a useless LAST_UPDATE column, for example. But the deeper we analyze SELECT *, the worse things turn out to be. Let's talk about...

SQL conversions

Optimizers perform so well because they transform SQL queries ( I talked about how this works in my recent talk at Voxxed Days in Zurich ). For example, there is an extremely powerful "exception JOIN" transformation. Consider the following helper view that we had to create to avoid joining all these tables manually every time:
CREATE VIEW v_customer AS
SELECT
  c.first_name, c.last_name,
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)
This view simply does all the "...-to-one" joins between the CUSTOMER customer table and various tables for parts of their address. Thank you, normalization. Imagine that, after working a little with this view, we got used to it and forgot about the tables underlying it. And now we execute the following query:
SELECT *
FROM v_customer
As a result, we get a very impressive plan:
----------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |   599 | 47920 |   14 |
|*  1 |  HASH JOIN           |          |   599 | 47920 |   14 |
|   2 |   TABLE ACCESS FULL  | COUNTRY  |   109 |  1526 |    2 |
|*  3 |   HASH JOIN          |          |   599 | 39534 |   11 |
|   4 |    TABLE ACCESS FULL | CITY     |   600 | 10800 |    3 |
|*  5 |    HASH JOIN         |          |   599 | 28752 |    8 |
|   6 |     TABLE ACCESS FULL| CUSTOMER |   599 | 11381 |    4 |
|   7 |     TABLE ACCESS FULL| ADDRESS  |   603 | 17487 |    3 |
----------------------------------------------------------------
Well, of course. The database is doing all these joins and full table scans because that's what we told it to do - fetch all this data. Now, again, imagine that all we really needed was this:
SQL performance issues caused by
What, seriously, right? Now you are starting to understand what I am talking about. But imagine that we learned something from past mistakes, and execute this, more optimal query:
SELECT first_name, last_name
FROM v_customer
Now let's check what happened!
------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |   599 | 16173 |     4 |
|   1 |  NESTED LOOPS      |             |   599 | 16173 |     4 |
|   2 |   TABLE ACCESS FULL| CUSTOMER    |   599 | 11381 |     4 |
|*  3 |   INDEX UNIQUE SCAN| SYS_C007120 |     1 |     8 |     0 |
------------------------------------------------------------------
Drastic changes for the better in terms of execution. The joins have been eliminated because the optimizer can now see that they are useless , and if it can see that (and you haven't made this work mandatory by selecting *), then it can simply not do all that work. Why is this so in this case? The foreign key CUSTOMER.ADDRESS_ID to the primary key ADDRESS.ADDRESS_ID guarantees exactly one value of the latter, which means that the JOIN operation will be a "...-to-one" join that does not increase or decrease the number of rows. And since we don’t select or request any rows at all, then there’s no point in loading them at all. Removing the JOIN will probably not affect the result of the query at all. Databases do this all the time. You can run the following query on almost any database:
-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual

-- Более адекватные диалекты SQL, например, PostgreSQL
SELECT EXISTS (SELECT 1 / 0)
In this case, you might expect an arithmetic exception to be thrown, as when executing the following query:
SELECT 1 / 0 FROM dual

Happened:


ORA-01476: divisor is equal to zero

But this doesn't happen. The optimizer (or even the parser) can ensure that no select list elements in the EXISTS predicate (SELECT ..) will change the result of the query, so there is no need to execute it. Like this!

Meanwhile...

One of the most annoying problems with ORMs is that they are so easy to write SELECT * queries. In fact, for example, in HQL / JPQL they are generally used by default. We can omit the SELECT clause altogether, because we are going to retrieve the entire entity, right? For example:
FROM v_customer
For example, Vlad Mihalcea, an expert and advocate for developing with Hibernate , recommends using [qualified] queries almost always when you are sure you don't want to save any changes after checkout. ORMs greatly facilitate the solution of the problem of persistence of object graphs. Note: Persistence. The tasks of actually modifying object graphs and saving changes are inextricably linked. But if you're not going to do that, then why bother extracting the essence? Why not write a [refined] request? Let's be clear: from a performance standpoint, writing a query specifically tailored to your specific use case is obviously better than any other option. You might not care since your data set is small and it doesn't matter. Great. But when you eventually need scalability, redesigning your applications to use queries instead of imperative traversal of the entity graph will be quite challenging. And you will have something to do without it.

Counting rows to find out if something is present

One of the worst wastes of resources is running COUNT(*) queries just to see if something is in the database. For example, we need to find out whether a given user has orders at all. And we execute the request:
SELECT count(*)
FROM orders
WHERE user_id = :user_id
Elementary. If COUNT = 0, then there are no orders. Otherwise, yes. The performance won't be that bad since we probably have an index on the ORDERS.USER_ID column. But what do you think the performance of the above query will be compared to the following option:
-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
) THEN 1 ELSE 0 END
FROM dual

-- Более адекватные диалекты SQL, например, PostgreSQL
SELECT EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
)
It doesn't take a rocket scientist to figure out that a true existence predicate will stop searching for additional strings as soon as it finds the first one . So if the result turns out to be “no orders”, then the speed will be comparable. If, however, the result is “yes, there are orders,” then in the case where the exact quantity does not need to be counted, the answer will be received much faster. After all, we are not interested in the exact number. However, we told the database to calculate it ( unnecessary work ) and the database doesn't know that we are ignoring all results greater than 1 ( required work ). Of course, it would be much worse if we called list.size() on a JPA-backed collection to achieve the same results. I already wrote about this on my blog earlier, and conducted comparative testing of both options on...

Conclusion

This article states the obvious. Don't force the database to do unnecessary but required work . It is unnecessary because, given the requirements, you know that some specific piece of work does not need to be done. However, you tell the database to do it. It is required because there is no way for the database to ensure that this work is unnecessary . This information is only available to the client and is not available to the server. So the database has to execute it. The article focused on SELECT *, mainly because it is such a convenient object to look at. But this applies not only to databases. This applies to all distributed algorithms in which the client tells the server to do unnecessary but required work . How many N+1 tasks are there in your average AngularJS application where the UI loops through the result of service A, calling service B multiple times, rather than packing all the calls to B into a single call? This is a very common occurrence. The solution is always the same. The more information you provide to the entity executing your commands, the faster it (theoretically) executes those commands. Write optimal queries. Always. Your entire system will thank you for this. Original article
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION