JavaRush /Java Blog /Random EN /SQL performance issues arising from "unnecessary but requ...

Level 41

8 August 2023
226 views
0 comments

SQL performance issues arising from "unnecessary but required work"

The level of knowledge required to understand the article: a general understanding of databases and SQL, a little practical experience with the DBMS.

Probably the most important thing you can learn to write efficient SQL queries is indexing. However, in second place, very close behind, is the knowledge that many SQL clients require the database to do a lot of "unnecessary but necessary work" . Repeat after me:

Unnecessary but required work

What is "unnecessary but obligatory work"? As Captain Evidence tells us, she:

Unnecessary

Let our client application need the following data:

Nothing unusual. We are working with a database of movies (eg the Sakila database ) and want to display the title and rating of all movies to users. The following query can give us the result we need:

SELECT title, rating
FROM film

However, our application (or our ORM) does this instead:

SELECT *
FROM film

What do we get as a result? Guess. We get a lot of useless information: SQL performance issues due to

On the right, you can even see some complex JSON being loaded:

from disk
to cache
by wire
in memory of the client
and finally thrown away [as unnecessary]

Yes, we throw away most of this information. All the actions taken to extract this information turned out to be completely useless. Is it true? Is it true.

Mandatory

And now - the worst. Although optimizers are now able to do a lot, these actions are required for the database. There is no way for the database to know that the client application does not need 95% of this data at all. And this is just the simplest example. Imagine joining multiple tables... So what, you might say, are databases fast? Let me enlighten you on some things you probably haven't thought about. Of course, the execution time of a single query does not really affect anything. Okay, it ran one and a half times slower, but we'll survive it, right? For convenience? Sometimes it is. But if you sacrifice performance for the sake of convenience, always, these little things will start to accumulate. It will no longer be about performance (speed of execution of individual requests), but about throughput (system response time), and then serious problems will begin, which are not so easy to solve. Then you will lose scalability. Let's take a look at the execution plans, in this case, the Oracle DBMS:

--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 |   166K|
|   1 |  TABLE ACCESS FULL| FILM |  1000 |   166K|
--------------------------------------------------

compared with:

--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 | 20000 |
|   1 |  TABLE ACCESS FULL| FILM |  1000 | 20000 |
--------------------------------------------------

When executing a SELECT * query instead of SELECT title, rating uses 8 times more memory in the database. Nothing unexpected, right? We knew it would. But still, we agree with this for many of our requests, in which we simply do not need all this data. We create unnecessary but necessary work for the database that keeps piling up and piling up. We use 8 times more memory than we need (the multiplier will change, of course). Meanwhile, at all other stages (disk I / O, data transfer over the network, memory consumption by the client), the problems are exactly the same, but I will skip them, and look instead at ...

Using Indexes

Most databases today have already appreciated the concept of covering indexes . Covering index - in itself is not some special kind of index. But it could be a "special index" for a particular query, or "accidentally" or because it was meant to be. Consider the following query:

SELECT *
FROM actor
WHERE last_name LIKE 'A%'

There is nothing unexpected in terms of its implementation. This is a simple request. Viewing the range by index, accessing the table - and you're done:

-------------------------------------------------------------------
| Id  | Operation                   | Name                | Rows  |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                     |     8 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR               |     8 |
|*  2 |   INDEX RANGE SCAN          | IDX_ACTOR_LAST_NAME |     8 |
-------------------------------------------------------------------

Good plan, isn't it? Well, if we really needed this, then no:

Obviously, we are wasting memory, etc. Let's consider this query as an alternative:

SELECT first_name, last_name
FROM actor
WHERE last_name LIKE 'A%'

His plan is:

----------------------------------------------------
| Id  | Operation        | Name            | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT |                 |     8 |
|*  1 |  INDEX RANGE SCAN| IDX_ACTOR_NAMES |     8 |
----------------------------------------------------

We were able to completely eliminate access to the table, thanks to the presence of an index that satisfies all the needs of our query ... a covering index. Does it matter? And how! This approach allows you to speed up some queries by an order of magnitude (or slow them down by an order of magnitude when the index ceases to be covering after any changes). Covering indexes can not always be used. You have to pay for indexes and you shouldn't add too many of them. But in this case, everything is clear. Let's evaluate the performance:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Наихудший вариант requestа: перерасход памяти ПЛЮС доступ к таблице
      SELECT *
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Оператор 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Улучшенный request, но все равно с доступом к таблице
      SELECT /*+INDEX(actor(last_name))*/
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Оператор 2 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Оптимальный request: покрывающий индекс
      SELECT /*+INDEX(actor(last_name, first_name))*/
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Оператор 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/

As a result, we get:

Operator 1 : +000000000 00:00:02.479000000

Operator 2: +000000000 00:00:02.261000000

Operator 3: +000000000 00:00:01.857000000

Note that there are only 4 columns in the actor table, so the performance difference between statements 1 and 2 is not that big, but still significant. I also note that I used the Oracle optimizer hints to tell the optimizer to choose a particular index for the query. Operator 3 is the undisputed winner of our race. Its performance is much better, and we are talking about an extremely simple query. Again, when we write SELECT *, we create unnecessary but necessary work for the database., which it cannot optimize. She won't choose a covering index because it has a slightly higher overhead than the LAST_NAME index she chose, and, among other things, she still has to access the table to retrieve a LAST_UPDATE column that no one wants, for example. But the deeper we parse SELECT *, the worse things get. Let's talk about...

SQL transformations

Optimizers perform so well because they transform SQL queries ( I explained how this works in my recent talk at the Voxxed Days conference in Zurich ). For example, there is an extremely powerful "exception JOIN" transformation. Consider the following helper view, which we had to create in order to avoid manually joining all these tables each time:

CREATE VIEW v_customer AS
SELECT
  c.first_name, c.last_name,
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)

This view simply does all the "...-to-one" joins between the CUSTOMER table and the various tables for parts of their address. Thank you, normalization. Imagine that after working with this view for a while, we got used to it and forgot about the tables underlying it. And now we execute the following query:

SELECT *
FROM v_customer

As a result, we get a very impressive plan:

----------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |   599 | 47920 |   14 |
|*  1 |  HASH JOIN           |          |   599 | 47920 |   14 |
|   2 |   TABLE ACCESS FULL  | COUNTRY  |   109 |  1526 |    2 |
|*  3 |   HASH JOIN          |          |   599 | 39534 |   11 |
|   4 |    TABLE ACCESS FULL | CITY     |   600 | 10800 |    3 |
|*  5 |    HASH JOIN         |          |   599 | 28752 |    8 |
|   6 |     TABLE ACCESS FULL| CUSTOMER |   599 | 11381 |    4 |
|   7 |     TABLE ACCESS FULL| ADDRESS  |   603 | 17487 |    3 |
----------------------------------------------------------------

Well, of course. The database is doing all of these joins and full table scans because that's what we told it to do - extract all that data. Now, again, imagine that all we really needed was this:

What, seriously, huh? Now you begin to understand what I'm talking about. But imagine that we have learned something from past mistakes, and we execute this, more optimal query:

SELECT first_name, last_name
FROM v_customer

Now let's check what happened!

------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |   599 | 16173 |     4 |
|   1 |  NESTED LOOPS      |             |   599 | 16173 |     4 |
|   2 |   TABLE ACCESS FULL| CUSTOMER    |   599 | 11381 |     4 |
|*  3 |   INDEX UNIQUE SCAN| SYS_C007120 |     1 |     8 |     0 |
------------------------------------------------------------------

Striking changes for the better in terms of execution. Joins have been excluded because the optimizer can now make sure they are useless, and if it can make sure of this (and you didn't make this work mandatory by choosing *), then it may simply not do all this work. Why is this so in this case? The foreign key CUSTOMER.ADDRESS_ID to the primary key ADDRESS.ADDRESS_ID guarantees exactly one value of the latter, which means that the JOIN operation will be a "...-to-one" join that does not increase or decrease the number of rows. And since we don’t select or request any lines at all, then there’s no point in loading them at all. Removing the JOIN will probably not affect the result of the query at all. Databases do this all the time. You can run the following query on almost any database:

-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual

-- Более адекватные диалекты SQL, например, PostgreSQL
SELECT EXISTS (SELECT 1 / 0)

In this case, you might expect an arithmetic exception to be thrown, as in the following query:

SELECT 1 / 0 FROM dual

Happened:

ORA-01476: divisor is equal to zero

But that doesn't happen. The optimizer (or even the parser) can make sure that none of the select list elements in the EXISTS (SELECT ..) predicate will change the result of the query, so there is no need to execute it. Like this!

Meanwhile...

One of the most annoying problems with ORMs is that it's so convenient to write SELECT * queries in them. Actually, for example, in HQL / JPQL they are generally used by default. We can omit the SELECT clause altogether, because we are going to extract the entire entity, right? For example:

FROM v_customer

For example, Vlad Mihalcea, an expert and supporter of development with the Hibernate library, recommends using [refined] queries almost always when you're sure you don't want to save any changes after checkout. ORMs greatly facilitate the solution of the object graph persistence problem. Note: Persistence. The tasks of actually modifying object graphs and saving changes are inextricably linked. But if you are not going to do this, then why bother extracting the essence? Why not write a [refined] request? Let's be clear: From a performance point of view, writing a query specifically tailored to your specific use case is definitely better than any other option. Maybe you don't care since your dataset is small and it doesn't matter. Great. But when you finally need scalability, then redesigning your applications to use queries instead of imperative traversal of the entity graph will be very difficult. And you will have something to do without it.

Counting rows to find out the presence of something

One of the worst wastes of resources is doing COUNT(*) queries just to see if there is something in the database. For example, we need to find out if a given user has orders at all. And we execute the request:

SELECT count(*)
FROM orders
WHERE user_id = :user_id

Elementary. If COUNT = 0, then there are no orders. Otherwise, yes. The performance will not be so bad, because we probably have an index on the ORDERS.USER_ID column. But what do you think the performance of the above query will be compared to the following one:

-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
) THEN 1 ELSE 0 END
FROM dual

-- Более адекватные диалекты SQL, например, PostgreSQL
SELECT EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
)

It doesn't take a big brain to figure out that a real existence predicate will stop looking for additional rows as soon as the first one is found . So if the result is "no orders", then the speed will be comparable. If, however, the result is "yes, there are orders", then in the case when it is not necessary to count the exact quantity, the answer will be received much faster. After all, we are not interested in the exact number. However, we have ordered the database to calculate it ( unnecessary work ), and the database does not know that we are ignoring all results greater than 1 ( mandatory work ). Of course, it would be much worse if we called list.size() on a JPA-backed collection to achieve the same results. I already wrote about this in my blog earlier, and conducted a comparative test of both options on r...

Conclusion

This article states the obvious. Don't force the database to do unnecessary but required work . It is unnecessary because, according to the existing requirements, you know that some specific part of the work does not need to be done. However, you are telling the database to do it. It is required because there is no way for the database to make sure that this work is unnecessary . This information is only available to the client and not available to the server. So the database has to do it. The article was about SELECT *, mostly because it's such a convenient object to consider. But it's not just about databases. This applies to all distributed algorithms in which the client tells the server to do unnecessary but required work.. How many N+1 tasks are there in your average AngularJS application where the UI loops over the result from service A, calling service B multiple times, instead of packing all calls to B into a single call? This is a very common occurrence. The solution is always the same. The more information you provide to the entity executing your commands, the faster it (theoretically) executes those commands. Write the best queries. Always. Your entire system will thank you for it. Original article

What else to read?

Q&A: how to convert String to int correctly in Java?

Best Books for OCAJP8 (1Z0-808) Java 8 Exam Preparation

How to get started with Oracle DBMS development

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION