JavaRush /Java Blog /Random EN /Detailed analysis of the HashMap class

Главный архитектор программного обеспечения at Tribunal de Justiça da Paraíba

8 August 2023
415 views
1 comment

Detailed analysis of the HashMap class

Before moving on to a detailed discussion of the class, let's focus on the basic concepts associated with hash tables. This article will not discuss methods for working with hash mapping. Only the operations of insertion, search and deletion will be discussed clearly and in detail. I think it won’t be difficult to read the description of the methods available for HashMap from the same Schildt. Perhaps in the future I will write another article devoted to such methods, but for now this is in question. Compared to Java 7, the HashMap class in Java 8 has undergone significant changes (+1000 lines of code). You can read about the implementation in Java 7 here (but no longer relevant): habr A hash table is a data structure that implements the associative array interface (abstract key-value model or entry), which provides very fast insertion and search: regardless of the number elements insertion and search (and sometimes deletion) are performed in a time close to a constant - O(1). Essentially, this is a regular array, where the location of the element depends on the value of the element itself. The relationship between the value of an element and its position in the hash table is specified by a hash function. A hash function takes an input piece of data, which we call a key , and as an output it produces an integer known as a hash value (or hash code) . The hash value then binds our key to a specific hash table index. For the basic operations: insertion, lookup and deletion, we use the same hash function, so these operations are performed quite quickly. For this reason, it is important that the hash function behaves consistently and outputs the same index for the same input data. It is worth noting that the resulting hash code can be a huge numeric value, and the original array is conditionally designed for only 16 elements. Why not create an array with a billion elements in order to add only ten? Therefore, we must somehow transform this hash code into values from 0 to 15 (if the array size is 16). And for this, additional transformations are used. So we generate an index to minimize the size of the array. For example, in HashMap before Java 8, this additional method was used to find the desired cell:


static int indexFor(int h, int length) {
        return h & (length-1);
}

As input, it took the hash code obtained as a result of the work hashCode()and the length of the internal array (number of cells). And it returned the result “hash code” -> bitwise “AND” -> (array length – 1). The class HashMapinherits from the class AbstractMapand implements the following interfaces: Map, Cloneable, Serializable. The .method is responsible for the hash function in Java hashCode(). The default implementation hashCode()returns a value called the identity hash code . Even if a class overrides hashCode(), you can always get the object's ID hash by calling System.identityHashCode(Object o). The default implementation hashCode()in OpenJDK has nothing to do with the memory address, as is sometimes believed. More details here: habr In HashMap , the hash table is implemented based on an array (or, more precisely, dynamic, since the table can expand) of singly linked lists. Essentially, we obtain the hash code of the key as a result of the method hashCode(), which is then modified (as we will consider later), and internally, using an additional method, the resulting values are distributed to the required cells. Array elements (cells) are also called buckets , which are used to store individual nodes. Each of the buckets is a collection (list or tree). A node is an object of a nested class Node(or TreeNodein a tree structure). In fact, inside the array cell lies LinkedListonly a singly linked list, or a red-black tree, which underlies the implementation of another class - TreeMap.

Detailed analysis of the HashMap class - 1

The option with a red-black tree does not arise so often (how, what and where - below), and this structure is quite difficult to understand, so the emphasis will be on a node of the Node type. Node is a nested class inside HashMapwhich has the following fields:

Detailed analysis of the HashMap class - 2

final int hash— the hash of the current element, which we obtain as a result of hashing the key;
final K key— the key of the current element. This is where what you specify as the first object in the method is written to put();
V value— the value of the current element. And what you specify as the second object in the method is written here put();
Node < K, V> next— link to the next node within the same basket. The list is connected, so it needs a link not to the next node, if there is one.

Now let's look at the fields of the HashMap class itself:

transient Node < K, V> [] table– the hash table itself, implemented on the basis of an array, for storing key-value pairs in the form of nodes. This is where our Nodes are stored;
transient int size— number of key-value pairs;
int threshold— the maximum number of elements, upon reaching which the size of the hash table doubles. Calculated using the formula (capacity * loadFactor);
final float loadFactor— this parameter determines at what degree of load the current hash table needs to create a new hash table, i.e. as soon as the hash table is 75% full, a new hash table will be created and the current elements will be moved into it (a costly operation, since all elements must be rehashed);
transient Set< Map.Entry< K,V>> entrySet- contains a cached entrySet(), with which we can iterate HashMap.

And constants:

static final int DEFAULT_INITIAL_CAPACITY= 1 << 4— default hash table capacity (16);
static final int MAXIMUM_CAPACITY = 1 << 30— the maximum possible capacity of the hash table (approximately 1 billion);
static final float DEFAULT_LOAD_FACTOR = 0.75f— default load factor;
static final int TREEIFY_THRESHOLD = 8is the “threshold” of the number of elements in one bucket, upon reaching which the internal linked list will be converted into a tree structure (red-black tree).
static final int UNTREEIFY_THRESHOLD = 6— if the number of elements in one basket decreases to 6, then a reverse transition from a tree to a linked list will occur;
static final int MIN_TREEIFY_CAPACITY = 64— the minimum capacity (number of buckets) of a hash table at which a transition to a tree structure is possible. Those. If there are at least 64 buckets in the hash table and there are 8 or more elements in one bucket, then a transition to a tree structure will occur.

Class constructors:

public HashMap()— creates a hash display by default: by volume (capacity) = 16and with load factor (load factor) = 0.75;
public HashMap(Map< ? extends K, ? extends V> m)— creates a hash mapping initialized by the elements of another given mapping with the initial capacity that is enough to accommodate the elements of another mapping;
public HashMap(int initialCapacity)— creates a hash map with a given initial capacity. For HashMap to work correctly and correctly, the size of the internal array must be a power of two (i.e. 16, 64, 128, etc.);
public HashMap(int initialCapacity, float loadFactor)— creates a hash map with the specified parameters: initial capacity and load factor.

A class implements an interface Mapand extends a class AbstractMapwithout adding its own methods. A hash map does not guarantee the order of its elements. Therefore, the order in which elements are entered into the hash map is not necessarily the order in which they are retrieved by the iterator. Adding Objects Adding a key-value pair is done using the put(). Let's look at the steps involved when adding an object:

The hash value of the key of the entered object is calculated. The key hash is calculated by a method static final int hash(Object key)that already accesses the hashCode()key method we know. To do this, either an overridden method hashCode()or its default implementation is used. Additional transformations in the method hash():
```
static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
```
Почему бы просто не вычислить с помощью hashCode()? Это сделано из-за того, что hashCode() можно реализовать так, что только нижние биты int'a будут заполнены. Например, для Integer, Float – если мы в HashMap кладем маленькие значения, то у них и биты хеш-codeов будут заполнены только нижние. В таком случае ключи в HashMap будут иметь тенденцию скапливаться в нижних ячейках, а верхние будут оставаться пустыми, что не очень эффективно. На то, в Howой бакет попадёт новая запись, влияют только младшие биты хеша. Поэтому и придумали различными манипуляциями подмешивать старшие биты хеша в младшие, чтобы улучшить распределение по бакетам (чтобы старшие биты родного хеша an object начали вносить коррективы в то, в Howой бакет попадёт an object) и, How следствие, производительность. Потому и придумана дополнительная функция hash внутри HashMap.
Вычисляем индекс бакета (ячейки массива), в который будет добавлен наш элемент:
```
i = (n - 1) & hash
```
где n – длина массива.
Создается an object Node.
Теперь, зная индекс в массиве, мы получаем список (цепочку) элементов, привязанных к этой ячейке. Если в бакете пусто, тогда просто размещаем в нем элемент. Иначе хэш и ключ нового element поочередно сравниваются с хешами и ключами элементов из списка и, при совпадении этих параметров, meaning element перезаписывается. Если совпадений не найдено, элемент добавляется в конец списка.

Теперь к очень подробному примеру.
1. Создадим an object класса HashMap:
```
HashMap < String, Integer> map = new HashMap<>();
```
2. С помощью метода put() добавим в хеш-отображение первую пару «ключ-meaning»:
```
map.put("KING", 100);
```
  В этот момент внутри вызывается метод putVal().
3. С помощью хеш-функции, роль которой играет метод hash, вычисляется хеш-code ключа, внутри которого предварительно вычисляется хеш-code с помощью метода hashCode() (в данном случае класса String), в результате чего мы получаем его «предварительное meaning» – 2306967. Может проверить в IDEA с помощью
```
System.out.println("KING".hashCode());
```
  Полученный хеш-code модифицируется по формуле: (хеш-code) ^ (хеш-code>>> 16), и в результате получаем окончательный хеш-code – 2306996.
4. Проверяем таблицу на «пустоту»:
```
if ((tab = table) == null || (n = tab.length) == 0)
```
  где [] tab — сама хеш-table: ссылаем tab и table (напомню, что это поле класса HashMap, которое хранит массив для наших элементов) на один и тот же массив, а int n – дополнительная переменная-счетчик.
  
  Так How проверка вернёт true (потому что массив для таблицы не был создан с помощью оператора new в конструкторе), будет вызван метод resize(), который собственно и создаст таблицу размером на 16 элементов. Да-да, конструкторы класса ниHowой таблицы не создают. Вместо этого всегда происходит вызов метода resize() при первом добавлении element. Длина созданной таблицы (считай длина массива) будет записана в переменную n – n = (tab = resize()).length, которая в дальнейшем используется для вычисления бакета.
5. At the same time, we calculate the index of the bucket where our object will be placed and check whether there are already elements in it. We calculate:
```
i = (n - 1) & hash
i = (16 - 1) & 2306996
i = 4
```
  check:
```
if ((p = tab[i = (n - 1) & hash]) == null)
```
6. Since as a result of the check we get true (there are no elements in the bucket), a Node object with the following fields will be generated:
```
{
int hash = 2306996 — сгенерированный хеш-code
String key = {"KING"} — ключ
Integer value = 100 — meaning
Node next = null — link на следующий узел
}
```
  Our generated Node object will be added to the bucket under index [4]:
```
tab[i] = newNode(hash, key, value, null);
tab[4] = newNode(2306996, “KING”, 100, null);
```
  newNode()is a method that simply returns an object of the Node class.
7. After adding, a check will be made to see if the current number of elements exceeds the parameter threshold:
```
if (++size > threshold)
    resize();
```
  If exceeds, then a method will be called resize()to increase the size of the hash table.
  
  At this point, the method putVal()(and , respectively put()) will complete its work.
  
  The result can be graphically represented as follows:
  
  In general, this is what adding a Node (key-value pair) to a hash table looks like if the desired bucket is empty . Now let's try to add an element that will lead to a collision (when there is more than one element in one bucket).

A little about collisions The situation when different keys end up in the same bucket (even with different hashes) is called a collision or collision. Even if the hash table is larger than the data set and a good hash function has been chosen, this does not guarantee that collisions will not occur. And the hash value is limited to the range of int values (about 4 billion). The resulting new value also needs to be written somewhere, and for this you need to determine where exactly it will be written. This is called conflict resolution. There are two approaches:

external chaining or chaining method (implemented in HashMap) - i.e. the cell actually contains a list (chain). And the list may already contain several values (not necessarily with the same hash code).
linear probing or open addressing method (implemented in IdentityHashMap) - consists of searching for the first empty cell after the one pointed to by the hash function;

You can read about collisions here: click

Using the method, put()we will add another key-value pair to the hash mapping:
```
map.put("BLAKE", 10);
```
We calculate the “preliminary hash” – 63281361. We modify it and as a result we get the final hash code – 63281940.
Since the first check for “emptiness” will now return false (there is no need to create a table), we immediately calculate the index of the bucket where our object will be placed:
```
i = (n - 1) & hash
i = (16 - 1) & 63281940
i = 4
```
The bucket at the specified index is checked for the presence of elements in it, and since the condition if ((p = tab[i = (n - 1) & hash]) == null)is not met in this case (there is already an element in the bucket), then we go to the block else.
First of all, we compare the element being added with the first element of the linked list inside the bucket:
```
(p.hash == hash && ((k = p.key) == key || (key != null && key.equals(k))))
```
При проверке сначала сравниваются хеши ключей. Если этот «участок» (p.hash == hash) возвращает false, тогда остальная часть условия игнорируется (&&), так How an objectы гарантированно разные. Иначе затем сравниваются ключи по ссылке (==) и в случае неequalsства, ключи сравниваются уже посредством метода equals(). Сравнение осуществляется в таком порядке во благо производительности. Если все три выражения возвращают true, это означает, что ключи равны и новый элемент будет записан в дополнительную переменную, чтобы в дальнейшем с её помощью перезаписать meaning у ключа:
```
if (e != null) { // existing mapping for key
      V oldValue = e.value;
      if (!onlyIfAbsent || oldValue == null)
      e.value = value;
      afterNodeAccess(e);
       return oldValue;
 }
```
В результате сравнения ключей мы получаем false уже на первом этапе (разные хеши).
Игнорируем condition (p instanceof TreeNode), так How структура в бакете не является древовидной на данном этапе.
Далее переходим в цикл for, где в пределах одного бакета проверяем у элементов указатель на следующий элемент next, и если он equals null (значит элемент в списке последний и единственный), добавляем новый элемент Node в конец списка:
```
if ((e = p.next) == null){
	p.next = newNode(hash, key, value, null)
... };
```
Вы можете спросить, а где же проверка на equalsство ключей? Мы же не можем просто взять и добавить новый элемент. Так вот она уже была заранее осуществлена в пункте (5). Благодаря этому, теперь мы можем просто проверить указатель этого element, и так How он сейчас equals null, можно сделать вывод о том, что в списке только один элемент. И так How мы уже проверяли их ключи, мы можем безболезненно добавлять новый элемент.

Если же при первой итерации указатель не equals null, это говорит о том, что в списке How минимум два element, поэтому в таком случае мы переходим к следующему условия и сравниваем ключ добавляемого element со всеми ключами элементов в списке (способом, описанным в пятом пункте).
```
if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
```
Если при сравнении ключей будет найдено совпадение, новый элемент будет записан в дополнительную переменную, чтобы в дальнейшем с её помощью перезаписать meaning у ключа.

После добавления второго element наш an object HashMap графически выглядит так:

В итоге, добавление элементов при коллизии (в пределах одного бакета) выглядит следующим образом:
- проверяем с помощью методов hashCode() и equals(), что оба ключа одинаковы.
- если ключи одинаковы, заменить текущее meaning новым.
- иначе связать новый и старый an objectы с помощью структуры данных "связный список", указав ссылку на следующий an object в текущем и сохранить оба под нужным индексом; либо осуществить переход к древовидной структуре
После каждой итерации (добавления нового element) в цикле for увеличивается счетчик, который отвечает за количество элементов в бакете:
```
for (int binCount = 0; ; ++binCount)
```
До тех пор, пока их количество не станет равно or больше 7:
```
binCount >= TREEIFY_THRESHOLD – 1
```
В таком случае произойдет вызов метода treeifyBin()treeify()для непосредственного построения древовидной структуры. Однако, если количество бакетов в текущей хеш-таблице меньше 64:
```
if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
```
Вместо перехода к древовидной структуре будет вызван метод resize() для увеличения размера хеш-таблицы с перераспределением элементов. treeify() в свою очередь связный список из TreeNode конвертирует в красно-черное дерево. Метод resize() перебирает все элементы текущего хранorща, пересчитывает их индексы (с учетом нового размера) и перераспределяет элементы по новому массиву.

Если кратко, не вдаваясь в подробности структуры красно-черного дерева, то происходит следующее:
1. Первый элемент списка записывается How корень всего дерева (чёрный цвет):
```
if (root == null) {
    x.parent = null;
    x.red = false;
    root = x;
}
```
2. Для остальных элементов:
  
  распределяем их налево or направо в зависимости от значения хешей:
```
if ((ph = p.hash) > h)
    dir = -1;
else if (ph < h)
    dir = 1;
```
  Все левые потомки должны быть меньше своего корневого узла (or равны ему), в то время How все правые потомки должны быть больше.
3. Если у двух элементов совпали хеши и их нельзя сравнить иным образом (не реализуют интерфейс Comparable), прерываем построение дерева и вызываем метод tieBreakOrder(), который в свою очередь использует нативный метод System.identityHashCode() для вычисления глобального уникального хеш-codeа.
  
  Подробнее здесь: link на статью
4. Проверяем элементы дерева (an objectы TreeNode) до тех пор, пока не будет найден дочерний (левый or правый) нулевой элемент.
```
if ((p = (dir <= 0) ? p.left : p.right) == null)
```
5. Добавляем дочерний узел (левый or правый в зависимости от dir):
```
x.parent = xp;
if (dir <= 0)
    xp.left = x;
else
    xp.right = x;
```
6. Так How при добавлении нового element может нарушиться баланс дерева, вызываем метод для перебалансировки:
```
root = balanceInsertion(root, x);
```
  Про балансировку КЧД можно почитать здесь: хабр
7. После балансировки корневой элемент может измениться. Исправляем это вызовом метода, который гарантирует, что переданный ему корень будет первым узлом:
```
moveRootToFront(tab, root)
```
  Как строится и самобалансируется красно-черное дерево можно наглядно посмотреть здесь: визуализация

That's all, in principle, and using an example, let's assume that we want to add the following names as keys: KING, MARY, JOSE, ANNA, FRED, TONY, ALEX, PEPE. And let’s say that we have at least 64 buckets in the hash table, and all these elements have accumulated in one bucket. Structurally, this bucket will look like this (elements are sorted by hash code): Type of CCHD:

Detailed analysis of the HashMap class - 6

View inside the bucket:

Detailed analysis of the HashMap class - 7

Retrieving elements (retrieving a value by key) Regarding the adding operation, it is quite simple. The algorithm (when there is a linked list in the bucket) can be written like this:

Calculate the hash code of the key.
Calculate the bucket index.
Go through the index and compare the key of the first element with the existing value. If they are equal, return the value, otherwise check for the next element, if it exists.
If the next Node object is null, return null.
If the next Node object is not null, go to it and repeat the first three steps until the element is found or the next Node object is null.

Using the method, get()we get the value for the “KING” key:


map.get("KING");

Inside, the method is called getNode(int hash, Object key), to which the key itself (“KING”) and its hash (2306996) are passed, which is pre-calculated in the same way as during the operation put().

We check:
1. does a hash table even exist:(tab = table) != null
  
  Let me remind you that when creating a HashMap, an array for the table is not created in the constructor; this happens later in the method resize(), which is always called when adding the first element to the hash table. Therefore, if no elements have been added to the HashMap, there is simply no internal array to store the elements.
2. if the previous expression returns true, you need to make sure that the length of the internal array is greater than 0:(n = tab.length) > 0;
3. if the previous expression also returns true, go to the bucket at the index (in our case 4), which was previously calculated, and check it for the presence of elements:
```
(first = tab[(n - 1) & hash]) != null)
```
4. We compare the key we are looking for with the key of the first element in the list inside the bucket, since in most buckets there will be (not everywhere we have collisions) only one element (our case). As in the case of the method put(), the hashes are compared, and if they match, the keys are compared by reference, and only then by equals().
```
if (first.hash == hash && // always check first node
    ((k = first.key) == key || (key != null && key.equals(k))))
```
  Since in our case, the key “KING” will precede the key “BLAKE” (within a linked list, new elements are added to the end, and KING was added first), we will stop at this point and return the first object of type Node to the get() method, which “snatch” from him a field with the value (100):
```
return (e = getNode(hash(key), key)) == null ? null : e.value;
```
If there is more than one element inside the bucket, then:
1. if the bucket is a linked list, we go through the list through each of the elements in a loop do – whileuntil a match is found:
```
do {
    if (e.hash == hash &&
        ((k = e.key) == key || (key != null && key.equals(k))))
        return e;
} while ((e = e.next) != null);
```
2. if the bucket is a tree structure, then the method is additionally called getTreeNode(), which in turn uses the method to find the required key find(). We carry out a tree search - the hashes are compared and the left or right root node to search is determined. If the keys are equal (by reference or by equals), return this node. If the left or right child nodes are null, we additionally compare the keys using compareTo (if the keys implement the interface Comparable), otherwise we perform a recursive search through the tree (right or left subtree) until a match is found.

Removing objects from a HashMap Since space in the article is running out, I will briefly describe how deletion occurs by key. The algorithm is very similar:

go to the desired bucket (again, it is pre-calculated);
if there is only one object in the bucket (we check its null pointer) we compare hashes, links and equals(if suddenly the hashes are not equal). Found a match? Great, this is our key - delete it (=null) and return the value of this key.
if there is more than one element in the bucket, we check each element in a loop until we find the element or reach the end of the list.
if the element was not found, we return null.

In the situation with a tree, there is a rather tricky implementation, which it is better not to know about and sleep soundly (the description of the method even says that the implementation is more complicated than in a regular deletion operation in a red-black tree). Moreover, when deleted, the number of nodes in one bucket can return to 6, and then the tree will be rebuilt back into a linked list. If you are not a developer with many years of experience, it is not at all necessary to know and understand this (and simply not necessary).

Jesse Haniel

Главный архитектор программного обеспечения at Tribunal de Justiça da Paraíba

Джесси - инженер-программист, увлеченный поиском решений, которые помогают людям жить лучше. Он работает с технологиями с 2008 год ... [Read full bio]

Comments (1)

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION

Toren Level 22

11 January 2024

Thorough material! Thanks. For anyone studying inner works of HashMap. A while ago I wrote a simple HashMap analysis app using Swing. Check it out: https://github.com/Tor3n/HashMapAnalize