Ah, those lines...

The java.lang.String class is perhaps one of the most used in Java. And very often it is used illiterately, which gives rise to many problems, primarily with performance. In this article, I want to talk about strings, the intricacies of using them, the sources of problems, and so on.

Here's what we'll talk about:

string device
String literals
String comparison
String addition
Substring selection and copy constructor
Line change
Let's start with the basics.

string device

The java.lang.String class contains three fields:

/**
 * NOTE: This is just a partial API
 */
public final class String{

    private final char value[];
    private final int offset;
    private final int count;

}

In fact, it contains other fields, such as a hash code, but it doesn't matter now. The main ones are these. So, the string is based on an array of characters ( char ). When storing characters in memory, the Unicode UTF-16BE encoding is used . You can read more about it here . Starting with Java 5.0, support for Unicode versions greater than 2 and, accordingly, characters with codes greater than 0xFFFF has been introduced . For these characters, not one char is used , but two, more about the encoding of these characters in the same article . Although support for these characters has been introduced, it’s bad luck that they won’t be displayed. I found a set of musical symbols ( U1D100) and tried to display the treble clef somewhere (symbol with code 1D120). Translated the code into two char , as it should be - '\uD834' and '\uDD20'. The decoder does not swear at them, honestly recognizes them as one character. But there is no font in which this symbol exists. And that's why it's a square. And by the looks of it, it's for a long time. So the introduction of support for Unicode 4 can only be viewed through the prism of a reserve for the future. Let's go further. I ask you to pay close attention to the second and third fields - offset and count . It would seem that the array completely defines the string if ALL are usedsymbols. If such fields exist, not all characters in the array can be used. So it is, we will talk about this in the substring selection and copy constructor part.

String literals

What is a string literal? This is a string enclosed in double quotes, such as "abc". Such expressions are used in the code all the time. This string can contain unicode escape sequences, for example, \u0410, which will match the Russian letter 'A'. However, this line CANNOTcontain the sequences \u000A and \u000D corresponding to the characters LF and CR, respectively. The fact is that sequences are processed at the earliest stage of compilation, and these characters will be replaced by real LF and CR (as if the editor just pressed "Enter"). To insert these characters into a string, use the \n and \r sequences, respectively. String literals are stored in the string pool. I mentioned the pool in an article about comparison in practice, but I will repeat it. The Java Virtual Machine maintains a pool of strings. It contains all string literals declared in the code. When literals match (in terms of equals, see here) uses the same object in the pool. This allows you to greatly save memory, and in some cases improve performance. The point is that a string can be forced into the pool using the String.intern() method . This method returns a string from the pool equal to the one on which this method was called. If there is no such string, the one for which the method is called is put into the pool, after which a link to it is returned. Thus, with proper use of the pool, it becomes possible to compare strings not by value, through equals, but by reference, which is much faster by orders of magnitude. This is how, for example, the java.util.Locale class is implemented , which deals with a bunch of small, mostly two-character strings - country codes, languages, etc. See also here: Object Comparison: Practice - String.intern method . Very often I see constructions of the following form in various literature:

public static final String SOME_STRING = new String("abc");

To be more precise, new String("abc") causes my complaints . The thing is, this design is illiterate. In Java, a string literal - "abc" - is ALREADY an object of class String . And therefore, the use of a constructor also leads to a COPY of the string. Since the string literal is already stored in the pool and will not go anywhere, the created NEW object is nothing but a waste of memory. This construction with a clear conscience can be rewritten like this:

public static final String SOME_STRING = "abc";

From the point of view of the code, it will be exactly the same, but somewhat more efficient. Let's move on to the next question -

String comparison

Actually, I already wrote everything about this issue in the article Comparison of objects: practice . And there is nothing more to add. Summarizing what was said there - strings must be compared by value, using the equals method . By reference, you can compare them, but carefully, only if you know exactly what you are doing. The String.intern method helps with this . The only point that I would like to mention is the comparison with literals. I often see constructs like str.equals("abc") . And here there is a small rake - before this comparison, it would be correct to compare str with null so as not to get a NullPointerException . Those. the correct construction would be str != null && str.equals("abc"). Meanwhile, it can be simplified. It is enough to write just "abc".equals(str) . There is no need to check for null in this case. We are next in line...

String addition

Strings are the only object for which the reference addition operation is defined. In any case, this was the case until Java 5.0, which introduced autoboxing / unboxing, but this is not the point now. A general description of how the concatenation operator works can be found in the article on links, namely here . I want to touch on a deeper level. Imagine, imagine... Just like in a song about a grasshopper. :) So, imagine that we need to add two lines, or rather, add another to one:

String str1 = "abc";
str1 += "def";

How does addition work? Since the object of the string class is immutable, the result of the addition will be a new object. So. First, enough memory is allocated to contain the contents of both lines. The contents of the first line are copied into this memory, then the second. Next, the variable str1 is assigned a reference to the new string, and the old string is discarded. Let's complicate the task. Let's say we have a file with four lines:

abc
def
ghi
jkl

We need to read these lines and collect them into one. We proceed in the same way.

BufferedReader br = new BufferedReader(new FileReader("... filename ..."));
String result = "";
while(true){
    String line = br.readLine();
    if (line == null) break;
    result += line;
}

So far everything seems to be good and logical. Let's take a look at what's going on at the bottom level. First pass of the cycle. result="" , line="abc" . Memory is allocated for 3 characters, the contents of line - "abc" are copied there . The result variable is assigned a reference to the new string, the old one is discarded. Second pass of the cycle. result="abc" , line="def" . Memory is allocated for 6 characters, the contents of result - "abc" are copied there , then line - "def" . The result variable is assigned a reference to the new string, the old one is discarded. result="abcdef" , line="ghi" . Memory is allocated for 9 characters, the contents of result - "abcdef" are copied there , then line - "ghi" . The result variable is assigned a reference to the new string, the old one is discarded. Fourth pass of the cycle. result="abcdefghi" , line="jkl" . Memory is allocated for 12 characters, the contents of result - "abcdefghi" are copied there , then line - "jkl" . The result variable is assigned a reference to the new string, the old one is discarded. result="abcdefghijkl" , line=null . The cycle is over. So. Three characters "abc" were copied into memory 4 times, "def" - 3 times, "ghi" - 2 times, "jkl" - once. Scary? Not really? Now imagine a file with a line length of 80 characters, in which there are somewhere 1000 lines. Only 80kb. Represented? What will happen in this case? the first line, as you can easily calculate, will be copied into memory 1000 times, the second - 999, and so on. And with an average length of 80 characters, ((1000 + 1) * 1000 / 2) * 80 = ... drumroll ... 40,040,000 characters will pass through memory, which is about 80 MB (!!!) of memory. What is the outcome of thiscycle? Reading an 80KB file caused 80MB of memory to be allocated. Neither more nor less - 1000 times more than the usable volume. What should be the conclusion from this? Very simple. Never, remember - NEVER use direct string concatenation, especially in loops. Even in some toString method , if it is called often enough, it makes sense to use StringBuffer instead of concatenation. Actually, the compiler most often does this when optimizing - it performs direct additions through StringBuffer. However, in cases like the one I cited, the compiler is not able to do the optimization. Which leads to very sad consequences, described below. Unfortunately, such designs are too common. That is why I thought it necessary to focus on this. Personal experience I can't help but recall one episode from my own practice. One of the programmers who worked with me once complained that his code was very slow for him. He read a fairly large file in HTML format, after which he performed some manipulations. Indeed, everything worked at a snail's pace. I took a look at the source, and found that it... uses string concatenation. He had 200-250 lines in each file, and when reading a file of about 200Kb, more than 40Mb passed through memory! As a result, I rewrote the code a bit, replacing operations with strings with operations with StringBuffer. To be honest, when I ran the rewritten code, I thought that it just "fell" somewhere. Processing took a fraction of a second. The speed increased by 300-800 times. The next act of the Marlezon ballet -

Substring selection and copy constructor

Imagine that we have a string from which we need to cut a substring. The question "how to do it" is not worth it - and so it is clear. The question is - what is going on?

String str = "abcdefghijklmnopqrstuvwxyz";
str = str.substring(5,10);

Like trivial code. And the first thought is this - the substring "efghi" is selected, the reference to the new string is assigned to the variable str, and the old object is discarded. So? Almost. The fact is that to increase the speed, when selecting a substring, the SAME ARRAY is used as in the original string. In other words, we will not get an object in which the value array (see the string device) has length 5 and contains the characters 'e', 'f', 'g', 'h' and 'i', count=5 and offset=0. No, the array length will still be 26, count=5 and offset=5. And when discarding the old string, the array will NOT be discarded, but will still be in memory, because there is a link to it from the new string. And it will exist in memory until a new line is discarded. This is a completely non-obvious point that can lead to memory problems. The question arises - how to avoid this? The answer is with the String(String) copy constructor . The fact is that in this constructor, memory is explicitly allocated for a new line, and the contents of the original one are copied into this memory. Thus, if we rewrite the code like this:

String str = "abcdefghijklmnopqrstuvwxyz";
str = new String(str.substring(5,10));

..., then the length of the value array of the str object will indeed be 5, count=5 and offset=0. And this is the only case where a copy constructor for a string is justified. And as the final chord -

Line change

It to a line as that concerns weakly. I just want to show the fact that a string is immutable only to a certain extent. So code.

package tests;

import java.lang.reflect.Field;
import java.lang.reflect.Modifier;

/**
 * This application demonstrates how to modify java.lang.String object
 * through reflection API.
 *
 * @version 1.0
 * @author Eugene Matyushkin
 */
public class StringReverseTest {

    /**
     * final static string that should be modified.
     */
    public static final String testString = "abcde";

    public static void main(String[] args) {
        try{
            System.out.println("Initial static final string:  "+testString);
            Field[] fields = testString.getClass().getDeclaredFields();
            Field value = null;
            for(int i=0; i

What's going on here? First I'm looking for a field of type char[] . I could search by name too. However, the name may change, but the type - I strongly doubt it. Next, I call the setAccessible(true) method on the found field . This is the key point - I disable the field access level check (otherwise I simply cannot change the value, because the field is private ). At this point, I can get hit in the head by the security manager that checks if such an action is allowed (via a call to checkPermission(new ReflectPermission("suppressAccessChecks")) ). If allowed (and by default for normal applications it is) - I can access the private field. The rest, as they say, is a matter of technique. As a result, I get the output:

Initial static final string:  abcde
Reversed static final string: edcba

Q.E.D. Therefore, in real applications, I advise you to approach the security policy setting more carefully. Otherwise, it may turn out that objects that you think are guaranteed to be immutable are not. * * * I guess that's all I want to say about strings for now. Thank you for your attention! Link to the original source: Oh, these lines...

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION