Byte-by-byte work with files

Special for

Let's get started

At level 18, the first tasks of byte-by-byte file reading began: read the file, then find the minimum/maximum bytes or output it in an ordered form, etc.

The people here are very smart. They know about collections and that they can sort and insert. Collections are a powerful mechanism. And many did not use them at all before JavaRush. It is, of course, commendable to study them and try to hit them in the wrong places. So. Let's take a problem that is not in the tasks (so that there are no spoilers when solving it), but there are very similar ones:

Enter the file name from the console
Read all bytes from a file.
Ignoring repetitions, sort them by bytecode in descending order.
Display
Close I/O stream

Example bytes of an input file 44 83 44 Example of output 83 44 We additionally introduced variables startTimeand finishTimeto record the execution time of the program. For the calculation I used i3-3GHz/8Gb RAM/HDD WD Blue-1Tb/Win7-64/jdk-8u73-windows-x64 (examples of programs in options 1-2 are taken from the info.javarush forum, they are slightly modified only for sorting in ascending order okay - that is, they are REAL!!)

Let's solve it head-on:

// Вариант 1. Загоняем в коллекцию и сортируем используя ее метод Collections.sort
public class Solution {
    public static void main(String[] args) throws Exception {
        FileInputStream inputStream = new FileInputStream(new BufferedReader(new InputStreamReader(System.in)).readLine());
        long startTime = System.currentTimeMillis();

        ArrayList<Integer> listData = new ArrayList<Integer>();
        while (inputStream.available() > 0) listData.add(inputStream.read());
        inputStream.close();
        ArrayList<Integer> result = new ArrayList<Integer>(new HashSet<Integer>(listData));
        Collections.sort(result);

        while (!result.isEmpty()) {
            System.out.print(result.get(result.size()-1) + " ");
            result.remove(result.get(result.size()-1));
        }

        long finishTime = System.currentTimeMillis();
        System.out.println("\nвремя работы=" + (finishTime-startTime) + "ms.");
    }
}

Solves everything great! The test (if there had been, it would have passed with a bang). But in life there are few files containing only the line “Mom washed the frame.” Let's feed our program a 46MB file (by today's standards, it doesn't seem like much). What is it, the program runs for 220 seconds. An attempt to feed a 1Gb file in the evening (the size of the MPEG4 movie is not of the best quality) was unsuccessful. I was still reading the program in the morning - and I had to go to work already. What is the problem? Probably in use ArrayList<Integer>which has 1 billion elements inside. Each element of it takes up 16 bytes minimum (Header: 8 bytes + Field int: 4 bytes + Alignment for multiplicity 8: 4 bytes). In total, we voluntarily put 16 Gb of data into memory with a RAM size of 8. We will do better. Let's dive deeper into the collections. And hurray, we found what we needed.

Meet TreeSet

This is a lot:

does not allow storing two identical elements (which means we will store all 255 elements in memory, instead of a billion!)
when manipulating its elements, it automatically organizes (sorts itself - here it is, the height of perfection!)

We get:

// Вариант 2. Загоняем в ТreeSet который сам сортирует (лютый win!)
public class Solution {
    public static void main(String[] args) throws Exception {
        FileInputStream inputStream = new FileInputStream(new BufferedReader(new InputStreamReader(System.in)).readLine());

        byte[] arrBytes = new byte[256];
        long startTime = System.currentTimeMillis();

        SortedSet<Integer> list = new TreeSet<Integer>();
        while(inputStream.available()>0) list.add(inputStream.read());
        inputStream.close();

        while (!list.isEmpty())        {
            System.out.print(list.last() + " ");
            list.remove(list.last());
        }

		long finishTime = System.currentTimeMillis();
        System.out.println("\nвремя работы=" + (finishTime-startTime) + "ms.");
    }
}

The output is: 46MB file, 176 seconds. 1Gb file - 3 hours 5 minutes. Progress is obvious. We were able to “wait” for the results, and the 46MB file is processed noticeably faster. Go ahead. Let's try to give up collections (this will be excruciatingly painful for some). We will use simple arrays (it's so primitive). Let's note one important thing . The number of bytes encountered can be put into an array of length 256. So we will simply increase the array element corresponding to the read byte by one.

Array - byte by byte

// Вариант 3. Считываем массив поbyteно.
public class Solution {
    public static void main(String[] args) throws Exception {
        FileInputStream inputStream = new FileInputStream(new BufferedReader(new InputStreamReader(System.in)).readLine());

        long[] arrBytes = new long[256];
        long startTime = System.currentTimeMillis();

        while (inputStream.available() > 0) arrBytes[inputStream.read()]++;

		inputStream.close();
        // Выводим отсортированный по byte-codeу в обратном порядке
        for (long i = 255; i >= 0 ; i--)
            if (arrBytes[(int) i] > 0) System.out.print(i + " ");

			long finishTime = System.currentTimeMillis();
        System.out.println("\nвремя работы=" + (finishTime-startTime) + "ms.");
    }
}

The output is: 46MB file, 158 seconds. 1Gb file - 2 hours 55 minutes. Again an improvement, but small. And we did everything with simple tools. Did not use a microscope for driving nails . Now a lyrical digression. Let's remember the structure of a computer. RAM memory (DRAM) , where the program is usually executed and variables are stored, has a high access speed, but is small in size. Memory on a hard/flash drive (HDD or Flash drives) where files are usually stored, on the contrary, has a low access speed but a large size. So when we read a 1Gb file byte by byte (that is, we access the HDD a billion times), we spend a lot of time working with a low-speed device (we transfer sand grain by grain from the body of a KamAZ truck into the sandbox). Let's try to improve it further.

Let's dump out the ENTIRE KAMAZ truck with sand at once!

// Вариант 4. Считываем массив сразу целиком за раз в память.
public class Solution {
    public static void main(String[] args) throws Exception {
        FileInputStream inputStream = new FileInputStream(new BufferedReader(new InputStreamReader(System.in)).readLine());

        long[] arrBytes = new long[256];
        long startTime = System.currentTimeMillis();

        byte fileImage[]=new byte[inputStream.available()];
        long fileSize=fileImage.length;
        inputStream.read(fileImage);
        for (int i = 0; i = 0 ; i--)
            if (arrBytes[(int) i] > 0) System.out.print(i + " ");

		long finishTime = System.currentTimeMillis();
        System.out.println("\nвремя работы=" + (finishTime-startTime) + "ms.");
    }
}

a small, but again important digression . Note:

arrBytes index is defined within 0..255,
fileImage is a byte array whose elements have the value -128..127

Therefore, to count bytes, we will use a construction arrBytes[fileImage[i] & 0b11111111]++; that will simply reset the sign bit and return us a value in the range 0..255 And so, the results: 46MB file 0.13 seconds (less than a second). 1Gb file - 9 seconds. We did it! We are incredibly cool! Speeded up from 3 hours to 9 seconds. That's it, you can sit back in your chair and drink some tea. And now another experiment - let's try a 32 Gb file (for example, an HD movie). As a result, we get a crackling sound from the working HDD with the program crashing in Windows. KamAZ dumped the body with sand and broke the sandbox! What do we do? Let's remember one more fact. Files in the OS are usually stored in portions (clusters) of 2-64Kb (depending on the type of file system, settings, etc.). We will read in portions, for example 64,000 bytes. Let's try to unload KamAZ with an excavator in fairly large portions:

Using a buffer

// Вариант 5. Считываем массив кусками.
public class Solution {
    public static void main(String[] args) throws Exception {
        FileInputStream inputStream = new FileInputStream(new BufferedReader(new InputStreamReader(System.in)).readLine());

        long[] arrBytes = new long[256];
        long startTime = System.currentTimeMillis();

        int  bufferSize = 64000;
        byte buffer[]   = new byte[64000];

        while (inputStream.available() > 0) {
            if (inputStream.available() < 64000) bufferSize = inputStream.available();
            inputStream.read(buffer, 0, bufferSize );
            for (int i = 0; i = 0 ; i--)
            if (arrBytes[(int) i] > 0) System.out.print(i + " ");

		long finishTime = System.currentTimeMillis();
        System.out.println("\nвремя работы=" + (finishTime-startTime) + "ms.");
    }
}

As a result, we got: 46MB file 0.08 seconds (less than a second). 1Gb file - 0.9 seconds (less than a second). 32Gb file - 31 seconds. Note that for a 1 Gb file we have improved performance from several hours to fractions of seconds !!! With this modest fact, we will finish the experiment and improve the initial code. We have made progress in many ways - we are pleased with the new indicators of memory consumption and operating time. Also, in this case, we do not pull up useless collections from the standard library. PS Someone will say the example is far-fetched, etc. But there are a lot of similar tasks - to analyze a huge volume of elements that have a finite number of states. For example, images (RGB - usually stored in 24 bytes, in our case long[] arrRGB = new long[256*256*256] would take up only 64MB in memory), music (amplitude is usually digitized in 16 or 24 bits) or discrete indicators sensors, etc.