JavaRush /Java Blog /Random EN /ASCII text encoding (Windows 1251, CP866, KOI8-R) and Uni...
articles
Level 15

ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry

Published in the Random EN group
Today we will talk about where krakozyabry come from on the site and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting with basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251, and ending with the modern encodings of the Unicode Consortium UTF 16 and 8. Contents ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 1: To some, this information may seem redundant, but you would know how many questions come to me specifically with regards to crawled out krakozyabrs (an unreadable character set). Now I will have the opportunity to refer everyone to the text of this article and independently look for my jambs. Well, get ready to absorb the information and try to follow the course of the story.

ASCII - basic text encoding for Latin

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals, and punctuation marks with control characters. But still, the famous ASCII should be considered the starting point for the development of modern text encodings.(American Standard Code for Information Interchange, which in Russian is usually pronounced as "aski"). It describes the first 128 characters of the most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks. Even in these 128 characters described in ASCII, some service characters like brackets, bars, asterisks, etc. fell into place. Actually, you can see them yourself: ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 2It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely meet them and they will stand in that order. But the fact is that with the help of one byte of information, it is possible to encode not 128, but as many as 256 different values ​​​​(two to the power of eight equals 256), so after the basic version of Asuka, a whole series ofextended ASCII encodings , in which, in addition to 128 basic characters, it was also possible to encode characters of the national encoding (for example, Russian). Here, probably, it is worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone studied at an institute or at school). One byte consists of eight bits, each of which is a two to the power of two, starting from zero, and up to two in the seventh: ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 3 It is not difficult to understand that there can be only 256 of all possible combinations of zeros and ones in such a construction. Converting a number from binary to decimal is quite simple. You just need to add up all the powers of two, over which there are ones. In our example, this is 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth), plus 64 (to the sixth), plus 128 (to the seventh). The total is 233 in the decimal number system. As you can see, everything is very simple. But if you take a closer look at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds in Asci to the hexadecimal number 2A. You probably know that in addition to Arabic numerals, the hexadecimal number system also uses Latin letters from A (meaning ten) to F (meaning fifteen). So,converting a binary number to hexadecimal resort to the following simple method. Each byte of information is divided into two parts of four bits. Those. in each half byte, only sixteen values ​​\u200b\u200bcan be encoded in binary code (two to the fourth power), which can be easily represented as a hexadecimal number. Moreover, in the left half of the byte, it will be necessary to count the degrees again, starting from zero, and not as shown in the screenshot. As a result, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle turned out to be clear to you. Well, now let's continue, in fact, to talk about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8). Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. it became possible to add characters of the letters of your language to Asci. Here it will be necessary to digress once again to explain why text encodings are needed at all.and why is it so important. Symbols on your computer screen are formed on the basis of two things - sets of vector shapes (representations) of all kinds of characters (they are in files with fonts that are installed on your computer) and a code that allows you to pull out exactly that one from this set of vector shapes (font file). character to be inserted at the correct location. It is clear that fonts are responsible for the vector forms themselves, but the operating system and the programs used in it are responsible for encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text. The program that displays this text on the screen (text editor, browser, etc.) when parsing the code, it reads the encoding of the next character and looks for the corresponding vector form in the required font file, which is connected to display this text document. Everything is simple and banal. This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met: the vector form of this character must be in the font used, and this character could be encoded in extended ASCII encodings into one byte. Therefore, there are a whole bunch of such options. Only for encoding characters of the Russian language, there are several varieties of the extended Aska. For example, it originally appeared two conditions must be met: the vector form of this character must be in the font used, and this character could be encoded in extended ASCII encodings into one byte. Therefore, there are a whole bunch of such options. Only for encoding characters of the Russian language, there are several varieties of the extended Aska. For example, it originally appeared two conditions must be met: the vector form of this character must be in the font used, and this character could be encoded in extended ASCII encodings into one byte. Therefore, there are a whole bunch of such options. Only for encoding characters of the Russian language, there are several varieties of the extended Aska. For example, it originally appearedCP866 , in which it was possible to use characters of the Russian alphabet, and it was an extended version of ASCII. That is, its upper part completely coincided with the basic version of Asuka (128 Latin characters, numbers and other crap), which is shown in the screenshot just above, but the lower part of the table with CP866 encoding had the form indicated in the screenshot just below and allowed encode another 128 characters (Russian letters and all sorts of pseudographics there): ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 4 You see, in the right column, the numbers start with 8, because numbers from 0 to 7 refer to the ASCII base part (see the first screenshot). Thus, the Cyrillic letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and the column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters this letter will be displayed without problems in the text. Where did so many pseudo-graphics come from in the CP866? The thing here is that this encoding for Russian text was developed back in those furry years when graphical operating systems were not common as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to somehow diversify the design of texts, and therefore it abounds in CP866 and all its other peers from the category of extended versions of Asuka. CP866 was distributed by IBM, but besides this, a number of encodings were developed for Russian characters, for example, KOI8-R can be attributed to the same type (extended ASCII) : ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 5The principle of its operation remains the same as that of the CP866 described a little earlier - each character of the text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half fully corresponds to the basic Asuka, which is shown in the first screenshot in this article. Among the features of the KOI8-R encoding, it can be noted that the Cyrillic letters in its table are not in alphabetical order, as was done in CP866. If you look at the very first screenshot (of the base part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the letters of the Latin alphabet consonant with them from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding only one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why krakozyabry crawl out

Further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them eventually disappeared. As a result, a whole group arose, which, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic characters. They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the variant with support for the Russian language. An example of this is Windows 1251. It favorably differed from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (apart from the accent mark), as well as symbols used in Slavic languages ​​close to Russian (Ukrainian, Belarusian, etc.). ): ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 6Due to such an abundance of Russian language encodings, font manufacturers and software manufacturers constantly had a headache, and we, dear readers, often got out the very notorious krakozyabrywhen there was confusion with the version used in the text. Very often they got out when sending and receiving messages by e-mail, which led to the creation of very complex conversion tables, which, in fact, could not solve this problem in the root, and often users used transliteration of Latin letters for correspondence in order to avoid the notorious krakozyabry when using Russian encodings like CP866, KOI8-R or Windows 1251. In fact, the bugs that popped up instead of Russian text were the result of incorrect use of the encoding of this language, which did not match the one in which the text message was originally encoded. For example, if you try to display characters encoded with CP866 using the Windows 1251 code table, ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 7A similar situation very often occurs when creating and configuring websites, forums or blogs, when the text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor that adds invisible gag to the code naked eye. In the end, many people got tired of such a situation with a lot of encodings and constantly getting out krakozyabry, there were prerequisites for creating a new universal variation that would replace all existing ones and solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where the characters of the language were much more than 256.

Unicode (Unicode) - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not be described in any way in one byte of information, which was allocated for encoding characters in extended versions of ASCII. As a result, a consortium called Unicode (Unicode - Unicode Consortium) was created with the cooperation of many leaders in the IT industry (those who produce software, who code hardware, who create fonts) who were interested in the emergence of a universal text encoding. The first variation to be published under the auspices of the Unicode Consortium was UTF 32. The number in the name of the encoding means the number of bits that is used to encode one character. 32 bits is 4 bytes of information that will be needed to encode one single character in the new universal encoding UTF. As a result, the same text file, encoded in the extended version of ASCII and in UTF-32, in the latter case will have a size (weight) four times larger. This is bad, but now we have the opportunity to encode using UTF the number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a huge margin). But many countries with languages ​​​​of the European group did not need to use such a huge number of characters in the encoding at all, however, when using UTF-32, they received a fourfold increase in the weight of text documents for nothing, and as a result, an increase in the volume of Internet traffic and the amount of data stored. This is a lot, and no one could afford such waste. As a result of the development of Unicode, UTF-16 appeared, which turned out to be so successful that it was accepted as the default base space for all the characters that we use. It uses two bytes to encode one character. Let's see what this thing looks like. In the Windows operating system, you can go along the path "Start" - "Programs" - "Accessories" - "Utilities" - "Character Table". As a result, a table with vector shapes of all fonts installed in your system will open. If you select the Unicode character set in the "Advanced Options", you can see for each font individually the entire range of characters included in it. By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format , consisting of four hexadecimal digits: ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 8How many characters can be encoded in UTF-16 using 16 bits? 65536 (two to the power of sixteen), and it was this number that was adopted as the base space in Unicode. In addition, there are ways to encode with it about two million characters, but limited to an extended space of a million characters of text. But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English, because after switching from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Asci and two bytes per same character in UTF-16). That's it for the satisfaction of everyone and everything in the Unicode consortium, it was decided to come up with an encodingvariable length. It's called UTF-8. Despite the eight in the name, it really has a variable length, i.e. each text character can be encoded into a sequence of one to six bytes. In practice, in UTF-8, only the range from one to four bytes is used, because behind four bytes of code, nothing is even theoretically possible to imagine. All Latin characters in it are encoded in one byte, just like in the good old ASCII. Remarkably, in the case of encoding only Latin, even those programs that do not understand Unicode will still read what is encoded in UTF-8. That is, the basic part of Asuka simply passed into this brainchild of the Unicode consortium. Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters in three bytes.fonts have a single code space . And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. In the “Character Table” above, you can see that different fonts support a different number of characters. Some Unicode-rich fonts can be very large. But now they differ not in that they were created for different encodings, but in the fact that the font manufacturer filled or did not fill the single code space with one or another vector form to the end.

Krakozyabry instead of Russian letters - how to fix

Let's now see how krakozyabras appear instead of text, or, in other words, how the correct encoding for Russian text is chosen. Actually, it is set in the program in which you create or edit this same text, or code using text fragments. For editing and creating text files, I personally use a very good, in my opinion, Html and PHP editor Notepad ++ . However, it can highlight the syntax of a good hundred more programming and markup languages, and also has the ability to be extended using plugins. Read a detailed review of this wonderful program at the link below. In the top menu of Notepad ++ there is an item "Encodings", where you will have the opportunity to convert an existing option to the one used on your site by default: ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 9In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should choose the UTF 8 option without BOM to avoid the appearance of bugs . What is the prefix BOM? The fact is that when the UTF-16 encoding was developed, for some reason they decided to attach to it such a thing as the ability to write a character code, both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand in which sequence to read the codes, BOM was invented(Byte Order Mark or, in other words, signature), which was expressed in the addition of three additional bytes to the very beginning of the documents. In UTF-8 encoding, no BOM was provided for in the Unicode consortium, and therefore adding a signature (these most notorious additional three bytes to the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always choose the option without BOM (without signature). Thus, you will protect yourself in advance from crawling out krakozyabry. Remarkably, some programs in Windows do not know how to do this (they cannot save text in UTF-8 without BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on the servers, because of this little thing, a problem may arise - krakozyabry will come out. Therefore, in no case do not use the usual Windows notepadfor editing documents of your site, if you do not want the appearance of krakozyabrov. I consider the already mentioned Notepad ++ editor to be the best and simplest option, which has practically no drawbacks and consists of only advantages. In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is inherently very close to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described a little above. Where does this information come from? It is written in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which one to choose in the case of OEM (for the Russian language it will be CP866). If you set a different default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language. After you save the document in Notepad ++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor: ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry - 10To avoid krakozyabr , in addition to the actions described above, it will be useful to write information about this encoding in its source code header of all pages of the site so that there is no confusion on the server or local host. In general, in all hypertext markup languages ​​except Html, a special xml declaration is used, which specifies the text encoding.
<?xml version="1.0" encoding="windows-1251"?>
Before parsing the code, the browser knows which version is being used and how exactly the character codes of that language should be interpreted. But what is remarkable, if you save the document in the default unicode, this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM). In the case of an Html language document, the Meta element is used to specify the encoding , which is written between the opening and closing Head tag:
<head>
...
<meta charset="utf-8">
...
</head>
This notation is quite different from the standard in Html 4.01, but is fully compliant with the Html 5 standard and will be 100% correctly understood by any browsers currently in use. In theory, it would be better to place the Meta element with the Html encoding of the document as high as possible in the document header , so that at the time of the meeting in the text of the first character not from the base ANSI (which will always be read correctly and in any variation), the browser should already have information about how interpret the codes of these characters. Link to the original source: ASCII text encoding (Windows 1251, CP866, KOI8-R) and Unicode (UTF 8, 16, 32) - how to fix the problem with krakozyabry
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION