XML Basics for Java Programmer - Part 1 of 3

XML is a very popular and flexible format nowadays. Every programmer should understand it, it's simply a must have. Many technologies today are actively using it, and modern ones are among them.

Introduction

Hello, dear readers of my article. I want to say right away that this is only the first article in my series of three articles. The main goal of the entire series is to initiate each reader into XML and give, if not a complete explanation and understanding, then at least a good push towards it, explaining the main points and things. The entire cycle will be for one nomination - “Attention to detail” , and the division into 3 articles is made in order to fit into the character limit in posts and divide a large amount of material into smaller portions for greater understanding. The first article will be devoted to XML itself and what it is, as well as one of the ways to create a schema for XML files - DTD. To begin with, I would like to make a small preface for those who are not yet familiar with XML: there is no need to be scared. XML is not very complicated and should be understood by any programmer, as it is a very flexible, efficient and popular file format today for storing a variety of information that you want. XML is used in Ant, Maven, Spring. Any programmer needs knowledge of XML. Now that you have gathered the strength and motivation, let's start studying. I will try to lay out all the material as simply as possible, collecting only the most important and not going into the weeds.

XML

For a clearer explanation, it would be better to visualize the XML with an example.

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <name>IT-Heaven</name>
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee>
                    <name>Maksim</name>
                    <job>Middle Software Developer</job>
                </employee>
                <employee>
                    <name>Ivan</name>
                    <job>Junior Software Developer</job>
                </employee>
                <employee>
                    <name>Franklin</name>
                    <job>Junior Software Developer</job>
                </employee>
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee>
                    <name>Herald</name>
                    <job>Middle Software Developer</job>
                </employee>
                <employee>
                    <name>Adam</name>
                    <job>Middle Software Developer</job>
                </employee>
                <employee>
                    <name>Leroy</name>
                    <job>Junior Software Developer</job>
                </employee>
            </employees>
        </office>
    </offices>
</company>

HTML and XML are similar in syntax because they have a common parent - SGML. However, in HTML there are only fixed tags of a specific standard, while in XML you can create your own tags, attributes and, in general, do whatever you want to store data in the way that suits you. In fact, XML files can be read by anyone who knows English. This example can be depicted using a tree. XML Basics for Java Programmer - Part 1 of 3 - 2

XML Basics for Java Programmer - Part 1 of 3 - 2

The root of the tree is Company. It is also the root (root) element from which all other elements come. Each XML file can only have one root element. It must be declared after the declaration of the xml file (the first line in the example) and contain all other elements. A little about the declaration: it is mandatory and is needed to identify the document as XML. It has three pseudo-attributes (special predefined attributes): version (according to the 1.0 standard), encoding (encoding) and standalone (autonomy: if yes and external schemes are connected to the document, then there will be an error, the default is no). Elements are entities that store data using other elements and attributes. Attributes are additional information about an element that is specified when adding an element. If we translate the explanation into an OOP field, we can give the following example: we have a car, each car has characteristics (color, capacity, brand, etc.) - these are attributes, and there are entities that are inside the car: doors, windows, engine , the steering wheel are other elements. You can store properties either as individual elements or as attributes, depending on your desire. After all, XML is an extremely flexible format for storing information about anything. After the explanations, we just need to look at the example above for everything to fall into place. In the example, we described a simple company structure: there is a company that has a name and offices, and in the offices there are employees. The Employees and Offices elements are wrapper elements - they serve to collect elements of the same type, essentially combining them into one set for ease of processing. Floor and room deserve special attention. These are the attributes of the office (floor and number), in other words, its properties. If we had an “image” element, then we could transfer its dimensions. You may notice that company does not have a name attribute, but does have a name element. You can simply describe structures the way you want. Nobody obliges you to write all the properties of elements only in attributes; you can use just elements and write some data inside them. For example, we can record the name and position of our employees as attributes:

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <name>IT-Heaven</name>
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="Maksim" job="Middle Software Developer">

                </employee>
                <employee name="Ivan" job="Junior Software Developer">

                </employee>
                <employee name="Franklin" job="Junior Software Developer">

                </employee>
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer">

                </employee>
                <employee name="Adam" job="Middle Software Developer">

                </employee>
                <employee name="Leroy" job="Junior Software Developer">

                </employee>
            </employees>
        </office>
    </offices>
</company>

As you can see, now the name and position of each employee are his attributes. And you can notice that there is nothing inside the employee entity (tag), all employee elements are empty. Then you can make employee an empty element - close it immediately after declaring the attributes. This is done quite simply, just add a slash:

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <name>IT-Heaven</name>
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="Maksim" job="Middle Software Developer" />
                <employee name="Ivan" job="Junior Software Developer" />
                <employee name="Franklin" job="Junior Software Developer" />
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer" />
                <employee name="Adam" job="Middle Software Developer" />
                <employee name="Leroy" job="Junior Software Developer" />
            </employees>
        </office>
    </offices>
</company>

As you can see, by closing the empty elements, we preserved the entire integrity of the information and greatly shortened the record, making the information more concise and readable. To add a comment (text that will be skipped when parsing a file) in XML, there is the following syntax:

<!-- Ivan недавно уволился, только неделю отработать должен. Не забудьте потом удалить его из списка.-->

And the last construction is CDATA , which means “character data”. Thanks to this design, it is possible to write text that will not be interpreted as XML markup. This is useful if you have an entity inside the XML file that stores XML markup in the information. Example:

<?xml version="1.0" encoding="UTF-8" ?>
<bean>
    <information>
        <![CDATA[<name>Ivan</name><age>26</age>]]>
    </information>
</bean>

The thing about XML is that you can extend it however you want: use your own elements, your own attributes, and structure it as you wish. You can use both attributes and elements to store data (as was shown in the example earlier). However, you need to understand that you can come up with your own elements and attributes on the fly and however you want, but what if you work on a project where another programmer wants to move the name element into attributes, and your entire program logic is written so that name was an element? How can you create your own rules about what elements should be, what attributes they have, and other things, so that you can validate XML files and be sure that the rules will become standard in your project and no one will violate them? In order to write all the rules of your own XML markup, there are special tools. The most famous: DTD and XML Schema. This article will only talk about the first.

DTD

DTD is created to describe types of documents. DTD is already becoming obsolete and is now being actively abandoned in XML, but there are still many XML files that use DTD and, in general, it is useful to understand. DTD is a technology for validating XML documents . A DTD declares specific rules for a document type: its elements, what elements can be inside the element, attributes, whether they are required or not, the number of their repetitions, as well as Entities. Similar to XML, a DTD can be visualized with an example for a clearer explanation.

<!-- Объявление возможных элементов -->
<!ELEMENT employee EMPTY>
<!ELEMENT employees (employee+)>
<!ELEMENT office (employees)>
<!ELEMENT offices (office+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (name, offices)>

<!-- Добавление атрибутов для элементов employee и office -->
<!ATTLIST employee
        name CDATA #REQUIRED
        job  CDATA #REQUIRED
>

<!ATTLIST office
        floor CDATA #REQUIRED
        room  CDATA #REQUIRED
>

<!-- Добавление сущностей -->
<!ENTITY M "Maksim">
<!ENTITY I "Ivan">
<!ENTITY F "Franklin">

Here we have such a simple example. In this example, we declared our entire hierarchy from the XML example: employee, employees, office, offices, name, company. To create DTD files, there are 3 main constructs used to describe any XML files: ELEMENT (to describe elements), ATTLIST (to describe attributes for elements) and ENTITY (to substitute text with abbreviated forms). ELEMENT Used to describe an element. The elements that can be used within the described element are listed in parentheses in list form. You can use quantifiers to indicate quantity (they are similar to quantifiers from regular expressions): +means 1+ *means 0+ ?means 0OR 1 If no quantifiers were added, then it is assumed that there should be only 1 element. If we needed one of a group of elements, we could write it like this:

<!ELEMENT company ((name | offices))>

Then one of the elements would be selected: name or offices, but if there were two of them inside the company, then the validation would not pass. You can also notice that in employee there is the word EMPTY - this means that the element must be empty. There is also ANY - any elements. #PCDATA – text data. ATTLIST Used to add attributes to elements. After ATTLIST follows the name of the desired element, and after it a dictionary of the form “attribute name - attribute type”, and at the end you can add #IMPLIED (optional) or #REQUIRED (required). CDATA – text data. There are other types, but they are all lowercase. ENTITY ENTITY is used to declare abbreviations and the text that will be placed on them. In fact, we will simply be able to use in XML, instead of the full text, just the name of the entity with an & sign before and ; after. For example: to differentiate between HTML markup and just characters, the left angle bracket is often escaped with lt; , you just need to put & before lt. Then we will not use markup, but simply the < symbol. As you can see, everything is quite simple: you declare elements, explain what elements the declared elements are capable of containing, add attributes to these elements and, if desired, you can add entities to shorten some entries. And here you would have to ask: how to use our rules in our XML file? After all, we simply declared the rules, but we did not use them in XML. There are two ways to use them in XML: 1. Embedding - writing DTD rules inside the XML file itself, just write the root element after the DOCTYPE keyword and enclose our DTD file inside square brackets.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE company [
        <!-- Объявление возможных элементов -->
        <!ELEMENT employee EMPTY>
        <!ELEMENT employees (employee+)>
        <!ELEMENT office (employees)>
        <!ELEMENT offices (office+)>
        <!ELEMENT name (#PCDATA)>
        <!ELEMENT company (name, offices)>

        <!-- Добавление атрибутов для элементов employee и office -->
        <!ATTLIST employee
        name CDATA #REQUIRED
        job  CDATA #REQUIRED
        >

        <!ATTLIST office
        floor CDATA #REQUIRED
        room  CDATA #REQUIRED
        >

        <!-- Добавление сущностей -->
        <!ENTITY M "Maksim">
        <!ENTITY I "Ivan">
        <!ENTITY F "Franklin">
]>

<company>
    <name>IT-Heaven</name>
    <!-- Ivan недавно уволился, только неделю отработать должен. Не забудьте потом удалить его из списка.-->
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="&M;" job="Middle Software Developer" />
                <employee name="&I;" job="Junior Software Developer" />
                <employee name="&F;" job="Junior Software Developer" />
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer" />
                <employee name="Adam" job="Middle Software Developer" />
                <employee name="Leroy" job="Junior Software Developer" />
            </employees>
        </office>
    </offices>
</company>

2. Import - we write all our rules in a separate DTD file, after which in the XML file we use the DOCTYPE construction from the first method, only instead of square brackets you need to write SYSTEM and specify an absolute or relative path to the current location of the file.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE company SYSTEM "dtd_example1.dtd">

<company>
    <name>IT-Heaven</name>
    <!-- Ivan недавно уволился, только неделю отработать должен. Не забудьте потом удалить его из списка.-->
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="&M;" job="Middle Software Developer" />
                <employee name="&I;" job="Junior Software Developer" />
                <employee name="&F;" job="Junior Software Developer" />
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer" />
                <employee name="Adam" job="Middle Software Developer" />
                <employee name="Leroy" job="Junior Software Developer" />
            </employees>
        </office>
    </offices>
</company>

You can also use the PUBLIC keyword instead of SYSTEM, but it is unlikely to be useful to you. If you are interested, you can read about it (and about SYSTEM too) in detail here: link . Now we can't use other elements without declaring them in the DTD, and all XML is subject to our rules. You can try writing this code in IntelliJ IDEA in a separate file with an .xml extension and try adding some new elements or removing an element from our DTD and notice how the IDE will indicate an error to you. However, DTD has its disadvantages:

It has its own syntax, different from xml syntax.
A DTD has no data type checking and can only contain strings.
There is no namespace in a DTD.

About the problem of your own syntax: you must understand two syntaxes at once: XML and DTD syntax. They are different and this may make you confused. Also, because of this, it is more difficult to track errors in huge XML files in conjunction with the same DTD schemas. If something doesn’t work for you, you have to check a huge amount of text with different syntaxes. It's like reading two books at the same time: in Russian and English. And if your knowledge of one language is worse, then understanding the text will be just as difficult. About the problem of data type checking: attributes in DTDs do have different types, but they are all, at their core, string representations of something, lists or links. However, you cannot demand only numbers, and especially not positive or negative ones. And you can completely forget about object types. The last problem will be discussed in the next article, which will be devoted to namespaces and XML schemas, since discussing it here is pointless. Thank you all for your attention, I have done a lot of work and continue to do it to finish the entire series of articles on time. Basically, I just have to figure out the XML schemas and come up with an explanation of them in clearer words to finish the 2nd article. Half of it is already done, so you can expect it soon. The last article will be entirely devoted to working with XML files using Java. Good luck to everyone and success in programming :) Next article: [Competition] XML Basics for a Java Programmer - Part 2 of 3

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION