XML Basics for a Java Programmer - Part 1 of 3

XML is a very popular and flexible format these days. Every programmer should understand it, it's just a must have. Many technologies are actively used today, and modern ones are among them.

Introduction

Hello dear readers of my article. I want to say right away that this is only the first article in my series of three articles. The main goal of the whole cycle is to dedicate each reader to XML and give, if not a complete explanation and understanding, then at least a good push to it, explaining the main points and things. The whole cycle will be for one nomination - "Attention to Details", and the division into 3 articles is done in order to fit into the character limit in posts and to divide a large amount of material into smaller portions for better understanding. The first article will focus on XML itself and what it is, as well as one of the ways to schema XML files - DTD. To begin with, I would like to make a small preface for those who are not yet familiar with XML at all: there is no need to be scared. XML is not very complex and should be understood by any programmer, as it is a very flexible, efficient and popular file format today for storing all sorts of information you want. XML is used in Ant, Maven, Spring. Any programmer needs knowledge of XML. Now that you have gathered strength and motivation, let's start studying. I will try to lay out all the material as simply as possible,

XML

For a clearer explanation, it would be more correct to visualize the XML with an example.

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <name>IT-Heaven</name>
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee>
                    <name>Maksim</name>
                    <job>Middle Software Developer</job>
                </employee>
                <employee>
                    <name>Ivan</name>
                    <job>Junior Software Developer</job>
                </employee>
                <employee>
                    <name>Franklin</name>
                    <job>Junior Software Developer</job>
                </employee>
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee>
                    <name>Herald</name>
                    <job>Middle Software Developer</job>
                </employee>
                <employee>
                    <name>Adam</name>
                    <job>Middle Software Developer</job>
                </employee>
                <employee>
                    <name>Leroy</name>
                    <job>Junior Software Developer</job>
                </employee>
            </employees>
        </office>
    </offices>
</company>

HTML and XML are similar in syntax because they share a common parent, SGML. However, in HTML there are only fixed standard-specific tags, while in XML you can create your own tags, attributes, and generally do whatever you want to store data the way you want. In fact, XML files can be read by anyone who knows English. You can depict this example using a tree. XML Basics for a Java Programmer - Part 1 of 3 - 2

XML Basics for a Java Programmer - Part 1 of 3 - 2

The root of the tree is Company. It is also the root (root) element from which all other elements come. Each XML file can only have one root element. It must be declared after the declaration of the xml file (the first line in the example) and contain all other elements. A little about the declaration: it is mandatoryand is needed to identify the document as XML. It has three pseudo-attributes (special predefined attributes): version (according to the 1.0 standard), encoding (encoding) and standalone (standalone: if yes and external schemas are connected to the document, then there will be an error, the default is no). Elements are entities that store data using other elements and attributes. Attributesis additional information about the element that is specified when adding the element. If we translate the explanation into the OOP field, then we can give the following example: we have a car, each car has characteristics (color, capacity, brand, etc.) - these are attributes, and there are entities that are inside the car: doors, windows, engine , the steering wheel is other elements. You can store properties as separate elements or attributes, depending on your desire. After all, XML is an extremely flexible format for storing information about something. After the explanations, it is enough for us to parse the example above for everything to fall into place. In the example, we described a simple company structure: there is a company that has a name and offices, and there are employees in the offices. The Employees and Offices elements are wrapper elements - they serve to to collect elements of the same type, in fact, by combining them into one set for the convenience of their processing. Floor and room deserve special attention. These are office attributes (floor and number), in other words, its properties. If we had an “image” element, then we could pass its dimensions. You may notice that the company does not have a name attribute, but does have a name element. It's just that you can describe structures however you want. No one obliges you to write all the properties of elements only in attributes, you can use just elements and write some data inside them. For example, we can record the name and position of our employees as attributes: If we had an “image” element, then we could pass its dimensions. You may notice that the company does not have a name attribute, but does have a name element. It's just that you can describe structures however you want. No one obliges you to write all the properties of elements only in attributes, you can use just elements and write some data inside them. For example, we can record the name and position of our employees as attributes: If we had an “image” element, then we could pass its dimensions. You may notice that the company does not have a name attribute, but does have a name element. It's just that you can describe structures however you want. No one obliges you to write all the properties of elements only in attributes, you can use just elements and write some data inside them. For example, we can record the name and position of our employees as attributes:

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <name>IT-Heaven</name>
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="Maksim" job="Middle Software Developer">

                </employee>
                <employee name="Ivan" job="Junior Software Developer">

                </employee>
                <employee name="Franklin" job="Junior Software Developer">

                </employee>
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer">

                </employee>
                <employee name="Adam" job="Middle Software Developer">

                </employee>
                <employee name="Leroy" job="Junior Software Developer">

                </employee>
            </employees>
        </office>
    </offices>
</company>

As you can see, now the name and position of each employee are his attributes. And you can see that there is nothing inside the employee entity (tag), all employee elements are empty. Then you can make employee an empty element - close it immediately after the attribute declaration. This is done quite simply, just put a slash:

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <name>IT-Heaven</name>
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="Maksim" job="Middle Software Developer" />
                <employee name="Ivan" job="Junior Software Developer" />
                <employee name="Franklin" job="Junior Software Developer" />
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer" />
                <employee name="Adam" job="Middle Software Developer" />
                <employee name="Leroy" job="Junior Software Developer" />
            </employees>
        </office>
    </offices>
</company>

As you can see, by closing the empty elements, we preserved the integrity of the information and greatly reduced the record, making the information more concise and readable. To add a comment (text that will be skipped when the file is parsed) in XML, there is the following syntax:

<!-- Ivan недавно уволился, только неделю отработать должен. Не забудьте потом удалить его из списка.-->

And the last construct is CDATA , which stands for "character data". With this construct, it is possible to write text that will not be interpreted as XML markup. This is useful if you have an entity inside the XML file that stores XML markup in the information. Example:

<?xml version="1.0" encoding="UTF-8" ?>
<bean>
    <information>
        <![CDATA[<name>Ivan</name><age>26</age>]]>
    </information>
</bean>

The beauty of XML is that you can extend it however you want: use your own elements, your own attributes, and structure it however you want. You can use both attributes and elements to store data (as shown in the example earlier). However, you need to understand that you can invent your own elements and attributes on the go and as you wish, you can, but what if you work on a project where another programmer wants to transfer the name element to attributes, and you have all the program logic written so that name was an element? How to create your own rules for what elements should be, what attributes they have and other things so that you can validate XML files and be sure that the rules will become standard in your project and no one will violate them? For that, to write all the rules of your own XML markup, there are special tools. The most famous are DTD and XML Schema. This article will focus only on the first.

DTD

DTD is designed to describe document types. The DTD is now obsolete and is being actively deprecated in XML, but there are still many XML files that use the DTD and are generally useful to understand. DTD is a technology for validating XML documents . The DTD declares specific rules for a document type: its elements, what elements can be inside an element, attributes, whether they are required or not, the number of times they can be repeated, and the Entity. Similar to XML, a DTD can be visualized with an example to explain it more clearly.

<!-- Объявление возможных элементов -->
<!ELEMENT employee EMPTY>
<!ELEMENT employees (employee+)>
<!ELEMENT office (employees)>
<!ELEMENT offices (office+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (name, offices)>

<!-- Добавление атрибутов для элементов employee и office -->
<!ATTLIST employee
        name CDATA #REQUIRED
        job  CDATA #REQUIRED
>

<!ATTLIST office
        floor CDATA #REQUIRED
        room  CDATA #REQUIRED
>

<!-- Добавление сущностей -->
<!ENTITY M "Maksim">
<!ENTITY I "Ivan">
<!ENTITY F "Franklin">

Here we have such a simple example. In this example, we have declared our entire hierarchy from the example XML: worker, workers, office, offices, name, company. To create DTD files, 3 main constructs are used to describe any XML files: ELEMENT (for describing elements), ATTLIST (for describing attributes for elements) and ENTITY (for substituting text with abbreviated forms). ELEMENT Used to describe an element. Elements that can be used inside the described element are listed in parentheses as a list. You can use quantifiers to specify a quantity (they are similar to regular expression quantifiers): +means 1+ *means 0+ ?means 0OR1 If no quantifiers have been added, then it is considered that there should be only 1 element. If we needed one of a group of elements, we could write it like this:

<!ELEMENT company ((name | offices))>

Then one of the elements would be selected: name or offices, but if there were two of them inside the company at once, then the validation would not pass. You can also notice that the employee has the word EMPTY - this means that the element must be empty. There is also ANY - any elements. #PCDATA - text data. ATTLIST Used to add attributes to elements. ATTLIST is followed by the name of the required element, and after the dictionary of the form "attribute name - attribute type", and at the end you can add #IMPLIED (optional) or #REQUIRED (required). CDATA - text data. There are other types, but they are all lowercase. ENTITY ENTITY serves to declare abbreviations and the text that will be attached to them. In fact, we can simply use in XML, instead of the full text, just the name of the entity with the sign & in front of and ; after. For example: to distinguish markup in HTML from just characters, the left angle bracket is often escaped with lt; , but you also need to set & before lt. Then we will not use markup, but simply the < symbol. As you can see, it's pretty simple: declare elements, explain what elements declared elements can contain, add attributes to those elements, and optionally add entities to shorten some entries. And here you would have to ask: how to use our rules in our XML file? After all, we just declared the rules, but we did not use them in XML.There are two ways to use them in XML: 1. Injection - writing DTD rules inside the XML file itself, simply writing the root element after the DOCTYPE keyword and enclosing our DTD file inside square brackets.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE company [
        <!-- Объявление возможных элементов -->
        <!ELEMENT employee EMPTY>
        <!ELEMENT employees (employee+)>
        <!ELEMENT office (employees)>
        <!ELEMENT offices (office+)>
        <!ELEMENT name (#PCDATA)>
        <!ELEMENT company (name, offices)>

        <!-- Добавление атрибутов для элементов employee и office -->
        <!ATTLIST employee
        name CDATA #REQUIRED
        job  CDATA #REQUIRED
        >

        <!ATTLIST office
        floor CDATA #REQUIRED
        room  CDATA #REQUIRED
        >

        <!-- Добавление сущностей -->
        <!ENTITY M "Maksim">
        <!ENTITY I "Ivan">
        <!ENTITY F "Franklin">
]>

<company>
    <name>IT-Heaven</name>
    <!-- Ivan недавно уволился, только неделю отработать должен. Не забудьте потом удалить его из списка.-->
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="&M;" job="Middle Software Developer" />
                <employee name="&I;" job="Junior Software Developer" />
                <employee name="&F;" job="Junior Software Developer" />
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer" />
                <employee name="Adam" job="Middle Software Developer" />
                <employee name="Leroy" job="Junior Software Developer" />
            </employees>
        </office>
    </offices>
</company>

2. Import - we write all our rules in a separate DTD file, after which we use the DOCTYPE construction from the first method in the XML file, only instead of square brackets you need to write SYSTEM and specify an absolute or relative path to the current location of the file.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE company SYSTEM "dtd_example1.dtd">

<company>
    <name>IT-Heaven</name>
    <!-- Ivan недавно уволился, только неделю отработать должен. Не забудьте потом удалить его из списка.-->
    <offices>
        <office floor="1" room="1">
            <employees>
                <employee name="&M;" job="Middle Software Developer" />
                <employee name="&I;" job="Junior Software Developer" />
                <employee name="&F;" job="Junior Software Developer" />
            </employees>
        </office>
        <office floor="1" room="2">
            <employees>
                <employee name="Herald" job="Middle Software Developer" />
                <employee name="Adam" job="Middle Software Developer" />
                <employee name="Leroy" job="Junior Software Developer" />
            </employees>
        </office>
    </offices>
</company>

It is also possible to use the PUBLIC keyword instead of SYSTEM, but it is unlikely to be useful to you. If you are interested, you can read about it (and about SYSTEM too) in detail here: link . Now we can't use other elements without declaring them in the DTD, and all XML is subject to our rules. You can try writing this code in IntelliJ IDEA to a separate .xml file and try adding some new elements or removing an element from our DTD and notice how the IDE will point you to an error. However, DTDs have their downsides:

It has its own syntax, different from xml syntax.
DTDs do not have type checking and can only contain strings.
There is no namespace in DTD.

About the problem of own syntax: you have to understand two syntaxes at once: in XML and in DTD syntax. They are different and this can make you confused. It also makes it harder to track down errors in huge XML files in conjunction with the same DTD schemas. If something doesn't work for you, you have to check a huge amount of text of different syntaxes. It's like reading two books at the same time: in Russian and English. And if your knowledge of one language is worse for you, then it will be just as difficult to understand the text. About the data type checking problem: attributes in a DTD do have different types, but they are all, in essence, string representations of something, lists or links. However, you cannot demand only numbers, much less positive or negative ones. And you can forget about object types altogether. The last problem will be discussed in the next article, which will be devoted to namespaces and XML schemas, since it is pointless to discuss it here. Thank you all for your attention, I have done a great job and continue to do it in order to complete the entire series of articles on time. In fact, I just need to understand XML Schemas and come up with an explanation of them in clearer words in order to finish the 2nd article. Half of it is already done, so you can expect it soon. The last article will be completely devoted to working with XML files using Java. Good luck to everyone and success in programming :) Next article: in order to complete the entire series of articles on time. In fact, I just need to understand XML Schemas and come up with an explanation of them in clearer words in order to finish the 2nd article. Half of it is already done, so you can expect it soon. The last article will be completely devoted to working with XML files using Java. Good luck to everyone and success in programming :) Next article: in order to complete the entire series of articles on time. In fact, I just need to understand XML Schemas and come up with an explanation of them in clearer words in order to finish the 2nd article. Half of it is already done, so you can expect it soon. The last article will be completely devoted to working with XML files using Java. Good luck to everyone and success in programming :) Next article:[Contest] XML Basics for Java Programmer - Part 2 of 3

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION