JavaRush /Java Blog /Random EN /Parsing html with jsoup library
Сергей
Level 40
Москва

Parsing html with jsoup library

Published in the Random EN group
So we want to get specific information from the site. Let's look at how to do this step by step. Parsing html with jsoup library - 1First we need to get the Document. This is a representation of our html page. Jsoup has several ways to turn a site into a Document. Connect to server

Document document = Jsoup.connect("https://hh.ru/").get();
Jsoup itself connects to the site. This method is the simplest, but it is only suitable for testing. There are more convenient and flexible http clients. Also keep in mind, no matter what http client you use, add a header to the request User-Agentwith a value such as Chrome/81.0.4044.138. Using this header, the server determines which device you connected from. Without this header, the server considers you a bot and may ban you. From file ;

File file = new File("hh-test.html");
Document document = Jsoup.parse(file, "UTF-8", "hh.ru");
This is the main way to get an object Document. The last argument "hh.ru"is the base URI. This is necessary to create absolute links from relative ones that are present on the site. From line

String html =
                "<html>                                                                       " +
                "    <head>                                                                   " +
                "        <title href=\"hh.ru/vacancy?home\">                                  " +
                "            Работа в Москве, поиск персонала и публикация вакансий - hh.ru   " +
                "        </title>                                                             " +
                "    </head>                                                                  " +
                "    <body>                                                                   " +
                "        <div class=\"header main\">                                          " +
                "            <h1>Работа найдется для каждого</h1>                             " +
                "            <div>Поиск вакансий</div>                                        " +
                "        </div>                                                               " +
                "        <div class=\"content\">                                              " +
                "            <div>Вакансии дня</div>                                          " +
                "            <div id=\"123\">Компании дня</div>                               " +
                "            <div>Работа из дома</div>                                        " +
                "        </div>                                                               " +
                "    </body>                                                                  " +
                "</html>                                                                      ";

        Document document = Jsoup.parse(html, "hh.ru");
Next, I will demonstrate the library in this html, which represents a simplified site. Getting a tag The main task when parsing is to get the desired tag. We will do this using the method select. Note that it always returns a list of tags. If no tags are found, the list will be empty. You need to pass a CSS selector by which tags are searched as the method argument. I will dwell on selectors in more detail, because all the work comes down to writing the correct selector. Typically we need to compose it so that it returns a single tag.

Elements h1 = document.select("h1");
System.out.println(h1);
Get tags tags h1 Output:

<h1>Работа найдется для каждого</h1>

Elements titleElem = document.select("head > title");
Get tags title. The sign >selects tags titlenested within a taghead

Elements divs = document.select("body > div");
Get tags divnested inbody

Elements firstDiv = document.select("body > div:nth-child(1)");
Get the first tag divnested in body. Retrieving a tag by serial number is a bad way, because its position on the site may change. It is better to define the tag using absolute parameters. These parameters are attributes classandid

Elements contentElem = document.select("body > div.content");
Get a tag divwith the "content" class, nested inbody

Elements idElem = document.select("#123");
Get tags with id "123"

Elements divHeader = document.select("body > div.header.main :not(h1)");
Get tags divwith the class "header" and "main", nested in body, but without tags h1 . Output:

<div>
 Поиск вакансий
</div>
Elements Methods Once we have received a list, Elementswe can extract data from it. Let me remind you that usually the selector searches for one tag, i.e. must Elementsbe size 1.

  elements.size();
number of tags found

elements.get(0);
get the first tag from the list of found ones

elements.text();
text embedded in the tag

elements.attr("href");
"href" attribute value

elements.outerHtml();
string representation of a tag If you need to quickly get an element selector, open the developer panel (f12) in the browser, right-click on the element, “view code”, right-click on the tag, then “Copy” “Copy selector”. Such a selector will not be optimal, but it is quite suitable for quick results. Conclusion This is the basics of working with the Jsoup library. But this is quite enough to parse sites. To work confidently, all you need is practice writing selectors on real sites. Ps This library is used to solve a large problem at level 38 Parsing html with jsoup library - 1Parsing html with jsoup library - 3
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION