Easy HTML parsing with jsoup

Perhaps someday you will need to get information from a website or HTML document in your application, and I will say without further ado that using the jsoup library will greatly simplify your task. According to the wiki , jsoup is an open source Java library designed for parsing, extracting and manipulating data stored in HTML documents.

Fast start

The library can be downloaded as a jar file and placed in a project, or connected using Maven/Gradle. I will leave a link to the official website at the end of the article: there you can find the current version of the library. In the example we will use connection via Maven. Let's add a dependency:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.11.3</version>
</dependency>

Usage

First of all, you need to get an instance of the class Documentfrom org.jsoup.nodes.Document indicating the source for parsing. It can be either a local file or a link. For example, in this article we will use the website yandex.ru and try to get their current news feed:

Document doc = Jsoup.connect("https://yandex.ru/")
                .userAgent("Chrome/4.0.249.0 Safari/532.5")
                .referrer("http://www.google.com")
                .get();

User Agentis an identifier that is communicated to the site being visited. On many sites it is the most important criterion for the antispam filter. Referrercontains the URL of the request source. The method get()throws a handled IOException, so we can wrap everything in try/catcha block, or just throw it further with throws. We have now received the source code for this page. If necessary, the jsoup library itself can restore damaged elements. Now all we have to do is narrow the search to a separate block. The method select()has a large selection of uses: it allows you to search for elements by tag, attributes, class and other parameters. Almost all modern browsers support the ability to quickly search for the source code of a selected element. With simple manipulations, we find the source code of the element we need and get diva block with the specified class, which we will use for sampling. Easy HTML parsing using jsoup - 1

Let's use the class Elementsfrom org.jsoup.select.Elements to select all elements from our selected block.

Elements listNews = doc.select("div#tabnews_newsc.content-tabs__items.content-tabs__items_active_true");

Now we have something like this: Easy HTML parsing using jsoup - 2

Now all we have to do is use a small loop to iterate through all the elements:

for (Element element : listNews.select("a"))
            System.out.println(element.text());

The method text()allows you to discard the markup code and leaves only a combination of text for all incoming elements. The result of the execution will be as follows: Easy HTML parsing using jsoup - 3

It is easy to notice that the actual number of rows received does not correspond to the actual display on the page. This is where the pitfalls lie. If you look at the source code of the markup, you will notice that the latest news changes animatedly at a certain time interval. Some of these “stones” are solved by additional sampling, and of course, tests. It may turn out that the first five elements will contain the information we need, and the sixth element will only contain a scripted empty line. It also happens that blocks will not have any identifiers, then it is possible to directly point using a method get(int index)to the position number of the element in question.

System.out.println(listNews.select("a").get(2).text());

Conclusion

This example shows only a small part of what jsoup is capable of. There is also the fact that sites are often updated, changing the structure of the markup code, so when working with scraping, you need to be ready to adapt to changes. You can get more information and the current version on the official website jsoup.org ; you can read more about classes and methods at o7planning.org . I’ll leave a link to my github , at the time of writing there are several Telegram bots that use Jsoup to receive and display information.