Perhaps someday you will need to get information from a website or HTML document in your application, and I will say without further ado that using the jsoup library will greatly simplify your task. According to the wiki , jsoup is an open source Java library designed for parsing, extracting and manipulating data stored in HTML documents.
Fast start
The library can be downloaded as a jar file and placed in a project, or connected using Maven/Gradle. I will leave a link to the official website at the end of the article: there you can find the current version of the library. In the example we will use connection via Maven. Let's add a dependency:<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
Usage
First of all, you need to get an instance of the classDocument
from org.jsoup.nodes.Document indicating the source for parsing. It can be either a local file or a link. For example, in this article we will use the website yandex.ru and try to get their current news feed:
Document doc = Jsoup.connect("https://yandex.ru/")
.userAgent("Chrome/4.0.249.0 Safari/532.5")
.referrer("http://www.google.com")
.get();
User Agent
is an identifier that is communicated to the site being visited. On many sites it is the most important criterion for the antispam filter. Referrer
contains the URL of the request source. The method get()
throws a handled IOException, so we can wrap everything in try/catch
a block, or just throw it further with throws
. We have now received the source code for this page. If necessary, the jsoup library itself can restore damaged elements. Now all we have to do is narrow the search to a separate block. The method select()
has a large selection of uses: it allows you to search for elements by tag, attributes, class and other parameters. Almost all modern browsers support the ability to quickly search for the source code of a selected element. With simple manipulations, we find the source code of the element we need and get div
a block with the specified class, which we will use for sampling. Let's use the class Elements
from org.jsoup.select.Elements to select all elements from our selected block.
Elements listNews = doc.select("div#tabnews_newsc.content-tabs__items.content-tabs__items_active_true");
Now we have something like this: Now all we have to do is use a small loop to iterate through all the elements:
for (Element element : listNews.select("a"))
System.out.println(element.text());
The method text()
allows you to discard the markup code and leaves only a combination of text for all incoming elements. The result of the execution will be as follows: It is easy to notice that the actual number of rows received does not correspond to the actual display on the page. This is where the pitfalls lie. If you look at the source code of the markup, you will notice that the latest news changes animatedly at a certain time interval. Some of these “stones” are solved by additional sampling, and of course, tests. It may turn out that the first five elements will contain the information we need, and the sixth element will only contain a scripted empty line. It also happens that blocks will not have any identifiers, then it is possible to directly point using a method get(int index)
to the position number of the element in question.
System.out.println(listNews.select("a").get(2).text());