JavaRush /Java Blog /Random EN /3 examples of how to parse an HTML file in Java using Jso...
Sdu
Level 17

3 examples of how to parse an HTML file in Java using Jsoup

Published in the Random EN group
3 examples of how to parse an HTML file in Java using Jsoup - 1HTML is the core of the WEB, all Internet pages you see, whether they are dynamically generated using JavaScript, JSP, PHP, ASP or other web technologies, are based on HTML. In fact, your browser parses the HTML and displays it in a way that is convenient for you. But what if you need to parse an HTML document and find some element, tag, attribute in it, or check whether a specific element exists or not using a Java program. If you had been a Java programmer for several years, I'm sure you would have done XML parsing using parsers like DOM or SAX. But, ironically, there are times when you need to parse an HTML document from a basic Java application that does not contain Servlets and other Java web technologies. Moreover, Core JDK also does not contain HTTP or HTML libraries. That's why, when it comes to parsing an HTML file, many Java programmers ask Google how to get the value of an HTML tag in Java. When I came across this, I was sure that the solution would be an open-source library that implemented the functionality I needed, but I did not know that it would be as wonderful and feature-rich as Jsoup. It not only provides support for reading and parsing HTML files, attributes, CSS classes in the JQuery style, but at the same time, allows you to modify them. Using Jsoup you can do anything you want with an HTML document. In this article, we will parse an HTML file and find the names and attributes of the tags. We'll also look at examples of downloading and parsing HTML from a file and any URL, such as the Google home page.

What is Jsoup

Jsoup is an open-source Java library for working with real HTML. It provides a very convenient API for retrieving and manipulating data using the best DOM, CSS, and jQuery-like techniques. Jsoup implements the WHATWG HTML5 specification, and parses HTML into the same DOM as modern browsers like Chrome and Firefox do. Here are some of the useful features of the Jsoup library:
  • Jsoup can scrape and parse HTML from a URL, file or string.
  • Jsoup can find and extract data using DOM traversal or CSS selectors.
  • Jsoup allows you to manipulate HTML elements, attributes and text.
  • Jsoup provides white-list cleaning of user-provided information to prevent XSS attacks.
  • Jsoup also produces "neat" HTML.
Jsoup is designed to work with the various kinds of HTML that exist in the real world, including properly validated HTML with an incomplete, unvalidated set of tags. One of the main advantages of Jsoup is its reliability.

Parsing HTML in Java using Jsoup

In this tutorial, we will see three different examples of parsing and traversing an HTML document in Java using Jsoup. In the first example, we will parse an HTML string containing tags in the form of a Java string literal. In the second example, we will download our HTML document from the Internet, and in the third example, we will download our own sample HTML file login.html for parsing. This file is a sample HTML document that consists of a "title" tag and a "div" tag in the "body" section that contains the HTML form. The form contains fields for entering a username and password, as well as reset and confirmation buttons for further actions. This is "correct" HTML that can pass the "validity" check, meaning all tags and attributes are properly closed. This is what our HTML file looks like:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <title>Login Page</title>
    </head>
    <body>
        <div id="login" class="simple" >
            <form action="login.do">
                Username : <input id="username" type="text" /><br>
                Password : <input id="password" type="password" /><br>
                <input id="submit" type="submit" />
                <input id="reset" type="reset" />
            </form>
        </div>
    </body>
</html>
With Jsoup it is very easy to parse HTML, all you need to do is call a static method Jsoup.parse()and pass your HTML string to it. Jsoup provides several overloaded methods parse()for reading HTML from a string, a file, from a base URI, from a URL, and from a InputStream. You can also specify the encoding to correctly read the HTML file if it is not in UTF-8 format. The method parse(String html)parses the incoming HTML into a new object Document. In Jsoup, a class Documentinherits from a class Element, which extends the class Node. The class also Nodeinherits from the class TextNode. As long as you pass a non-null string to the method, you are guaranteed to have a successful, meaningful parse of an object Documentcontaining (at least) the "head" and "body" elements. If you have an object Document, you can get the desired data by calling the appropriate methods of the class Documentand its parents Elementand Node.

Java program for parsing HTML document

Here is our complete program for parsing an HTML string, an HTML file downloaded from the internet, and a local HTML file. To run it, you can use an IDE (Eclipse or any other) or the command line. In Eclipse this is very easy, just copy this code, create a new Java project, right click on the "src" folder and paste the copied code. Eclipse will take care of creating the proper package and source code file with the appropriate name, so it's a lot less work. If you already have a Java project, then this is just one step. The program below illustrates three examples of parsing and traversing an HTML file. In the first example, we directly parse a string containing HTML, in the second an HTML file downloaded from a URL, in the third we download and parse an HTML document from the local file system.
import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

/**
* Java Program to parse/read HTML documents from File using Jsoup library.
* Jsoup is an open source library which allows Java developer to parse HTML
* files and extract elements, manipulate data, change style using DOM, CSS and
* JQuery like method.
*
* @author Javin Paul
*/
public class HTMLParser{

    public static void main(String args[]) {

        // Parse HTML String using JSoup library
        String HTMLSTring = "<!DOCTYPE html>"
                + "<html>"
                + "<head>"
                + "<title>JSoup Example</title>"
                + "</head>"
                + "<body>"
                + "<table><tr><td><h1>HelloWorld</h1></tr>"
                + "</table>"
                + "</body>"
                + "</html>";

        Document html = Jsoup.parse(HTMLSTring);
        String title = html.title();
        String h1 = html.body().getElementsByTag("h1").text();

        System.out.println("Input HTML String to JSoup :" + HTMLSTring);
        System.out.println("After parsing, Title : " + title);
        System.out.println("Afte parsing, Heading : " + h1);

        // JSoup Example 2 - Reading HTML page from URL
        Document doc;
        try {
            doc = Jsoup.connect("http://google.com/").get();
            title = doc.title();
        } catch (IOException e) {
            e.printStackTrace();
        }

        System.out.println("Jsoup Can read HTML page from URL, title : " + title);

        // JSoup Example 3 - Parsing an HTML file in Java
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
        Document htmlFile = null;
        try {
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } // right
        title = htmlFile.title();
        Element div = htmlFile.getElementById("login");
        String cssClass = div.className(); // getting class form HTML element

        System.out.println("Jsoup can also parse HTML file directly");
        System.out.println("title : " + title);
        System.out.println("class of div tag : " + cssClass);
    }
}
Output:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple
The Jsoup HTML parser will make every effort to produce a "clean" parse of the HTML you provide, whether it is well-formed or not. It can handle the following errors:
  • unclosed tags. For example,<p>Java <p>Scala to <p>Java</p> <p>Scala</p>
  • implied tags. For example, open ones <td>Java is Great</td>will be wrapped in<table><tr><td>
  • Jsoup creates robust document structures (HTML contains head and body, but body contains only relevant elements)
Jsoup is an excellent and reliable open-source library that makes reading html document, body fragments, html strings and directly parsing html web content very simple.
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION