3 examples how to parse HTML file in Java using Jsoup

HTML is the core of the WEB, all the web pages you see, whether they are dynamically generated by JavaScript, JSP, PHP, ASP or other web technologies, are based on HTML. In fact, your browser parses the HTML and displays it in a way that is convenient for you. But what if you need to parse an HTML document and find some element, tag, attribute in it, or check if a specific element exists or not using a Java program. If you had been a Java programmer for several years now, I'm sure you would have done XML parsing using parsers like DOM or SAX. But, ironically, there are times when you need to parse an HTML document from a basic Java application that doesn't contain Servlets and other Java web technologies. Moreover, the Core JDK also does not contain any HTTP or HTML libraries. That's why, When it comes to parsing an HTML file, many Java programmers ask Google how to get the value of an HTML tag in Java. When I came across this, I was sure that the solution would be an open-source library that implements the functionality I needed, but I did not know that it would be as wonderful and feature-rich as Jsoup. It not only provides support for reading and parsing HTML files, attributes, jQuery-style CSS classes, but at the same time, allows you to modify them. With Jsoup you can do anything with an HTML document. In this article, we will parse an HTML file and find tag names and attributes. We will also look at examples of downloading and parsing HTML from a file and any URL, such as the Google home page. When I came across this, I was sure that the solution would be an open-source library that implements the functionality I needed, but I did not know that it would be as wonderful and feature-rich as Jsoup. It not only provides support for reading and parsing HTML files, attributes, jQuery-style CSS classes, but at the same time, allows you to modify them. With Jsoup you can do anything with an HTML document. In this article, we will parse an HTML file and find tag names and attributes. We will also look at examples of downloading and parsing HTML from a file and any URL, such as the Google home page. When I came across this, I was sure that the solution would be an open-source library that implements the functionality I needed, but I did not know that it would be as wonderful and feature-rich as Jsoup. It not only provides support for reading and parsing HTML files, attributes, jQuery-style CSS classes, but at the same time, allows you to modify them. With Jsoup you can do anything with an HTML document. In this article, we will parse an HTML file and find tag names and attributes. We will also look at examples of downloading and parsing HTML from a file and any URL, such as the Google home page. It not only provides support for reading and parsing HTML files, attributes, jQuery-style CSS classes, but at the same time, allows you to modify them. With Jsoup you can do anything with an HTML document. In this article, we will parse an HTML file and find tag names and attributes. We will also look at examples of downloading and parsing HTML from a file and any URL, such as the Google home page. It not only provides support for reading and parsing HTML files, attributes, jQuery-style CSS classes, but at the same time, allows you to modify them. With Jsoup you can do anything with an HTML document. In this article, we will parse an HTML file and find tag names and attributes. We will also look at examples of downloading and parsing HTML from a file and any URL, such as the Google home page.

What is jsoup

Jsoup is an open-source Java library for working with real HTML. It provides a very convenient API for retrieving and manipulating data using the best DOM, CSS, and jQuery-like techniques. Jsoup implements the WHATWG HTML5 specification, and parses HTML into the same DOM as modern browsers like Chrome and Firefox do. Here are some of the useful features of the Jsoup library:

Jsoup can scrape and parse HTML from a URL, file, or string.
Jsoup can find and extract data using DOM traversal or CSS selectors.
Jsoup allows you to manipulate HTML elements, attributes, and text.
Jsoup provides white-listing of user-provided information to prevent XSS attacks.
Also Jsoup produces "neat" HTML.

Jsoup is designed to work with various kinds of HTML that exist in the real world, including properly validated HTML with an incomplete, unvalidated set of tags. One of the main advantages of Jsoup is its reliability.

Parsing HTML in Java using Jsoup

In this tutorial, we'll see three different examples of parsing and traversing an HTML document in Java using Jsoup. In the first example, we will parse an HTML string containing tags into a Java string literal form. In the second example, we will download our HTML document from the Internet, and in the third example, we will download our own login.html sample HTML file for parsing. This file is a sample HTML document that consists of a "title" tag and a "div" tag in the "body" section that contains an HTML form. The form contains fields for entering a username and password, as well as reset and confirmation buttons for further action. This is "correct" HTML that can pass the "validity" check, i.e. all tags and attributes are properly closed. Here's what our HTML file looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <title>Login Page</title>
    </head>
    <body>
        <div id="login" class="simple" >
            <form action="login.do">
                Username : <input id="username" type="text" /><br>
                Password : <input id="password" type="password" /><br>
                <input id="submit" type="submit" />
                <input id="reset" type="reset" />
            </form>
        </div>
    </body>
</html>

It's very easy to parse HTML with Jsoup, all you need to do is call the static method Jsoup.parse()and pass your HTML string to it. Jsoup provides several overloaded methods parse()for reading HTML from a string, from a file, from a base URI, from a URL, and from a InputStream. You can also specify an encoding to correctly read the HTML file if it is not in "UTF-8" format. The method parse(String html)parses the incoming HTML into a new Document. In Jsoup , a class Documentinherits from a class Elementthat extends a Node. The class also Nodeinherits from the class TextNode. As long as you pass a non-null string to the method, you are guaranteed to have a successful, meaningful parsing of the object.Documentcontaining (at least) the "head" and "body" elements. Once you have an object Document, you can get the data you want by calling the appropriate methods on the class Documentand its parents Elementand Node.

Java program to parse HTML document

Here is our complete program to parse an HTML string, an HTML file downloaded from the Internet, and a local HTML file. You can use the IDE (Eclipse or any other) or the command line to run it. It's very easy in Eclipse, just copy this code, create a new Java project, right click on the "src" folder and paste the copied code (paste). Eclipse takes care of creating the proper package and source code file with the appropriate name, so much less work. If you already have a Java project, then this is just one step. The program below shows three examples of parsing and traversing an HTML file. In the first example, we are directly parsing a string containing HTML, in the second, an HTML file downloaded from a URL, in the third, we are loading and parsing an HTML document from the local file system.

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

/**
* Java Program to parse/read HTML documents from File using Jsoup library.
* Jsoup is an open source library which allows Java developer to parse HTML
* files and extract elements, manipulate data, change style using DOM, CSS and
* JQuery like method.
*
* @author Javin Paul
*/
public class HTMLParser{

    public static void main(String args[]) {

        // Parse HTML String using JSoup library
        String HTMLSTring = "<!DOCTYPE html>"
                + "<html>"
                + "<head>"
                + "<title>JSoup Example</title>"
                + "</head>"
                + "<body>"
                + "<table><tr><td><h1>HelloWorld</h1></tr>"
                + "</table>"
                + "</body>"
                + "</html>";

        Document html = Jsoup.parse(HTMLSTring);
        String title = html.title();
        String h1 = html.body().getElementsByTag("h1").text();

        System.out.println("Input HTML String to JSoup :" + HTMLSTring);
        System.out.println("After parsing, Title : " + title);
        System.out.println("Afte parsing, Heading : " + h1);

        // JSoup Example 2 - Reading HTML page from URL
        Document doc;
        try {
            doc = Jsoup.connect("http://google.com/").get();
            title = doc.title();
        } catch (IOException e) {
            e.printStackTrace();
        }

        System.out.println("Jsoup Can read HTML page from URL, title : " + title);

        // JSoup Example 3 - Parsing an HTML file in Java
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
        Document htmlFile = null;
        try {
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } // right
        title = htmlFile.title();
        Element div = htmlFile.getElementById("login");
        String cssClass = div.className(); // getting class form HTML element

        System.out.println("Jsoup can also parse HTML file directly");
        System.out.println("title : " + title);
        System.out.println("class of div tag : " + cssClass);
    }
}

output:

Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple

The jsoup HTML parser will do its best to produce a "clean" parse of the HTML you provide, whether it is well-formed or not. It can handle the following errors:

unclosed tags. For example,Java Scala to Java Scala
implied tags. For example, open ones <td>Java is Great</td>will be wrapped in<table><tr><td>
Jsoup creates robust document structures (HTML contains head and body, with only relevant elements in body)

Jsoup is an excellent and reliable open-source library that makes reading html documents, body fragments, html strings and directly parsing html web content very easy.

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION