JavaRush /ื‘ืœื•ื’ Java /Random-HE /3 ื“ื•ื’ืžืื•ืช ื›ื™ืฆื“ ืœื ืชื— ืงื•ื‘ืฅ HTML ื‘-Java ื‘ืืžืฆืขื•ืช Jsoup
Sdu
ืจึธืžึธื”

3 ื“ื•ื’ืžืื•ืช ื›ื™ืฆื“ ืœื ืชื— ืงื•ื‘ืฅ HTML ื‘-Java ื‘ืืžืฆืขื•ืช Jsoup

ืคื•ืจืกื ื‘ืงื‘ื•ืฆื”
3 ื“ื•ื’ืžืื•ืช ื›ื™ืฆื“ ืœื ืชื— ืงื•ื‘ืฅ HTML ื‘-Java ื‘ืืžืฆืขื•ืช Jsoup - 1HTML ื”ื•ื ื”ืœื™ื‘ื” ืฉืœ ื”-WEB, ื›ืœ ื“ืคื™ ื”ืื™ื ื˜ืจื ื˜ ืฉืืชื” ืจื•ืื”, ื‘ื™ืŸ ืื ื”ื ื ื•ืฆืจื™ื ื‘ืื•ืคืŸ ื“ื™ื ืžื™ ื‘ืืžืฆืขื•ืช JavaScript, JSP, PHP, ASP ืื• ื˜ื›ื ื•ืœื•ื’ื™ื•ืช ืื™ื ื˜ืจื ื˜ ืื—ืจื•ืช, ืžื‘ื•ืกืกื™ื ืขืœ HTML. ืœืžืขืฉื”, ื”ื“ืคื“ืคืŸ ืฉืœืš ืžื ืชื— ืืช ื”-HTML ื•ืžืฆื™ื’ ืื•ืชื• ื‘ืฆื•ืจื” ืฉื ื•ื—ื” ืœืš. ืื‘ืœ ืžื” ืื ืืชื” ืฆืจื™ืš ืœื ืชื— ืžืกืžืš HTML ื•ืœืžืฆื•ื ื‘ื• ืืœืžื ื˜, ืชื’, ืชื›ื•ื ื”, ืื• ืœื‘ื“ื•ืง ืื ืงื™ื™ื ืืœืžื ื˜ ืžืกื•ื™ื ืื• ืœื ื‘ืืžืฆืขื•ืช ืชื•ื›ื ืช Java. ืื ื”ื™ื™ืช ืžืชื›ื ืช ื’'ืื•ื•ื” ืžืกืคืจ ืฉื ื™ื, ืื ื™ ื‘ื˜ื•ื— ืฉื”ื™ื™ืช ืขื•ืฉื” ื ื™ืชื•ื— XML ื‘ืืžืฆืขื•ืช ืžื ืชื—ื™ื ื›ืžื• DOM ืื• SAX. ืื‘ืœ, ืœืžืจื‘ื” ื”ืื™ืจื•ื ื™ื”, ื™ืฉ ืžืงืจื™ื ืฉื‘ื”ื ืืชื” ืฆืจื™ืš ืœื ืชื— ืžืกืžืš HTML ืžืืคืœื™ืงืฆื™ื™ืช Java ื‘ืกื™ืกื™ืช ืฉืื™ื ื” ืžื›ื™ืœื” Servlets ื•ื˜ื›ื ื•ืœื•ื’ื™ื•ืช ืื™ื ื˜ืจื ื˜ ืื—ืจื•ืช ืฉืœ Java. ื™ืชืจื” ืžื›ืš, Core JDK ื’ื ืื™ื ื• ืžื›ื™ืœ ืกืคืจื™ื•ืช HTTP ืื• HTML. ื–ื• ื”ืกื™ื‘ื” ืฉื›ืืฉืจ ืžื“ื•ื‘ืจ ื‘ื ื™ืชื•ื— ืงื•ื‘ืฅ HTML, ืžืชื›ื ืชื™ Java ืจื‘ื™ื ืฉื•ืืœื™ื ืืช ื’ื•ื’ืœ ื›ื™ืฆื“ ืœืงื‘ืœ ืืช ื”ืขืจืš ืฉืœ ืชื’ HTML ื‘-Java. ื›ืฉื ืชืงืœืชื™ ื‘ื–ื”, ื”ื™ื™ืชื™ ื‘ื˜ื•ื— ืฉื”ืคืชืจื•ืŸ ื™ื”ื™ื” ืกืคืจื™ื™ืช ืงื•ื“ ืคืชื•ื— ืฉืชื˜ืžื™ืข ืืช ื”ืคื•ื ืงืฆื™ื•ื ืœื™ื•ืช ืฉืื ื™ ืฆืจื™ืš, ืื‘ืœ ืœื ื™ื“ืขืชื™ ืฉื”ื™ื ืชื”ื™ื” ื ืคืœืื” ื•ืขืฉื™ืจื” ื‘ืชื›ื•ื ื•ืช ื›ืžื• Jsoup. ื–ื” ืœื ืจืง ืžืกืคืง ืชืžื™ื›ื” ื‘ืงืจื™ืื” ื•ื ื™ืชื•ื— ืฉืœ ืงื•ื‘ืฆื™ HTML, ืชื›ื•ื ื•ืช, ืžื—ืœืงื•ืช CSS ื‘ืกื’ื ื•ืŸ JQuery, ืืœื ื‘ื•-ื–ืžื ื™ืช, ืžืืคืฉืจ ืœืš ืœืฉื ื•ืช ืื•ืชื. ื‘ืืžืฆืขื•ืช Jsoup ืืชื” ื™ื›ื•ืœ ืœืขืฉื•ืช ื›ืœ ืžื” ืฉืืชื” ืจื•ืฆื” ืขื ืžืกืžืš HTML. ื‘ืžืืžืจ ื–ื” ื ื ืชื— ืงื•ื‘ืฅ HTML ื•ื ืžืฆื ืืช ื”ืฉืžื•ืช ื•ื”ืชื›ื•ื ื•ืช ืฉืœ ื”ืชื’ื™ื. ื›ืžื• ื›ืŸ, ื ื‘ื—ืŸ ื“ื•ื’ืžืื•ืช ืœื”ื•ืจื“ื” ื•ื ื™ืชื•ื— ืฉืœ HTML ืžืงื•ื‘ืฅ ื•ืžื›ืœ ื›ืชื•ื‘ืช ืืชืจ, ื›ื’ื•ืŸ ื“ืฃ ื”ื‘ื™ืช ืฉืœ Google.

ืžื” ื–ื” Jsoup

Jsoup ื”ื™ื ืกืคืจื™ื™ืช Java ื‘ืงื•ื“ ืคืชื•ื— ืœืขื‘ื•ื“ื” ืขื HTML ืืžื™ืชื™. ื”ื•ื ืžืกืคืง API ื ื•ื— ืžืื•ื“ ืœืื—ื–ื•ืจ ื•ืœืชืคืขืœ ื ืชื•ื ื™ื ื‘ืืžืฆืขื•ืช ื”ื˜ื›ื ื™ืงื•ืช ื”ื˜ื•ื‘ื•ืช ื‘ื™ื•ืชืจ ืฉืœ DOM, CSS ื•-jQuery. Jsoup ืžื™ื™ืฉื ืืช ืžืคืจื˜ WHATWG HTML5, ื•ืžื ืชื— HTML ืœืื•ืชื• DOM ื›ืคื™ ืฉืขื•ืฉื™ื ื“ืคื“ืคื ื™ื ืžื•ื“ืจื ื™ื™ื ื›ืžื• Chrome ื•-Firefox. ืœื”ืœืŸ ื›ืžื” ืžื”ืชื›ื•ื ื•ืช ื”ืฉื™ืžื•ืฉื™ื•ืช ืฉืœ ืกืคืจื™ื™ืช Jsoup:
  • Jsoup ื™ื›ื•ืœ ืœื’ืจื“ ื•ืœื ืชื— HTML ืžื›ืชื•ื‘ืช URL, ืงื•ื‘ืฅ ืื• ืžื—ืจื•ื–ืช.
  • Jsoup ื™ื›ื•ืœ ืœืžืฆื•ื ื•ืœื—ืœืฅ ื ืชื•ื ื™ื ื‘ืืžืฆืขื•ืช ื—ืฆื™ื™ืช DOM ืื• ื‘ื•ืจืจื™ CSS.
  • Jsoup ืžืืคืฉืจ ืœืš ืœืชืคืขืœ ืจื›ื™ื‘ื™ HTML, ืชื›ื•ื ื•ืช ื•ื˜ืงืกื˜.
  • Jsoup ืžืกืคืงืช ื ื™ืงื•ื™ ืจืฉื™ืžื” ืœื‘ื ื” ืฉืœ ืžื™ื“ืข ืฉืกื•ืคืง ืขืœ ื™ื“ื™ ื”ืžืฉืชืžืฉ ื›ื“ื™ ืœืžื ื•ืข ื”ืชืงืคื•ืช XSS.
  • Jsoup ื’ื ืžื™ื™ืฆืจ HTML "ืžืกื•ื“ืจ".
Jsoup ื ื•ืขื“ ืœืขื‘ื•ื“ ืขื ืกื•ื’ื™ HTML ื”ืฉื•ื ื™ื ื”ืงื™ื™ืžื™ื ื‘ืขื•ืœื ื”ืืžื™ืชื™, ื›ื•ืœืœ HTML ืžืื•ืžืช ื›ื”ืœื›ื” ืขื ืงื‘ื•ืฆืช ืชื’ื™ื ืœื ืฉืœืžื” ื•ืœื ืžืื•ืžืชืช. ืื—ื“ ื”ื™ืชืจื•ื ื•ืช ื”ืขื™ืงืจื™ื™ื ืฉืœ Jsoup ื”ื•ื ื”ืืžื™ื ื•ืช ืฉืœื•.

ื ื™ืชื•ื— HTML ื‘-Java ื‘ืืžืฆืขื•ืช Jsoup

ื‘ืžื“ืจื™ืš ื–ื”, ื ืจืื” ืฉืœื•ืฉ ื“ื•ื’ืžืื•ืช ืฉื•ื ื•ืช ืฉืœ ื ื™ืชื•ื— ื•ืžืขื‘ืจ ืฉืœ ืžืกืžืš HTML ื‘-Java ื‘ืืžืฆืขื•ืช Jsoup. ื‘ื“ื•ื’ืžื” ื”ืจืืฉื•ื ื”, ื ื ืชื— ืžื—ืจื•ื–ืช HTML ื”ืžื›ื™ืœื” ืชื’ื™ื•ืช ื‘ืฆื•ืจื” ืฉืœ ืžื—ืจื•ื–ืช Java ืžื™ืœื•ืœื™ืช. ื‘ื“ื•ื’ืžื” ื”ืฉื ื™ื™ื”, ื ื•ืจื™ื“ ืืช ืžืกืžืš ื”-HTML ืฉืœื ื• ืžื”ืื™ื ื˜ืจื ื˜, ื•ื‘ื“ื•ื’ืžื” ื”ืฉืœื™ืฉื™ืช, ื ื•ืจื™ื“ ืงื•ื‘ืฅ HTML ืœื“ื•ื’ืžื” ืžืฉืœื ื• login.html ืœื ื™ืชื•ื—. ืงื•ื‘ืฅ ื–ื” ื”ื•ื ืžืกืžืš HTML ืœื“ื•ื’ืžื” ื”ืžื•ืจื›ื‘ ืžืชื’ "title" ื•ืชื’ "div" ื‘ืงื˜ืข "body" ื”ืžื›ื™ืœ ืืช ื˜ื•ืคืก ื”-HTML. ื”ื˜ื•ืคืก ืžื›ื™ืœ ืฉื“ื•ืช ืœื”ื–ื ืช ืฉื ืžืฉืชืžืฉ ื•ืกื™ืกืžื” ื•ื›ืŸ ื›ืคืชื•ืจื™ ืื™ืคื•ืก ื•ืื™ืฉื•ืจ ืœืคืขื•ืœื•ืช ื ื•ืกืคื•ืช. ื–ื”ื• HTML "ื ื›ื•ืŸ" ืฉื™ื›ื•ืœ ืœืขื‘ื•ืจ ืืช ื‘ื“ื™ืงืช ื”"ืชื•ืงืฃ", ื›ืœื•ืžืจ ื›ืœ ื”ืชื’ื™ื ื•ื”ืชื›ื•ื ื•ืช ืกื’ื•ืจื™ื ื›ื”ืœื›ื”. ื›ืš ื ืจืื” ืงื•ื‘ืฅ ื”-HTML ืฉืœื ื•:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <title>Login Page</title>
    </head>
    <body>
        <div id="login" class="simple" >
            <form action="login.do">
                Username : <input id="username" type="text" /><br>
                Password : <input id="password" type="password" /><br>
                <input id="submit" type="submit" />
                <input id="reset" type="reset" />
            </form>
        </div>
    </body>
</html>
ืขื Jsoup ืงืœ ืžืื•ื“ ืœื ืชื— HTML, ื›ืœ ืžื” ืฉืืชื” ืฆืจื™ืš ืœืขืฉื•ืช ื”ื•ื ืœืงืจื•ื ืœืฉื™ื˜ื” ืกื˜ื˜ื™ืช Jsoup.parse()ื•ืœื”ืขื‘ื™ืจ ืืœื™ื” ืืช ืžื—ืจื•ื–ืช ื”-HTML ืฉืœืš. Jsoup ืžืกืคืง ืžืกืคืจ ืฉื™ื˜ื•ืช ืขืžื•ืกื•ืช ืžื“ื™ parse()ืœืงืจื™ืืช HTML ืžืžื—ืจื•ื–ืช, ืงื•ื‘ืฅ, ืž-URI ื‘ืกื™ืกื™, ืž-URL ื•ืž- InputStream. ืืชื” ื™ื›ื•ืœ ื’ื ืœืฆื™ื™ืŸ ืืช ื”ืงื™ื“ื•ื“ ื›ื“ื™ ืœืงืจื•ื ื ื›ื•ืŸ ืืช ืงื•ื‘ืฅ ื”-HTML ืื ื”ื•ื ืœื ื‘ืคื•ืจืžื˜ UTF-8. ื”ืฉื™ื˜ื” parse(String html)ืžื ืชื—ืช ืืช ื”-HTML ื”ื ื›ื ืก ืœืื•ื‘ื™ื™ืงื˜ ื—ื“ืฉ Document. ื‘-Jsoup, ืžื—ืœืงื” Documentื™ื•ืจืฉืช ืžืžื—ืœืงื” Element, ืžื” ืฉืžืจื—ื™ื‘ ืืช ื”ืžื—ืœืงื” Node. ื”ื›ื™ืชื” ื’ื Nodeื™ื•ืจืฉืช ืžื”ื›ื™ืชื” TextNode. ื›ืœ ืขื•ื“ ืืชื” ืžืขื‘ื™ืจ ืœืฉื™ื˜ื” ืžื—ืจื•ื–ืช ืฉืื™ื ื” ืืคืกื™ืช, ืžื•ื‘ื˜ื— ืœืš ืฉืชื”ื™ื” ืœืš ื ื™ืชื•ื— ืžื•ืฆืœื— ื•ืžืฉืžืขื•ืชื™ ืฉืœ ืื•ื‘ื™ื™ืงื˜ Documentื”ืžื›ื™ืœ (ืœืคื—ื•ืช) ืืช ื”ืืœืžื ื˜ื™ื "ืจืืฉ" ื•"ื’ื•ืฃ". ืื ื™ืฉ ืœืš ืื•ื‘ื™ื™ืงื˜ Document, ืืชื” ื™ื›ื•ืœ ืœืงื‘ืœ ืืช ื”ื ืชื•ื ื™ื ื”ืจืฆื•ื™ื™ื ืขืœ ื™ื“ื™ ืงืจื™ืื” ืœืฉื™ื˜ื•ืช ื”ืžืชืื™ืžื•ืช ืฉืœ ื”ื›ื™ืชื” Documentื•ื”ื•ืจื™ื” Elementื• Node.

ืชื•ื›ื ืช Java ืœื ื™ืชื•ื— ืžืกืžืš HTML

ื”ื ื” ื”ืชื•ื›ื ื™ืช ื”ืžืœืื” ืฉืœื ื• ืœื ื™ืชื•ื— ืžื—ืจื•ื–ืช HTML, ืงื•ื‘ืฅ HTML ืฉื”ื•ืจื“ ืžื”ืื™ื ื˜ืจื ื˜ ื•ืงื•ื‘ืฅ HTML ืžืงื•ืžื™. ื›ื“ื™ ืœื”ืคืขื™ืœ ืื•ืชื•, ืืชื” ื™ื›ื•ืœ ืœื”ืฉืชืžืฉ ื‘-IDE (Eclipse ืื• ื›ืœ ืื—ืจ) ืื• ื‘ืฉื•ืจืช ื”ืคืงื•ื“ื”. ื‘-Eclipse ื–ื” ืžืื•ื“ ืงืœ, ืคืฉื•ื˜ ื”ืขืชืง ืืช ื”ืงื•ื“ ื”ื–ื”, ืฆื•ืจ ืคืจื•ื™ื™ืงื˜ Java ื—ื“ืฉ, ืœื—ืฅ ืœื—ื™ืฆื” ื™ืžื ื™ืช ืขืœ ื”ืชื™ืงื™ื” "src" ื•ื”ื“ื‘ืง ืืช ื”ืงื•ื“ ืฉื”ื•ืขืชืง. Eclipse ืชื“ืื’ ืœื™ืฆื•ืจ ืืช ื”ื—ื‘ื™ืœื” ื”ืžืชืื™ืžื” ื•ืงื•ื‘ืฅ ืงื•ื“ ื”ืžืงื•ืจ ืขื ื”ืฉื ื”ืžืชืื™ื, ื›ืš ืฉื–ื” ื”ืจื‘ื” ืคื—ื•ืช ืขื‘ื•ื“ื”. ืื ื›ื‘ืจ ื™ืฉ ืœืš ืคืจื•ื™ืงื˜ Java, ืื– ื–ื” ืจืง ืฉืœื‘ ืื—ื“. ื”ืชื•ื›ื ื™ืช ืฉืœื”ืœืŸ ืžืžื—ื™ืฉื” ืฉืœื•ืฉ ื“ื•ื’ืžืื•ืช ืฉืœ ื ื™ืชื•ื— ื•ืžืขื‘ืจ ืฉืœ ืงื•ื‘ืฅ HTML. ื‘ื“ื•ื’ืžื” ื”ืจืืฉื•ื ื” ืื ื• ืžื ืชื—ื™ื ื™ืฉื™ืจื•ืช ืžื—ืจื•ื–ืช ื”ืžื›ื™ืœื” HTML, ื‘ืฉื ื™ื™ื” ืงื•ื‘ืฅ HTML ืฉื”ื•ืจื“ ืžื›ืชื•ื‘ืช URL, ื‘ื“ื•ื’ืžื” ื”ืฉืœื™ืฉื™ืช ืžื•ืจื™ื“ื™ื ื•ืžื ืชื—ื™ื ืžืกืžืš HTML ืžืžืขืจื›ืช ื”ืงื‘ืฆื™ื ื”ืžืงื•ืžื™ืช.
import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

/**
* Java Program to parse/read HTML documents from File using Jsoup library.
* Jsoup is an open source library which allows Java developer to parse HTML
* files and extract elements, manipulate data, change style using DOM, CSS and
* JQuery like method.
*
* @author Javin Paul
*/
public class HTMLParser{

    public static void main(String args[]) {

        // Parse HTML String using JSoup library
        String HTMLSTring = "<!DOCTYPE html>"
                + "<html>"
                + "<head>"
                + "<title>JSoup Example</title>"
                + "</head>"
                + "<body>"
                + "<table><tr><td><h1>HelloWorld</h1></tr>"
                + "</table>"
                + "</body>"
                + "</html>";

        Document html = Jsoup.parse(HTMLSTring);
        String title = html.title();
        String h1 = html.body().getElementsByTag("h1").text();

        System.out.println("Input HTML String to JSoup :" + HTMLSTring);
        System.out.println("After parsing, Title : " + title);
        System.out.println("Afte parsing, Heading : " + h1);

        // JSoup Example 2 - Reading HTML page from URL
        Document doc;
        try {
            doc = Jsoup.connect("http://google.com/").get();
            title = doc.title();
        } catch (IOException e) {
            e.printStackTrace();
        }

        System.out.println("Jsoup Can read HTML page from URL, title : " + title);

        // JSoup Example 3 - Parsing an HTML file in Java
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
        Document htmlFile = null;
        try {
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } // right
        title = htmlFile.title();
        Element div = htmlFile.getElementById("login");
        String cssClass = div.className(); // getting class form HTML element

        System.out.println("Jsoup can also parse HTML file directly");
        System.out.println("title : " + title);
        System.out.println("class of div tag : " + cssClass);
    }
}
ืชึฐืคื•ึผืงึธื”:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple
ืžื ืชื— ื”-HTML ืฉืœ Jsoup ื™ืขืฉื” ื›ืœ ืžืืžืฅ ืœื™ื™ืฆืจ ื ื™ืชื•ื— "ื ืงื™" ืฉืœ ื”-HTML ืฉืืชื” ืžืกืคืง, ื‘ื™ืŸ ืื ื”ื•ื ืžืขื•ืฆื‘ ื”ื™ื˜ื‘ ื•ื‘ื™ืŸ ืื ืœื. ื–ื” ื™ื›ื•ืœ ืœื”ืชืžื•ื“ื“ ืขื ื”ืฉื’ื™ืื•ืช ื”ื‘ืื•ืช:
  • ืชื’ื™ื ืœื ืกื’ื•ืจื™ื. ืœื“ื•ื’ืžื”,<p>Java <p>Scala to <p>Java</p> <p>Scala</p>
  • ืชื’ื™ื ืžืจื•ืžื–ื™ื. ืœื“ื•ื’ืžื”, ืคืชื•ื—ื™ื <td>Java is Great</td>ื™ื”ื™ื• ืขื˜ื•ืคื™ื<table><tr><td>
  • Jsoup ื™ื•ืฆืจ ืžื‘ื ื™ ืžืกืžื›ื™ื ื—ื–ืงื™ื (HTML ืžื›ื™ืœ ืจืืฉ ื•ื’ื•ืฃ, ืื‘ืœ ื”ื’ื•ืฃ ืžื›ื™ืœ ืจืง ืืœืžื ื˜ื™ื ืจืœื•ื•ื ื˜ื™ื™ื)
Jsoup ื”ื™ื ืกืคืจื™ื™ืช ืงื•ื“ ืคืชื•ื— ืžืขื•ืœื” ื•ืืžื™ื ื” ืฉื”ื•ืคื›ืช ืืช ืงืจื™ืืช ืžืกืžื›ื™ HTML, ืงื˜ืขื™ ื’ื•ืฃ, ืžื—ืจื•ื–ื•ืช HTML ื•ื ื™ืชื•ื— ื™ืฉื™ืจ ืฉืœ ืชื•ื›ืŸ ืื™ื ื˜ืจื ื˜ ืฉืœ HTML ืœืคืฉื•ื˜ื” ืžืื•ื“.
ื”ืขืจื•ืช
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION