JavaRush /Blog Jawa /Random-JV /3 conto carane ngurai file HTML ing Jawa nggunakake Jsoup...
Sdu
tingkat

3 conto carane ngurai file HTML ing Jawa nggunakake Jsoup

Diterbitake ing grup
3 conto carane ngurai file HTML ing Jawa nggunakake Jsoup - 1HTML minangka inti saka WEB, kabeh kaca Internet sing sampeyan deleng, apa sing digawe kanthi dinamis nggunakake JavaScript, JSP, PHP, ASP utawa teknologi web liyane, adhedhasar HTML. Nyatane, browser sampeyan ngurai HTML lan nampilake kanthi cara sing trep kanggo sampeyan. Nanging apa yen sampeyan kudu ngurai dokumen HTML lan nemokake sawetara unsur, tag, atribut, utawa mriksa manawa ana unsur tartamtu utawa ora nggunakake program Java. Yen sampeyan wis dadi programmer Java kanggo sawetara taun, Aku manawa sampeyan wis rampung XML parsing nggunakake parser kaya DOM utawa SAX. Nanging, ironis, ana wektu nalika sampeyan kudu ngurai dokumen HTML saka aplikasi Java dhasar sing ora ngemot Servlets lan teknologi web Java liyane. Kajaba iku, Core JDK uga ora ngemot perpustakaan HTTP utawa HTML. Mulane, nalika nerangake parsing file HTML, akeh programer Java takon Google carane njaluk nilai saka tag HTML ing Jawa. Nalika aku nemoni iki, aku yakin manawa solusi kasebut bakal dadi perpustakaan sumber terbuka sing ngetrapake fungsi sing dibutuhake, nanging aku ora ngerti manawa bakal dadi apik banget lan kaya fitur kaya Jsoup. Iku ora mung menehi support kanggo maca lan parsing file HTML, atribut, kelas CSS ing gaya JQuery, nanging ing wektu sing padha, ngijini sampeyan kanggo ngowahi mau. Nggunakake Jsoup sampeyan bisa nindakake apa wae sing dikarepake karo dokumen HTML. Ing artikel iki, kita bakal ngurai file HTML lan nemokake jeneng lan atribut tag. Kita uga bakal ndeleng conto ngundhuh lan ngurai HTML saka file lan URL apa wae, kayata kaca ngarep Google.

Apa iku Jsoup

Jsoup minangka perpustakaan Java open-source kanggo nggarap HTML nyata. Nyedhiyakake API sing trep banget kanggo njupuk lan manipulasi data nggunakake teknik DOM, CSS, lan jQuery sing paling apik. Jsoup ngleksanakake spesifikasi WHATWG HTML5, lan ngurai HTML menyang DOM sing padha karo browser modern kaya Chrome lan Firefox. Ing ngisor iki sawetara fitur migunani saka perpustakaan Jsoup:
  • Jsoup bisa scrape lan parse HTML saka URL, file utawa string.
  • Jsoup bisa nemokake lan ngekstrak data nggunakake traversal DOM utawa pamilih CSS.
  • Jsoup ngidini sampeyan ngapusi unsur HTML, atribut lan teks.
  • Jsoup nyedhiyakake sanitasi daftar putih saka informasi sing diwenehake pangguna kanggo nyegah serangan XSS.
  • Jsoup uga ngasilake HTML "rapi".
Jsoup dirancang kanggo nggarap macem-macem jinis HTML sing ana ing donya nyata, kalebu HTML sing divalidasi kanthi bener kanthi tag sing ora lengkap lan ora valid. Salah sawijining kaluwihan utama Jsoup yaiku linuwih.

Parsing HTML ing Jawa nggunakake Jsoup

Ing tutorial iki, kita bakal weruh telung conto sing beda babagan parsing lan ngliwati dokumen HTML ing Jawa nggunakake Jsoup. Ing conto pisanan, kita bakal ngurai string HTML sing ngemot tag ing wangun string Jawa literal. Ing conto kapindho, kita bakal ngundhuh dokumen HTML saka Internet, lan ing conto katelu, kita bakal ngundhuh file HTML sampel login.html kanggo parsing. Berkas iki minangka conto dokumen HTML sing kasusun saka tag "title" lan tag "div" ing bagean "body" sing ngemot formulir HTML. Formulir kasebut ngemot kolom kanggo ngetik jeneng pangguna lan sandhi, uga tombol reset lan konfirmasi kanggo tumindak luwih lanjut. Iki HTML "bener" sing bisa ngliwati pamriksa "validitas", tegese kabeh tag lan atribut ditutup kanthi bener. Iki minangka file HTML kita:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <title>Login Page</title>
    </head>
    <body>
        <div id="login" class="simple" >
            <form action="login.do">
                Username : <input id="username" type="text" /><br>
                Password : <input id="password" type="password" /><br>
                <input id="submit" type="submit" />
                <input id="reset" type="reset" />
            </form>
        </div>
    </body>
</html>
Kanthi Jsoup, gampang banget kanggo ngurai HTML, sampeyan mung kudu nelpon metode statis Jsoup.parse()lan pass string HTML sampeyan. Jsoup nyedhiyakake sawetara cara sing kakehan parse()kanggo maca HTML saka senar, file, saka URI dhasar, saka URL, lan saka InputStream. Sampeyan uga bisa nemtokake enkoding kanggo maca file HTML kanthi bener yen ora ana ing format UTF-8. Cara kasebut parse(String html)ngurai HTML sing mlebu dadi obyek anyar Document. Ing Jsoup, kelas Documentdiwenehi warisan saka kelas Element, sing ngluwihi kelas kasebut Node. Kelas uga Nodewarisan saka kelas TextNode. Anggere sampeyan pass senar non-null kanggo cara, sampeyan dijamin sukses, parse migunani saka obyek Documentngemot (paling) unsur "sirah" lan "awak". Yen sampeyan duwe obyek Document, sampeyan bisa njaluk data sing dikarepake dening nelpon cara cocok saka kelas Documentlan tuwane Elementlan Node.

Program Java kanggo parsing dokumen HTML

Punika program lengkap kita kanggo parsing string HTML, file HTML sing diundhuh saka internet, lan file HTML lokal. Kanggo mbukak, sampeyan bisa nggunakake IDE (Eclipse utawa liyane) utawa baris printah. Ing Eclipse iki gampang banget, mung nyalin kode iki, nggawe proyek Java anyar, klik-tengen ing folder "src" banjur tempel kode sing disalin. Eclipse bakal ngurus nggawe paket sing tepat lan file kode sumber kanthi jeneng sing cocog, mula kerjane kurang. Yen sampeyan wis duwe proyek Jawa, iki mung siji langkah. Program ing ngisor iki nggambarake telung conto parsing lan ngliwati file HTML. Ing conto pisanan, kita langsung ngurai string sing ngemot HTML, ing kaloro file HTML sing diundhuh saka URL, ing katelu kita ngundhuh lan ngurai dokumen HTML saka sistem file lokal.
import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

/**
* Java Program to parse/read HTML documents from File using Jsoup library.
* Jsoup is an open source library which allows Java developer to parse HTML
* files and extract elements, manipulate data, change style using DOM, CSS and
* JQuery like method.
*
* @author Javin Paul
*/
public class HTMLParser{

    public static void main(String args[]) {

        // Parse HTML String using JSoup library
        String HTMLSTring = "<!DOCTYPE html>"
                + "<html>"
                + "<head>"
                + "<title>JSoup Example</title>"
                + "</head>"
                + "<body>"
                + "<table><tr><td><h1>HelloWorld</h1></tr>"
                + "</table>"
                + "</body>"
                + "</html>";

        Document html = Jsoup.parse(HTMLSTring);
        String title = html.title();
        String h1 = html.body().getElementsByTag("h1").text();

        System.out.println("Input HTML String to JSoup :" + HTMLSTring);
        System.out.println("After parsing, Title : " + title);
        System.out.println("Afte parsing, Heading : " + h1);

        // JSoup Example 2 - Reading HTML page from URL
        Document doc;
        try {
            doc = Jsoup.connect("http://google.com/").get();
            title = doc.title();
        } catch (IOException e) {
            e.printStackTrace();
        }

        System.out.println("Jsoup Can read HTML page from URL, title : " + title);

        // JSoup Example 3 - Parsing an HTML file in Java
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
        Document htmlFile = null;
        try {
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } // right
        title = htmlFile.title();
        Element div = htmlFile.getElementById("login");
        String cssClass = div.className(); // getting class form HTML element

        System.out.println("Jsoup can also parse HTML file directly");
        System.out.println("title : " + title);
        System.out.println("class of div tag : " + cssClass);
    }
}
Output:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple
Parser HTML Jsoup bakal nggawe kabeh gaweyan kanggo ngasilake parse "resik" saka HTML sing sampeyan nyedhiyani, manawa wis kawangun kanthi apik utawa ora. Bisa nangani kesalahan ing ngisor iki:
  • tag sing ora ditutup. Tuladhane,<p>Java <p>Scala to <p>Java</p> <p>Scala</p>
  • tags tersirat. Contone, sing mbukak <td>Java is Great</td>bakal dibungkus<table><tr><td>
  • Jsoup nggawe struktur dokumen sing kuat (HTML ngemot sirah lan awak, nanging awak mung ngemot unsur sing relevan)
Jsoup minangka perpustakaan sumber terbuka sing apik banget lan dipercaya sing nggawe maca dokumen html, fragmen awak, string html lan langsung ngurai konten web html kanthi gampang.
Komentar
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION