HTML ืืื ืืืืื ืฉื ื-WEB, ืื ืืคื ืืืื ืืจื ื ืฉืืชื ืจืืื, ืืื ืื ืื ื ืืฆืจืื ืืืืคื ืืื ืื ืืืืฆืขืืช JavaScript, JSP, PHP, ASP ืื ืืื ืืืืืืืช ืืื ืืจื ื ืืืจืืช, ืืืืกืกืื ืขื HTML. ืืืขืฉื, ืืืคืืคื ืฉืื ืื ืชื ืืช ื-HTML ืืืฆืื ืืืชื ืืฆืืจื ืฉื ืืื ืื. ืืื ืื ืื ืืชื ืฆืจืื ืื ืชื ืืกืื HTML ืืืืฆืื ืื ืืืื ื, ืชื, ืชืืื ื, ืื ืืืืืง ืื ืงืืื ืืืื ื ืืกืืื ืื ืื ืืืืฆืขืืช ืชืืื ืช Java. ืื ืืืืช ืืชืื ืช ื'ืืืื ืืกืคืจ ืฉื ืื, ืื ื ืืืื ืฉืืืืช ืขืืฉื ื ืืชืื XML ืืืืฆืขืืช ืื ืชืืื ืืื DOM ืื SAX. ืืื, ืืืจืื ืืืืจืื ืื, ืืฉ ืืงืจืื ืฉืืื ืืชื ืฆืจืื ืื ืชื ืืกืื HTML ืืืคืืืงืฆืืืช Java ืืกืืกืืช ืฉืืื ื ืืืืื Servlets ืืืื ืืืืืืืช ืืื ืืจื ื ืืืจืืช ืฉื Java. ืืชืจื ืืื, Core JDK ืื ืืื ื ืืืื ืกืคืจืืืช HTTP ืื HTML. ืื ืืกืืื ืฉืืืฉืจ ืืืืืจ ืื ืืชืื ืงืืืฅ HTML, ืืชืื ืชื Java ืจืืื ืฉืืืืื ืืช ืืืื ืืืฆื ืืงืื ืืช ืืขืจื ืฉื ืชื HTML ื-Java. ืืฉื ืชืงืืชื ืืื, ืืืืชื ืืืื ืฉืืคืชืจืื ืืืื ืกืคืจืืืช ืงืื ืคืชืื ืฉืชืืืืข ืืช ืืคืื ืงืฆืืื ืืืืช ืฉืื ื ืฆืจืื, ืืื ืื ืืืขืชื ืฉืืื ืชืืื ื ืคืืื ืืขืฉืืจื ืืชืืื ืืช ืืื Jsoup. ืื ืื ืจืง ืืกืคืง ืชืืืื ืืงืจืืื ืื ืืชืื ืฉื ืงืืืฆื HTML, ืชืืื ืืช, ืืืืงืืช CSS ืืกืื ืื JQuery, ืืื ืื-ืืื ืืช, ืืืคืฉืจ ืื ืืฉื ืืช ืืืชื. ืืืืฆืขืืช Jsoup ืืชื ืืืื ืืขืฉืืช ืื ืื ืฉืืชื ืจืืฆื ืขื ืืกืื HTML. ืืืืืจ ืื ื ื ืชื ืงืืืฅ HTML ืื ืืฆื ืืช ืืฉืืืช ืืืชืืื ืืช ืฉื ืืชืืื. ืืื ืื, ื ืืื ืืืืืืืช ืืืืจืื ืื ืืชืื ืฉื HTML ืืงืืืฅ ืืืื ืืชืืืช ืืชืจ, ืืืื ืืฃ ืืืืช ืฉื Google.
ืื ืื Jsoup
Jsoup ืืื ืกืคืจืืืช Java ืืงืื ืคืชืื ืืขืืืื ืขื HTML ืืืืชื. ืืื ืืกืคืง API ื ืื ืืืื ืืืืืืจ ืืืชืคืขื ื ืชืื ืื ืืืืฆืขืืช ืืืื ืืงืืช ืืืืืืช ืืืืชืจ ืฉื DOM, CSS ื-jQuery. Jsoup ืืืืฉื ืืช ืืคืจื WHATWG HTML5, ืืื ืชื HTML ืืืืชื DOM ืืคื ืฉืขืืฉืื ืืคืืคื ืื ืืืืจื ืืื ืืื Chrome ื-Firefox. ืืืื ืืื ืืืชืืื ืืช ืืฉืืืืฉืืืช ืฉื ืกืคืจืืืช Jsoup:- Jsoup ืืืื ืืืจื ืืื ืชื HTML ืืืชืืืช URL, ืงืืืฅ ืื ืืืจืืืช.
- Jsoup ืืืื ืืืฆืื ืืืืืฅ ื ืชืื ืื ืืืืฆืขืืช ืืฆืืืช DOM ืื ืืืจืจื CSS.
- Jsoup ืืืคืฉืจ ืื ืืชืคืขื ืจืืืื HTML, ืชืืื ืืช ืืืงืกื.
- Jsoup ืืกืคืงืช ื ืืงืื ืจืฉืืื ืืื ื ืฉื ืืืืข ืฉืกืืคืง ืขื ืืื ืืืฉืชืืฉ ืืื ืืื ืืข ืืชืงืคืืช XSS.
- Jsoup ืื ืืืืฆืจ HTML "ืืกืืืจ".
ื ืืชืื HTML ื-Java ืืืืฆืขืืช Jsoup
ืืืืจืื ืื, ื ืจืื ืฉืืืฉ ืืืืืืืช ืฉืื ืืช ืฉื ื ืืชืื ืืืขืืจ ืฉื ืืกืื HTML ื-Java ืืืืฆืขืืช Jsoup. ืืืืืื ืืจืืฉืื ื, ื ื ืชื ืืืจืืืช HTML ืืืืืื ืชืืืืช ืืฆืืจื ืฉื ืืืจืืืช Java ืืืืืืืช. ืืืืืื ืืฉื ืืื, ื ืืจืื ืืช ืืกืื ื-HTML ืฉืื ื ืืืืื ืืจื ื, ืืืืืืื ืืฉืืืฉืืช, ื ืืจืื ืงืืืฅ HTML ืืืืืื ืืฉืื ื login.html ืื ืืชืื. ืงืืืฅ ืื ืืื ืืกืื HTML ืืืืืื ืืืืจืื ืืชื "title" ืืชื "div" ืืงืืข "body" ืืืืื ืืช ืืืคืก ื-HTML. ืืืืคืก ืืืื ืฉืืืช ืืืื ืช ืฉื ืืฉืชืืฉ ืืกืืกืื ืืื ืืคืชืืจื ืืืคืืก ืืืืฉืืจ ืืคืขืืืืช ื ืืกืคืืช. ืืื HTML "ื ืืื" ืฉืืืื ืืขืืืจ ืืช ืืืืงืช ื"ืชืืงืฃ", ืืืืืจ ืื ืืชืืื ืืืชืืื ืืช ืกืืืจืื ืืืืื. ืื ื ืจืื ืงืืืฅ ื-HTML ืฉืื ื:<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Login Page</title>
</head>
<body>
<div id="login" class="simple" >
<form action="login.do">
Username : <input id="username" type="text" /><br>
Password : <input id="password" type="password" /><br>
<input id="submit" type="submit" />
<input id="reset" type="reset" />
</form>
</div>
</body>
</html>
ืขื Jsoup ืงื ืืืื ืื ืชื HTML, ืื ืื ืฉืืชื ืฆืจืื ืืขืฉืืช ืืื ืืงืจืื ืืฉืืื ืกืืืืช Jsoup.parse()
ืืืืขืืืจ ืืืื ืืช ืืืจืืืช ื-HTML ืฉืื. Jsoup ืืกืคืง ืืกืคืจ ืฉืืืืช ืขืืืกืืช ืืื parse()
ืืงืจืืืช HTML ืืืืจืืืช, ืงืืืฅ, ื-URI ืืกืืกื, ื-URL ืื- InputStream
. ืืชื ืืืื ืื ืืฆืืื ืืช ืืงืืืื ืืื ืืงืจืื ื ืืื ืืช ืงืืืฅ ื-HTML ืื ืืื ืื ืืคืืจืื UTF-8. ืืฉืืื parse(String html)
ืื ืชืืช ืืช ื-HTML ืื ืื ืก ืืืืืืืงื ืืืฉ Document
. ื-Jsoup, ืืืืงื Document
ืืืจืฉืช ืืืืืงื Element
, ืื ืฉืืจืืื ืืช ืืืืืงื Node
. ืืืืชื ืื Node
ืืืจืฉืช ืืืืืชื TextNode
. ืื ืขืื ืืชื ืืขืืืจ ืืฉืืื ืืืจืืืช ืฉืืื ื ืืคืกืืช, ืืืืื ืื ืฉืชืืื ืื ื ืืชืื ืืืฆืื ืืืฉืืขืืชื ืฉื ืืืืืืงื Document
ืืืืื (ืืคืืืช) ืืช ืืืืื ืืื "ืจืืฉ" ื"ืืืฃ". ืื ืืฉ ืื ืืืืืืงื Document
, ืืชื ืืืื ืืงืื ืืช ืื ืชืื ืื ืืจืฆืืืื ืขื ืืื ืงืจืืื ืืฉืืืืช ืืืชืืืืืช ืฉื ืืืืชื Document
ืืืืจืื Element
ื Node
.
ืชืืื ืช Java ืื ืืชืื ืืกืื HTML
ืื ื ืืชืืื ืืช ืืืืื ืฉืื ื ืื ืืชืื ืืืจืืืช HTML, ืงืืืฅ HTML ืฉืืืจื ืืืืื ืืจื ื ืืงืืืฅ HTML ืืงืืื. ืืื ืืืคืขืื ืืืชื, ืืชื ืืืื ืืืฉืชืืฉ ื-IDE (Eclipse ืื ืื ืืืจ) ืื ืืฉืืจืช ืืคืงืืื. ื-Eclipse ืื ืืืื ืงื, ืคืฉืื ืืขืชืง ืืช ืืงืื ืืื, ืฆืืจ ืคืจืืืืงื Java ืืืฉ, ืืืฅ ืืืืฆื ืืื ืืช ืขื ืืชืืงืื "src" ืืืืืง ืืช ืืงืื ืฉืืืขืชืง. Eclipse ืชืืื ืืืฆืืจ ืืช ืืืืืื ืืืชืืืื ืืงืืืฅ ืงืื ืืืงืืจ ืขื ืืฉื ืืืชืืื, ืื ืฉืื ืืจืื ืคืืืช ืขืืืื. ืื ืืืจ ืืฉ ืื ืคืจืืืงื Java, ืื ืื ืจืง ืฉืื ืืื. ืืชืืื ืืช ืฉืืืื ืืืืืฉื ืฉืืืฉ ืืืืืืืช ืฉื ื ืืชืื ืืืขืืจ ืฉื ืงืืืฅ HTML. ืืืืืื ืืจืืฉืื ื ืื ื ืื ืชืืื ืืฉืืจืืช ืืืจืืืช ืืืืืื HTML, ืืฉื ืืื ืงืืืฅ HTML ืฉืืืจื ืืืชืืืช URL, ืืืืืื ืืฉืืืฉืืช ืืืจืืืื ืืื ืชืืื ืืกืื HTML ืืืขืจืืช ืืงืืฆืื ืืืงืืืืช.import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
/**
* Java Program to parse/read HTML documents from File using Jsoup library.
* Jsoup is an open source library which allows Java developer to parse HTML
* files and extract elements, manipulate data, change style using DOM, CSS and
* JQuery like method.
*
* @author Javin Paul
*/
public class HTMLParser{
public static void main(String args[]) {
// Parse HTML String using JSoup library
String HTMLSTring = "<!DOCTYPE html>"
+ "<html>"
+ "<head>"
+ "<title>JSoup Example</title>"
+ "</head>"
+ "<body>"
+ "<table><tr><td><h1>HelloWorld</h1></tr>"
+ "</table>"
+ "</body>"
+ "</html>";
Document html = Jsoup.parse(HTMLSTring);
String title = html.title();
String h1 = html.body().getElementsByTag("h1").text();
System.out.println("Input HTML String to JSoup :" + HTMLSTring);
System.out.println("After parsing, Title : " + title);
System.out.println("Afte parsing, Heading : " + h1);
// JSoup Example 2 - Reading HTML page from URL
Document doc;
try {
doc = Jsoup.connect("http://google.com/").get();
title = doc.title();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Jsoup Can read HTML page from URL, title : " + title);
// JSoup Example 3 - Parsing an HTML file in Java
//Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
Document htmlFile = null;
try {
htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} // right
title = htmlFile.title();
Element div = htmlFile.getElementById("login");
String cssClass = div.className(); // getting class form HTML element
System.out.println("Jsoup can also parse HTML file directly");
System.out.println("title : " + title);
System.out.println("class of div tag : " + cssClass);
}
}
ืชึฐืคืึผืงึธื:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple
ืื ืชื ื-HTML ืฉื Jsoup ืืขืฉื ืื ืืืืฅ ืืืืฆืจ ื ืืชืื "ื ืงื" ืฉื ื-HTML ืฉืืชื ืืกืคืง, ืืื ืื ืืื ืืขืืฆื ืืืื ืืืื ืื ืื. ืื ืืืื ืืืชืืืื ืขื ืืฉืืืืืช ืืืืืช:
- ืชืืื ืื ืกืืืจืื. ืืืืืื,
<p>Java <p>Scala to <p>Java</p> <p>Scala</p>
- ืชืืื ืืจืืืืื. ืืืืืื, ืคืชืืืื
<td>Java is Great</td>
ืืืื ืขืืืคืื<table><tr><td>
- Jsoup ืืืฆืจ ืืื ื ืืกืืืื ืืืงืื (HTML ืืืื ืจืืฉ ืืืืฃ, ืืื ืืืืฃ ืืืื ืจืง ืืืื ืืื ืจืืืื ืืืื)
GO TO FULL VERSION