典型的爬虫工作步骤是这样的:
1解析根网页(“mit.edu”),并从此页面获取所有链接。要访问每个URL并解析HTML页面,使用JSoup,这是一个方便和简单的Java库,类似于python的soulsoap
2使用从步骤1中检索的URL,并解析这些URL
3在执行上述步骤时,大家需要跟踪之前处理过的页面,以便每个网页只处理一次。这就是大家需要数据库的原因
开始用Java爬虫
1从http://jsoup.org/download下载JSoup核心库,从http://dev.mysql.com/downloads/connector/j/下载mysqljar包
2现在在eclipse中创建一个名为Crawler的工程,并把JSoup和mysqljar包放到javabuild目录下
3创建一个名为DB的类,它被用来进行数据库的相关操作
importjava.sql.Connection;
importjava.sql.DriverManager;
importjava.sql.ResultSet;
importjava.sql.SQLException;
importjava.sql.Statement;
publicclassDB{
publicConnectionconn=null;
publicDB(){
try{
Class.forName(“com.mysql.jdbc.Driver”);
Stringurl=”jdbc:mysql://localhost:3306/Crawler”;
conn=DriverManager.getConnection(url,”root”,”admin213″);
System.out.println(“connbuilt”);
}catch(SQLExceptione){
e.printStackTrace();
}catch(ClassNotFoundExceptione){
e.printStackTrace();
}
}
publicResultSetrunSql(Stringsql)throwsSQLException{
Statementsta=conn.createStatement();
returnsta.executeQuery(sql);
}
publicbooleanrunSql2(Stringsql)throwsSQLException{
Statementsta=conn.createStatement();
returnsta.execute(sql);
}
@Override
protectedvoidfinalize()throwsThrowable{
if(conn!=null||!conn.isClosed()){
conn.close();
}
}
}
4创建一个名为Main的类,这将是大家的爬虫类
importjava.io.IOException;
importjava.sql.PreparedStatement;
importjava.sql.ResultSet;
importjava.sql.SQLException;
importjava.sql.Statement;
importorg.jsoup.Jsoup;
importorg.jsoup.nodes.Document;
importorg.jsoup.nodes.Element;
importorg.jsoup.select.Elements;
publicclassMain{
publicstaticDBdb=newDB();
publicstaticvoidmain(String[]args)throwsSQLException,IOException{
db.runSql2(“TRUNCATERecord;”);
processPage(“http://www.mit.edu”);
}
publicstaticvoidprocessPage(StringURL)throwsSQLException,IOException{
//checkifthegivenURLisalreadyindatabase
Stringsql=”select*fromRecordwhereURL='”+URL+”‘”;
ResultSetrs=db.runSql(sql);
if(rs.next()){
}else{
//storetheURLtodatabasetoavoidparsingagain
sql=”INSERTINTO`Crawler`.`Record`”+”(`URL`)VALUES”+”(?);”;
PreparedStatementstmt=db.conn.prepareStatement(sql,Statement.RETURN_GENERATED_KEYS);
stmt.setString(1,URL);
stmt.execute();
//getusefulinformation
Documentdoc=Jsoup.connect(“http://www.mit.edu/”).get();
if(doc.text().contains(“research”)){
System.out.println(URL);
}
//getalllinksandrecursivelycalltheprocessPagemethod
Elementsquestions=doc.select(“a[href]”);
for(Elementlink:questions){
if(link.attr(“href”).contains(“mit.edu”))
processPage(link.attr(“abs:href”));
}
}
}
}