'Web Crawling with Spring Batch
I have a crawl function that also checks whether the content contains the param. If it contains I will write that to the database. How can I use the following code as a Read Job for the spring batch?
public void crawl(String baseUrl, String url, String postgresParam) {
if (!urls.contains(url) && url.startsWith(baseUrl)) {
// System.out.println(">> count: " + count + " [" + url + "]");
urls.add(url);
try {
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
Elements linksOnPage = htmlDocument.select("a[href]");
bodyContent = htmlDocument.body().text();
String title = htmlDocument.title();
searchParameters(url, title);
// count++;
for (Element link : linksOnPage) {
crawl(baseUrl, link.absUrl("href"), postgresParam);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
private void searchParameters(String URL, String title) {
for (String param : postgresParamArray) {
if (bodyContent.toLowerCase().contains(param.toLowerCase())) {
System.out.println(">>>>>> Found: " + " [" + param + "]" + " [" + URL + "]" + " [" + title + "]");
}
}
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
