'Scraping each link from sitemap.xml
I'm new on Apify.
I would like to scrape each link in the sitemap.xml
More specifically: I have the following situation:
My sitemap url: https://www.mywebsite.com/sitemap.xml
My links from sitemap looks like: https://www.mywebsite.com/product_id/product
eg: https://www.mywebsite.com/534372/acer_laptop
I would like to ask you if there is a solution for me to extract from each link the following elements: title, product_image_url, price
I tried Web Scraper and Legacy PhantomJS Crawler, but I think I'm missing something because I can't get the elements I need.
Solution 1:[1]
For increased performance, either
make sure you disable these options in advanced settings:
Download media files
Download CSS files
look into using cheerio instead of web/puppeteer scraper if you're not yet https://docs.apify.com/scraping/cheerio-scraper
request a custom optimized solution on the MP: https://apify.com/marketplace
Solution 2:[2]
Consider making a function using Puppeteer. Open the sitemap in your browser and look for the singular tag class name. This function could be a good start. I'm going to try it my self and see if it works
async function scrap() {
const browser = await puppeteer.launch({
headless: true,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
await page.goto(`https://yourpage.it/sitemap.xml`);
const data = await page.evaluate(() => {
const link = document.querySelectorAll(".html-tag > span").innerHTML; //you should be able to loop through it
return {
link
};
});
await page.close();
await browser.close();
return data;
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Vasek Tobey Vlcek |
| Solution 2 | Vincenzo |
