'How to Ignore URL references in Scrapy
I'm using Scrapy to scrape a website that contains a menu with a lot of sublevel menus. The problem is that I'm extracting multiple URLs that correspond to the same item/subitem in the website. I'm extracting them as if they were different items because the URLs contain a "ref=" section. For example:
https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_1
https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_2
https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_3
https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_4
All these URLs correspond to the same ssubsubitem_ABC in the website. Instead of this, I would like to extract only one URL corresponding to the subsubitem_ABC
https://thestore/category1/subitem/subsubitem_ABC
This way, mi intention is to reduce the time consumption of the crawler and avoid duplicated URLs for the same subsubitem or subitem or item.
So far I have these rules:
rules = [
Rule(
LinkExtractor(
restrict_xpaths=['my_xpath"]//a',],
),
follow=True,
callback='parse_categories'
)
]
Is there something I can add to the Rule/LinkExtractor to avoid the references in the URLs?
Solution 1:[1]
If you like to scrape only "https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_1", you can use regular expression rather than X_path. It could be allow = r'https://thestore/category1/subitem/subsubitem_ABC/ref(.*?)1'. Hope this can help you.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | studymakesmebetter |
