'How to Ignore URL references in Scrapy

I'm using Scrapy to scrape a website that contains a menu with a lot of sublevel menus. The problem is that I'm extracting multiple URLs that correspond to the same item/subitem in the website. I'm extracting them as if they were different items because the URLs contain a "ref=" section. For example:

https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_1
https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_2
https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_3
https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_4

All these URLs correspond to the same ssubsubitem_ABC in the website. Instead of this, I would like to extract only one URL corresponding to the subsubitem_ABC

https://thestore/category1/subitem/subsubitem_ABC

This way, mi intention is to reduce the time consumption of the crawler and avoid duplicated URLs for the same subsubitem or subitem or item.

So far I have these rules:

rules = [
    Rule(
        LinkExtractor(
            restrict_xpaths=['my_xpath"]//a',],
        ),
        follow=True,
        callback='parse_categories'
    )
]

Is there something I can add to the Rule/LinkExtractor to avoid the references in the URLs?



Solution 1:[1]

If you like to scrape only "https://thestore/category1/subitem/subsubitem_ABC/ref=asd_asd_1", you can use regular expression rather than X_path. It could be allow = r'https://thestore/category1/subitem/subsubitem_ABC/ref(.*?)1'. Hope this can help you.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 studymakesmebetter