'generate python regex at runtime to match numbers from 'n' to infinite

I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.

I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.

My site urls look like http://foobar.com/page1.html, so, usually, the rule's regex to follow every link like this would be something like /page\d+\.html.

But how can I write a regex so it would match, for example, page 15 and more? Also, as I don't know the starting point in advance, how could I generate this regex at runtime?



Solution 1:[1]

Why not group the page number, then check if it is qualified:

>>> m=re.match("/page(\d+)\.html","/page18.html")
>>> if m:
    ID=int(m.groups()[0])
>>> ID > 15
True

Or more specifically what you requested:

>>> def genRegex(n):
    return ''.join('[' + "0123456789"[int(d):] + ']' for d in str(n))

>>> genRegex(123)
'[123456789][23456789][3456789]'

Solution 2:[2]

extending Kabie's answer a little:

def genregex(n):
    nstr = str(n)
    same_digit = ''.join('[' + "0123456789"[int(d):] + ']' for d in nstr)
    return "\d{%d,}|%s" % (len(nstr) + 1, same_digit)

It's easy to modify to handle leading 0's if that occurs in your website. But this seems like the wrong approach.

You have a few other options in scrapy. You're probably using SgmlLinkExtractor, in which case the easiest thing is to pass your own function as the process_value keyword argument to do your custom filtering.

You can customize CrawlSpider quite a lot, but if it doesn't fit your task, you should check out BaseSpider

Solution 3:[3]

>>> import regex
>>> import random
>>> n=random.randint(100,1000000)
>>> n
435220
>>> len(str(n))
>>> '\d'*len(str(n))
'\\d\\d\\d\\d\\d\\d'
>>> reg='\d{%d}'%len(str(n))
>>> m=re.search(reg,str(n))
>>> m.group(0)
'435220'

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Neuron
Solution 2 Shane Evans
Solution 3