'generate python regex at runtime to match numbers from 'n' to infinite
I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.
I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.
My site urls look like http://foobar.com/page1.html, so, usually, the rule's regex to follow every link like this would be something like /page\d+\.html.
But how can I write a regex so it would match, for example, page 15 and more? Also, as I don't know the starting point in advance, how could I generate this regex at runtime?
Solution 1:[1]
Why not group the page number, then check if it is qualified:
>>> m=re.match("/page(\d+)\.html","/page18.html")
>>> if m:
ID=int(m.groups()[0])
>>> ID > 15
True
Or more specifically what you requested:
>>> def genRegex(n):
return ''.join('[' + "0123456789"[int(d):] + ']' for d in str(n))
>>> genRegex(123)
'[123456789][23456789][3456789]'
Solution 2:[2]
extending Kabie's answer a little:
def genregex(n):
nstr = str(n)
same_digit = ''.join('[' + "0123456789"[int(d):] + ']' for d in nstr)
return "\d{%d,}|%s" % (len(nstr) + 1, same_digit)
It's easy to modify to handle leading 0's if that occurs in your website. But this seems like the wrong approach.
You have a few other options in scrapy. You're probably using SgmlLinkExtractor, in which case the easiest thing is to pass your own function as the process_value keyword argument to do your custom filtering.
You can customize CrawlSpider quite a lot, but if it doesn't fit your task, you should check out BaseSpider
Solution 3:[3]
>>> import regex
>>> import random
>>> n=random.randint(100,1000000)
>>> n
435220
>>> len(str(n))
>>> '\d'*len(str(n))
'\\d\\d\\d\\d\\d\\d'
>>> reg='\d{%d}'%len(str(n))
>>> m=re.search(reg,str(n))
>>> m.group(0)
'435220'
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Neuron |
| Solution 2 | Shane Evans |
| Solution 3 |
