'Python regular expressions matching question
#!/usr/bin/python3
import requests
import re
def get_html(url):
r = requests.get(url)
return r.text
url = "https://foxnews.com/"
html = get_html(url)
pattern = re.compile(r'\b(?:https?:\/\/)?(?:.*?\.)\b(?:png|jpg|jpeg|gif|svg)')
matches = re.findall(pattern, html)
for match in matches:
print(match)
But it keeps matching things like this:
" data-src="//a57.foxnews.com/hp.foxnews.com/images/2022/05/320/180/adc987321731b78d41a3b061e67ea84c.png
href="https://www.foxnews.com/entertainment/saturday-night-live-medieval-supreme-courts-leaked-draft-opinion-abortion-roe-v-wade"
And so forth and so on. How can I make this get image links and filter the remaining other things out? Look around? I'm stuck here.
Solution 1:[1]
You can use
pattern = re.compile(r'\b(?:https?://)?[^\s<>"\']*\.(?:png|jpg|jpeg|gif|svg)\b')
matches = re.findall(pattern, html)
Details:
\b- word boundary(?:https?://)?- an optional sequence ofhttp://orhttps://[^\s<>"\']*- zero or more chars other than whitespace,",',<and>\.- a dot(?:png|jpg|jpeg|gif|svg)- one of the extensions listed\b- word boundary
Solution 2:[2]
I have collated a set of regular expressions that are available on patterns-finder, for finding URL I have been using the following URL regex:
((ftp|https?):\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\\.[^\\s]{2,}|www\\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\\.[^\\s]{2,}|https?:\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9]+\\.[^\\s]{2,}|www\\.[a-zA-Z0-9]+\\.[^\\s]{2,})
You can notice the presence of ftp protocol. With slight modifications, you can add this list of file extensions (gif|png|jpg|jpeg|webp|svg|psd|bmp|tif) as it supports more common type of images. The final regex will look like this:
\b((ftp|https?):\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})[^\s]*\.(gif|png|jpg|jpeg|webp|svg|psd|bmp|tif)\b
It will match the following:
https://cdn.analytics.com/wp-content/uploads/2022/04/test.jpg
http://cdn.analytics.com/wp-content/uploads/2022/04/file.jpg
http://cdn.analytics.com/wp-content/uploads/2022/04/.jpg
ftp://www.test.com/wp-content/uploads/2022/04/.jpg
It supports the case when the file name is ".jpg" or ".png", etc. But it will not match the following:
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/test.html
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/test
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/png
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
| Solution 2 | Rachid Benouini |
