'Issues with extracting URLs from text
I am trying to find a regular expression to extract any valid URLs (not only http[s]) using a regular expression. Unfortunately, each one outputs weird things. The best results I achieved using this regex:
\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
But I can mark at least the following issues:
- http://208.206.41.61/email/[email protected]&=refdoc=3D(01-128) is extracted as http://208.206.41.61/email/[email protected]&=
- http://www.onlinefilefolder.com',AJAXTHRESHOLD should be extracted without AJAXTHRESHOLD
- CSS / HTML styling is extracted, for example
xmlns:x="urn:schemas-microsoft-com:xslt, ze:12px;color:#666, font-size:12px;coloretc
How can I improve this regex to make sure only valid URLs are extracted? I am not only extracting it from the HTML, but also from a plain text. Therefore, using only beautifulsoup is impossible for my use case.
Solution 1:[1]
No regex is perfect, but this one might help you:
(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])
Flag to enable: insensitive, global, multiline (igm)
Source: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
