'Issues with extracting URLs from text

I am trying to find a regular expression to extract any valid URLs (not only http[s]) using a regular expression. Unfortunately, each one outputs weird things. The best results I achieved using this regex:

\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

But I can mark at least the following issues:

http://208.206.41.61/email/[email protected]&=refdoc=3D(01-128) is extracted as http://208.206.41.61/email/[email protected]&=
http://www.onlinefilefolder.com',AJAXTHRESHOLD should be extracted without AJAXTHRESHOLD
CSS / HTML styling is extracted, for example xmlns:x="urn:schemas-microsoft-com:xslt, ze:12px;color:#666, font-size:12px;color etc

How can I improve this regex to make sure only valid URLs are extracted? I am not only extracting it from the HTML, but also from a plain text. Therefore, using only beautifulsoup is impossible for my use case.

Solution 1:^[1]

No regex is perfect, but this one might help you:

(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])

Flag to enable: insensitive, global, multiline (igm)

Source: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Issues with extracting URLs from text

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]