'Issues with extracting URLs from text

I am trying to find a regular expression to extract any valid URLs (not only http[s]) using a regular expression. Unfortunately, each one outputs weird things. The best results I achieved using this regex:

\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

But I can mark at least the following issues:

How can I improve this regex to make sure only valid URLs are extracted? I am not only extracting it from the HTML, but also from a plain text. Therefore, using only beautifulsoup is impossible for my use case.



Solution 1:[1]

No regex is perfect, but this one might help you:

(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])

Flag to enable: insensitive, global, multiline (igm)

Source: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1