'Find a URL with RegEx without finding the value after it
I saw a few posts on this but they were for PHP (I need JavaScript (actually ActionScript (...because ActionScript extends JavaScript))) so my question is how to only capture up to a comma, period, question mark or exclamation point.
This is what I have so far,
instructionText.replace(/(https?:\/\/\w.*[\w])/gi, "<a href='$1' target='_blank'>$1</a>");
But when I use the text, "Visit http://www.google.com. Hello world" it captures the hello world part.
The result of the capture group above is "http://google.com. Hello world ". Obviously I don't want anything after the URL. They should be simple URL's.
Mainly, I just want to add a check for any of these ".,!?" or a space character and end the capture group. It doesn't have to be perfect.
BTW Not sure if you have something to test your RegEx first but if not you can use RegExr.
Solution 1:[1]
Problem is that you are capturing .* followed by a \w which means any amount of anything followed by a word...
/(https?:\/\/\w.*[\w])/
You need to make your wildcard capture ungreedy...
/(https?:\/\/\w.*?[\w])/
So it will capture as few characters as possible before capturing a \w
EDIT: More info
Additionally, your regex is very simple, and unfortunately, capturing url's is quite complex, because there are so many variations of what is valid and what is not. You will need to set yourself a clear line where you define what you consider to be a good match for a url in your context.
If you wanted to ensure valid top level domains for example, you would have to include something like this...
/https?:\/\/\w.*?\.(com|org|co\.uk| ... etc ... )/
Which becomes obsolete as soon as a new top level domain is registered.
If you want to match anything starting with a protocol, and up to the next space, something like this should do...
/[a-zA-Z]+:\/\/\S+/
Good luck!
Solution 2:[2]
In your regex you're looking for as many characters as possible (.* is greedy), where the last character is a \w character. Try this (a quick edit to your regex). It should work on domains with or without the presence of the www., and domains with a two or three letter tld.
https?\:\/\/(www\.)?\w*?\.\w{2,3}(?=[\W])
Solution 3:[3]
https?\:\/\/((www\\.)?\w*?(\\.\w{2,7})+)(?=\\.|\\,|\\?|\\!|\s)
i guess (?=\\.|\\,|\\?|\\!|\s) this is the part you were looking for?
Solution 4:[4]
Thanks to @MikeM answer I was able to use his and generate handling to replace links and email address (only if no link already exists), omitting the punctuation, here for reference if other's need it:
/**
* Replace URLs and Emails with HTML links
*
* This function will replace all URLs and Email Addresses wrapped in HTML links, ONLY if one does not already exist,
* excluding punctuation (email or url followed by period, comma, etc).
*
* @param $content
*
* @return string
* @since 1.0.0
*
*/
function replace_links( $content ) {
$content = preg_replace( '"<a[^>]+>.+?</a>(*SKIP)(*FAIL)|(https?:\/\/\S+?)(?=[.,!?]?(\s|$))"', '<a href="$0">$0</a>', $content );
$content = preg_replace( '"<a[^>]+>.+?</a>(*SKIP)(*FAIL)|(\S+@\S+\.\S+?)(?=[.,!?]?(\s|$))"', '<a href="mailto:$0">$0</a>', $content );
return $content;
}
Check gist for latest: https://gist.github.com/tripflex/0cc930c2afe5f4c73f2aed61cedf95d0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | AMIC MING |
| Solution 4 | sMyles |
