'Recognize urls and anchor tags in scala
Im trying to port over Regex into Scala from JavaScript but its not behaving as I expected. My understanding is that it should port over pretty much one for one. The overall goal of what Im trying to do is be able to recognize URLs either with/without protocols and if they are in a anchor tag. Currently Ive got this for the regex.
private val entireCapture = "entireCapture"
private val aHrefOpen = "aHrefOpen"
private val urlBody = "urlBody"
private val aHrefClose = "aHrefClose"
private val A_HREF_OPEN_CAPTURING_GROUP = "(<a [^>]*)?" //This should get <a>
private val A_HREF_CLOSE_CAPTURING_GROUP = "([^>]*<\\/a>)?"// This should get </a>
private val URL_PROTOCOL = "(?:(?:https?):\\/\\/|www\\d{0,3}[.])"
private val URL_DOMAIN = "(?:\\([A-Z0-9+&@#\\/%=~_|$?!;:,.\\-]*\\)|[^\\s()<>])*"
private val URL_PATH = "(?:\\([A-Z0-9+&@#\\/%=~_|$?!;:,.\\-]*\\)|[^\\s()<>])"
private val URL_BODY = s"($URL_PROTOCOL$URL_DOMAIN$URL_PATH)" //This should get the acutal url
private val URL_MATCHER =
(s"$A_HREF_OPEN_CAPTURING_GROUP$URL_BODY$A_HREF_CLOSE_CAPTURING_GROUP").r(
entireCapture,
aHrefOpen,
urlBody,
aHrefClose
)
Then when I run
class LinkUtil {
def matchURL(text: String): String = {
URL_MATCHER
.replaceAllIn(
text,
x => rewriteMatch(x.group(entireCapture), x.group(aHrefOpen), x.group(urlBody), x.group(aHrefClose))
)
}
def rewriteMatch(entireCapture: String, aHrefOpen: String, urlBody: String, aHrefClose: String): String = {
if (aHrefOpen.nonEmpty || aHrefClose.nonEmpty) {
entireCapture
}
else {
val linkUrl = if (urlBody.matches("""^https?:""")) { urlBody }
else { ("http://" + urlBody) }
linkUrl
}
}
}
With various urls( Im using https://google.com, www.google.com, <a href="https://google.com">my link</a> it keeps crashing because of the aHrefClose group, it always fails with a ArrayIndexOutOfBounds error on 4. I don't understand WHY its not working like I expected, at worst I thought that if a group didn't get a match it would be null instead of not existing. What am I doing wrong here? Did I set up the regex incorrectly or is it something else?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
