'Select text strings with multiple formatting tags within
Context:
VB.NET application using htmlagility pack to handle html document.
Issue:
In a html document, I'd like to prefixe all the strings starting with # and ending with space by an url whatever formatting tags are used within. So #sth would became http://www.anything.tld/sth
For instance:
Before:
<p>#string1</p> blablabla
<p><strong>#stri</strong>ng2</p> bliblibli
After:
<p><a href="http://www.anything.tld/string1">#string1</a> blablabla</p>
<p><a href="http://www.anything.tld/string2"><strong>#stri</strong>ng2</a> bliblibli</p>
I guess i can achieve this with html agility pack but how to select the entire text string without its formatting ?
Or should i use a simple regex replace routine?
Solution 1:[1]
Here's my solution. I'm sure it would make some experienced developpers bleed from every hole but it actually works. The htmlcode is in strCorpusHtmlContent
Dim matchsHashtag As MatchCollection
Dim matchHashtag As Match
Dim captureHashtag As Capture
Dim strHashtagFormatted As String
Dim strRegexPatternHashtag As String = "#([\s]*)(\w*)"
matchsHashtag = Regex.Matches(strCorpusHtmlContent, strRegexPatternHashtag)
For Each matchHashtag In matchsHashtag
For Each captureHashtag In matchHashtag.Captures
Dim strHashtagToFormat As String
Dim strHashtagValueToFormat As String
' Test if the hashtag is followed by a tag
If Mid(strCorpusHtmlContent, captureHashtag.Index + captureHashtag.Length + 1, 1) = "<" Then
strHashtagValueToFormat = captureHashtag.Value
Dim intStartPosition As Integer = captureHashtag.Index + captureHashtag.Length + 1
Dim intSpaceCharPostion As Integer = intStartPosition
Dim nextChar As Char
Dim blnInATag As Boolean = True
Do Until (nextChar = " " Or nextChar = vbCr Or nextChar = vbLf Or nextChar = vbCrLf) And blnInATag = False
nextChar = CChar(Mid(strCorpusHtmlContent, intSpaceCharPostion + 1, 1))
If nextChar = "<" Then
blnInATag = True
ElseIf nextChar = ">" Then
blnInATag = False
End If
If blnInATag = False And nextChar <> ">" And nextChar <> " " Then
strHashtagValueToFormat &= nextChar
End If
intSpaceCharPostion += 1
Loop
strHashtagToFormat = Mid(strCorpusHtmlContent, captureHashtag.Index + 1, intSpaceCharPostion - captureHashtag.Length)
Else
strHashtagToFormat = captureHashtag.Value
End If
strHashtagFormatted = "<a href=" & Chr(34) & strUrnPrefixHashtag & strHashtagValueToFormat & Chr(34) & ">" & strHashtagToFormat & "</a>"
strCorpusHtmlContent = Regex.Replace(strCorpusHtmlContent, strHashtagToFormat, strHashtagFormatted)
Next
Next
Before:
<p>#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltifortmat</span> to convert</p>
After:
<p><a href="web:keyword:#hashtag_multi ">#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltiformat</span></a> to convert</p>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | 8oris |
