'Select text strings with multiple formatting tags within

Context:

VB.NET application using htmlagility pack to handle html document.

Issue:

In a html document, I'd like to prefixe all the strings starting with # and ending with space by an url whatever formatting tags are used within. So #sth would became http://www.anything.tld/sth

For instance:

Before:

<p>#string1</p> blablabla
<p><strong>#stri</strong>ng2</p> bliblibli

After:

<p><a href="http://www.anything.tld/string1">#string1</a> blablabla</p>
<p><a href="http://www.anything.tld/string2"><strong>#stri</strong>ng2</a> bliblibli</p>

I guess i can achieve this with html agility pack but how to select the entire text string without its formatting ?

Or should i use a simple regex replace routine?



Solution 1:[1]

Here's my solution. I'm sure it would make some experienced developpers bleed from every hole but it actually works. The htmlcode is in strCorpusHtmlContent

Dim matchsHashtag As MatchCollection
Dim matchHashtag As Match
Dim captureHashtag As Capture
Dim strHashtagFormatted As String
Dim strRegexPatternHashtag As String = "#([\s]*)(\w*)"
matchsHashtag = Regex.Matches(strCorpusHtmlContent, strRegexPatternHashtag)
For Each matchHashtag In matchsHashtag
     For Each captureHashtag In matchHashtag.Captures
         Dim strHashtagToFormat As String
         Dim strHashtagValueToFormat As String
         ' Test if the hashtag is followed by a tag
         If Mid(strCorpusHtmlContent, captureHashtag.Index + captureHashtag.Length + 1, 1) = "<" Then
            strHashtagValueToFormat = captureHashtag.Value                    
            Dim intStartPosition As Integer = captureHashtag.Index + captureHashtag.Length + 1
            Dim intSpaceCharPostion As Integer = intStartPosition
            Dim nextChar As Char
            Dim blnInATag As Boolean = True
            Do Until (nextChar = " " Or nextChar = vbCr Or nextChar = vbLf Or nextChar = vbCrLf) And blnInATag = False
                  nextChar = CChar(Mid(strCorpusHtmlContent, intSpaceCharPostion + 1, 1))
                  If nextChar = "<" Then
                     blnInATag = True
                  ElseIf nextChar = ">" Then
                     blnInATag = False
                  End If
                  If blnInATag = False And nextChar <> ">" And nextChar <> " " Then
                     strHashtagValueToFormat &= nextChar
                  End If
                  intSpaceCharPostion += 1
              Loop
              strHashtagToFormat = Mid(strCorpusHtmlContent, captureHashtag.Index + 1, intSpaceCharPostion - captureHashtag.Length)
         Else
              strHashtagToFormat = captureHashtag.Value
         End If

             strHashtagFormatted = "<a href=" & Chr(34) & strUrnPrefixHashtag & strHashtagValueToFormat & Chr(34) & ">" & strHashtagToFormat & "</a>"

             strCorpusHtmlContent = Regex.Replace(strCorpusHtmlContent, strHashtagToFormat, strHashtagFormatted)
     Next
Next

Before:

<p>#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltifortmat</span> to convert</p>

After:

<p><a href="web:keyword:#hashtag_multi ">#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltiformat</span></a> to convert</p>

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 8oris