'Return nodes that have siblings that start-with substring of self

I'm trying to come up with a clean Xpath 1.0 expression in Excel's FILTERXML() function to return nodes with the following requirement:

  • The node must have a sibling (following) that starts with the exact same three characters of itself.

The point is to find out if there is some similarity to a certain degree in the data. Imagine the following sample data:

<t>
    <s>ABCDEF</s>
    <s>GHIJKL</s>
    <s>MNOPQR</s>
    <s>GHISTU</s>
    <s>ABVWXY</s>
</t>

From here, I'd like to return GHIJKL since it's first 3 characters, 'GHI', are found at the start of the second-to-last node.

I've been trying to piece together functions like starts-with(), substring() and count(), yet not been able to get it right. My (obviously wrong) attempt:

//s[count(following::*[starts-with(., substring(<placeholder>, 1,3))]>0]

I'm unsure if it's possible at all and as to what to write instead of the placeholder or how to rework the query to tell the expression I'd like to take the three leftmost characters of each node and test if there are any duplicates in the following ones.



Solution 1:[1]

Would following expression work?

//s[substring(., 1, 3) = following::*/substring(., 1, 3)]

Solution 2:[2]

Requirement (clean Xpath 1.0 expression): The node must have a sibling (following) that starts with the exact same three characters of itself.

As far as I understand your post, you want to get only node contents actually having followers that meet this requirement.

Sort of workaround might be to include abbreviation attributes a to each s node and to use the following XPath expression:

    "//*[substring(.,1,3)=following-sibling::*/@a]"

To make this reproducible, I expanded the original xml content and added a numeric suffix to identify results easily (of course this would need a prior substitution I left aside to focus upon the main issue and to keep it short & simple).

Sub GetNodesHavingSimilarFollowers()
'a) Define wellformed xml content
    Dim content As String
    content = _
    "<t>" & _
        "<s a='ABC'>ABCDEF1</s>" & _
        "<s a='GHI'>GHIJKL2</s>" & _
        "<s a='MNO'>MNONot3</s>" & _
        "<s a='GHI'>GHINot4</s>" & _
        "<s a='ABV'>ABVNot5</s>" & _
        "<s a='ABC'>ABCPQR6</s>" & _
        "<s a='ABC'>ABCSTU7</s>" & _
        "<s a='ABC'>ABCNot8</s>" & _
    "</t>"
'~~~~~~~~~~~~~~~~~~~~~~~~~~
'b) Define XPath expression
'~~~~~~~~~~~~~~~~~~~~~~~~~~
    Dim XPth As String
    XPth = "//*[substring(.,1,3)=following-sibling::*/@a]"

'c) Execute FilterXML
    Dim x: x = Application.FilterXML(content, XPth)
    Select Case VarType(x)
        Case vbError
            Debug.Print "Error: ", x
        Case vbString
            Debug.Print x, "1 element only."
        Case Else
            Debug.Print Join(Application.Transpose(x), "|"), UBound(x) & " elements found."
End Select

End Sub

Example output in VB Editor's immediate window

As nodes 3 and 5 are solitaires and nodes 4 and 8 (though showing the same abbreviation as other previous nodes) have no direct follower, there are only four elements left:

ABCDEF1|GHIJKL2|ABCPQR6|ABCSTU7           4 elements found.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Alexandra Dudkina
Solution 2