'extract Xpath for string in a div class

I have the below XPath

<div class="sic_cell {symbol : 'GGRM.JK'}">
    <a href="/fundamental/factsheet.html?counter=GGRM.JK">Gudang Garam Tbk.</a>
</div>

I would like to extract "GGRM.JK"from the HTML.

//div[contains(@class, "symbol")]

return element not no text of "GGRM.JK"



Solution 1:[1]

Since it seems you are using python, try the following:

import lxml.html as lh
data = """[your html above]"""
doc = lh.fromstring(data)

#version 1
target = doc.xpath('//div[contains(@class, "symbol")]/@class')[0]    
print(target.split("'")[1])

#version 2
target2 = doc.xpath('//div[contains(@class, "symbol")]/a/@href')[0]
target2.split('=')[1]

In either case, the output should be

GGRM.JK

Solution 2:[2]

The shortest way to get the substing you want with xpath only, without postprocessing, is to use a functions substring-after and substring-before.

Here is an example, how to get 'GGRM.JK' from both class and href attributes.

import lxml.html as lh

htmlText = """<div class="sic_cell {symbol : 'GGRM.JK'}">
    <a href="/fundamental/factsheet.html?counter=GGRM.JK">Gudang Garam Tbk.</a>
</div>"""

htmlDom = lh.fromstring(htmlText)

fromHref = htmlDom.xpath('substring-after(//div/a/@href, "=")')
print(fromHref)

fromClass = htmlDom.xpath('substring-before(substring-after(//div/@class, ": \'"), "\'")')
print(fromClass)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jack Fleeting
Solution 2