'extract Xpath for string in a div class
I have the below XPath
<div class="sic_cell {symbol : 'GGRM.JK'}">
<a href="/fundamental/factsheet.html?counter=GGRM.JK">Gudang Garam Tbk.</a>
</div>
I would like to extract "GGRM.JK"from the HTML.
//div[contains(@class, "symbol")]
return element not no text of "GGRM.JK"
Solution 1:[1]
Since it seems you are using python, try the following:
import lxml.html as lh
data = """[your html above]"""
doc = lh.fromstring(data)
#version 1
target = doc.xpath('//div[contains(@class, "symbol")]/@class')[0]
print(target.split("'")[1])
#version 2
target2 = doc.xpath('//div[contains(@class, "symbol")]/a/@href')[0]
target2.split('=')[1]
In either case, the output should be
GGRM.JK
Solution 2:[2]
The shortest way to get the substing you want with xpath only, without postprocessing, is to use a functions substring-after and substring-before.
Here is an example, how to get 'GGRM.JK' from both class and href attributes.
import lxml.html as lh
htmlText = """<div class="sic_cell {symbol : 'GGRM.JK'}">
<a href="/fundamental/factsheet.html?counter=GGRM.JK">Gudang Garam Tbk.</a>
</div>"""
htmlDom = lh.fromstring(htmlText)
fromHref = htmlDom.xpath('substring-after(//div/a/@href, "=")')
print(fromHref)
fromClass = htmlDom.xpath('substring-before(substring-after(//div/@class, ": \'"), "\'")')
print(fromClass)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jack Fleeting |
| Solution 2 |
