'Python lxml xpath is selecting more content than expected
I'm tring to save the result of searching. A typical result is something like: https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2
In the html file, the information I need is within this section:
<table class="table table-striped table-condensed" id="searchResults">
<thead>
<tr>
<th></th>
<th></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Symbol&sortDir=Ascending"
target="_self">Symbol</a>
</th>
<th>Description</th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Category&sortDir=Ascending"
target="_self">Category</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GeneCard#tocEl-2" target="_blank" title="Read more about gene categories"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Gifts&sortDir=Ascending"
target="_self">GIFtS</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GeneCard#GIFtS" target="_blank"
title="Read more about GeneCards Inferred Functionality Scores (GIFtS)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Gcid&sortDir=Ascending"
target="_self">GC id</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GCids" target="_blank" title="Read more about GeneCards identifiers (GC ids)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Score&sortDir=Ascending"
target="_self">Score</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/Search#relevance" target="_blank" title="Read more about search scores"></a></th>
</tr>
</thead>
<tbody>
<tr>
<td class="index-col">1</td>
<td class="gc-expand-collapse expand-collapse-col"><a href="#"></a></td>
<td class="gc-gene-symbol gc-highlight symbol-col">
<a href="/cgi-bin/carddisp.pl?gene=IL1R1-AS1&keywords=NONHSAT072848.2" target="_blank"
data-track-event="Result Clicked" data-ga-label="IL1R1-AS1">IL1R1-AS1</a>
</td>
<td class="gc-highlight description-col">IL1R1 Antisense RNA 1</td>
<td class="category-col">RNA Gene</td>
<td class="gifts-col">9</td>
<td class="gc-highlight gcid-col">GC02M102174</td>
<td class="score-col">1.29</td>
</tr>
</tbody>
</table>
Here is my code:
import lxml.html
import requests
NONCODE_IDs = [
"NONHSAT072848.2",
"NONHSAT182278.1",
"NONHSAG077582.1",
"NONHSAG028748.2",
"NONHSAT151221.1",
"NONHSAT151222.1",
"NONHSAG000557.2"
]
# query link example: https://www.genecards.org/Search/Keyword?queryString=MAPK
my_header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
}
link_base = "https://www.genecards.org/Search/Keyword?queryString="
query_link = link_base + NONCODE_IDs[0]
response = requests.get(query_link, headers=my_header)
html = lxml.html.fromstring(response.content)
table = html.xpath('//table[@id="searchResults"]')[0]
However,
table = html.xpath('//table[@id="searchResults"]')[0]
is selecting more content than expected.
etree.tostring(table) returns content starting from the desired line <table class="table table-striped table-condensed" id="searchResults"> to the end of the html file.
I'm not sure where I did wrong.
Solution 1:[1]
For this perticular web page, beautifulsoup works for me. Yet I'm still looking for a generel fix for it using lxml because I'm a fan of xpath which beautifulsoup does not support.
Here is the beautifulsoup code that can extract the table correctly:
from bs4 import BeautifulSoup
import requests
query_link = "https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"}
response = requests.get(query_link, headers=headers)
html = BeautifulSoup(response.content, "html.parser")
table = html.find_all("table", {"class": "table table-striped table-condensed", "id": "searchResults"})
print(table)
Solution 2:[2]
I'm still not entirely sure why this happens, but it seems that lxml (unlike BeautifulSoup) treats the table as two different tables: one containing the <thead> and the other the <tbody>. So to extract them both, try:
table = html.xpath('//table[@id="searchResults"]')[0]
print(lxml.html.tostring(table[0]).decode())
print(lxml.html.tostring(table[1]).decode())
The output should be the one in your question.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Zheng |
| Solution 2 | Jack Fleeting |
