'Scrapy - getting HTML without outer tag
I'm scraping a page, using Scrapy. I want the HTML contents of the TD with "text" class:
<tr valign="top">
<td class="text" width="100%">
<b>A bunch of HTML</b>
<ul type="disc">
<li>Some random text</li>
</ul>
</td>
</tr>
This is my Scrapy line:
for body in response.css('td.text'):
yield {'body': body.extract()}
Which works - except it includes the surrounding td:
[
{"body": "<td class="text" width="100%"> <b>A bunch of HTML</b> <ul type="disc"> <li>Some random text</li> </ul> </td>"}
]
This is what I want:
[
{"body": "<b>A bunch of HTML</b> <ul type="disc"> <li>Some random text</li> </ul>"}
]
Halp? :)
Solution 1:[1]
Try this selector:
response.css('td.text *')
The *
will select all inner tags.
Solution 2:[2]
Well, I found a solution, although I still think there must be a smarter way:
bodies = ''
for body in response.xpath("//td[@class='text']/child::node()"):
bodies += body.extract()
yield {'body': bodies}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | DharmanBot |
Solution 2 | Benjamin Rasmussen |