'How can I preserve line breaks in html table cell when scraping with gocolly
I'm trying to preserve the formatting
in table cells when I extract the contents of a <td> cell.
What happens is if there are two lines of text (for e.g, an address) in the , the code may look like:
<td> address line1<br>1 address line2</td>
When colly extracts this, I get the following: address line1address line2
with no spacing or line breaks since all the html has been stripped from the text.
How can I work around / fix this so I receive readable text from the <td>
Solution 1:[1]
As far as I know gocolly does not support such formatting, but you can basically do something like below, by using htmlquery(which gocolly uses it internally) package's OutputHTML method
const htmlPage = `
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Your page title here</title>
</head>
<body>
<p>
AddressLine 1
<br>
AddresLine 2
</p>
</body>
</html>
`
doc, _ := htmlquery.Parse(strings.NewReader(htmlPage))
xmlNode := htmlquery.FindOne(doc, "//p")
result := htmlquery.OutputHTML(xmlNode, false)
output of result variable is like below now:
AddressLine 1
<br/>
AddresLine 2
You can now parse result by <br/> tag and achive what you want.
But I am also new in go, so maybe there may be better way to do it.
Solution 2:[2]
gocolly uses goquery under the hood. You can call all Selection methods, including the Html().
func (*Selection) Html
func (s *Selection) Html() (ret string, e error)Html gets the HTML contents of the first element in the set of matched elements. It includes text and comment nodes.
This is how you can get the html content:
c.OnHTML("tr", func(e *colly.HTMLElement) {
// You can find the elem
h, _ := e.DOM.Find("td").Html()
fmt.Printf("=> %s \n", h)
// ...or you can loop thru all of them
elem.DOM.Each(func(_ int, s *goquery.Selection) {
h, _ := s.Html()
fmt.Printf("=> %s \n", h)
})
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sinan Ulker |
| Solution 2 | flydev |
