'How can I preserve line breaks in html table cell when scraping with gocolly

I'm trying to preserve the formatting
in table cells when I extract the contents of a <td> cell.

What happens is if there are two lines of text (for e.g, an address) in the , the code may look like: <td> address line1<br>1 address line2</td>

When colly extracts this, I get the following: address line1address line2

with no spacing or line breaks since all the html has been stripped from the text.

How can I work around / fix this so I receive readable text from the <td>



Solution 1:[1]

As far as I know gocolly does not support such formatting, but you can basically do something like below, by using htmlquery(which gocolly uses it internally) package's OutputHTML method

const htmlPage = `
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
    <title>Your page title here</title>
  </head>
  <body>
    <p>
    AddressLine 1 
    <br>
    AddresLine 2
    </p>
  </body>
</html>
`

doc, _ := htmlquery.Parse(strings.NewReader(htmlPage))
xmlNode := htmlquery.FindOne(doc, "//p")
result := htmlquery.OutputHTML(xmlNode, false)

output of result variable is like below now:

 AddressLine 1
   <br/>
 AddresLine 2

You can now parse result by <br/> tag and achive what you want.

But I am also new in go, so maybe there may be better way to do it.

Solution 2:[2]

gocolly uses goquery under the hood. You can call all Selection methods, including the Html().

func (*Selection) Html

func (s *Selection) Html() (ret string, e error)

Html gets the HTML contents of the first element in the set of matched elements. It includes text and comment nodes.

This is how you can get the html content:

c.OnHTML("tr", func(e *colly.HTMLElement) {
    // You can find the elem
    h, _ := e.DOM.Find("td").Html()
    fmt.Printf("=> %s \n", h)


    // ...or you can loop thru all of them
    elem.DOM.Each(func(_ int, s *goquery.Selection) {
        h, _ := s.Html()
        fmt.Printf("=> %s \n", h)
    })
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sinan Ulker
Solution 2 flydev