'Getting all the text content from a HTML string in NodeJS

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.

For example, the HTML String might be:

<ul>
  <li>First</li>
  <li>Second</li>
</ul>

What I want:

First Second

or

First
Second

I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).

The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?



Solution 1:[1]

Okay you can try this example, This may help you

I used JSDom module

https://www.npmjs.com/package/jsdom

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent); 

BTW Helped me enter image description here

This code can help I think :)

Solution 2:[2]

Using the DOM, you could use document.Node.textContent. However, NodeJs doesn't have textContent (since it doesn't have native access to the DOM), therefore you should use external packages. You could install request and cheerio, using npm. cheerio, suggested by Jon Church, is maybe the easiest web scraping tool to use (there are also complexer ones like jsdom) With power of cheerio and request in your hands, you could write

const request = require("request");
const cheerio = require("cheerio");
const fs = require("fs");

//taken from https://stackoverflow.com/a/19709846/10713877
function is_absolute(url)
{
    var r = new RegExp('^(?:[a-z]+:)?//', 'i');
    return r.test(url);
}

function is_local(url)
{
    var r = new RegExp('^(?:file:)?//', 'i');
    return (r.test(url) || !is_absolute(url));
}

function send_request(URL)
    {
        if(is_local(URL))
        {
            if(URL.slice(0,7)==="file://")
                url_tmp = URL.slice(7,URL.length);
            else
                url_tmp = URL;

           //taken from https://stackoverflow.com/a/20665078/10713877
           const $ = cheerio.load(fs.readFileSync(url_tmp));
           //Do something
           console.log($.text())
        }
        else
        {
            var options = {
                url: URL,
                headers: {
                  'User-Agent': 'Your-User-Agent'
                }
              };

            request(options, function(error, response, html) {
                //no error
                if(!error && response.statusCode == 200)
                {
                    console.log("Success");

                    const $ = cheerio.load(html);


                    return Promise.resolve().then(()=> {
                        //Do something
                        console.log($.text())
                    });
                }
                else
                {
                    console.log(`Failure: ${error}`);
                }
            });
        }
    }

Let me explain the code. You pass a URL to send_request function. It checks whether the URL string is a path to your local file, (a relative path, or a path starting with file://). If it is a local file, it proceeds to use cheerio module, otherwise, it has to send a request, to the website, using the request module, then use cheerio module. Regular Expressions are used in is_absolute and is_local. You get the text using text() method provided by cheerio. Under the comments //Do something, you could do whatever you want with the text. There are websites that let you know 'Your-User-Agent', copy-paste your user agent to that field.

Below lines will work

//your local file
send_request("/absolute/path/to/your/local/index.html"); 
send_request("/relative/path/to/your/local/index.html"); 
send_request("file:///absolute/path/to/your/local/index.html"); 
//website
send_request("https://stackoverflow.com/"); 

EDIT: I am on a linux system.

Solution 3:[3]

Convert html to text:

In your terminal: npm install html-to-text

index.js:

const { convert } = require('html-to-text')
var html = `
<ul>
  <li>First</li>
  <li>Second</li>
</ul>
`;
var text = convert(html, { wordwrap: 130 })
// Out:
// First
// Second

Solution 4:[4]

You can try get rid of html tags using regex, for the yours example try the following:

let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`

console.log(str)

let regex = '<\/?!?(li|ul)[^>]*>'

var re = new RegExp(regex, 'g');

str = str.replace(re, '');
console.log(str)

Solution 5:[5]

You can try using npm library htmlparser2. Its will be very simple using this

const htmlparser2 = require('htmlparser2');

const htmlString = ''; //your html string goes here
const parser = new htmlparser2.Parser({
    ontext(text) {
      if (text && text.trim().length > 0) {
        //do as you need, you can concatenate or collect as string array
      }
    }
  });

parser.write(htmlString);
parser.end();

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Ramy Hadid
Solution 4 elvira.genkel
Solution 5 vikash vik