'How can I get the textContent of an element, from the command line, using node?

I have lots of data which represent character-offset spans, using the W3C annotation spec in JSON-LD. An example (taken from Recogito.js) looks like this:

"selector": [
  {
    "type": "TextQuoteSelector",
    "exact": "death in battle"
  },
  {
    "type": "TextPositionSelector",
    "start": 646,
    "end": 661
  }
]

Those spans (646, 661) refer to locations in this HTML:

<div id="content" class="plaintext">
  <h1>Homer: The Odyssey</h1>
  <p>
    <strong>Tell me, O muse,</strong> of that ingenious hero who travelled far and wide after he had sacked
    the famous town of Troy. Many cities did he visit, and many were the nations with whose manners and customs
    he was acquainted; moreover he suffered much by sea while trying to save his own life and bring his men safely
    home; but do what he might he could not save his men, for they perished through their own sheer folly in eating
    the cattle of the Sun-god Hyperion; so the god prevented them from ever reaching home. Tell me, too, about all
    these things, O daughter of Jove, from whatsoever source you may know them.
  </p>
  <p>
    <strong>So now all who escaped death in battle</strong> or by shipwreck had got safely home except Ulysses,
    and he, though he was longing to return to his wife and country, was detained by the goddess Calypso, who
    had got him into a large cave and wanted to marry him. But as years went by, there came a time when the gods
    settled that he should go back to Ithaca; even then, however, when he was among his own people, his troubles
    were not yet over; nevertheless all the gods had now begun to pity him except Neptune, who still persecuted
    him without ceasing and would not let him get home.
  </p>
</div>

I found out that they're substrings of textContent, according to the w3c discussion on annotation, so I can get those spans programmatically, by first selecting the textContent of the #content div using the browser console, like this,

document.getElementById('content').textContent

Which outputs this:

"Homer: The OdysseyTell me, O muse, of that ingenious hero who travelled far and wide after he had sacked the famous town of Troy. Many cities did he visit, and many were the nations with whose manners and customs he was acquainted; moreover he suffered much by sea while trying to save his own life and bring his men safely home; but do what he might he could not save his men, for they perished through their own sheer folly in eating the cattle of the Sun-god Hyperion; so the god prevented them from ever reaching home. Tell me, too, about all these things, O daughter of Jove, from whatsoever source you may know them.So now all who escaped death in battle or by shipwreck had got safely home except Ulysses, and he, though he was longing to return to his wife and country, was detained by the goddess Calypso, who had got him into a large cave and wanted to marry him. But as years went by, there came a time when the gods settled that he should go back to Ithaca; even then, however, when he was among his own people, his troubles were not yet over; nevertheless all the gods had now begun to pity him except Neptune, who still persecuted him without ceasing and would not let him get home.isRelatedTo"

And I am able to get a substring of that, given two character indices,

document.getElementById('content').textContent.substring(646, 661)

Which gives "death in battle". I.e., the same string listed above.

I'm trying to recreate this in node, so that I can have a short script which reads the HTML, and grabs the textBody, and prints it. However I can't get it to behave as it does in the browser.

I've tried this:

const fs = require('fs')
const cheerio = require('cheerio') 
var file = fs.readFileSync('index.html', 'utf8')
const $ = cheerio.load(file);
text = $('#content').text()
console.debug(text.substring(646,661))

But the text is all indented as it is in the HTML, and the substring I get is hter of Jove, f, rather than death in battle.

So I tried again like this:

const fs = require('fs')
const cheerio = require('cheerio') 
var file = fs.readFileSync('index.html', 'utf8')
const $ = cheerio.load(file, {xml: {normalizeWhitespace: true}});
text = $('#content').text()
console.debug(text)
console.debug(text.substring(646,661))

In other words, using the normalizeWhitespace option from the cheerio documentation. This returns something better, aped death in b, but it's still off. I thought that they all might be off by 5, so I tried just adding five to everything:

console.log('This should be "Troy"')
console.log(text.substring(124+5,128+5))
console.log('This should be "death in battle"')
console.log(text.substring(646+5,661+5))
console.log('This should be "Ithaca"')
console.log(text.substring(963+5,970+5))

But while that fixes death in battle, it breaks the others:

This should be "Troy"
oy. 
This should be "death in battle"
death in battle
This should be "Ithaca"
Ithaca;

So I have a feeling something else is going on with the whitespace parsing here.

My question is: should I be using something other than cheerio to do this? Or passing it another option? Ultimately I just want to be able to output the text as it will be seen by document.getElementById('content').textContent.substring(646, 661).

I know that node and the browser aren't the same, and behave differently, but is there something I could use that has the same behavior as textContent?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How can I get the textContent of an element, from the command line, using node?

Sources

Related Questions