'Puppeteer document.querySelectorAll only returning undefined in a loop "TypeError: Cannot read properties of undefined (reading 'innerHTML')"
I am trying to programmatically access data in tables from the page https://rarity.tools/upcoming/ in javascript. Since the site loads through javascript, I've been using puppeteer. The site has multiple tables (4 total) and I would like to be able to reference each table and check how many rows they have.
I originally tried to use nth-of-type, but it seems the site I'm trying to receive data from doesn't structure their page in a way that would allow me to use nth-of-type or nth-child (please see: Accessing nth table and counting rows with pupeteer in javascript error: "failed to find element matching selector "table:nth-of-type(2) > tr"").
Instead, I'm trying to create a for loop to set the innerHTML of each table to its own variable and then analyze the HTML string based on the index. The following returns the correct value if I hardcode the number:
console.log(table_html)
let table_html = await page.evaluate(
() => document.querySelectorAll('table')[2].innerHTML
)
However, as soon as I set it to a loop:
for (let j = 0; j < numTables; j++) {
let table_html = await page.evaluate(
(j) => document.querySelectorAll('table')[j].innerHTML
)
console.log(table_html)
}
I receive the error:
Error: Evaluation failed: TypeError: Cannot read properties of undefined (reading 'innerHTML') at puppeteer_evaluation_script:1:46 at ExecutionContext._evaluateInternal (C:\Users\kylel\Desktop\NFTSorter_IsolatedJS\node_modules\puppeteer\lib\cjs
puppeteer\common\ExecutionContext.js:221:19) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async ExecutionContext.evaluate (C:\Users\kylel\Desktop\NFTSorter_IsolatedJS\node_modules\puppeteer\lib\cjs\pup peteer\common\ExecutionContext.js:110:16) at async fetch (C:\Users\kylel\Desktop\NFTSorter_IsolatedJS\app.js:35:30)
All code:
const puppeteer = require('puppeteer');
let fetch = async () => {
try {
// Puppeteer initialization
const browser = await puppeteer.launch({ headless: true, defaultViewport: null });
const [page] = await browser.pages();
await page.goto('https://rarity.tools/upcoming/');
await page.waitForTimeout(2500) // Timeout so page can load actual content
const numTables = await page.$$eval('table', el => el.length) - 1;
for (let j = 0; j < numTables; j++) {
let table_html = await page.evaluate(
(j) => document.querySelectorAll('table')[j].innerHTML
)
console.log(table_html)
}
}
catch (error) {
console.log(error)
}
}
fetch();
How do I fix this for loop to allow me to run the document.querySelectorAll('table') for each table?
Additionally, if anyone has any insights about ways I could achieve what I'm looking to do (programmatically access the data from these tables based on a variable amount of tables using puppeteer) it'd be much appreciated! And any recommendations of what tools to use to analyze HTML in string form if I end up utilizing the method described here?
Thank you very much!
Solution 1:[1]
This code exhibits a classic Puppeteer gotcha:
let table_html = await page.evaluate(
(j) => document.querySelectorAll('table')[j].innerHTML
)
you need to pass j as a parameter to evaluate as described in How can I pass variable into an evaluate function?, otherwise j is undefined by the time the function is deserialized and executed in the browser console.
let table_html = await page.evaluate(
(j) => document.querySelectorAll('table')[j].innerHTML, j
// ^^^
)
That said, I suggest using $$eval and map instead of a counter for loop to avoid having to worry about the index. Also, the waitForTimeout seems like an unnecessary race condition. Using an event-driven approach with waitForSelector as the docs recommend seems faster and more reliable.
const puppeteer = require("puppeteer"); // ^13.5.1
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const url = "https://rarity.tools/upcoming/";
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.waitForSelector("table");
const tableLengths = await page.$$eval("table", els =>
els.map(el => el.querySelectorAll("tr").length)
);
console.log(tableLengths);
const tableHTML = await page.$$eval("table", els =>
els.map(el => el.innerHTML)
);
console.log(tableHTML.map(e => e.slice(0, 50)));
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output:
[ 224, 100, 21, 5 ]
[
'<tr class=""><th colspan="4" class="text-xl text-c',
'<tr class=""><th colspan="4" class="text-xl text-c',
'<tr class=""><th colspan="4" class="text-xl text-c',
'<tr class="hidden"><th colspan="4" class="text-xl '
]
Regarding "And any recommendations of what tools to use to analyze HTML in string form if I end up utilizing the method described here?", I'm not sure what sort of analysis you're planning on doing on the HTML exactly, but generally speaking, don't deal with HTML in string form. Use the DOM, XPath and CSS selectors provided by Puppeteer or the browser console for 99% of use cases.
For example, if you want average prices per table, select the <td>s corresponding to the price column, loop over the values, then convert the cells to numbers and take the average. Doing the same thing with string manipulation or a regex is like using scissors to mow the lawn when there's a lawn mower sitting right there.
Better yet, avoid scraping entirely if you can. There's an open endpoint https://collections.rarity.tools/upcoming2, so you can retrieve the data with wget or curl instantly, assuming it has what you're looking for.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
