'(while reading XRef): Error: Invalid XRef stream header?

hi i am trying to read pdf in node js . when i try to read this pdf. it start showing this error.

(while reading XRef): Error: Invalid XRef stream header
Error: Error: Invalid XRef stream header
    at error (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:195:9)
    at XRef_readXRef [as readXRef] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:5692:9)
    at XRef_parse [as parse] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:5280:28)
    at PDFDocument_setup [as setup] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:4622:17)
    at PDFDocument_parse [as parse] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:4506:12)
    at LocalPdfManager_ensure [as ensure] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:32515:24)
    at LocalPdfManager.BasePdfManager_ensureModel [as ensureModel] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:32451:19)
    at Object.eval [as onResolve] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:27142:22)
    at Object.runHandlers (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:864:35)
    at listOnTimeout (internal/timers.js:557:17)
Error: Invalid XRef stream header
error: { parserError: 'Error: Invalid XRef stream header' }

here is my code as well

import { PdfReader } from "pdfreader";

new PdfReader().parseFileItems("./GeM-Bidding-3342395.pdf", (err, item) => {
  if (err) console.error("error:", err);
  else if (!item) console.warn("end of file");
  else if (item.text) console.log(item.text);
});

but when i try to parse the same pdf using online parsers the pdf get parsed and here is a sample of it . and also sujjest if not this way how can i extract the data using api or something.



Solution 1:[1]

From any OS console system (Linux Mac Windows) the easiest way to parse a PDF is to use either of the utility commands pdftotext - Xpdf or Poppler (generally 64 bit) Windows binary here

To export say two pages to console use pdftotext -nopgbrk -f 1 -l 2 GeM-Bidding-3342395.pdf - To save in a file use a filename in place of - or pipe to another command

The sequence of output can vary depending on options so the above without mod looks like this:- enter image description here

However if I add -layout in the poppler version its more like this:-

enter image description here

And there are other options in the Xpdf version such as -table -simple -simple2, so you need to pick the one best suited to your desire.

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1