'Skip SAXParser error in parsing very big XML [duplicate]

I'm trying to use Java and SAXParser to get information from the WikiData dump file (120 GB, bzipped).

This is the code:

XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
parser.setContentHandler(this);
parser.setErrorHandler(this);
parser.setProperty("http://xml.org/sax/properties/lexical-handler", this);
FileInputStream in = new FileInputStream(xin);
BZip2CompressorInputStream bin = new BZip2CompressorInputStream(in);
parser.parse(new InputSource(bin));

At some point, after more than 770,000 WikiData pages correctly parsed, I get this error

[main] ERROR (AbstractWikipediaXmlDumpParser.java:119) - SAXParseException at Q843131 org.xml.sax.SAXParseException; lineNumber: 14861047; columnNumber: 1959; Invalid byte 2 of 4-byte UTF-8 sequence.

This is probably an error in the XML file, but I do not know how to solve it, since it's almost impossible to open a bzipped 120 GB file and fix a single character.

Is there a way to tell SAXPArser to ignore errors? Since I got the page title that gives the error (Q843131), I think the program can skip it, can't it?

I also search for a solution on the web, but most of the answers suggest to edit the file (impossible, since it's 120 GB bzipped in size) or to use some checkers (xmlstarlet, for example, considers the XML valid).



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source