'PDFBOX digit garble

I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you

My version is 2.0.22

The correct output is [USD-001], the wrong output is [USD- ]

public static String readPDF(File file) throws IOException {
    RandomAccessBufferedFileInputStream rbi = null;
    PDDocument pdDocument = null;
    String text = "";
    try {
        rbi = new RandomAccessBufferedFileInputStream(file);
        PDFParser parser = new PDFParser(rbi);
        parser.setLenient(false);
        parser.parse();
        pdDocument = parser.getPDDocument();
        PDFTextStripper textStripper = new PDFTextStripper();
        text = textStripper.getText(pdDocument);
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        rbi.close();
    }
    return text;
}

I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text

PDFDebugger output enter image description here

The pdf file : http://tmp.link/f/6249a07f6e47f



Solution 1:[1]

There are a number of aspects of this file making text extraction difficult.

First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:

1 begincodespacerange
<00> <ff> endcodespacerange
2 beginbfchar
<22> <0000> <23> <0000> endbfchar

I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.

Also the Encoding is not helping at all:

<</Type/Encoding/Differences[ 0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g121/g122]>>

The glyph names /g121 and /g122 don't have a standardized meaning either.

PdfBox for text extraction works with these two properties of a font and, therefore, fails here.

Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.

In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':

/P <</MCID 23>>/Span <</ActualText<FEFF0030>>>BDC 

The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.

Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.


Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.

So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mkl