'Extracting exact table data from PDF

I am trying to extract each row of my table from a pdf file I created before.

The problem I have, is that empty cells, which I thought would be saved as 'null', are ignored, and not even read as space characters.

enter image description here

extracted from PDF

I extract the content from my PDF via this method:

    public final ArrayList<String> extractLines(final File pdf) throws IOException {
    try (PDDocument doc = PDDocument.load(pdf)) {
        PDFTextStripper strip = new PDFTextStripper();
        String txt = strip.getText(doc);
        String[] arr = txt.split("\n");
        final ArrayList<String> lines = new ArrayList<>(Arrays.asList(arr));
        return lines;
    }
}

Is it even possible to extract the data with whitespaces?

If so, with PDFBox? Or a different method?

EDIT:

Cannot get traprange to work, simple test:

File e = new File("C:/Users/Test/Downloads/a.pdf");

    List<Table> t = new PDFTableExtractor().setSource(e).extract();
    System.out.println(t.get(0).toString());

Only gives me:

enter image description here

Could it have to do with the form of my table?

My table:

enter image description here



Solution 1:[1]

I came up with my own solution.

Since I have a 2D ArrayList, I each have a list containing a row of the table.

Now I save the position of the non empty cells (only one is not empty per row at any time).

I save it in a meta data field of the PDF and load this field to get the positions back.

Solution 2:[2]

The solution needs custom algorithm to complete the task. Please check this solution for custom PDFTableStripper.

Another great solution has been implemented by Tho which could be found at traprage. It can extract the null data of a particular cell.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dahlin
Solution 2 Abdul Alim Shakir