'c# IText7 looping through text in PDF document

Currently we convert a bunch of pdf's to xlsx and then use vba to scrape through them for the data we need. I always get annoyed as all the pdf converters I've tried convert all documents differently which is rather annoying to deal with. So I had the bright idea to convert them myself in C#.

Using iText7 I can grab all the text and store it in a string using the code below but it's not extremely useful as I need to be able to loop through it and grab what I need.

public static string pdfTextExtract(string path)
    {

        var pageText = new StringBuilder();
        using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(path)))
        {
            var pageNumbers = pdfDocument.GetNumberOfPages();
            
            //var lineNumbers = pdfDocument.GetNumberOfPdfObjects();
            for (int i = 1; i <= pageNumbers; i++)
            {
                LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
                var page = pdfDocument.GetPage(i);
                pageText.Append( PdfTextExtractor.GetTextFromPage(page, strategy));
                parser.Reset();
            }
        }

        return pageText.ToString();

    }

Hopefully someone can help me figure out how to loop through the pdf line by line rather than grabbing the whole page or how I can loop through the string nicely to grab names and figures.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source