'c# IText7 looping through text in PDF document
Currently we convert a bunch of pdf's to xlsx and then use vba to scrape through them for the data we need. I always get annoyed as all the pdf converters I've tried convert all documents differently which is rather annoying to deal with. So I had the bright idea to convert them myself in C#.
Using iText7 I can grab all the text and store it in a string using the code below but it's not extremely useful as I need to be able to loop through it and grab what I need.
public static string pdfTextExtract(string path)
{
var pageText = new StringBuilder();
using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(path)))
{
var pageNumbers = pdfDocument.GetNumberOfPages();
//var lineNumbers = pdfDocument.GetNumberOfPdfObjects();
for (int i = 1; i <= pageNumbers; i++)
{
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
var page = pdfDocument.GetPage(i);
pageText.Append( PdfTextExtractor.GetTextFromPage(page, strategy));
parser.Reset();
}
}
return pageText.ToString();
}
Hopefully someone can help me figure out how to loop through the pdf line by line rather than grabbing the whole page or how I can loop through the string nicely to grab names and figures.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
