'PdfTextExtractor.GetTextFromPage() returns empty string

I'm trying to extract the text from the following PDF with the following code (using iText7 7.2.2) :

var source = (string)GetHttpResult("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf", new CookieContainer());
var bytes = Encoding.UTF8.GetBytes(source);
var stream = new MemoryStream(bytes);
var reader = new PdfReader(stream);
var doc = new PdfDocument(reader);
var pages = doc.GetNumberOfPages();
var text = PdfTextExtractor.GetTextFromPage(doc.GetPage(1));

Loading the PDF in my browser (Edge 100.0) works fine.

GetHttpResult() is a simple HttpClient defining a custom CookieContainer, a custom UserAgent, and calling ReadAsStringAsync(). Nothing fancy.

source has the correct PDF content, starting with "%PDF-1.7".

pages has the correct number of pages, which is 2.

But, whatever I try, text is always empty.

Defining an explicit TextExtractionStrategy, trying some Encodings, extracting from all pages in a loop, ..., nothing matters, text is always empty, with no Exception thrown anywhere.

I think I don't read this PDF how it's "meant" to be read, but what is the correct way then (correct content in source, correct number of pages, no Exception anywhere) ?

Thanks.



Solution 1:[1]

That's it ! Thanks to mkl and KJ !

I first downloaded the PDF as a byte array so I'm sure it's not modified in any way.

Then, as pdftotext is able to extract the text from this PDF, I searched for a NuGet package able to do the same. I tested almost ten of them, and FreeSpire.PDF finally did it !

Update : Actually, FreeSpire.PDF missed some words, so I finally found PdfPig, able to extract every single word.

Code using PdfPig :

using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

byte[] bytes;
using (HttpClient client = new())
{
    bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}

List<string> words = new();
using (PdfDocument document = PdfDocument.Open(bytes))
{
    foreach (Page page in document.GetPages())
    {
        foreach (Word word in page.GetWords())
        {
            words.Add(word.Text);
        }
    }
}

string text = string.Join(" ", words);

Code using FreeSpire.PDF :

using Spire.Pdf;
using Spire.Pdf.Exporting.Text;

byte[] bytes;
using (HttpClient client = new())
{
    bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}

string text = string.Empty;
SimpleTextExtractionStrategy strategy = new();
using (PdfDocument doc = new())
{
    doc.LoadFromBytes(bytes);
    foreach (PdfPageBase page in doc.Pages)
    {
        text += page.ExtractText(strategy);
    }
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1