'How to deal with large pdf?
I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.
os.environ['TIKA_SERVER_ENDPOINT'] = 'http://0.0.0.0:9998/'
headers = {
"X-Tika-OCRLanguage": "eng+nor",
"X-Tika-PDFextractInlineImages": "true", # run OCR against inline images
}
data = parser.from_buffer(
buffer.readall(),
xmlContent=True,
requestOptions={
"headers": headers,
"timeout": 3600
}
)
Is there any header I'm missing about to handle large files?
I'm using tika-server running it directly on a docker image with this command:
docker run -d -p 9998:9998 apache/tika:1.28.2-full
Thanks for your time!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
