'extract text from pdf File from S3 bucket python
I have multiple format files in my AWS s3 bucket like pdf,doc,rtf,odt,png and I need to extract text from it. I have managed to get the list of contents with their path .now depending on the file type i will use different libraries to extract text from the file . since files can be in thousands i need to extract text directly from s3 instead of downloading.
filespath=['https://abc.s3.ap-south-1.amazonaws.com/DocumentOnPATest', 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf', 'https://abc.s3.ap-south-1.amazonaws.com/receipt.png', 'https://abc.s3.ap-south-1.amazonaws.com/sample.rtf', 'https://abc.s3.ap-south-1.amazonaws.com/sample1.odt']
bucketname =abc
I tried something but its giving me error
for path in filespath:
ext=pathlib.Path(path).suffix
if ext=='.pdf':
pdf_file=PyPDF2.PdfFileReader(path)
print(pdf_file.extractText())
but i am getting an error
File "F:\Projects\FileExtractor\fileextracts3.py", line 28, in <module>
pdf_file=PyPDF2.PdfFileReader(path)
File "C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1081, in __init__
fileobj = open(stream, 'rb')
OSError: [Errno 22] Invalid argument: 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf
please help me with the lead. Thank you
Solution 1:[1]
PyPDF2 does not support reading from s3 directly. You'll need to download them first locally.
Solution 2:[2]
You could try the boto3 solution here, provided by Justin Leto. You would still need a way of reading/converting the file stream for each file type but the PDF answer is there.
import boto3
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, itemname)
fs = obj.get()['Body'].read()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
