'Search for 2 strings from multiple pdfs in AWS S3 Bucket which has sub directories without downloading those in local machine
Im looking to search for two words in multiple pdfs located in AWS S3 bucket. However, I dont want to download those docs in local machine, instead if the search part could directly run on those pdfs via URL. Point to note that these PDFs are located in multiple sub directories within a bucket ( like year folder, then month folder, then date ).
Solution 1:[1]
Amazon S3 does not have a 'Search' capability. It is a "simple storage service".
You would either need to download those documents to some form of compute platform (eg EC2, Lambda, or your own computer) and perform the searches, or you could pre-index the documents using a service like Amazon OpenSearch Service and then send the query to the search service.
Solution 2:[2]
Running a direct scan of PDFs to search for texts in an S3 bucket is HARD:
- Some PDFs contain text that were embedded inside images (They are not readable in text form)
- If you want to download a PDF without saving it, consider using memory-optimized machines and don't store the files in the hard drive of the virtual machines and use in-memory streams.
- In order to get around texts inside images, it would require you to use OCR logic which is also HARD to execute. You'll prolly want to use AWS Textract or Google Vision for OCR. If compliance and security is an issue, you could use Tesseract.
- If in any case that you have a reliable OCR solution, I would suggest to run a text extraction job after an upload event happens, this will save you tons of money to pay for any OCR service that you'll consume, it will also enable your organization to cache the contents of the pdf in text format in more search-friendly services like AWS OpenSearch
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | John Rotenstein |
| Solution 2 | Allan Chua |
