'how to search for file contents in amazon S3 bucket without downloading the file

i have n number of files uploaded to amazon S3 i need*search* those files based on occurrence of an string in its contents , i tried one method of downloading the files from S3 bucket converting input stream to string and then search for the word in content , but if their are more than five to six files it takes lot of time to do the above process,

is their any other way to do this , please help thanks in advance.



Solution 1:[1]

If your files contain CSV, TSV, JSON, Parquet or ORC, you can take a look at AWS's Athena: https://aws.amazon.com/athena/

From their intro:

Amazon Athena is a fast, cost-effective, interactive query service that makes it easy to analyze petabytes of data in S3 with no data warehouses or clusters to manage.

Unlikely to help you though as it sounds like you have plain text to search through.

Thought I'd mention it as it might help others looking to solve a similar problem.

Solution 2:[2]

Nope!

If you can't infer where the matches are from object metadata (like, the file name), then you're stuck with downloading & searching manually. If you have spare bandwidth, I suggest downloading a few files at a time to speed things up.

Solution 3:[3]

In single word NO!!

I think you can do to imprrve the performance will be to cache the files locally so that you don't have to download the file again and again

Probably you can use Last-Modified header to check whether the local file is dirty, then download it again

Solution 4:[4]

My suggestion, since you seem to own the files, is to index them manually, based on content. If there is a lot of "keywords", or metadata associated with each file, you can help yourself by using a lightweight database, where you will perform your queries and get the exact file(s) users are looking for. This will preserve bandwidth and also be much faster, at the cost of maintaining kind of an "indexing" system.

Another option (if each file does not contain much metadata) would be to reorganize the files in your buckets, adding prefixes which would "auto-index" them, like follows:

/foo/bar/randomFileContainingFooBar.dat /foo/zar/anotherRandomFileContainingFooZar.dat.

This way you might end up scanning the whole bucket in order to find the set of files you need (this is why I suggested this option only if you have little metadata), but you will only download the matching ones, which is still much better than your original approach.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Adrian Lynch
Solution 2 phs
Solution 3 Arun P Johny
Solution 4 Viccari