'How to improve performance of listing files in a Google bucket?
I implemented a function in Java Spring Boot to list all images from a bucket with in sum ~12000 assets. In the bucket are ~4000 images stored.
String pathPrefix = "/2021/";
Page<Blob> blobs = storage.list(bucket, Storage.BlobListOption.prefix(pathPrefix));
Predicate<Blob> ONLY_IMAGES = b -> b.getContentType().startsWith("image/");
ArrayList<Blob> images = new ArrayList<>();
for (Blob blob : blobs.iterateAll()) {
if (ONLY_IMAGES.test(blob)) {
images.add(blob);
}
}
The runtime of this small filter function is 48 minutes.
Update
I had tested the following two approaches.
- Reducing the amount of requested metadata with:
Storage.BlobListOption.fields(Storage.BlobField.CONTENT_TYPE)
- Parallel processing of blobs with:
ArrayList<Blob> imgBlobs = (ArrayList<Blob>) StreamSupport.stream(
storage.list(bucket, Storage.BlobListOption.prefix(pathPrefix),
Storage.BlobListOption.fields(Storage.BlobField.CONTENT_TYPE))
.iterateAll().spliterator(), true)
.filter(filter)
.collect(Collectors.toList());
Unfortunately, both approaches did not improve performance.
Does anyone have any ideas on how to improve this?
Thank you.
Solution 1:[1]
The problem is that you are reading each blob's metadata, one object at a time which results in long processing time - thousands of HTTP requests.
Ideas:
Limit the getContentType() operation to blobs with certain filename extensions like jpg.
Rewrite the code to process blobs in parallel.
Maintain an external database of bucket contents.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | John Hanley |
