'Optimal way to store/index/manage large amounts of image training data for machine learning?
I understand that you generally do not want to store images in a database. What you instead do is store metadata about the images (owner, creation date, size, file format, etc) and a link to the image (S3 location or path to the image on the local filesystem). If you need to recover the image you can then look up the path in the database and read it in from object storage or the local filesystem.
This use case seems to be designed for cases where the system needs to just read a few images per request, for example, to get a few images that belong to the user's web page.
My situation is a little bit different. I'm aggregating a large number of labeled images for training data for various machine learning algorithms. For each image, there will be a row in a table that contains information on where the image came from, its size, and the labels associated with that image (i.e. one image might have the labels: ["car", "vehicle", "sedan", "honda", "civic", "blue", "2002"], another might only have ["vehicle", "truck"], another might have ["human", "pedestrian", "woman"]).
My goal is to have the data structured in such a way that I can make training sets of data from this table arbitrarily as I see fit according to different label groupings. So I could say 'gather all images with the label "animal", and group based on the label "dog", "cat", "horse"' (should one of those labels exist). Now, from my flat list of training data, I'll have images grouped into three categories that I can train a CNN classifier from.
The trouble comes from the fact that I can have millions of images, so if I run the above query to get all images with the label "animal", I will need to run the SQL query to find all the images with that label, then I will need to do millions of RPC calls to S3 or the local filesystem to actually get the image data I need. If I actually keep the images stored in the database, the images will come right out of the query itself.
So, as a general question, what is the best way to store and index a large number of images and their metadata for machine learning? On the one hand, we can simply group a large number of images into ZIP files and store the zip files in some object store. This is convenient because I only need a single handle and RPC call to get all of the training data onto whatever server I'm performing the ML training sequence on, but this causes me to lose any granular visibility into my training data. On the other hand, I can store all of my data indexed in some large SQL table, image data included. This gives me maximum visibility into my data, but is cost prohibitive and makes it inconvenient to actually get the images onto a server that needs the image data to perform a training sequence.
Solution 1:[1]
For training I would assume, that you just want to read everything sequential, so I would put everything (including images) in a SQL database or just in one file which has in odd lines the meta data and in the even lines the image(binary). Like
0|meta: Car,blue,Mercedes....
1|001010100010101011111111111111000010101000..............
2|meta: motorcycle,red,yamaha.....
3|010110101010101010110101011010101010101010..............
.
.
.
When you can't read the whole file at once and you want to know on which line your last training has stopped you also should include an index on each line (0|...,1|....).
The actually need to convert the words into numbers for the real training:
0|meta: 0,101,1001....
1|001010100010101011111111111111000010101000..............
2|meta: 2,102,1002.....
3|010110101010101010110101011010101010101010..............
.
.
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
