'How to store large scale comparison?

I'm comparing images between thousands of users who can have between 1 and 12 photos. The comparison between two photos returns a score that I need to be stored so I don't make the comparison twice.

What is the best way of storing it?

I thought about storing in a table with one photo per row/column but this can quickly get out of hand



Solution 1:[1]

Multi-indexed pandas if you want to work in memory, like so:

df = pd.DataFrame(index=[['Alice/foo.png'], ['Bob/bar.png']], columns=['user1', 'user2', 'score'], data=[['Alice', 'Bob', 42.0]])
df.index.names = ['photo1', 'photo2']
df

                             user1 user2  score
photo1        photo2
user1/foo.png user2/bar.png  Alice   Bob   42.0

SQLite if you want to work on disk, like so

import sqlite3
# The important part: defining the table
conn = sqlite3.Connection('photos.sqlite')
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS photos (photoid INTEGER PRIMARY KEY, photopath TEXT UNIQUE, userid INT)')
c.execute('CREATE TABLE IF NOT EXISTS photoscores (photoid1 INTEGER, photoid2 INTEGER, userid1 INT, userid2 INT, score REAL)')
c.execute('CREATE UNIQUE INDEX photopair on photoscores (photoid1, photoid2)')
conn.commit()

# Example of populating the table
sql = ('INSERT INTO photos(photopath, userid) VALUES ("foo.png", 1)',
       'INSERT INTO photos(photopath, userid) VALUES ("bar.png", 2)')
for statement in sql:
    c.execute(statement)
conn.commit()
sql = 'SELECT photoid FROM PHOTOS'
c.execute(sql)
values = [v[0] for v in enumerate(c.fetchall())]
# This is more complicated than it needs to be, especially since
# values will always be sorted if this code is run, but I'm just
# emphasizing the need to keep the photo ids and user ids aligned
sql = ('INSERT OR REPLACE INTO photoscores VALUES (%s, %s, %s, %s, 42.0)'
       % (tuple(sorted(values)) + ((1, 2) if values == sorted(values) else (2, 1))))
c.execute(sql)
conn.commit()

import pandas as pd
pd.read_sql('SELECT * FROM photoscores', conn)

  photoid1  photoid2  userid1  userid2  score
0        0         1        1        2   42.0

If you use the SQL code, I'd suggest always sorting the pair of photo IDs you compare.

The key thing is that you want something with an underlying hash map that will quickly tell you whether an existing pair of photos has already been compared.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1