'Efficient looping and comparing properties of two similar objects

I have a function find() that needs to loop through a lot of objects to identify a similar object by comparing a bunch of properties.

class Target:

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

class Source:

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c


def find(target: Target, source_set: set):
    for s in source_set:
        if s.a == target.a:
            if s.b == target.b:
                if s.c == target.c:
                    print("Found!")


source_set = {
    Source(a=1, b=2, c=3),
    Source(a=4, b=2, c=4)
}

target = Target(a=4, b=2, c=4)

find(target, source_set)

The current function is very slow as my source_set can be millions.

The source_set creation and its Source objects can be adjusted (e.g. the type). The source_set itself is not modified after initialisation.

The Source objects creation's input is coming from a dict with the same properties. One Source's raw input data is like this:

{'a': '1', 'b': '2', 'c': '3'}

The source_set is searched with many targets.

Is there a nice way to be more efficient? I'm hoping to not need to change the data structure.



Solution 1:[1]

Without any external libraries, you can modify the __hash__ method of each class

class Target:

    ...
       
    def __hash__(self):
        return hash(frozenset(self.__dict__.items()))


class Source:

    ...

    def __hash__(self):
        return hash(frozenset(self.__dict__.items()))

Now try:

count = len({hash(target),}.intersection(map(hash, source_set)))
print(count)

# Output
1

Solution 2:[2]

Using Pandas:

# Python env: pip install pandas
# Miniconda env: conda install pandas
import pandas as pd

df = pd.DataFrame([s.__dict__ for s in source_set])
sr = pd.Series(target.__dict__)
print(df)
print(sr)

# Output of source_set
   a  b  c
0  4  2  4
1  1  2  3

# Output of target
a    4
b    2
c    4
dtype: int64

Find same rows:

>>> sr.eq(df).all(axis=1).sum()
1

Solution 3:[3]

Since the source_set is only created once, but searched with many targets (as stated in your question), it is beneficial to invest time into creating a data structure for the source_set (which is only done once) if the reward is a time gain for the comparison later on (which is done multiple times).
Python's set provides the desired functionality. Internally it is somehow implemented as a hash map (not sure on this). To make use of the in statement, the elements in the set and also the elements that are compared to the set have to be hashable and comparable, i.e. both provide a __hash__ method and one of them provide a __eq__ method.

class Target:

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

    def __hash__(self):
        return hash((self.a, self.b, self.c))

class Source:

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

    def __eq__(self, other):
        return self.a == other.a and self.b == other.b and self.c == other.c

    def __hash__(self):
        return hash((self.a, self.b, self.c))

Now building the set of Source elements is a time investment, because for each element that is added, the __hash__ method is applied.

source_set = {
    Source(a=1, b=2, c=3),
    Source(a=4, b=2, c=4),
}

However, the reward is that now checking if a target is in the source_set happens in constant time compared to your current approach by comparing the target consecutively to each of the Sources which is in linear time.

target = Target(a=4, b=2, c=4)
target in source_set
# returns True

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Corralien
Solution 2 Corralien
Solution 3 Alex G