'Python: searching way to remove duplicates from list of pandas dataframes?
I have some list full of pandas dataframes. Is their a way to remove duplicates from it. Here some example code:
import pandas as pd
import numpy as np
if __name__ == '__main__':
data1 = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
df1 = pd.DataFrame.from_dict(data1, orient='index', columns=['A', 'B', 'C', 'D'])
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = pd.DataFrame.from_dict(data2)
df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df4 = pd.DataFrame(data, columns=['c', 'a'])
l_input = [df1, df2, df1, df3, df4, df4, df1, df3]
# l_aim = [df1, df2, df3, df4]
the input list l_input in the example should removed and l_aim should be the result.
Solution 1:[1]
An efficient method to find the duplicates in linear time would be to compute a hash of the dataframes. You can't do it with the python hash function, but there is a helper function in pandas: pandas.util.hash_pandas_object.
The function computes a hash per row, so you need to aggregate to a single value. sum could be used but it might lead to collisions. Here I opted for a concatenation of all hashes. If you have huge dataframes this might consume a lot of memory (in such case, maybe hash the list of hashes).
Update. The hash of hashes seems to be ideal, see the second option at the end of the answer.
hashes = [pd.util.hash_pandas_object(d).astype(str).str.cat(sep='-')
for d in l_input]
# identify duplicated per index
dups = pd.Series(hashes).duplicated()
Output:
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
dtype: bool
To filter the unique dataframes:
out = [d for d,h in zip(l_input, dups) if h]
variant with a hash of the hashes
I was initially unsure that computing the hash of a list of hashes would be secure, but this seems to be the case, so the second method below should probably be preferred:
def df_hash(df):
s = pd.util.hash_pandas_object(df)
return hash(tuple(s))
hashes = [df_hash(d) for d in l_input]
dups = pd.Series(hashes).duplicated()
out = [d for d,h in zip(l_input, dups) if h]
Solution 2:[2]
Try df.equals():
out = []
while l_input:
d = l_input.pop()
if any(d.equals(df) for df in l_input):
continue
out.append(d)
print(*out[::-1], sep="\n\n")
Prints:
A B C D
row_1 3 2 1 0
row_2 a b c d
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
a b c
0 1 2 3
1 4 5 6
2 7 8 9
c a
0 3 1
1 6 4
2 9 7
Solution 3:[3]
If these are literally the same objects (as in your example), then you could use their ids:
out = list(dict((id(df), df) for df in l_input).values())
If not, you could use equals:
Output:
[ A B C D
row_1 3 2 1 0
row_2 a b c d,
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d,
a b c
0 1 2 3
1 4 5 6
2 7 8 9,
c a
0 3 1
1 6 4
2 9 7]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Andrej Kesely |
| Solution 3 |
