'How can I remove duplicate parts of a string, within a column and sort the values
I have a dataframe like this (assuming one column):
column
[A,C,B,A]
[HELLO,HELLO,ha]
[test/1, test/1, test2]
The type of the column above is: dtype('O')
I would like to remove the duplicates here, resulting in:
column
[A,C,B] # - A
[HELLO, ha] # removing 1 hello
[test/1, test2] # removing 1 test/1
Then, I would like to sort the data
column
[A,B,C]
[ha, HELLO]
[test2, test/1] # assuming that number comes before /
I am struggling getting this done in a proper way. Hope anyone has nice ideas (would it make sense to transform to small lists?)
Solution 1:[1]
Assuming that you have lists in the column, use a list comprehension.
If you want to maintain order:
df['column_keep_order'] = [list(dict.fromkeys(x)) for x in df['column']]
If you want to sort the items:
df['column_sorted'] = [sorted(set(x)) for x in df['column']]
output:
column column_keep_order column_sorted
0 [A, C, B, A] [A, C, B] [A, B, C]
1 [HELLO, HELLO, ha] [HELLO, ha] [HELLO, ha]
2 [test/1, test/1, test2] [test/1, test2] [test/1, test2]
reproducible input:
df = pd.DataFrame({'column': [['A','C','B','A'],
['HELLO','HELLO','ha'],
['test/1', 'test/1', 'test2']]})
Solution 2:[2]
A set is a datatype which is like a list but it only includes unique values. This means that you can convert your list to a set and then back to a list (if you want a list of those unique values).
You would then write something like this:
list(set(name_of_the_list))
Solution 3:[3]
Nested list inside of a dataframe is not a good habits to use, maybe for some really specific problems. I strongly encourage you to read how to stick to tidy paradigm (I know it's from a R point of view, but it can be easily transcript into python).
Nonetheless, using Numpy functions as unique and sort can do the job for your example, assuming df is your dataframe :
Using a for loop on your dataframe
df = pd.DataFrame({"col" :
[["A","C","B","A"],
["HELLO","HELLO","ha"],
["test/1", "test/1", "test2"]]})
for n, value in df.iterrows():
df.loc[n, "col"] = np.sort(np.unique(value.loc["col"]))
df
Out[77]:
col
0 [A, B, C]
1 [HELLO, ha]
2 [test/1, test2]
Using applymap
df2 = pd.DataFrame({"col" :
[["A","C","B","A"],
["HELLO","HELLO","ha"],
["test/1", "test/1", "test2"]]})
df2 = df2.applymap(lambda x : np.sort(np.unique(x)))
df2
Out[75]:
col
0 [A, B, C]
1 [HELLO, ha]
2 [test/1, test2]
Solution 4:[4]
Leverage set.
If already a list
df['column2']=[list(set(x)) for x in df.column.to_list()]
column column2
0 [A, C, B, A] [A, C, B]
1 [HELLO, HELLO, ha] [HELLO, ha]
2 [test/1, test/1, test2] [test2, test/1]
otherwise
df['column2']=df['column'].str.replace('\]|\[','',regex=True).str.split(',').map(set).map(list)
or
df['column2']=df['column'].str.replace('\]|\[','',regex=True).apply(lambda x:list(set(x.split(','))))
column column2
0 [A, C, B, A] [A, C, B]
1 [HELLO, HELLO, ha] [HELLO, ha]
2 [test/1, test/1, test2] [test2, test/1]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |
| Solution 2 | António Rebelo |
| Solution 3 | Zelemist |
| Solution 4 |
