'How do I filter a certain column, removing repeated data?
I want to return the weights to a histogram, but with the names only appearing once.
df = pd.DataFrame({'Name': ['Bob', 'Simon', 'Bill', 'Mary', 'Mary', 'Bob'],
'Weight': [70, 72, 71, 67, 67, 70]})
This:
Bob 70
Simon 72
Bill 71
Mary 67
Solution 1:[1]
Use drop_duplicates:
out = df.drop_duplicates(['Name', 'Weight'])
print(out)
# Output
Name Weight
0 Bob 70
1 Simon 72
2 Bill 71
3 Mary 67
Solution 2:[2]
You need a groupby:
df.groupby('Name')['Weight'].mean()
If you want to take just the first data point available for each name:
df.groupby('Name')['Weight'].first()
Solution 3:[3]
We can use groupby function with aggregate function as mean
The data looks like this
>>> df = pd.DataFrame({'Name': ['Bob', 'Simon', 'Bill', 'Mary', 'Mary', 'Bob'], 'Weight': [70, 72, 71, 67, 67, 70]})
>>> print(df)
Name Weight
0 Bob 70
1 Simon 72
2 Bill 71
3 Mary 67
4 Mary 67
5 Bob 70
>>> df2 = df.groupby(['Name']).mean()
>>> print(df2)
Name Weight
0 Bill 71
1 Bob 70
2 Mary 67
3 Simon 72
Convert Name index column to a column and add a RangeIndex
>>> df2['Name'] = df2.index
>>> df2 = df2[['Name', 'Weight']]
>>> df2.set_index(pd.RangeIndex(start=0, stop=len(df2), step=1), inplace=True)
>>> print(df2)
Name Weight
0 Bill 71
1 Bob 70
2 Mary 67
3 Simon 72
Solution 4:[4]
Do the following:
df = df.drop_duplicates(subset=['Name', 'Weight'])
print(df)
Output:
>>> Name Weight
0 Bob 70
1 Simon 72
2 Bill 71
3 Mary 67
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
| Solution 2 | Code Different |
| Solution 3 | Rajarshi Ghosh |
| Solution 4 | Muhammad Mohsin Khan |
