'Counting unique values in a column in pandas dataframe like in Qlik?
If I have a table like this:
df = pd.DataFrame({
'hID': [101, 102, 103, 101, 102, 104, 105, 101],
'dID': [10, 11, 12, 10, 11, 10, 12, 10],
'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})
I can do count(distinct hID) in Qlik to come up with count of 5 for unique hID. How do I do that in python using a pandas dataframe? Or maybe a numpy array? Similarly, if were to do count(hID) I will get 8 in Qlik. What is the equivalent way to do it in pandas?
Solution 1:[1]
If I assume data is the name of your dataframe, you can do :
data['race'].value_counts()
this will show you the distinct element and their number of occurence.
Solution 2:[2]
Or get the number of unique values for each column:
df.nunique()
dID 3
hID 5
mID 3
uID 5
dtype: int64
New in pandas 0.20.0 pd.DataFrame.agg
df.agg(['count', 'size', 'nunique'])
dID hID mID uID
count 8 8 8 8
size 8 8 8 8
nunique 3 5 3 5
You've always been able to do an agg within a groupby. I used stack at the end because I like the presentation better.
df.groupby('mID').agg(['count', 'size', 'nunique']).stack()
dID hID uID
mID
A count 5 5 5
size 5 5 5
nunique 3 5 5
B count 2 2 2
size 2 2 2
nunique 2 2 2
C count 1 1 1
size 1 1 1
nunique 1 1 1
Solution 3:[3]
You can use nunique in pandas:
df.hID.nunique()
# 5
Solution 4:[4]
For unique count of your rows without duplications
df['hID'].nunique()
To know the number of each unique row content duplicated
df['hID'].value_counts()
Solution 5:[5]
To count unique values in column, say hID of dataframe df, use:
len(df.hID.unique())
Solution 6:[6]
I was looking for something similar and I found another way you may help you
- If you want to count the number of null values, you could use this function:
def count_nulls(s):
return s.size - s.count()
- If you want to include NaN values in your unique counts, you need to pass dropna=False to the nunique function.
def unique_nan(s):
return s.nunique(dropna=False)
- Here is a summary of all the values together using the titanic dataset:
from scipy.stats import mode
agg_func_custom_count = {
'embark_town': ['count', 'nunique', 'size', unique_nan, count_nulls, set]
}
df.groupby(['deck']).agg(agg_func_custom_count)
You can find more info Here
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | oumar |
| Solution 2 | |
| Solution 3 | Psidom |
| Solution 4 | fessyadedic |
| Solution 5 | Das_Geek |
| Solution 6 | GeoP |
