'Python Pandas: How I can determine the distribution of my dataset?

This is my dataset with two columns of NS and count.

    NS                                                count
0   ns18.dnsdhs.com.                                  1494
1   ns0.relaix.net.                                   1835
2   ns2.techlineindia.com.                            383
3   ns2.microwebsys.com.                              1263
4   ns2.holy-grail-body-transformation-program.com.   1
5   ns2.chavano.com.                                  1
6   ns1.x10host.ml.                                   17
7   ns1.amwebaz.info.                                 48
8   ns2.guacirachocolates.com.br.                     1
9   ns1.clicktodollars.com.                           2

Now I would like to see how many NSs have the same count by plotting it. My own guess is that I can use histogram to see that but I am not sure how. Can anyone help?



Solution 1:[1]

From your comment, I'm guessing your data table is actually much longer, and you want to see the distribution of name server counts (whatever count is here).

I think you should just be able to do this:

df.hist(column="count")

And you'll get what you want. IF that is what you want.

pandas has decent documentation for all of it's functions though, and histograms are described here.

If you actually want to see "how many have the same count", rather than a representation of the disribution, then you'll either need to set the bins kwarg to be df["count"].max()-df["count"].min() - or do as you said and count the number of times you get each count and then create a bar chart.

Maybe something like:

from collections import Counter
counts = Counter()
for count in df["count"]:
  counts[count] += 1

print counts

An alternative, and cleaner approach, which i completely missed and wwii pointed out below, is just to use the standard constructor of Counter:

count_counter = Counter(df['count'])

Solution 2:[2]

To get the the description about your distribution you can use:

df['NS'].value_counts().describe()

To plot the distribution:

import matplotlib.pyplot as plt
df['NS'].value_counts().hist(bins = 20)
plt.title('Distribution')
plt.xlabel('x lable')
plt.ylabel('y lable')
plt.show()

Solution 3:[3]

#Distribution of count
plt.figure(figsize=(6,6))
sns.distplot(Dataset['Count'])
plt.title('Count Distribution')
plt.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Emad Alomari
Solution 3 Usha