'How to create histogram of extremely right skewed data by different bin sizes

I have a variable that is very right skewed. Most of the observations are "0" while the max value is close to 4000. A histogram plot yields something ridiculous so I want to create bins of different sizes.

I used the following code


bins = [0, 1, 2, 5.0, 10.0, 20.0, 30.0]
fig = plt.hist(df[df.year==2021].diversification, bins=bins)
plt.xticks([0, 1, 2, 5.0, 10.0, 20.0, 30.0])
plt.show()

But i get the following plot with different bin width. Ideally I want to have the same bin width no matter the interval. Any idea how to implement this?

enter image description here



Solution 1:[1]

You could add a new column indicating the range. Pandas' pd.cut() is a function calculating to which range each element belongs.

Then you can create a bar plot with that range on the x-axis and the counts on the y-axis:

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

bins = [0, 1, 2, 5, 10, 20, 30]
xs = np.random.geometric(0.2, size=1000) - 1

df = pd.DataFrame({'x': xs, 'year': 2021})
df['range'] = pd.cut(df['x'], bins, right=False)

# df['range'].value_counts().sort_index().plot.bar()  # to plot via pandas
df_counts = df['range'].value_counts().sort_index().reset_index(name='Count').rename(columns={'index': 'Range'})
ax = sns.barplot(data=df_counts, x='Range', y='Count', palette='flare')

plt.show()

sns.barplot using ranges

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1