'matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.

With matplotlib, I plot against the range [0, max_data_value]

How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?

Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?

Solution 1:^[1]

If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:

plt.xlim(min_x_data_value,max_x_data_value)

Where the values are your desired limits to display.

plt.ylim(min,max) works to set limits on the y axis also.

Solution 2:^[2]

I think using pandas quantile is useful and much more flexible.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

pd_series = pd.Series(np.random.normal(size=300)) 
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))] 

ax1.boxplot(pd_series)
ax1.set_title('Original')

ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')

plt.show()

Solution 3:^[3]

I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.

What I run is something like this:

import numpy as np
import matplotlib.pyplot as plt

data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)

You can compare the results if you change the min and max in the clipping function until you find the right values for your data.

In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.

You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.

Solution 4:^[4]

In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.

Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.

# xdata = some x data points ...
# ydata = some y data points ...

# Finding limits for y-axis     
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad

Example usage:

fig = plt.figure(figsize=(6, 8))

ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')

ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])

plt.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Jdog
Solution 2	jaga
Solution 3
Solution 4

'matplotlib: disregard outliers when plotting

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]