'Why is my kde plot showing up as vertical lines instead of a curve?

I have been trying to make a KDE plot for data I have (frequency of chromosome start sites), and although I follow the examples exactly, when I use my data or generated data that looks like my own, the entire plot messes up and produces only vertical lines instead of the normal curve. I was hoping someone more familiar with scikit learn KDE could help me figure out what I am doing wrong.

Here is the code with generated data from the example, where everything runs fine:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

X = np.concatenate((np.random.normal(0, 1, 14), np.random.normal(5, 1, 6)))[:, np.newaxis]
X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=1.0).fit(X) 
log_density = kde.score_samples(X_plot)

fig, ax = plt.subplots()
plt.fill_between(X_plot[:, 0], np.exp(log_density), color="b")
plt.plot(X, np.full_like(X, -0.01), '|k', markeredgewidth=.01)
ax.set_xlim(-5, 10)

Here is the code with data I generated to look like my data. I have 1,000 start sites in the data and they range in value from 10000 to 824989. I changed the data, the linspace range and step, and the x axis, and now I get vertical lines instead of a curve. I also changed the y limits because they turned out really weird.

X = np.random.normal(10000, 824989, 1000)[:, np.newaxis]
X_plot = np.linspace(10000, 824989, 100000)[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=1.0).fit(X) 
log_density = kde.score_samples(X_plot)

fig, ax = plt.subplots()
plt.fill_between(X_plot[:, 0], np.exp(log_density), color="b")
plt.plot(X, np.full_like(X, -0.01), '|k', markeredgewidth=.01)
ax.set_xlim(10000, 824989)
ax.set_ylim(-0.0001, 0.00061) 

I think it must have something to do with the linspace. I don't really understand why score_samples() takes the linspace as a parameter either.



Solution 1:[1]

There are two issues with your code:

  1. The bandwidth used in the kernel density estimation needs to be higher as your data has a much larger standard deviation compared to the example (your data has a standard deviation of 824,989 while the data used in the example has a standard deviation of 2.5). You would need to use a bandwidth of approximately 200,000 instead of a bandwidth of 1. See, for instance, the section on "A rule-of-thumb bandwidth estimator" in the Wikipedia article on Kernel density estimation.
  2. The purpose of using np.linspace() is to generate a set of data points at which the estimated kernel density function kde can be evaluated. In order to be able to visualize the full distribution of your data the first argument of np.linspace() should be set equal to the minimum of the data (instead of the mean of the data) and the second argument of np.linspace() should be set equal to the maximum of the data (instead of the standard deviation of the data).

I included an example below.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

mu = 10000 # mean
sigma = 824989 # standard deviation

# generate the data
X = np.random.normal(mu, sigma, 1000)[:, np.newaxis]

# estimate the optimal bandwidth
h = 1.06 * np.std(X) * (len(X) ** (- 1 / 5))

# estimate the density function
kde = KernelDensity(kernel='gaussian', bandwidth=h).fit(X)

# evaluate the density function
x = np.linspace(np.min(X), np.max(X), 100000)[:, np.newaxis]
log_density = kde.score_samples(x)
density = np.exp(log_density)

# plot the density function
plt.plot(x, density)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1