'Pandas: Sampling from a DataFrame according to a target distribution

I have a Pandas DataFrame containing a dataset D of instances which all have some continuous value x. x is distributed in a certain way, say uniform, could be anything.

I want to draw n samples from D for which x has a target distribution that I can sample or approximate. This comes from a dataset, here I just take normal distribution.

How can I sample instances from D such that the distribution of x in the sample is equal/similar to an arbitrary distribution which I specify?

Right now, I sample a value x, subset D such that it contains all x +- eps and sample from that. But this is quite slow when the datasets get bigger. People must have come up with a better solution. Maybe the solution is already good but could be implemented more efficiently?

I could split x into strata, which would be faster, but is there a solution without this?

My current code, which works fine but is slow (1 min for 30k/100k, but I have 200k/700k or so.)

import numpy as np
import pandas as pd
import numpy.random as rnd
from matplotlib import pyplot as plt
from tqdm import tqdm

n_target = 30000
n_dataset = 100000

x_target_distribution = rnd.normal(size=n_target)
# In reality this would be x_target_distribution = my_dataset["x"].sample(n_target, replace=True)

df = pd.DataFrame({
    'instances': np.arange(n_dataset),
    'x': rnd.uniform(-5, 5, size=n_dataset)
    })

plt.hist(df["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)

def sample_instance_with_x(x, eps=0.2):
    try:
        return df.loc[abs(df["x"] - x) < eps].sample(1)
    except ValueError: # fallback if no instance possible
        return df.sample(1)

df_sampled_ = [sample_instance_with_x(x) for x in tqdm(x_target_distribution)]
df_sampled = pd.concat(df_sampled_)

plt.hist(df_sampled["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)


Solution 1:[1]

Rather than generating new points and finding a closest neighbor in df.x, define the probability that each point should be sampled according to your target distribution. You can use np.random.choice. A million points are sampled from df.x in a second or so for a gaussian target distribution like this:

x = np.sort(df.x)
f_x = np.gradient(x)*np.exp(-x**2/2)
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)

sample_probs is the key quantity, as it can be joined back to the dataframe or used as an argument to df.sample, e.g.:

# sample df rows without replacement
df_samples = df["x"].sort_values().sample(
    n=1000, 
    weights=sample_probs, 
    replace=False,
)

The result of plt.hist(samples, bins=100, density=True):

corrected image

Gaussian distributed x, uniform target distribution

Lets see how this method performs when the original samples are drawn from a gaussian distribution and we wish to sample them from a uniform target distribution:

x = np.sort(np.random.normal(size=100000))
f_x = np.gradient(x)*np.ones(len(x))
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)

sample to uniform distribution from gaussian distributed points

The tails look jittery at this resolution, but if we increased the bin size they would smooth out.

Method

Approximate probabilities are calculated for samples in x in the form:

prob(x_i) ~ delta_x*rho(x_i)

where rho(x_i) is the density function and np.gradient(x) is used as a differential value. If the differential weight is ignored, f_x will over-represent close points and under-represent sparse points in the resampling. I made this mistake initially, the effect is small is x is uniformly distributed (but generally can be significant):

un-corrected version

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1