'Python function that mimics the distribution of my dataset

I have a food order dataset that looks like this, with a few thousand orders over the span of a few months:

Date Item Name Price
2021-10-09 07:10:00 Water Bottle 1.5
2021-10-09 12:30:60 Pizza 12
2021-10-09 17:07:56 Chocolate bar 3

Those orders are time-dependent. Nobody will eat a pizza at midnight, usually. There will be more 3PM Sunday orders than there will be 3PM Monday orders (because people are at work). I want to extract the daily order distribution for each weekday (Monday till Sunday) from those few thousand orders so I can generate new orders later that fits this distribution. I do not want to fill in the gaps in my dataset.

How can I do so?

I want to create a generate_order_date() function that would generate a random hours:minutes:seconds depending on the day. I can already identify which weekday a date corresponds to. I just need to extract the 7 daily distributions so I can call my function like this:

generate_order_date(day=Monday, nb_orders=1)

[12:30:00]

generate_order_date(day=Friday, nb_orders=5)

[12:30:00, 07:23:32, 13:12:09, 19:15:23, 11:44:59]

generated timestamps do not have to be in chronological order. Just like if I was calling np.random.normal(mu, sigma, 1000)



Solution 1:[1]

Try np.histogram(data) https://numpy.org/doc/stable/reference/generated/numpy.histogram.html

The first argument will give you the density which would be your distribution. You can visualise it

plt.plot(np.histogram(data)[0])

data here would be the time of the day a particular item was ordered. For this approach, I would suggest round up your time to 5 min intervals or more depending on the frequency. For example - round 12:34pm to 12:30 and 12:36pm to 12:35pm. Choose a suitable frequency.

Another method would be scipy.stats.gaussian_kde. This would use a guassian kernel. Below is an implementation I have used previously

def get_kde(df:pd.DataFrame)->list:
    xs = np.round(np.linspace(-1,1,3000),3)
    kde = gaussian_kde(df.values)
    kde_vals = np.round(kde(xs),3)
    data = [[xs[i],kde_vals[i]] for i in range(len(xs)) if kde_vals[i]>0.1]
    return data

where df.values is your data. There are plenty more kernels which you can use to get a density estimate. The most suitable one depends on the nature of your data.

Also see https://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation

Solution 2:[2]

Here's a rough sketch of what you could do:

Assumption: Column Date of the DataFrame df contains datetimes. If not do df.Date = pd.to_datetime(df.Date) first.

First 2 steps:

import pickle
from datetime import datetime, timedelta
from sklearn.neighbors import KernelDensity

# STEP 1: Data preparation

df["Seconds"] = (
    df.Date.dt.hour * 3600 + df.Date.dt.minute * 60 + df.Date.dt.second
) / 86400
data = {
    day: sdf.Seconds.to_numpy()[:, np.newaxis]
    for day, sdf in df.groupby(df.Date.dt.weekday)
}

# STEP 2: Kernel density estimation

kdes = {
    day: KernelDensity(bandwidth=0.1).fit(X) for day, X in data.items()
}
with open("kdes.pkl", "wb") as file:
    pickle.dump(kdes, file)
  • STEP 1: Build a normalised column Seconds (normalised between 0 and 1). Then group over the weekdays (numbered 0,..., 6) and prepare for for every day of the week data for estimation of a kernel density.

  • STEP 2: Estimate the kernel densities for every day of the week with KernelDensity from Scikit-learn and pickle the results.

Based on these estimates build the desired sample function:

# STEP 3: Sampling

with open("kdes.pkl", "rb") as file:
    kdes = pickle.load(file)

def generate_order_date(day, orders):
    fmt = "%H:%M:%S"
    base = datetime(year=2022, month=1, day=1)
    kde = kdes[day]
    return [
        (base + timedelta(seconds=int(s * 86399))).time().strftime(fmt)
        for s in kde.sample(orders)
    ]

I won't pretend that this is anywhere near perfect. But maybe you could use it as a starting point.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 snag9677
Solution 2