'How to take sample from dask dataframe having all the products ordered by certain number of customers alone?

I tried loading my csv file using pd.read_csv. It has 33 million records and takes too much time for loading and querying also.

I have data of 200k customers. This is the code I have written for sampling

Data is loading quickly when using a dask dataframe but takes much time for queries.

df_s = df.sample(frac = 300000/33819106,replace = None,random_state = 10)

This works fine but the customers have ordered many products. In the sample how to include all the products of the customers. How to sample based on customer id?



Solution 1:[1]

Load your data into a dataframe and then sample from it. Output to a new .csv that is easier to read from.

df = pd.read_csv('customers.csv') 

df = df.sample(frac=.2) # 20% of the rows will be sampled. 

df.to_csv('sample_customers.csv')  # Create an easier to work with .csv

Generally the format of a question on here is

  1. Description of problem
  2. Desired outcome
  3. What you've tried
  4. Minimum reproducible example

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 rangeseeker