'How to take sample from dask dataframe having all the products ordered by certain number of customers alone?
I tried loading my csv file using pd.read_csv. It has 33 million records and takes too much time for loading and querying also.
I have data of 200k customers. This is the code I have written for sampling
Data is loading quickly when using a dask dataframe but takes much time for queries.
df_s = df.sample(frac = 300000/33819106,replace = None,random_state = 10)
This works fine but the customers have ordered many products. In the sample how to include all the products of the customers. How to sample based on customer id?
Solution 1:[1]
Load your data into a dataframe and then sample from it. Output to a new .csv that is easier to read from.
df = pd.read_csv('customers.csv')
df = df.sample(frac=.2) # 20% of the rows will be sampled.
df.to_csv('sample_customers.csv') # Create an easier to work with .csv
Generally the format of a question on here is
- Description of problem
- Desired outcome
- What you've tried
- Minimum reproducible example
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | rangeseeker |
