'Concat sorted Dask DataFrames
I have N Dask DataFrame sorted by the ts column(no index). I would like to create one DataFrame - concat all of them, but still have it sorted by this ts column.
Note: ts can overlap between DataFrames.
Can someone recommend efficient way to implement it?
UPDATE:
dfs = []
for product in PRODUCTS:
namespace = PRODUCT_NAMESPACE[product]
message_type = PRODUCT_MESSAGE_TYPE[product]
num_expected_channels = PRODUCT_EXPECTED_CHANNELS[product]
for channel in range(num_expected_channels):
df = storage.load(
namespace,
partition_filter=(P.date == '2022-02-01') & (P.channel == str(channel)),
)
df = df.assign(
product=product,
message_type=message_type
).astype(
dict(
dtype='category',
product=pd.api.types.CategoricalDtype(PRODUCTS),
message_type=pd.api.types.CategoricalDtype(['trade', 'quote']),
)
).drop(columns=['channel', 'feed'])
df = df.set_index('ts', sorted=True, drop=False).persist()
dfs.append(df)
df = dd.concat(dfs, interleave_partitions=True)
df = df.map_partitions(lambda pdf: pdf.sort_index())
Solution 1:[1]
If you can append the dataframes and still be sorted, you should look dask.dataframe.multi.concat.
You should look into dask.dataframe.DataFrame.merge if the simple concatenation was to result in a partially sorted dataframe.
EDIT: Credit to @Michel Delgado who pointed out that sorting data across the partitions without an index would be very memory-consuming. You might want to go through the comments below to see more details.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
