'Concat sorted Dask DataFrames

I have N Dask DataFrame sorted by the ts column(no index). I would like to create one DataFrame - concat all of them, but still have it sorted by this ts column.

Note: ts can overlap between DataFrames.

Can someone recommend efficient way to implement it?

UPDATE:

dfs = []
for product in PRODUCTS:
   namespace = PRODUCT_NAMESPACE[product]
   message_type = PRODUCT_MESSAGE_TYPE[product]

   num_expected_channels = PRODUCT_EXPECTED_CHANNELS[product]
   for channel in range(num_expected_channels):
       df = storage.load(
           namespace,
           partition_filter=(P.date == '2022-02-01') & (P.channel == str(channel)),
       )

       df = df.assign(
          product=product,
          message_type=message_type
       ).astype(
          dict(
              dtype='category',
              product=pd.api.types.CategoricalDtype(PRODUCTS),
              message_type=pd.api.types.CategoricalDtype(['trade', 'quote']),
          )
       ).drop(columns=['channel', 'feed'])

       df = df.set_index('ts', sorted=True, drop=False).persist()

       dfs.append(df)

df = dd.concat(dfs, interleave_partitions=True)
df = df.map_partitions(lambda pdf: pdf.sort_index())


Solution 1:[1]

If you can append the dataframes and still be sorted, you should look dask.dataframe.multi.concat.

You should look into dask.dataframe.DataFrame.merge if the simple concatenation was to result in a partially sorted dataframe.

EDIT: Credit to @Michel Delgado who pointed out that sorting data across the partitions without an index would be very memory-consuming. You might want to go through the comments below to see more details.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1