'Fastest way to split pandas dataframe rows into multiple rows

This is a follow up to a previous question: Split pandas dataframe rows into multiple rows

Assuming the following dataframe: from itertools import combinations

df = pd.DataFrame(
    {
        "IDs": [
            ["A", "B"],
            ["A", "B", "C"],
            ["A", "B", "C", "D"],
        ],
        "pos_x": [[1, 2], [1.3, 2.8, 3], [10, 20, 100, 1000]],
        "pos_y": [[3, 4], [1, 5, 3], [2, 0, 0, 4]],
    },
    index=[
        pd.to_datetime("2022-01-01 12:00:00"),
        pd.to_datetime("2022-01-01 12:00:01"),
        pd.to_datetime("2022-01-01 12:00:02"),
    ],
)

                    IDs             pos_x               pos_y
2022-01-01 12:00:00 [A, B]          [1, 2]              [3, 4]
2022-01-01 12:00:01 [A, B, C]       [1.3, 2.8, 3]       [1, 5, 3]
2022-01-01 12:00:02 [A, B, C, D]    [10, 20, 100, 1000] [2, 0, 0, 4]

So now I want to obtain the following DataFrame:

from itertools import combinations
desired_df = pd.DataFrame()
for col in df.columns:
    df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))

for col in df.columns:
    desired_df[[col+'_1',col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)

                   IDs_1 IDs_2 pos_x_1 pos_x_2  pos_y_1 pos_y_2
2022-01-01 12:00:00 A    B     1.0     2.0      3       4
2022-01-01 12:00:01 A    B     1.3     2.8      1       5
2022-01-01 12:00:01 A    C     1.3     3.0      1       3
2022-01-01 12:00:01 B    C     2.8     3.0      5       3
2022-01-01 12:00:02 A    B     10.0    20.0     2       0
2022-01-01 12:00:02 A    C     10.0    100.0    2       0
2022-01-01 12:00:02 A    D     10.0    1000.0   2       4
2022-01-01 12:00:02 B    C     20.0    100.0    0       0
2022-01-01 12:00:02 B    D     20.0    1000.0   0       4
2022-01-01 12:00:02 C    D     100.0   1000.0   0       4

Since I would like to use it on fairly big amount of data, I'd like to know if there is a faster way to obtain the same results. Note: The 'IDs' column contains at least 2 elements and in practice I have more than just 3 columns.

Here is a benchmark I have used on a google colab:

from itertools import combinations
import pandas as pd
import random
import time

random.seed(42)
n_rows = 2000000
list_ids = ["A", "B", "C", "D", "E",  "F"]
n_ids = [random.randint(2, len(list_ids)) for k in range(n_rows)]
ids = [random.sample(list_ids, n) for n in n_ids]
pos_x =[[random.randint(0, 10)]*n for n in n_ids]
pos_y =[[random.randint(0, 10)]*n for n in n_ids]
df = pd.DataFrame(
    {
        "IDs": ids,
        "pos_x": pos_x,
        "pos_y": pos_y,
    },
)

start = time.time()
for col in df.columns:
    df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))
print('Intermediary time:', time.time()-start)

desired_df = pd.DataFrame()
for col in df.columns:
    desired_df[[col+'_1', col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)
print('Final time:', time.time()-start)

Intermediary time: 36.28665089607239
Final time: 55.56900978088379

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Fastest way to split pandas dataframe rows into multiple rows

Sources

Related Questions