'Pandas Create Link Pairs from Multiple Rows
I have a df with id as the flow id, dttm as the step modification time, and step as the steps in the flow. It is currently ordered by dttm. There could be any number of steps for a particular id.
Current df:
id dttm step
0 81 2015-05-26 07:56:03 A
1 81 2015-05-26 08:19:07 B
2 81 2015-05-26 08:32:05 C
3 91 2015-05-26 08:07:12 B
4 91 2015-05-26 08:07:12 C
I want to create link data to feed into a Sankey. Therefore I ultimately want a df with three columns: source, target, and value. value is the count of ids that have such a step pair.
Desired df:
source target value
0 A B 1
1 B C 2
I know I can stuff step into one row with either groupby or possibly cat. However, I think that would just create a different starting point without actually advancing a solution. Part of what makes it tough is that the steps depend on dttm to stay ordered so that the steps are appropriately paired. Also, the fact that it has to be dynamic because there could be any number of steps adds to the difficulty.
How should I dynamically "stuff" the step column to arrive at the link data?
Is there a way to join the df to itself to get all of the pairs, and then remove rows that are created during the join but are nonsense?
Thank you for any and all insight!
Solution 1:[1]
I figured out a solution. It is not a one step solution like Scott's, but I think it's a little simpler to understand. also, it runs much, much faster:
Add a sequence to each group of id. then duplicate the df to allow a self-merge:
df['sequence'] = df.groupby('id').cumcount()
df2 = df
Merge the dataframes on id, then only keep the rows that make sense based on the sequence:
result = df.merge(df2, on='id', how='left')
result = result[result['sequence_x']+1==result['sequence_y']]
Group the resulting df by step pairs and use size to get value. Then reset the index and rename the columns:
result = pd.DataFrame(result.groupby(['step_x','step_y']).size().rename('value'))
result = result.reset_index()
result.columns = ['source','target','value']
Solution 2:[2]
Let's try:
(df.groupby('id')['step'].apply(lambda x: pd.DataFrame(list(zip(x, x[1:]))).set_index([0,1]).assign(count=1))
.rename_axis(['id','source','target'])
.sum(level=[1,2]).reset_index())
Output:
source target count
0 A B 1
1 B C 2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Scott Boston |
