'how do you go though data frame values in chunks and combine them?

I have this dataframe:

Metric  ProcId  TimeStamp               Value
CPU     proce_123   Mar-11-2022 11:00:00    1.4453125
CPU     proce_126   Mar-11-2022 11:00:00    0.058320373
CPU     proce_123   Mar-11-2022 11:00:00    0.095274389
CPU     proce_000   Mar-11-2022 11:00:00    0.019654088
CPU     proce_144   Mar-11-2022 11:00:00    0.019841269
CPU     proce_1     Mar-11-2022 11:00:00    0.234741792
CPU     proce_100   Mar-11-2022 11:00:00    5.32945776
CPU     proce_57777 Mar-11-2022 11:00:00    0.25390625
CPU     proce_0000  Mar-11-2022 11:00:00    0.019349845
CPU     proce_123   Mar-11-2022 11:00:00    0.019500781
CPU     proce_123   Mar-11-2022 11:00:00    2.32421875
CPU     proce_123   Mar-11-2022 11:00:00    68.3903656
CPU     proce_123   Mar-11-2022 11:00:00    0.057781201
CPU     proce_123   Mar-11-2022 11:00:00    0.416666627

This is just a sample dataframe; the actual dataframe is in thousands of rows. I need to go though this dataframe in chunks the ProdID column and I need to create a string combining these ProdID in chunks for each iteration.

For example the string needs to be like this given the chunk size 3:

proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
proce_123%22%2C%2proce_126%22%2C%2proce_111%22%29

Please note after the 3rd chunk, we need to add %22%29. After the first ad second we need to add %22%2C%2.

I can do something like this to print out the chunks:

n = 3 #size of chunks
chunks = [] #list of chunks

for i in range(0, len(id), n): 
    chunks.append(id[i:i + n])

I am not sure how would I combine these 3 items in one string and add the others strings at the end.



Solution 1:[1]

avoid iterating through your data frame in a for loop. your performance is almost guaranteed to be poorer than if you used some combination of groupby, merge, shift and other array oriented numpy or pandas operations

Make chunk ids out of your dataframe through integer division over the index (assumes incremental index values)

chunk_size = 3
df['ChunkId'] = df.index // chunk_size

add the suffix to each ProcId to create a new column ProcEnds, then join these within each group.

df['ProcEnds'] =  (df.ProcId + '%22%2C%2').where(
  df.index % chunk_size != chunk_size - 1, 
  df.ProcId + '%22%29')
# note DataFrame.where replaces values with other when cond is False

df['ChunkString'] = df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())

Optionally, drop the ChunkId & ProcEnds columns to get an output with only additional column ChunkString

df = df.drop(columns=['ChunkId', 'ProcEnds'])

df now outputs:

   Metric       ProcId           TimeStamp      Value                                           ChunkString
0     CPU    proce_123 2022-03-11 11:00:00   1.445312     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1     CPU    proce_126 2022-03-11 11:00:00   0.058320     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2     CPU    proce_123 2022-03-11 11:00:00   0.095274     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3     CPU    proce_000 2022-03-11 11:00:00   0.019654       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4     CPU    proce_144 2022-03-11 11:00:00   0.019841       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5     CPU      proce_1 2022-03-11 11:00:00   0.234742       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6     CPU    proce_100 2022-03-11 11:00:00   5.329458  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7     CPU  proce_57777 2022-03-11 11:00:00   0.253906  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8     CPU   proce_0000 2022-03-11 11:00:00   0.019350  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9     CPU    proce_123 2022-03-11 11:00:00   0.019501     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10    CPU    proce_123 2022-03-11 11:00:00   2.324219     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11    CPU    proce_123 2022-03-11 11:00:00  68.390366     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12    CPU    proce_123 2022-03-11 11:00:00   0.057781                    proce_123%22%2C%2proce_123%22%2C%2
13    CPU    proce_123 2022-03-11 11:00:00   0.416667                    proce_123%22%2C%2proce_123%22%2C%2

update

google colab notebook showing the output with sample data https://colab.research.google.com/drive/1f9ZHXE2ATZXD2qWsoATxEWABIBt0tMRN?usp=sharing

update 2

OP asked:

quick question. can we group this based on df['Metric']? For example, it would be CPU, Memory. I would need ChunkString based on CPU or memory?

to apply this transformation within each Metric Group, it is easiest to contain the transformation logic inside a function and apply that on the data.

extra care needs to be made to preserve the original indexing.

def transform(frame):
  _df = frame.reset_index(drop=True)
  _df['ChunkId'] = _df.index // chunk_size
  _df['ProcEnds'] =  (_df.ProcId + '%22%2C%2').where(
    _df.index % chunk_size != chunk_size - 1, 
    _df.ProcId + '%22%29')
  _df['ChunkString'] = _df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
  return _df.drop(columns=['ChunkId', 'ProcEnds'])
idx = df.index
df.groupby('Metric').apply(transform).set_index(idx)

produces the same output as earlier, elided for brevity.

Solution 2:[2]

You can use Python integer division (//) to form the index into groups of N:

N = 3
df['ChunkString'] = df.groupby(df.index//N)['ProcId'].transform(lambda x: '%22%2C%2'.join(x.tolist() + ['']*(N-len(x))) + ('%22%29' if len(x) == N else ''))

Notes:

  • x.tolist() + ['']*(N-len(x)) just converts x to a list, and pads it with empty items until it reaches length N

Output

>>> df
   Metric       ProcId           TimeStamp      Value                                           ChunkString
0     CPU    proce_123 2022-03-11 11:00:00   1.445312     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1     CPU    proce_126 2022-03-11 11:00:00   0.058320     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2     CPU    proce_123 2022-03-11 11:00:00   0.095274     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3     CPU    proce_000 2022-03-11 11:00:00   0.019654       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4     CPU    proce_144 2022-03-11 11:00:00   0.019841       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5     CPU      proce_1 2022-03-11 11:00:00   0.234742       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6     CPU    proce_100 2022-03-11 11:00:00   5.329458  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7     CPU  proce_57777 2022-03-11 11:00:00   0.253906  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8     CPU   proce_0000 2022-03-11 11:00:00   0.019350  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9     CPU    proce_123 2022-03-11 11:00:00   0.019501     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10    CPU    proce_123 2022-03-11 11:00:00   2.324219     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11    CPU    proce_123 2022-03-11 11:00:00  68.390366     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12    CPU    proce_123 2022-03-11 11:00:00   0.057781                    proce_123%22%2C%2proce_123%22%2C%2
13    CPU    proce_123 2022-03-11 11:00:00   0.416667                    proce_123%22%2C%2proce_123%22%2C%2

With N = 5:

>>> df
   Metric       ProcId           TimeStamp      Value                                                                           ChunkString
0     CPU    proce_123 2022-03-11 11:00:00   1.445312   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
1     CPU    proce_126 2022-03-11 11:00:00   0.058320   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
2     CPU    proce_123 2022-03-11 11:00:00   0.095274   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
3     CPU    proce_000 2022-03-11 11:00:00   0.019654   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
4     CPU    proce_144 2022-03-11 11:00:00   0.019841   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
5     CPU      proce_1 2022-03-11 11:00:00   0.234742  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
6     CPU    proce_100 2022-03-11 11:00:00   5.329458  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
7     CPU  proce_57777 2022-03-11 11:00:00   0.253906  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
8     CPU   proce_0000 2022-03-11 11:00:00   0.019350  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
9     CPU    proce_123 2022-03-11 11:00:00   0.019501  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
10    CPU    proce_123 2022-03-11 11:00:00   2.324219                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
11    CPU    proce_123 2022-03-11 11:00:00  68.390366                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
12    CPU    proce_123 2022-03-11 11:00:00   0.057781                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
13    CPU    proce_123 2022-03-11 11:00:00   0.416667                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2

Solution 3:[3]

chunk_size = 3
list_of_proc_ids = []
# First, generate a list of the procIds
for obj in range(0, len(id)):
    list_of_proc_ids.append(procId) # Not sure how you're appending this, guessing you use a slice on the string line?

final_str = ''
# Then enumerate through that list, adding a unique ending at every third
for index, obj in enumerate(list_of_proc_ids]:
    final_str += str(obj)
    if (index + 1) % chunk_size == 0: # Checks if divisible by 3, accounting for 0 index
        final_str += '%22%29'
    else:
        final_str += '%22%2C%2'

Solution 4:[4]

If your goal is to deal with a large amount of data, Pandas offers a great page about it and describes an efficient way to create chunks: https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html

My advice is to go with a Dask implementation to process chunks efficiently (locally, or prepare for even larger datasets by distributing compute on a cluster). Dask comes with a DataFrame API which looks like Pandas' one which will allow you to feel confortable using it.

import dask
import dask.dataframe as dd
df1 = dd.read_csv('myfile1.csv')
df2 = dd.read_csv('myfile2.csv')
# do something with your dataframes
# combine them (append -- but you could also join etc.)
final_df = df1.append(df2)

You'll find numerous great examples here:

Solution 5:[5]

You can use GroupBy.apply:

N = 3
out = (df.groupby(df.index//N)['ProcId']
         .apply(lambda x: '%22%2C%2'.join(x)+'%22%29')
       )

output (as Series):

0       proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1         proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
2    proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
3       proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
4                        proce_123%22%2C%2proce_123%22%29
Name: ProcId, dtype: object

Or output as list:

# out.to_list()
['proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29',
 'proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29',
 'proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29',
 'proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29',
 'proce_123%22%2C%2proce_123%22%29']

older answer before question update

For an efficiency, use a vectorial approach:

import numpy as np
N = 3

# map code every N procid
s = np.where(np.arange(len(df))%N < N-1, '%22%2C%2', '%22%29')

# concatenate strings
out = (df['ProcId']+'_'+s).str.cat()

Output: 'proce_123_%22%2C%2proce_126_%22%2C%2proce_123_%22%29proce_000_%22%2C%2proce_144_%22%2C%2proce_1_%22%29proce_100_%22%2C%2proce_57777_%22%2C%2proce_0000_%22%29proce_123_%22%2C%2proce_123_%22%2C%2proce_123_%22%29proce_123_%22%2C%2proce_123_%22%2C%2'

Solution 6:[6]

I'm guessing you want to apply .rolling but to string columns. Modifying a bit from Rolling sum with strings.

def create_chunk_string(dff):
    res_df = pd.concat(
        [dff['ProcId'].shift(-i).fillna('') for i in range(3)], 
        axis=1
    ).sum(axis=1)
    res_df = res_df.iloc[:-3+1].to_frame(name='ChunkStr')
    return dff.join(res_df, how='left')    

print(df.groupby('Metric').apply(create_chunk_string))

   Metric       ProcId            TimeStamp      Value                        ChunkStr
0     CPU    proce_123  Mar-11-202211:00:00   1.445312     proce_123proce_126proce_123
1     CPU    proce_126  Mar-11-202211:00:00   0.058320     proce_126proce_123proce_000
2     CPU    proce_123  Mar-11-202211:00:00   0.095274     proce_123proce_000proce_144
3     CPU    proce_000  Mar-11-202211:00:00   0.019654       proce_000proce_144proce_1
4     CPU    proce_144  Mar-11-202211:00:00   0.019841       proce_144proce_1proce_100
5     CPU      proce_1  Mar-11-202211:00:00   0.234742                             NaN
6     CPU    proce_100  Mar-11-202211:00:00   5.329458                             NaN
7     GPU  proce_57777  Mar-11-202211:00:00   0.253906  proce_57777proce_0000proce_123
8     GPU   proce_0000  Mar-11-202211:00:00   0.019350    proce_0000proce_123proce_123
9     GPU    proce_123  Mar-11-202211:00:00   0.019501     proce_123proce_123proce_123
10    GPU    proce_123  Mar-11-202211:00:00   2.324219     proce_123proce_123proce_123
11    GPU    proce_123  Mar-11-202211:00:00  68.390366     proce_123proce_123proce_123
12    GPU    proce_123  Mar-11-202211:00:00   0.057781                             NaN
13    GPU    proce_123  Mar-11-202211:00:00   0.416667                             NaN

You might want to sort your dataframe before doing this. Solution for adding the string in between and at the end of ChunkStr should be straightforward.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 richardec
Solution 3 Josh
Solution 4 fpajot
Solution 5
Solution 6 Gunawan L G