'how do you go though data frame values in chunks and combine them?
I have this dataframe:
Metric ProcId TimeStamp Value
CPU proce_123 Mar-11-2022 11:00:00 1.4453125
CPU proce_126 Mar-11-2022 11:00:00 0.058320373
CPU proce_123 Mar-11-2022 11:00:00 0.095274389
CPU proce_000 Mar-11-2022 11:00:00 0.019654088
CPU proce_144 Mar-11-2022 11:00:00 0.019841269
CPU proce_1 Mar-11-2022 11:00:00 0.234741792
CPU proce_100 Mar-11-2022 11:00:00 5.32945776
CPU proce_57777 Mar-11-2022 11:00:00 0.25390625
CPU proce_0000 Mar-11-2022 11:00:00 0.019349845
CPU proce_123 Mar-11-2022 11:00:00 0.019500781
CPU proce_123 Mar-11-2022 11:00:00 2.32421875
CPU proce_123 Mar-11-2022 11:00:00 68.3903656
CPU proce_123 Mar-11-2022 11:00:00 0.057781201
CPU proce_123 Mar-11-2022 11:00:00 0.416666627
This is just a sample dataframe; the actual dataframe is in thousands of rows. I need to go though this dataframe in chunks the ProdID column and I need to create a string combining these ProdID in chunks for each iteration.
For example the string needs to be like this given the chunk size 3:
proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
proce_123%22%2C%2proce_126%22%2C%2proce_111%22%29
Please note after the 3rd chunk, we need to add %22%29. After the first ad second we need to add %22%2C%2.
I can do something like this to print out the chunks:
n = 3 #size of chunks
chunks = [] #list of chunks
for i in range(0, len(id), n):
chunks.append(id[i:i + n])
I am not sure how would I combine these 3 items in one string and add the others strings at the end.
Solution 1:[1]
avoid iterating through your data frame in a for loop. your performance is almost guaranteed to be poorer than if you used some combination of groupby, merge, shift and other array oriented numpy or pandas operations
Make chunk ids out of your dataframe through integer division over the index (assumes incremental index values)
chunk_size = 3
df['ChunkId'] = df.index // chunk_size
add the suffix to each ProcId to create a new column ProcEnds, then join these within each group.
df['ProcEnds'] = (df.ProcId + '%22%2C%2').where(
df.index % chunk_size != chunk_size - 1,
df.ProcId + '%22%29')
# note DataFrame.where replaces values with other when cond is False
df['ChunkString'] = df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
Optionally, drop the ChunkId & ProcEnds columns to get an output with only additional column ChunkString
df = df.drop(columns=['ChunkId', 'ProcEnds'])
df now outputs:
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2
update
google colab notebook showing the output with sample data https://colab.research.google.com/drive/1f9ZHXE2ATZXD2qWsoATxEWABIBt0tMRN?usp=sharing
update 2
OP asked:
quick question. can we group this based on df['Metric']? For example, it would be CPU, Memory. I would need ChunkString based on CPU or memory?
to apply this transformation within each Metric Group, it is easiest to contain the transformation logic inside a function and apply that on the data.
extra care needs to be made to preserve the original indexing.
def transform(frame):
_df = frame.reset_index(drop=True)
_df['ChunkId'] = _df.index // chunk_size
_df['ProcEnds'] = (_df.ProcId + '%22%2C%2').where(
_df.index % chunk_size != chunk_size - 1,
_df.ProcId + '%22%29')
_df['ChunkString'] = _df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
return _df.drop(columns=['ChunkId', 'ProcEnds'])
idx = df.index
df.groupby('Metric').apply(transform).set_index(idx)
produces the same output as earlier, elided for brevity.
Solution 2:[2]
You can use Python integer division (//) to form the index into groups of N:
N = 3
df['ChunkString'] = df.groupby(df.index//N)['ProcId'].transform(lambda x: '%22%2C%2'.join(x.tolist() + ['']*(N-len(x))) + ('%22%29' if len(x) == N else ''))
Notes:
x.tolist() + ['']*(N-len(x))just convertsxto a list, and pads it with empty items until it reaches lengthN
Output
>>> df
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2
With N = 5:
>>> df
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
Solution 3:[3]
chunk_size = 3
list_of_proc_ids = []
# First, generate a list of the procIds
for obj in range(0, len(id)):
list_of_proc_ids.append(procId) # Not sure how you're appending this, guessing you use a slice on the string line?
final_str = ''
# Then enumerate through that list, adding a unique ending at every third
for index, obj in enumerate(list_of_proc_ids]:
final_str += str(obj)
if (index + 1) % chunk_size == 0: # Checks if divisible by 3, accounting for 0 index
final_str += '%22%29'
else:
final_str += '%22%2C%2'
Solution 4:[4]
If your goal is to deal with a large amount of data, Pandas offers a great page about it and describes an efficient way to create chunks: https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html
My advice is to go with a Dask implementation to process chunks efficiently (locally, or prepare for even larger datasets by distributing compute on a cluster). Dask comes with a DataFrame API which looks like Pandas' one which will allow you to feel confortable using it.
import dask
import dask.dataframe as dd
df1 = dd.read_csv('myfile1.csv')
df2 = dd.read_csv('myfile2.csv')
# do something with your dataframes
# combine them (append -- but you could also join etc.)
final_df = df1.append(df2)
You'll find numerous great examples here:
Solution 5:[5]
You can use GroupBy.apply:
N = 3
out = (df.groupby(df.index//N)['ProcId']
.apply(lambda x: '%22%2C%2'.join(x)+'%22%29')
)
output (as Series):
0 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
2 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
3 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
4 proce_123%22%2C%2proce_123%22%29
Name: ProcId, dtype: object
Or output as list:
# out.to_list()
['proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29',
'proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29',
'proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29',
'proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29',
'proce_123%22%2C%2proce_123%22%29']
older answer before question update
For an efficiency, use a vectorial approach:
import numpy as np
N = 3
# map code every N procid
s = np.where(np.arange(len(df))%N < N-1, '%22%2C%2', '%22%29')
# concatenate strings
out = (df['ProcId']+'_'+s).str.cat()
Output: 'proce_123_%22%2C%2proce_126_%22%2C%2proce_123_%22%29proce_000_%22%2C%2proce_144_%22%2C%2proce_1_%22%29proce_100_%22%2C%2proce_57777_%22%2C%2proce_0000_%22%29proce_123_%22%2C%2proce_123_%22%2C%2proce_123_%22%29proce_123_%22%2C%2proce_123_%22%2C%2'
Solution 6:[6]
I'm guessing you want to apply .rolling but to string columns. Modifying a bit from Rolling sum with strings.
def create_chunk_string(dff):
res_df = pd.concat(
[dff['ProcId'].shift(-i).fillna('') for i in range(3)],
axis=1
).sum(axis=1)
res_df = res_df.iloc[:-3+1].to_frame(name='ChunkStr')
return dff.join(res_df, how='left')
print(df.groupby('Metric').apply(create_chunk_string))
Metric ProcId TimeStamp Value ChunkStr
0 CPU proce_123 Mar-11-202211:00:00 1.445312 proce_123proce_126proce_123
1 CPU proce_126 Mar-11-202211:00:00 0.058320 proce_126proce_123proce_000
2 CPU proce_123 Mar-11-202211:00:00 0.095274 proce_123proce_000proce_144
3 CPU proce_000 Mar-11-202211:00:00 0.019654 proce_000proce_144proce_1
4 CPU proce_144 Mar-11-202211:00:00 0.019841 proce_144proce_1proce_100
5 CPU proce_1 Mar-11-202211:00:00 0.234742 NaN
6 CPU proce_100 Mar-11-202211:00:00 5.329458 NaN
7 GPU proce_57777 Mar-11-202211:00:00 0.253906 proce_57777proce_0000proce_123
8 GPU proce_0000 Mar-11-202211:00:00 0.019350 proce_0000proce_123proce_123
9 GPU proce_123 Mar-11-202211:00:00 0.019501 proce_123proce_123proce_123
10 GPU proce_123 Mar-11-202211:00:00 2.324219 proce_123proce_123proce_123
11 GPU proce_123 Mar-11-202211:00:00 68.390366 proce_123proce_123proce_123
12 GPU proce_123 Mar-11-202211:00:00 0.057781 NaN
13 GPU proce_123 Mar-11-202211:00:00 0.416667 NaN
You might want to sort your dataframe before doing this.
Solution for adding the string in between and at the end of ChunkStr should be straightforward.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | richardec |
| Solution 3 | Josh |
| Solution 4 | fpajot |
| Solution 5 | |
| Solution 6 | Gunawan L G |
