'Combine files from multiple folders in one file per folder in an S3 bucket
I am working with an S3 bucket which has multiple levels and every subfolder has multiple files. I am trying to run a python script in Glue that needs to combine all these files into 1 dataframe per folder run a process on it and then save it in another S3 bucket under a similar file path.
Here is the hierarchy of the folders:
long_path/folder1:
long_path/folder1/A: A1.csv, A2.csv, A3.csv
long_path/folder1/B: B1.csv, B2.csv
long_path/folder2:
long_path/folder2/C: C1.csv, C2.csv... C5.csv
long_path/folder3:
long_path/folder3/D: D1.csv
long_path/folder3/E: E1.csv...E4.csv
I would like to combine all the csv in the folders A,B,C,D,E and create individual data frames called df_a, df_b, df_c, df_d, df_e
So far, my approach has been to create a list with these paths and creating the dataframes by iterating over this list:
list = ["long_path/folder1/A", "long_path/folder1/B", "long_path/folder1/C", "long_path/folder1/D", "long_path/folder1/E"]
for i in list:
files = []
for item in s3_client.list_objects_v2(Bucket=bucket, Prefix=i)['Contents']:
if item['Key'].endswith(".csv"):
files.append(item['Key'])
list_df = []
for i in files:
path = "s3://bucket-name/" + i
df = pd.read_csv(path, engine='pyarrow')
list_df.append(df)
final_df = pd.concat(list_df)
And then I do the process within this loop. This code however looks very clunky. Is there a more efficient and cleaner way to do this task? How do I combine all the files in a folder for multiple folders?
Thanks in advance!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
