'Combine files from multiple folders in one file per folder in an S3 bucket

I am working with an S3 bucket which has multiple levels and every subfolder has multiple files. I am trying to run a python script in Glue that needs to combine all these files into 1 dataframe per folder run a process on it and then save it in another S3 bucket under a similar file path.

Here is the hierarchy of the folders:

long_path/folder1:
long_path/folder1/A: A1.csv, A2.csv, A3.csv
long_path/folder1/B: B1.csv, B2.csv

long_path/folder2:
long_path/folder2/C: C1.csv, C2.csv... C5.csv

long_path/folder3:
long_path/folder3/D: D1.csv
long_path/folder3/E: E1.csv...E4.csv

I would like to combine all the csv in the folders A,B,C,D,E and create individual data frames called df_a, df_b, df_c, df_d, df_e

So far, my approach has been to create a list with these paths and creating the dataframes by iterating over this list:

list = ["long_path/folder1/A", "long_path/folder1/B", "long_path/folder1/C", "long_path/folder1/D", "long_path/folder1/E"]
for i in list:
  files = []
  for item in s3_client.list_objects_v2(Bucket=bucket, Prefix=i)['Contents']:
      if item['Key'].endswith(".csv"):
          files.append(item['Key'])
  list_df = []
  for i in files:
      path = "s3://bucket-name/" + i
      df = pd.read_csv(path, engine='pyarrow')
      list_df.append(df)
  final_df = pd.concat(list_df)

And then I do the process within this loop. This code however looks very clunky. Is there a more efficient and cleaner way to do this task? How do I combine all the files in a folder for multiple folders?

Thanks in advance!

python

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Combine files from multiple folders in one file per folder in an S3 bucket

Sources

Related Questions