'Is it possible to load multiple directory separately in pyspark but process them in parallel?
I have s3 or azure blob directory structure like the following
parent_dir
    child_dir1
       avro_1
       avro_2
       ...
    child_dir2
       ...
There is 1-2 hundred child_dir with couple files in each child_dir with a size of couple GB
I want to use pySpark to do some transformation to each avro and output via the same directory structure (# of avro files can change, but directory structure and data ownership have to be the same, ie. data in one child dir cannot be output to another child dir)
Right now I list out the directory in parent_dir and convert all the avro files within each child_dir to a dataframe
parent_dir
    child_dir1 -> df_for_child1
    child_dir2 -> df_for_child2
I then iterate the list of dataframe and transform the data and output them base on which child_dir each dataframe it come from
for df_container in df_container_list:
   df = df_container.getdf()
   df.transform(my_transform_func)
   df.write(generate_output_path(df_container_get_sub_dir, df))
I was wondering if it is possible to parallelize the transformation? or is the transformation and read/write already parallelize internally in spark?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source | 
|---|
