'How do I outer join data frames coming from a for loop and having different columns size?

I have a few data frames coming from a location. I want to combine them (outer join them). The problem is, few of them have different column sizes. But, the column to join them is the same. How do I achieve this in pyspark?

The sample code I have is:

def unionAll(*dfs):
    return reduce(DataFrame.union, dfs)

attributes_df=[]
files = bucket.objects.filter(Prefix="data-process/attributes_extraction/Attributes_/")
for obj in files:
  file_path=obj.key 
  if file_path[-5:]==".xlsx":
    attributes_file_list.append(file_path)
    fileobj = s3.Object(S3_BUCKET, file_path)
    data = fileobj.get()['Body'].read()
    df = pd.read_excel(io.BytesIO(data))
    df= spark.createDataFrame(df.astype(str))
    attributes_df.append(df)
attributes_ = unionAll(*attributes_df)
display(attributes_)

The error I have is:

Union can only be performed on tables with the same number of columns, but the first table has 4 columns and the second table has 3 columns;

So, I believe one of the data frames has 4 columns and the others 3. But Can't not outer join them on a common column?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source