'How do I outer join data frames coming from a for loop and having different columns size?
I have a few data frames coming from a location. I want to combine them (outer join them). The problem is, few of them have different column sizes. But, the column to join them is the same. How do I achieve this in pyspark?
The sample code I have is:
def unionAll(*dfs):
return reduce(DataFrame.union, dfs)
attributes_df=[]
files = bucket.objects.filter(Prefix="data-process/attributes_extraction/Attributes_/")
for obj in files:
file_path=obj.key
if file_path[-5:]==".xlsx":
attributes_file_list.append(file_path)
fileobj = s3.Object(S3_BUCKET, file_path)
data = fileobj.get()['Body'].read()
df = pd.read_excel(io.BytesIO(data))
df= spark.createDataFrame(df.astype(str))
attributes_df.append(df)
attributes_ = unionAll(*attributes_df)
display(attributes_)
The error I have is:
Union can only be performed on tables with the same number of columns, but the first table has 4 columns and the second table has 3 columns;
So, I believe one of the data frames has 4 columns and the others 3. But Can't not outer join them on a common column?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
