'PySpark: dynamic union of DataFrames with different columns

Consider the arrays as shown here. I have 3 sets of array:

Array 1:

C1  C2  C3
1   2   3
9   5   6

Array 2:

C2 C3 C4
11 12 13
10 15 16

Array 3:

C1   C4
111  112
110  115

I need the output as following, the input I can get any one value for C1, ..., C4 but while joining I need to get correct values and if the value is not there then it should be zero.

Expected output:

C1 C2 C3 C4
1  2  3  0
9  5  6  0
0  11 12 13
0 10 15 16
111 0 0 112
110 0 0 115

I have written pyspark code but I have hardcoded the value for the new column and its RAW, I need to convert the below code to method overloading, so that I can use this script as automatic one. I need to use only python/pyspark not pandas.

import pyspark
from pyspark import SparkContext
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession

sqlContext = pyspark.SQLContext(pyspark.SparkContext())

df01 = sqlContext.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = sqlContext.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = sqlContext.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))

df01_add = df01.withColumn("C4", lit(0)).select("c1","c2","c3","c4")
df02_add = df02.withColumn("C1", lit(0)).select("c1","c2","c3","c4")
df03_add = df03.withColumn("C2", lit(0)).withColumn("C3", lit(0)).select("c1","c2","c3","c4")

df_uni = df01_add.union(df02_add).union(df03_add)
df_uni.show()

Method Overloading Example:

class Student:
     def ___Init__ (self,m1,m2):
         self.m1 = m1
         self.m2 = m2

     def sum(self,c1=None,c2=None,c3=None,c4=None):
         s = 0
         if c1!= None and c2 != None and c3 != None:
            s = c1+c2+c3
         elif c1 != None and c2 != None:
             s = c1+c2
         else:
            s = c1
         return s

print(s1.sum(55,65,23))


Solution 1:[1]

Since Spark 3.1.0 you can use unionByName with allowMissingColumns=True which almost exactly does what you want. You'll need a fillna to replace the nulls with 0.

Note that any nulls in your original dataframes will also be replaced with 0.

dfs = [df01, df02, df03]

res = dfs[0]
for df in dfs[1:]:
    res = res.unionByName(df, allowMissingColumns=True)

res.fillna(0).show()

+---+---+---+---+                                                               
| C1| C2| C3| C4|
+---+---+---+---+
|  1|  2|  3|  0|
|  9|  5|  6|  0|
|  0| 11| 12| 13|
|  0| 10| 15| 16|
|111|  0|  0|112|
|110|  0|  0|115|
+---+---+---+---+

Solution 2:[2]

I would try

df = df1.join(df2, ['each', 'shared', 'col], how='full')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ScootCork
Solution 2 ehacinom