'collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this

common_id table1_address table2_address table3_address table4_address
123 null null stack building12 null
157 123road street12 123road street 33 123road street 44 123road street 45
158 wolf building 451-2 451-2 building wolf wolf building 451-2 null
163 null sweet rd. 254-11 null --

I have about 3 million rows that contains address information from different tables with common_id. I joined 4 tables and made it into one table. I want to make the address rows into one address row that looks like this.

common_id collaborated_address
123 stack building12
157 123road street12
158 wolf building 451-2
163 sweet rd. 254-11

I tried to do this by using pandas, but it takes too long so I want to do this by using spark sql or pyspark functions.

Conditions:

  1. when collaborated, it should collect only the ones that are not null or not "--"
  2. like row common_id 158, it should collect addresses that are mostly the same. In this case, "wolf building 451-2" is in table1_address column and table3_address.
  3. if all column contains address but has slightly different address like row common_id 157, then it should collect random address.


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source