'collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this

common_id	table1_address	table2_address	table3_address	table4_address
123	null	null	stack building12	null
157	123road street12	123road street 33	123road street 44	123road street 45
158	wolf building 451-2	451-2 building wolf	wolf building 451-2	null
163	null	sweet rd. 254-11	null	--

I have about 3 million rows that contains address information from different tables with common_id. I joined 4 tables and made it into one table. I want to make the address rows into one address row that looks like this.

common_id	collaborated_address
123	stack building12
157	123road street12
158	wolf building 451-2
163	sweet rd. 254-11

I tried to do this by using pandas, but it takes too long so I want to do this by using spark sql or pyspark functions.

Conditions:

when collaborated, it should collect only the ones that are not null or not "--"
like row common_id 158, it should collect addresses that are mostly the same. In this case, "wolf building 451-2" is in table1_address column and table3_address.
if all column contains address but has slightly different address like row common_id 157, then it should collect random address.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'collaborating address columns from multiple tables into one column (3million rows)

Sources

Related Questions