'collaborating address columns from multiple tables into one column (3million rows)
I have a table that looks like this
common_id | table1_address | table2_address | table3_address | table4_address |
---|---|---|---|---|
123 | null | null | stack building12 | null |
157 | 123road street12 | 123road street 33 | 123road street 44 | 123road street 45 |
158 | wolf building 451-2 | 451-2 building wolf | wolf building 451-2 | null |
163 | null | sweet rd. 254-11 | null | -- |
I have about 3 million rows that contains address information from different tables with common_id. I joined 4 tables and made it into one table. I want to make the address rows into one address row that looks like this.
common_id | collaborated_address |
---|---|
123 | stack building12 |
157 | 123road street12 |
158 | wolf building 451-2 |
163 | sweet rd. 254-11 |
I tried to do this by using pandas, but it takes too long so I want to do this by using spark sql or pyspark functions.
Conditions:
- when collaborated, it should collect only the ones that are not null or not "--"
- like row common_id 158, it should collect addresses that are mostly the same. In this case, "wolf building 451-2" is in table1_address column and table3_address.
- if all column contains address but has slightly different address like row common_id 157, then it should collect random address.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|