'PySpark Design Pattern for Combining Values Based on Criteria

Hi I am new to PySpark and want to create a function that takes a table of duplicate rows and a dict of {field_names : ["the source" : "the approach for getting the record"]} as an input and creates a new record. The new record will be equal to the first non-null value in the priority list where each "approach" is a function.

For example, the input table looks like this for a specific component:

And given this priority dict:

The output record should look like this:

The new record looks like this because for each field there is a function selected that dictates how the value is selected. (e.g. phone is equal to 0.75 as Amazon's most complete record is null so you coalesce to the next approach in the list which is the value of phone for the most complete record for Google = 0.75)

Essentially, I want to write a pyspark function that groups by components and then applies the appropriate function for each column to get the correct value. While I have a function that "works" the time complexity is terrible as I am naively looping through each component then each column, then each approach in the list to build the record.

Any help is much appreciated!

python-3.x pyspark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'PySpark Design Pattern for Combining Values Based on Criteria

Sources

Related Questions