'Create df key->count mapping from multiple dfs

I have 3 input dfs all in the format:

key | irrelevant_data
----------------------
 A  |   asdfg 
 B  |   asdfg 

key | irrelevant_data
----------------------
 C  |   asdfg 
 B  |   asdfg

I want to combine the 3 into a dictionary-like df that will map key->count its shown up

i.e. from above example:

key | count
----------------------
 A  |   1 
 C  |   1 
 B  |   2

After this runs once, I need to keep the data in the dict for the next iteration which will have 3 new input dfs. We might come across the same keys - in that case, increase the count. The purpose of this is once a count reaches 3, I want to remove it from the table and get that key.

I was thinking of converting one of the input dfs to a MapType (it's guaranteed within a df that the keys are unique, but this is not true among all 3 input dfs):

df1 = df1.withColumn("propertiesMap", F.create_map(
    F.col("key"), F.lit(1)
))

But after that I'm not sure how to go about adding in rows from the other 2 dfs and increasing counts if the key already exists vs creating a new row if it doesn't. I'm familiar with python and it'd be so simple:

# pseudocode of what I essentially want in PySpark, where dict is a df
dict = {}
for curr_df in dfs:
    for key, _ in curr_df.items():
        dict[key] += 1

Solution 1:^[1]

So you have 6 dfs. You can union or unionByName all of them and then gruopBy('key') and aggregate using count.

df = (
    df1
    .unionByName(df2)
    .unionByName(df3)
    .unionByName(df4)
    .unionByName(df5)
    .unionByName(df6)
    .groupBy('key')
    .count()
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	ZygD

'Create df key->count mapping from multiple dfs

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]