'Create df key->count mapping from multiple dfs
I have 3 input dfs all in the format:
key | irrelevant_data
----------------------
A | asdfg
B | asdfg
key | irrelevant_data
----------------------
C | asdfg
B | asdfg
I want to combine the 3 into a dictionary-like df that will map key->count its shown up
i.e. from above example:
key | count
----------------------
A | 1
C | 1
B | 2
After this runs once, I need to keep the data in the dict for the next iteration which will have 3 new input dfs. We might come across the same keys - in that case, increase the count. The purpose of this is once a count reaches 3, I want to remove it from the table and get that key.
I was thinking of converting one of the input dfs to a MapType (it's guaranteed within a df that the keys are unique, but this is not true among all 3 input dfs):
df1 = df1.withColumn("propertiesMap", F.create_map(
F.col("key"), F.lit(1)
))
But after that I'm not sure how to go about adding in rows from the other 2 dfs and increasing counts if the key already exists vs creating a new row if it doesn't. I'm familiar with python and it'd be so simple:
# pseudocode of what I essentially want in PySpark, where dict is a df
dict = {}
for curr_df in dfs:
for key, _ in curr_df.items():
dict[key] += 1
Solution 1:[1]
So you have 6 dfs. You can union or unionByName all of them and then gruopBy('key') and aggregate using count.
df = (
df1
.unionByName(df2)
.unionByName(df3)
.unionByName(df4)
.unionByName(df5)
.unionByName(df6)
.groupBy('key')
.count()
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ZygD |
