'Return unique element col value from the file using spark scala/python
Consider a file with above two columns and has products column with diff products in the store, I need to return only unique product which is in just one store and return the store name. I have tried below approach but looking for efficient solution. Thanks in advance.
store products
walmart eggs,cereals,milk
target toys,eggs,cereals
costco eggs,cereals,milk
val df1 = dataDF.select("prods").agg(collect_list("prods")).collect.toArray
df1(0).getSeq[String](0).toList.map(x => x.split(",")).flatten.groupBy((word: String) => word).mapValues(_.length).filter(x=> x._2==1 ).keys.head
=> this returns toys, then filter that respective store from df. But it doesn't seem efficient .
The expected output
target toys
Solution 1:[1]
You could try this:
dataDf
.withColumn("products", split($"products", ",")) // Parse as array
.withColumn("product", explode($"products")) // Explode into rows
.groupBy($"product")
.agg(collect_list($"store").as("stores")) // Get list of stores as array
.filter(size($"stores") === 1) // Where there's only one store selling
.show
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Kombajn zbo?owy |
