'Filter dictionary in pyspark with key names
Given a dictionary like column in a dataset, I want to grab the value from a key given that the value from another key is satisfied.
Example: Say I have a column 'statistics' in a dataset, where each data row looks as:
array
0: {"hair": "black", "eye": "white", "metric": "feet"}
1: {"hair": "blue", "eye": "white", "metric": "m"}
2: {"hair": "red", "eye": "brown", "metric": "feet"}
3: {"hair": "yellow", "eye": "white", "metric": "cm"}
I want to get the value of 'eye' whenever hair is 'black'
I tried:
select
statistics.eye("*").filter(statistics.hair, x -> x == 'black')
from arrayData
but it gives an error and I'm unable to grab the value for eye, please assist.
Solution 1:[1]
I eventually figured it out without having to first convert to a dataframe.
The aggregate command allows you to grab the value from a key given that the value from another key is satisfied. For this instance, the command below will suffice:
select
aggregate(statistics,"",(agg,item)->concat(agg,CASE WHEN item.hair == 'black' THEN item.eye ELSE "" END)) as EyeColor
from arrayData
For more details on how to use this function, see here
Solution 2:[2]
You can convert to dataframe and read it .. You can also register this as temptable and read as sql
from pyspark.sql import functions as F
df=sc.parallelize([{"hair": "black", "eye": "white", "metric": "feet"},{"hair": "blue", "eye": "white", "metric": "m"},{"hair": "red", "eye": "brown", "metric": "feet"},{"hair": "yellow", "eye": "white", "metric": "cm"}]).toDF()
>>> df.show()
+-----+------+------+
| eye| hair|metric|
+-----+------+------+
|white| black| feet|
|white| blue| m|
|brown| red| feet|
|white|yellow| cm|
+-----+------+------+
>>> df.filter(F.col("hair") == 'black').show()
+-----+-----+------+
| eye| hair|metric|
+-----+-----+------+
|white|black| feet|
+-----+-----+------+
df.createOrReplaceTempView("data")
spark.sql("select * from data where hair ='black'")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Michael Okelola |
| Solution 2 | Rafa |
