'Extract element of exploded JSON via name of list element

I have a JSON that I have read in using

data_fields = spark.read.json(json_files)

where json_files is the path to the json files. To extract the data from the JSON I then use:

data_fields = data_fields.select('datarecords.fields')

I then give each record its own row via:

input_data = input_data.select((explode("fields").alias('fields')))

Resulting in data in the fields column that looks like:

fields
[[ID,, 101],[other_var,, 'some_value']]
[[other_var,,"some_value"],[ID,, 102],[other_var_2,, 'some_value_2']

each sub list element can be refereed too using "name", "status" and "value" as the components. For example:

input_data = input_data.withColumn('new_col', col('fields.name'))

Will extract the name of the first element. So in the above example, "ID" and "other_var". I am trying to extract the id for each record to its own column to end with:

id fields
101 [[ID,, 101],[other_var,, 'some_value']]
102 [[other_var,,"some_value"],[ID,, 102],[other_var_2,, 'some_value_2']

For those cases where the id is the first element in the fields column, row 1 above, I can do this via:

input_data = input_data.withColumn('id', col('fields')[0].value)
    

However as shown the "id" is not always the first element in the list in column fields, and there are many hundreds of potential sub list elements. I have therefore being trying to extract the "id" via its name rather than its position in the list but have come up against a blank. The nearest I have come is to use the below to identify which element it exists in:

    input_data = input_data.withColumn('id', array_position(col('fields.name'),"ID")) 

Which returns the position. But not sure where to go to get the value unless I do something like:

 result  = input_data.withColumn('id', 
when(col('fields.name')[0] == 'ID',col('fields')[0].value)
.when(col('fields.name')[1] == 'ID',col('fields')[1].value)
.when(col('fields.name')[2] == 'ID',col('fields')[2].value))

And of course the above is impractical with potentially 100 of sub list elements in the fields column

Any help to achieve the above would be appreciated to extract the id regardless of position in the list efficiently.

Hopefully the above minimum example is clear.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source