'How to flatten a nested Json struct using Python databricks
Trying to flatten a nested json response using Python databricks dataframe. I was able to flatten the "survey" struct successfully but getting errors when i try the same code for "questions". Attached is the json response and databricks code that i am using. Any idea what am i doing wrong here?
ta_team_customer_experience_survey_raw_path =
"abfss://raw@{}.dfs.core.windows.net/cultureamp/TA Team - Customer Experience Survey/{}".format(storageAccountName, ingest_date_path)
ta_team_customer_experience_survey_delta_path = "abfss://transformed@{}.dfs.core.windows.net/ingested/cultureamp/ta_team_customer_experience_survey/full".format(storageAccountName)
if file_exists(ta_team_customer_experience_survey_raw_path):
ta_team_customer_experience_survey_raw = spark.read \
.json("{}/*.json".format(ta_team_customer_experience_survey_raw_path), schema) \
.withColumn("responses", F.explode("responses"))\
.withColumn("survey_name", F.col("survey.name"))\
.withColumn("survey_status", F.col("survey.status"))\
.withColumn("survey_response_count", F.col("survey.response_count"))\
.withColumn("survey_type", F.col("survey.type"))\
.withColumn("survey_created_at", F.col("survey.created_at"))\
.withColumn("survey_launched_at", F.col("survey.launched_at"))\
.withColumn("questions_factors", F.col("questions.'ukics.question.241b1142.2738.km'.type"))
Solution 1:[1]
You can refer below code to flatten complex json input.
def flatten(dataframe: DataFrame,
explode_outer: bool = True,
explode_pos: bool = True,
name: str = "root") -> Dict[str, DataFrame]:
"""
Convert a complex nested DataFrame in one (or many) flat DataFrame.
If a columns is a struct it is flatten directly.
If a columns is an array or map, then child DataFrames are created in different granularities.
:param dataframe: Spark DataFrame
:param explode_outer: Should we preserve the null values on arrays?
:param explode_pos: Create columns with the index of the ex-array
:param name: The name of the root Dataframe
:return: A dictionary with the names as Keys and the DataFrames as Values
"""
cols_exprs: List[Tuple[str, str, str]] = Spark._flatten_struct_dataframe(df=dataframe,
explode_outer=explode_outer,
explode_pos=explode_pos)
exprs_arr: List[str] = [x[2] for x in cols_exprs if Spark._is_array_or_map(x[1])]
exprs: List[str] = [x[2] for x in cols_exprs if not Spark._is_array_or_map(x[1])]
dfs: Dict[str, DataFrame] = {name: dataframe.selectExpr(exprs)}
exprs = [x[2] for x in cols_exprs if not Spark._is_array_or_map(x[1]) and not x[0].endswith("_pos")]
for expr in exprs_arr:
df_arr = dataframe.selectExpr(exprs + [expr])
name_new: str = Spark._build_name(name=name, expr=expr)
dfs_new = Spark.flatten(dataframe=df_arr,
explode_outer=explode_outer,
explode_pos=explode_pos,
name=name_new)
dfs = {**dfs, **dfs_new}
return dfs
Refer this Git hub link By Igor Tavares
Solution 2:[2]
Finally managed to resolve the issue. If a nested struct has dot(.) in the name, in data frame nested struct name will be enclosed by acute (grave, grave accent) key and not the single quote key. Below is the working code
.withColumn("questions_factors",F.col("questions.`ukics.question.241b1142.2738.km`.type"))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | AbhishekKhandave-MT |
Solution 2 |