'HowTo Flatten simple Json file in Azure Synapse Spark Notebook and convert to Parquet
I needed to flatten a simple Json file (json lines) and convert it into a Parquet format within a Spark Notebook in Azure Synapse Analytics. There is only one level of nested object for any column. However, I discovered that getting the schema of the dataframe did not return the schema of the nested object. I was using c# so that other company developers would not have to learn the other supported languages.
Solution 1:[1]
The code below will work for the situation indicated above. Hope it saves someone else a few hours. It also drops the parent column from the data frame after the properties of the child are added into the parent dataframe.
I had no need to make this code into a recursive transversal because we have no nested objects.
using System;
using System.Collections.Generic;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using System.Diagnostics;
var df = spark.Read().Json("{Your source file path here}");
//get the schema of the data frame
var dfSchema = df.Schema() ;
// traverse the schema of the dataframe
foreach(var parentSchemaField in dfSchema.Fields) {
if (parentSchemaField.DataType is StructType) {
// get a new dataframe that just contains the child data from the parent
var childFrame = df.Select($"{parentSchemaField.Name}.*") ;
// traverse the schema of the child dataframe
foreach(var childSchemaField in childFrame.Schema().Fields) {
//make a new column in the parent dataframe for each parents child property
df = df.WithColumn($"{parentSchemaField.Name}.{childSchemaField.Name}",Col($"{parentSchemaField.Name}.{childSchemaField.Name}")) ;
}
// drop the parent column from the data frame its no longer needed
df = df.Drop(parentSchemaField.Name) ;
}
}
df.Write().Parquet("{Your sink file path here}") ;
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | bmukes |
