'HowTo Flatten simple Json file in Azure Synapse Spark Notebook and convert to Parquet

I needed to flatten a simple Json file (json lines) and convert it into a Parquet format within a Spark Notebook in Azure Synapse Analytics. There is only one level of nested object for any column. However, I discovered that getting the schema of the dataframe did not return the schema of the nested object. I was using c# so that other company developers would not have to learn the other supported languages.

Solution 1:^[1]

The code below will work for the situation indicated above. Hope it saves someone else a few hours. It also drops the parent column from the data frame after the properties of the child are added into the parent dataframe.

I had no need to make this code into a recursive transversal because we have no nested objects.

using System;
using System.Collections.Generic;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using System.Diagnostics;


var df = spark.Read().Json("{Your source file path here}");

//get the schema of the data frame
var dfSchema = df.Schema() ;
// traverse the schema of the dataframe
foreach(var parentSchemaField in dfSchema.Fields) {
    
    if (parentSchemaField.DataType is StructType) {
        // get a new dataframe that just contains the child data from the parent
        var childFrame = df.Select($"{parentSchemaField.Name}.*") ;
        // traverse the schema of the child dataframe
        foreach(var childSchemaField in childFrame.Schema().Fields) {   
            //make a new column in the parent dataframe for each parents child property
            df = df.WithColumn($"{parentSchemaField.Name}.{childSchemaField.Name}",Col($"{parentSchemaField.Name}.{childSchemaField.Name}")) ;
        }
        // drop the parent column from the data frame its no longer needed
        df = df.Drop(parentSchemaField.Name) ;
    }
}
df.Write().Parquet("{Your sink file path here}") ;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	bmukes

'HowTo Flatten simple Json file in Azure Synapse Spark Notebook and convert to Parquet

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]