'Data Factory: How to flatten json hierarchy
I have a json-file in a blob container in Azure SA and I want to use "Copy Data" activity in ADF to get the data in a SQL DB. I have also looked into using Data Flows in ADF but haven't succeeded there either.
Now when I use the copy data activity the output only contains the first entry in "lines".
The json-file has the following hierarchy:
And my goal is to have each "line" in "order" in a seperate line in the SQL DB.
EDIT 1: I am using Data Flows and data is added to both the Blob (sink1) and SQL DB (sink2) like I want to, i.e the data is flattened. The problem is that the Data Flow gives errors that I do not understand.
The flow looks like this:
And even though I have specified the file name in the Data Flow the output file is named "part-00000-609332d2-8494-4b68-b481-f237f62cc6c8-c000.json".
The output error details of the pipeline which runs the data flow is as follows:
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'sink1': org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.","Details":"org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2607)\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2617)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteFile(NativeAzureFileSystem.java:2657)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem$2.execute(NativeAzureFileSystem.java:2391)\n\tat org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor.executeParallel(AzureFileSystemThreadPoolExecutor.java:223)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteWithoutAuth(NativeAzureFileSystem.java:2403)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:2453)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:1936)\n\tat org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter."}"
Here is a sample of the JSON data as text:
{ "customerId": 2241, "soidGt": null, "timestampGt": "2022-04-25T00:00:00", "timestampLt": null, "orders": [ { "soid": 68810264, "id": "a4b84f56-c6a4-4b37-bffb-a34d04c513c4", "tableId": 4676, "revenueUnitId": 682, "lines": [ { "solid": 147557444, "articleId": 70949, "quantity": 3, "taxPctValue": 25, "articleName": "Diavola", "netAmount": 516, "grossAmount": 645 }, { "solid": 147557445, "articleId": 70961, "quantity": 1, "taxPctValue": 25, "articleName": "Parma ai pomodori secchi", "netAmount": 183.2, "grossAmount": 229 } ], "payments": [ { "soptid": 70655447, "paymentTypeId": 2, "amount": 874 } ] }, { "soid": 68810622, "id": "1b356f45-7df7-42ba-8d50-8b14cf67180d", "tableId": 4546, "revenueUnitId": 83, "lines": [ { "solid": 147557985, "articleId": 71159, "quantity": 2, "taxPctValue": 25, "articleName": "Hansa 0,4L", "netAmount": 152, "grossAmount": 190 }, { "solid": 147557986, "articleId": 70948, "quantity": 1, "taxPctValue": 25, "articleName": "Parma", "netAmount": 175.2, "grossAmount": 219 }, { "solid": 147557987, "articleId": 70918, "quantity": 1, "taxPctValue": 25, "articleName": "Focaccia sarda", "netAmount": 71.2, "grossAmount": 89 }, { "solid": 147557988, "articleId": 70935, "quantity": 1, "taxPctValue": 25, "articleName": "Pasta di manzo", "netAmount": 196, "grossAmount": 245 } ], "payments": [ { "soptid": 70655798, "paymentTypeId": 2, "amount": 750 } ] }
Solution 1:[1]
In DataFlow you can find flatten transformation which can unroll arrays. It's this cyan color in Formatters menu.
According to this StackOverflow answer it is also possible with foreach loop but it feels hacky.
Solution 2:[2]
Just in case you do ever consider using Azure SQL DB's built-in abilities to work with JSON eg OPENJSON, JSON_VALUE and JSON_QUERY, here is a common pattern I use: land the data in a SQL table using Azure Data Factory (ADF) then work with it in SQL, eg:
CREATE TABLE #tmp (
jsonId INT IDENTITY PRIMARY KEY,
[json] NVARCHAR(MAX)
);
GO
DECLARE @json NVARCHAR(MAX) = '{
"customerId": 2241,
"soidGt": null,
"timestampGt": "2022-04-25T00:00:00",
"timestampLt": null,
"orders": [
{
"soid": 68810264,
"id": "a4b84f56-c6a4-4b37-bffb-a34d04c513c4",
"tableId": 4676,
"revenueUnitId": 682,
"lines": [
{
"solid": 147557444,
"articleId": 70949,
"quantity": 3,
"taxPctValue": 25,
"articleName": "Diavola",
"netAmount": 516,
"grossAmount": 645
},
{
"solid": 147557445,
"articleId": 70961,
"quantity": 1,
"taxPctValue": 25,
"articleName": "Parma ai pomodori secchi",
"netAmount": 183.2,
"grossAmount": 229
}
],
"payments": [
{
"soptid": 70655447,
"paymentTypeId": 2,
"amount": 874
}
]
},
{
"soid": 68810622,
"id": "1b356f45-7df7-42ba-8d50-8b14cf67180d",
"tableId": 4546,
"revenueUnitId": 83,
"lines": [
{
"solid": 147557985,
"articleId": 71159,
"quantity": 2,
"taxPctValue": 25,
"articleName": "Hansa 0,4L",
"netAmount": 152,
"grossAmount": 190
},
{
"solid": 147557986,
"articleId": 70948,
"quantity": 1,
"taxPctValue": 25,
"articleName": "Parma",
"netAmount": 175.2,
"grossAmount": 219
},
{
"solid": 147557987,
"articleId": 70918,
"quantity": 1,
"taxPctValue": 25,
"articleName": "Focaccia sarda",
"netAmount": 71.2,
"grossAmount": 89
},
{
"solid": 147557988,
"articleId": 70935,
"quantity": 1,
"taxPctValue": 25,
"articleName": "Pasta di manzo",
"netAmount": 196,
"grossAmount": 245
}
],
"payments": [
{
"soptid": 70655798,
"paymentTypeId": 2,
"amount": 750
}
]
}
]
}'
INSERT INTO #tmp ( json ) VALUES ( @json );
-- CAST, JSON_VALUE, OPENJSON
SELECT
JSON_VALUE( t.[json], '$.customerId' ) customerId,
JSON_VALUE( t.[json], '$.soidGt' ) soidGt,
JSON_VALUE( t.[json], '$.timestampGt' ) timestampGt,
JSON_VALUE( t.[json], '$.timestampLt' ) timestampLt,
-- orders
JSON_VALUE( o.[value], '$.soid' ) soid,
JSON_VALUE( o.[value], '$.id' ) id,
JSON_VALUE( o.[value], '$.tableId' ) tableId,
JSON_VALUE( o.[value], '$.revenueUnitId' ) revenueUnitId,
-- lines
JSON_VALUE( l.[value], '$.solid' ) solid,
JSON_VALUE( l.[value], '$.articleId' ) articleId,
JSON_VALUE( l.[value], '$.quantity' ) quantity,
JSON_VALUE( l.[value], '$.taxPctValue' ) taxPctValue,
JSON_VALUE( l.[value], '$.articleName' ) articleName,
JSON_VALUE( l.[value], '$.netAmount' ) netAmount,
JSON_VALUE( l.[value], '$.grossAmount' ) grossAmount,
-- payments
JSON_VALUE( p.[value], '$.soptid' ) soptid,
JSON_VALUE( p.[value], '$.paymentTypeId' ) paymentTypeId,
JSON_VALUE( p.[value], '$.amount' ) amount
FROM #tmp t
CROSS APPLY OPENJSON( t.[json], '$.orders' ) o
CROSS APPLY OPENJSON( o.[value], '$.lines' ) l
CROSS APPLY OPENJSON( o.[value], '$.payments' ) p;
My results:
This is more of an ELT (and code-heavy) approach taking advantage of the existing compute you have running (Azure SQL DB) rather than an ETL approach, eg Mapping Data Flows (or low code) which spins up additional compute in the background. Whatever works for you. It's all managed for you but there's always more than one way to do things.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | wBob |



