'Unable to open or query .parquet files due to corrupted column
I am sending JSON telemetry data from Azure Stream Analytics to Azure Data Lake Gen2 serialized as .parquet files. From the data lake I've then created a view in my Azure Synapse Serverless SQL pool that I can connect to and query data for reports.
Every once in a while I will run a query and get back the following error:
Error handling external file: 'Invalid metadata in parquet file. Number of rows in metadata does not match actual number of rows in parquet file.'. File/External table name: 'https://test123.dfs.core.windows.net/devicetelemetry/2021/12/03/20/-1875592941_d9a0239529f04e1eb587b83d50bbb590_1.parquet'.
When I try to open the .paqruet file in question with "Apache Parquet Viewer" or any other parquet viewer it errors and refuses to open the file complaining that the column named 'data' is undefined. The data column is a JSON string with various sensor readings from IoT devices.
My StreamAnalytics query that sends data to my ADLS looks like this:
SELECT
Tel.identities.corporationId AS corporationId, Tel.identities.deviceId,
Tel.deviceTelemetry.version, Tel.deviceTelemetry.TimeStamp AS dateTimeStamp,
Tel.deviceTelemetry.data
INTO
deviceTelemetryADLS
FROM
data AS Tel
WHERE Tel.deviceTelemetry.data IS NOT null
Question
Wouldn't the WHERE clause used here ensure that the data is always present? Otherwise it would not be sent? Otherwise how else are my .parquet files being corrupted?
Solution 1:[1]
The "data" column in question was a nested JSON array of dynamic objects which SA didn't seem to be able to convert correctly into parquet format. What we ended up doing was creating a Stream Analytics Function that took in this array and converted it to a string
Function
function main(InputJSON) {
var InputJSONString = JSON.stringify(InputJSON);
return InputJSONString;
}
Query
Tel.identities.corporationId AS corporationId, Tel.identities.deviceId, Tel.deviceTelemetry.version, Tel.deviceTelemetry.TimeStamp AS dateTimeStamp, udf.ConvertToJSONString(Tel.deviceTelemetry.data) as deviceData
INTO
deviceTelemetryADLS
FROM
data AS Tel
WHERE Tel.deviceTelemetry.data IS NOT null
So we stored this nested JSON as a single column in our SQL as we didn't need to index these "data" fields. We deserialized it into an object when querying it to our .NET application
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Josh L |