'ADF MergeFiles copy activity inexplicably slow
We have a case where there's two parquet files in blob storage that need to be merged and written out as another parquet file in a different location in blob storage. The activity that does this looks like:
{
"name": "Mergefiles",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": false,
"wildcardFileName": "*.*",
"enablePartitionDiscovery": false
}
},
"sink": {
"type": "ParquetSink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings",
"copyBehavior": "MergeFiles"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "{STAGING_DATASET_NAME}",
"type": "DatasetReference",
"parameters": { ... }
}
],
"outputs": [
{
"referenceName": "{MERGE_OUTPUT_DATASET_NAME}",
"type": "DatasetReference",
"parameters": { ... }
}
]
}
For some timings, two examples:
- Folder has two files: 3MB and 450MB parquet. This takes 16 minutes to merge.
- Folder has just one file: 550MB parquet. This takes 20 minutes to merge.
With example 2, it's not even merging anything since the *.* only yields the single parquet file.
My understanding is that MergeFiles simply appends the files together with no other logic, so what could be taking it so long to do this? The parquet files should have the same columns, although we haven't defined a specific schema because we're not doing any sort of mapping. We just want to combine similar parquet files together
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
