'Azure DataBricks performance

My requirement was to process approx 1TB of data stored in Azure container.The container contains millions of json files which are multi part in nature .

For this i was using HdInsight which was able to process the data in 45 mins approx :

Worker Nodes (1-4)autoscale - 16 cores 112 gb Headnodes-2 - 4 cores 28gb

we planned to migrate to Azure Databricks Spark cluster

configuration of cluster used

Worker Nodes (4-10) autoscale - 8 cores 56gb - memory optimized Head nodes - 4 cores 28gb

But this keeps running for more then 2.5 hrs but still the process was not completed, and i can see it used 4 worker nodes to the maximum but does not scale up to leverage the remaining worker nodes to speed up the process.

Can any one help if i am doing something wrong here.enter image description here



Solution 1:[1]

What is the Task Type you have mentioned while creating the Task?

enter image description here

In the Type drop-down, select Notebook, JAR, Spark Submit, Python, or Pipeline.

If you have chosen Spark Submit then autoscaling isn't supported.

For autoscaling, you need to choose other options.

Additionally, autoscaling behaves differently depending on whether it is optimized or standard and whether applied to an all-purpose or a job cluster.

I suggest you to go through the official document on cluster size and autoscaling to get better insights about the same.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 UtkarshPal-MT