'Stop AutoML sampling the dataset

Whenever I run an Azure Databricks AutoML run it samples the dataset, only using around 66% of the rows.

I currently have 40,000 rows, each with 600 features.

Is there a way to force AutoML to use all the rows? I have tried increasing the memory of the compute I am using, but it does not appear to help



Solution 1:[1]

Although AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node.

AutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary.

In Databricks Runtime 9.1 LTS ML through Databricks Runtime 10.5 ML, the sampling fraction does not depend on the cluster’s node type or the amount of memory on each node.

In Databricks Runtime 11.0 ML and above:

• The sampling fraction increases for worker nodes with more memory.

You can increase the sample size by choosing a Memory optimized worker type when you create the cluster.

You can also increase the sample size by choosing a larger value for spark.task.cpus in the Spark configuration for the cluster. The default setting is 1; the maximum value is the number of CPUs in the worker. When you increase this value, the sample size is larger, but fewer trials run in parallel. For example, in a machine with 4 cores and 64GB total RAM, the default spark.task.cpus=1 runs 4 trials per worker with each trial limited to 16GB RAM. If you set spark.task.cpus=4, each worker runs only one trial but that trial can use 64GB RAM.

Refer - https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/automl#--sampling-large-datasets

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 AbhishekKhandave-MT