'Triggering Databricks job from Airflow without starting new cluster
I am using airflow to trigger jobs on databricks. I have many DAGs running databricks jobs and I whish to have to use only one cluster instead of many, since to my understanding this will reduce the costs these task will generate.
Using DatabricksSubmitRunOperatorthere are two ways to run a job on databricks. Either using a running cluster calling it by id
'existing_cluster_id' : '1234-567890-word123',
or starting a new cluster
'new_cluster': {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
},
Now I would like to try to avoid to start a new cluster for each task, however the cluster shuts down during downtime hence it will not be available trough it's id anymore and I will get an error, so the only option in my view is a new cluster.
1) Is there a way to have a cluster being callable by id even when it is down?
2) Do people simply keep the clusters alive?
3) Or am I completely wrong and starting clusters for each task won't generate more costs?
4) Is there something I missed completely?
Solution 1:[1]
It seems Databricks has added an option recently to reuse a job cluster within a job, sharing it between tasks.
Until now, each task had its own cluster to accommodate for the different types of workloads. While this flexibility allows for fine-grained configuration, it can also introduce a time and cost overhead for cluster startup or underutilization during parallel tasks.
In order to maintain this flexibility, but further improve utilization, we are excited to announce cluster reuse. By sharing job clusters over multiple tasks customers can reduce the time a job takes, reduce costs by eliminating overhead and increase cluster utilization with parallel tasks.
This seems to be available in the new API as well. https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate
job_clusters Array of objects (JobCluster) <= 100 items
A list of job cluster specifications that can be shared and reused by tasks of this job. Libraries cannot be declared in a shared job cluster. You must declare dependent libraries in task settings.
In order to fit your use case you could start a new cluster with your job, share it between your tasks, and it will automatically shut down at the end.
I still don't fully understand how we might keep a job cluster hot all the time if we want to have jobs start with no latency. I also don't think it's possible to share these clusters between jobs.
For now this information should provide a decent lead.
Solution 2:[2]
In fact when you want to execute a notebook via airflow, you have to specify the characterestics of your cluster.
databricks will consider your notebook as a new job and make it on the cluster you created. But when the execution is finished the cluster created will be deleted autormatically.
To verify this: when job are running on airflow ==> go to see logs => It gives you a link => the link forward you to databricks : There you click on View cluster, so you will see the execution on a new created cluster called for example job-1310-run-980
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | WarSame |
| Solution 2 | ben othman zied |
