'DataprocClusterCreateOperator doesnt have temp_bucket variable to define
I am trying to create dataproc cluster via DataprocClusterCreateOperator via Apache airflow Airflow version: 1.10.15 Composer version: 1.16.4 I wanted to assign a temp-bucket used by project to the cluster and not the bucket google creates during run time. This option is available when we create cluster via command line using option --temp-bucket but this same variable is not available to pass via ClusterCreateOperator.
Dataproc operator info: https://airflow.apache.org/docs/apache-airflow/1.10.15/_modules/airflow/contrib/operators/dataproc_operator.html
create cluster via command:
gcloud dataproc clusters create cluster-name \
--properties=core:fs.defaultFS=gs://defaultFS-bucket-name \
--region=region \
--bucket=staging-bucket-name \
**--temp-bucket=project-owned-temp-bucket-name \**
other args ...
create_cluster = DataprocClusterCreateOperator(
task_id="create_cluster",
project_id="my-project_id",
cluster_name="my-dataproc-{{ ds_nodash }}",
num_workers=2,
storage_bucket="project_bucket",
region="us-east4",
... other params...
)
Solution 1:[1]
Unfortunately, the method DataprocClusterCreateOperator in Ariflow doesn’t support the property temp-bucket. You can use this property only with the gcloud command or REST API.
With REST API, you can use these fields ClusterConfig.configBucket and ClusterConfig.tempBucket in a cluster.create.
A Possible solution could be creating a scheduler job. You can see this documentation.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
