'Is it possible to loop through submission of PySpark job submission and parameterize each submit?
I have a DAG (partial code snippet) which will submit multiple jobs to Dataproc on Google Cloud
for collection in collection_list:
op = dataproc.DataprocSubmitJobOperator(
task_id='load_collection_{}'.format(collection),
job=PYSPARK_JOB_READ_FROM_MONGO
)
and I have
PYSPARK_JOB_READ_FROM_MONGO = {
"reference": {"project_id": project_id},
"placement": {"cluster_name": 'spark-ingest-mongodb-{{ ds_nodash }}'},
"pyspark_job": {
"main_python_file_uri": "gs://greenline-demo-341617-spark-src/Demo/01_MongoDB/ingest.py",
"args": [
"gcsPath=gs://testing_part/path/",
"test.id=404",
"correlation.id=AUDIT_2222",
"b.url=http://testing-v1.net/test/"
]
}
}
Let's say I need to change parameter "gcsPath" for each submit, how do I pass this?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
