'Dataflow: Apache Beam WARNING: apache_beam.utils.retry:Retry with exponential backoff:
I have a simple pipeline, which 3 weeks ago would work fine but I've gone back to the code to enhance it and when I tried to run the code it returned the following error:
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 10.280192286716229 seconds before retrying exists because we caught exception: TypeError: string indices must be integers Traceback for above exception (most recent call last):
I am running the dataflow script via Cloud Shell on the Google Cloud Platform. By simply executing Python3 <dataflow.py>
The code is as follows, and this used to submit the job to dataflow without issue
import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam import coders
pipeline_args=[
'--runner=DataflowRunner',
'--job_name=my-job-name',
'--project=my-project-id',
'--region=europe-west2',
'--temp_location=gs://mybucket/temp',
'--staging_location=gs://mybucket/staging'
]
options = PipelineOptions (pipeline_args)
p = beam.Pipeline(options=options)
rows = (
p | 'Read daily Spot File' >> beam.io.ReadFromText(
file_pattern='gs://bucket/filename.gz',
compression_type='gzip',
coder=coders.BytesCoder(),
skip_header_lines=0))
p.run()
Any advice as to why this has started happening would be great to know. Thanks in advance.
Solution 1:[1]
I found same kind of error following this [tutorial][1]
python3 PubSubToGCS.py \
--project=mango-s-11-a81277cf \
--region=us-east1 \
--input_topic=projects/mango-s-11-a81277cf/topics/pubsubtest \
--output_path=gs://mangobucket001/samples/output \
--runner=DataflowRunner \
--window_size=2 \
--num_shards=2 \
--temp_location=gs://mangobucket001/temp
a81277cf/topics/pubsubtest --output_path=gs://mangobucket001/ --runner=DataflowRunner --window_size=2 --num_shards=2 --temp_location=gs://mangobucket001/temp
INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/tmpaelpxw64', 'apache-beam==2.33.0', '--no-deps', '--no-binary', ':all:']
INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI: dataflow_python_sdk.tar
INFO:apache_beam.runners.portability.stager:Downloading binary distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/tmpaelpxw64', 'apache-beam==2.33.0', '--no-deps', '--only-binary', ':all:', '--python-version', '38', '--implementation', 'cp', '--abi', 'cp38', '--platform', 'manylinux1_x86_64']
INFO:apache_beam.runners.portability.stager:Staging binary distribution of the SDK from PyPI: apache_beam-2.33.0-cp38-cp38-manylinux1_x86_64.whl
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
INFO:root:Default Python SDK image for environment is apache/beam_python3.8_sdk:2.33.0
INFO:root:Using provided Python SDK container image: gcr.io/cloud-dataflow/v1beta3/python38-fnapi:2.33.0
INFO:root:Python SDK container image set to "gcr.io/cloud-dataflow/v1beta3/python38-fnapi:2.33.0" for Docker environment
INFO:apache_beam.runners.dataflow.internal.apiclient:Defaulting to the temp_location as staging_location: gs://mangobucket001/temp
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 3.054787201157167 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
Traceback for above exception (most recent call last):
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
return fun(*args, **kwargs)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
self.stage_file(to_folder, to_name, f, total_size=total_size)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
raise IOError((
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 5.22755214575824 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
Traceback for above exception (most recent call last):
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
return fun(*args, **kwargs)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
self.stage_file(to_folder, to_name, f, total_size=total_size)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
raise IOError((
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 19.20582351652022 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
Traceback for above exception (most recent call last):
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
return fun(*args, **kwargs)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
self.stage_file(to_folder, to_name, f, total_size=total_size)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
raise IOError((
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 39.51226214946517 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
Traceback for above exception (most recent call last):
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
return fun(*args, **kwargs)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
self.stage_file(to_folder, to_name, f, total_size=total_size)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
raise IOError((
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 61.15550203875504 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
Traceback for above exception (most recent call last):
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
return fun(*args, **kwargs)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
self.stage_file(to_folder, to_name, f, total_size=total_size)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
raise IOError((
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 122.46460946846845 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
Traceback for above exception (most recent call last):
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
return fun(*args, **kwargs)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
self.stage_file(to_folder, to_name, f, total_size=total_size)
File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
[1]: https://cloud.google.com/pubsub/docs/pubsub-dataflow
Solution 2:[2]
Not sure about the original issue but I can speak to Usman's post which seems to describe an issue I ran into myself.
Python doesn't use gcloud auth to authenticate but it uses the environment variable GOOGLE_APPLICATION_CREDENTIALS. So before you run the python command to launch the Dataflow job, you will need to set that environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key"
More info on setting up the environment variable: https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable
Then you'll have to make sure that the account you set up has the necessary permissions in your GCP project.
Permissions and service accounts:
User service account or user account: it needs the Dataflow Admin role at the project level and to be able to act as the worker service account (source: https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account).
Worker service account: it will be one worker service account per Dataflow pipeline. This account will need the Dataflow Worker role at the project level plus the necessary permissions to the resources accessed by the Dataflow pipeline (source: https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account). Example: if Dataflow pipeline’s input is Pub/Sub topic and output is BigQuery table, the worker service account will need read access to the topic as well as write permission to the BQ table.
Dataflow service account: this is the account that gets automatically created when you enable the Dataflow API in a project. It automatically gets the Dataflow Service Agent role at the project level (source: https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Usman Ali Maan |
| Solution 2 | mellybear |
