'Dataflow: Apache Beam WARNING: apache_beam.utils.retry:Retry with exponential backoff:

I have a simple pipeline, which 3 weeks ago would work fine but I've gone back to the code to enhance it and when I tried to run the code it returned the following error:

WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 10.280192286716229 seconds before retrying exists because we caught exception: TypeError: string indices must be integers Traceback for above exception (most recent call last):

I am running the dataflow script via Cloud Shell on the Google Cloud Platform. By simply executing Python3 <dataflow.py>

The code is as follows, and this used to submit the job to dataflow without issue

import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam import coders

pipeline_args=[
      '--runner=DataflowRunner',
      '--job_name=my-job-name',
      '--project=my-project-id',
      '--region=europe-west2',
      '--temp_location=gs://mybucket/temp',
      '--staging_location=gs://mybucket/staging'
    ]

options = PipelineOptions (pipeline_args)
p = beam.Pipeline(options=options)
rows = (
        p | 'Read daily Spot File' >> beam.io.ReadFromText(
                     file_pattern='gs://bucket/filename.gz', 
                    compression_type='gzip',
                    coder=coders.BytesCoder(),
                    skip_header_lines=0))

p.run()

Any advice as to why this has started happening would be great to know. Thanks in advance.



Solution 1:[1]

I found same kind of error following this [tutorial][1]

python3 PubSubToGCS.py \


 --project=mango-s-11-a81277cf  \
  --region=us-east1 \
  --input_topic=projects/mango-s-11-a81277cf/topics/pubsubtest \
  --output_path=gs://mangobucket001/samples/output \
  --runner=DataflowRunner \
  --window_size=2 \
  --num_shards=2 \
  --temp_location=gs://mangobucket001/temp


a81277cf/topics/pubsubtest   --output_path=gs://mangobucket001/   --runner=DataflowRunner   --window_size=2   --num_shards=2   --temp_location=gs://mangobucket001/temp
    INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
    INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/tmpaelpxw64', 'apache-beam==2.33.0', '--no-deps', '--no-binary', ':all:']
    INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI: dataflow_python_sdk.tar
    INFO:apache_beam.runners.portability.stager:Downloading binary distribution of the SDK from PyPi
    INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/tmpaelpxw64', 'apache-beam==2.33.0', '--no-deps', '--only-binary', ':all:', '--python-version', '38', '--implementation', 'cp', '--abi', 'cp38', '--platform', 'manylinux1_x86_64']
    INFO:apache_beam.runners.portability.stager:Staging binary distribution of the SDK from PyPI: apache_beam-2.33.0-cp38-cp38-manylinux1_x86_64.whl
    WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
    INFO:root:Default Python SDK image for environment is apache/beam_python3.8_sdk:2.33.0
    INFO:root:Using provided Python SDK container image: gcr.io/cloud-dataflow/v1beta3/python38-fnapi:2.33.0
    INFO:root:Python SDK container image set to "gcr.io/cloud-dataflow/v1beta3/python38-fnapi:2.33.0" for Docker environment
    INFO:apache_beam.runners.dataflow.internal.apiclient:Defaulting to the temp_location as staging_location: gs://mangobucket001/temp
    INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
    INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
    INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
    INFO:oauth2client.client:Refreshing access_token
    INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
    INFO:oauth2client.client:Refreshing access_token
    WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 3.054787201157167 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
     Traceback for above exception (most recent call last):
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
        return fun(*args, **kwargs)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
        self.stage_file(to_folder, to_name, f, total_size=total_size)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
        raise IOError((
    
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
    WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 5.22755214575824 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
     Traceback for above exception (most recent call last):
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
        return fun(*args, **kwargs)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
        self.stage_file(to_folder, to_name, f, total_size=total_size)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
        raise IOError((
    
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
    WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 19.20582351652022 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
     Traceback for above exception (most recent call last):
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
        return fun(*args, **kwargs)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
        self.stage_file(to_folder, to_name, f, total_size=total_size)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
        raise IOError((
    
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
    WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 39.51226214946517 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
     Traceback for above exception (most recent call last):
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
        return fun(*args, **kwargs)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
        self.stage_file(to_folder, to_name, f, total_size=total_size)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
        raise IOError((
    
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
    WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 61.15550203875504 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
     Traceback for above exception (most recent call last):
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
        return fun(*args, **kwargs)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
        self.stage_file(to_folder, to_name, f, total_size=total_size)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
        raise IOError((
    
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658/pickled_main_session...
    WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 122.46460946846845 seconds before retrying _gcs_file_copy because we caught exception: OSError: Could not upload to GCS path gs://mangobucket001/temp/beamapp-maan-0214122440-788240.1644841480.788658: access denied. Please verify that credentials are valid and that you have write access to the specified path.
     Traceback for above exception (most recent call last):
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/utils/retry.py", line 253, in wrapper
        return fun(*args, **kwargs)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 559, in _gcs_file_copy
        self.stage_file(to_folder, to_name, f, total_size=total_size)
      File "/home/maan/.local/lib/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 642, in stage_file
    
    
      [1]: https://cloud.google.com/pubsub/docs/pubsub-dataflow

Solution 2:[2]

Not sure about the original issue but I can speak to Usman's post which seems to describe an issue I ran into myself.

Python doesn't use gcloud auth to authenticate but it uses the environment variable GOOGLE_APPLICATION_CREDENTIALS. So before you run the python command to launch the Dataflow job, you will need to set that environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key"

More info on setting up the environment variable: https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable

Then you'll have to make sure that the account you set up has the necessary permissions in your GCP project.

Permissions and service accounts:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Usman Ali Maan
Solution 2 mellybear