'ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream
I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.
Here is the code:
I launch the train.py this way:
python -m torch.distributed.launch --nproc_per_node 4 train.py
After training is complete I save model files using this. It has 3 files that needs to be saved.
trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0 cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP
Error:
ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded
And sometimes I get this error:
ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.
Solution 1:[1]
As per the documentation name conflict, you are trying to overwrite a file that has already been created.
So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:
- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000
I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sandeep Vokkareni |
