'How can I feed outputed augmented manifest file as input to blazingtext in a pipeline?
I'm creating a pipeline with multiple steps
One to preprocess a dataset and the other one takes the preprocessed one as an input to train a BlazingText model for classification
My first ProcessingStep outputs augmented manifest files
step_process = ProcessingStep(
name="Nab3Process",
processor=sklearn_processor,
inputs=[
ProcessingInput(source=raw_input_data, destination=raw_dir),
ProcessingInput(source=categories_input_data, destination=categories_dir)
],
outputs=[
ProcessingOutput(output_name="train", source=train_dir),
ProcessingOutput(output_name="validation", source=validation_dir),
ProcessingOutput(output_name="test", source=test_dir),
ProcessingOutput(output_name="mlb_train", source=mlb_data_train_dir),
ProcessingOutput(output_name="mlb_validation", source=mlb_data_validation_dir),
ProcessingOutput(output_name="mlb_test", source=mlb_data_test_dir),
ProcessingOutput(output_name="le_vectorizer", source=le_vectorizer_dir),
ProcessingOutput(output_name="mlb_vectorizer", source=mlb_vectorizer_dir)
],
code=preprocessing_dir)
But I'm having a hard time when I try to feed my train output as a TrainingInput to the model step to use it to train.
step_train = TrainingStep(
name="Nab3Train",
estimator=bt_train,
inputs={
"train": TrainingInput(
step_process.properties.ProcessingOutputConfig.Outputs[
"train"
].S3Output.S3Uri,
distribution="FullyReplicated",
content_type="application/x-recordio",
s3_data_type='AugmentedManifestFile',
attribute_names=['source', 'label'],
input_mode='Pipe',
record_wrapping='RecordIO'
),
"validation": TrainingInput(
step_process.properties.ProcessingOutputConfig.Outputs[
"validation"
].S3Output.S3Uri,
distribution="FullyReplicated",
content_type='application/x-recordio',
s3_data_type='AugmentedManifestFile',
attribute_names=['source', 'label'],
input_mode='Pipe',
record_wrapping='RecordIO'
)
})
And I'm getting the following error
'FailureReason': 'ClientError: Could not download manifest file with S3 URL "s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train". Please ensure that the bucket exists in the selected region (us-east-1), that the manifest file exists at that S3 URL, and that the role "arn:aws:iam::xxxxxxxxxx:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole" has "s3:GetObject" permissions on the manifest file. Error message from S3: The specified key does not exist.'
What Should I do?
[EDIT]
I made sure the role has the permissions, and the file exists in the required path in the required bucket, I even used the training files generated by a failed pipeline run as a static input to the model training process in a new pipeline run and it did well, the problem here is that the training step needs a file path, but the preprocessing step is outputting a directory path.
the traininbg step works with a path like this one
"s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train/train.json"
but not like this one
"s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
