'How can I feed outputed augmented manifest file as input to blazingtext in a pipeline?

I'm creating a pipeline with multiple steps

One to preprocess a dataset and the other one takes the preprocessed one as an input to train a BlazingText model for classification

My first ProcessingStep outputs augmented manifest files

step_process = ProcessingStep(
name="Nab3Process",
processor=sklearn_processor,
inputs=[
  ProcessingInput(source=raw_input_data, destination=raw_dir),
  ProcessingInput(source=categories_input_data, destination=categories_dir)
],
outputs=[
    ProcessingOutput(output_name="train", source=train_dir),
    ProcessingOutput(output_name="validation", source=validation_dir),
    ProcessingOutput(output_name="test", source=test_dir),
    ProcessingOutput(output_name="mlb_train", source=mlb_data_train_dir),
    ProcessingOutput(output_name="mlb_validation", source=mlb_data_validation_dir),
    ProcessingOutput(output_name="mlb_test", source=mlb_data_test_dir),
    ProcessingOutput(output_name="le_vectorizer", source=le_vectorizer_dir),
    ProcessingOutput(output_name="mlb_vectorizer", source=mlb_vectorizer_dir)
],
code=preprocessing_dir)

But I'm having a hard time when I try to feed my train output as a TrainingInput to the model step to use it to train.

step_train = TrainingStep(
name="Nab3Train",
estimator=bt_train,
inputs={
    "train": TrainingInput(
        step_process.properties.ProcessingOutputConfig.Outputs[
            "train"
        ].S3Output.S3Uri,
        distribution="FullyReplicated",
        content_type="application/x-recordio",
        s3_data_type='AugmentedManifestFile',
        attribute_names=['source', 'label'],
        input_mode='Pipe',
        record_wrapping='RecordIO'
    ),
    "validation": TrainingInput(
        step_process.properties.ProcessingOutputConfig.Outputs[
            "validation"
        ].S3Output.S3Uri,
        distribution="FullyReplicated",
        content_type='application/x-recordio',
        s3_data_type='AugmentedManifestFile',
        attribute_names=['source', 'label'],
        input_mode='Pipe',
        record_wrapping='RecordIO'
    )
})

And I'm getting the following error

'FailureReason': 'ClientError: Could not download manifest file with S3 URL "s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train". Please ensure that the bucket exists in the selected region (us-east-1), that the manifest file exists at that S3 URL, and that the role "arn:aws:iam::xxxxxxxxxx:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole" has "s3:GetObject" permissions on the manifest file. Error message from S3: The specified key does not exist.'

What Should I do?

[EDIT]

I made sure the role has the permissions, and the file exists in the required path in the required bucket, I even used the training files generated by a failed pipeline run as a static input to the model training process in a new pipeline run and it did well, the problem here is that the training step needs a file path, but the preprocessing step is outputting a directory path.

the traininbg step works with a path like this one

"s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train/train.json"

but not like this one

"s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train"

machine-learning pipeline amazon-sagemaker

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How can I feed outputed augmented manifest file as input to blazingtext in a pipeline?

Sources

Related Questions