How can I feed outputed augmented manifest file as input to blazingtext in a pipeline?

0

I'm creating a pipeline with multiple steps

One to preprocess a dataset and the other one takes the preprocessed one as an input to train a BlazingText model for classification

My first ProcessingStep outputs augmented manifest files

step_process = ProcessingStep(
name="Nab3Process",
processor=sklearn_processor,
inputs=[
  ProcessingInput(source=raw_input_data, destination=raw_dir),
  ProcessingInput(source=categories_input_data, destination=categories_dir)
],
outputs=[
    ProcessingOutput(output_name="train", source=train_dir),
    ProcessingOutput(output_name="validation", source=validation_dir),
    ProcessingOutput(output_name="test", source=test_dir),
    ProcessingOutput(output_name="mlb_train", source=mlb_data_train_dir),
    ProcessingOutput(output_name="mlb_validation", source=mlb_data_validation_dir),
    ProcessingOutput(output_name="mlb_test", source=mlb_data_test_dir),
    ProcessingOutput(output_name="le_vectorizer", source=le_vectorizer_dir),
    ProcessingOutput(output_name="mlb_vectorizer", source=mlb_vectorizer_dir)
],
code=preprocessing_dir)

But I'm having a hard time when I try to feed my train output as a TrainingInput to the model step to use it to train.

step_train = TrainingStep(
name="Nab3Train",
estimator=bt_train,
inputs={
    "train": TrainingInput(
        step_process.properties.ProcessingOutputConfig.Outputs[
            "train"
        ].S3Output.S3Uri,
        distribution="FullyReplicated",
        content_type="application/x-recordio",
        s3_data_type='AugmentedManifestFile',
        attribute_names=['source', 'label'],
        input_mode='Pipe',
        record_wrapping='RecordIO'
    ),
    "validation": TrainingInput(
        step_process.properties.ProcessingOutputConfig.Outputs[
            "validation"
        ].S3Output.S3Uri,
        distribution="FullyReplicated",
        content_type='application/x-recordio',
        s3_data_type='AugmentedManifestFile',
        attribute_names=['source', 'label'],
        input_mode='Pipe',
        record_wrapping='RecordIO'
    )
})

And I'm getting the following error

'FailureReason': 'ClientError: Could not download manifest file with S3 URL "s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train". Please ensure that the bucket exists in the selected region (us-east-1), that the manifest file exists at that S3 URL, and that the role "arn:aws:iam::xxxxxxxxxx:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole" has "s3:GetObject" permissions on the manifest file. Error message from S3: The specified key does not exist.'

What Should I do?

  • Are you able to view the files from your notebook? For example, like an aws s3 ls on the prefix and make sure it exists? I would check if your processing job has executed successfully and has the file there as well. Since the bucket name has sagemaker in it, ServiceCatalogProductsUseRole would have s3:GetObject permissions by default.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions