How can I feed outputed augmented manifest file as input to blazingtext in a pipeline?

0

I'm creating a pipeline with multiple steps

One to preprocess a dataset and the other one takes the preprocessed one as an input to train a BlazingText model for classification

My first ProcessingStep outputs augmented manifest files

step_process = ProcessingStep(
name="Nab3Process",
processor=sklearn_processor,
inputs=[
  ProcessingInput(source=raw_input_data, destination=raw_dir),
  ProcessingInput(source=categories_input_data, destination=categories_dir)
],
outputs=[
    ProcessingOutput(output_name="train", source=train_dir),
    ProcessingOutput(output_name="validation", source=validation_dir),
    ProcessingOutput(output_name="test", source=test_dir),
    ProcessingOutput(output_name="mlb_train", source=mlb_data_train_dir),
    ProcessingOutput(output_name="mlb_validation", source=mlb_data_validation_dir),
    ProcessingOutput(output_name="mlb_test", source=mlb_data_test_dir),
    ProcessingOutput(output_name="le_vectorizer", source=le_vectorizer_dir),
    ProcessingOutput(output_name="mlb_vectorizer", source=mlb_vectorizer_dir)
],
code=preprocessing_dir)

But I'm having a hard time when I try to feed my train output as a TrainingInput to the model step to use it to train.

step_train = TrainingStep(
name="Nab3Train",
estimator=bt_train,
inputs={
    "train": TrainingInput(
        step_process.properties.ProcessingOutputConfig.Outputs[
            "train"
        ].S3Output.S3Uri,
        distribution="FullyReplicated",
        content_type="application/x-recordio",
        s3_data_type='AugmentedManifestFile',
        attribute_names=['source', 'label'],
        input_mode='Pipe',
        record_wrapping='RecordIO'
    ),
    "validation": TrainingInput(
        step_process.properties.ProcessingOutputConfig.Outputs[
            "validation"
        ].S3Output.S3Uri,
        distribution="FullyReplicated",
        content_type='application/x-recordio',
        s3_data_type='AugmentedManifestFile',
        attribute_names=['source', 'label'],
        input_mode='Pipe',
        record_wrapping='RecordIO'
    )
})

And I'm getting the following error

'FailureReason': 'ClientError: Could not download manifest file with S3 URL "s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train". Please ensure that the bucket exists in the selected region (us-east-1), that the manifest file exists at that S3 URL, and that the role "arn:aws:iam::xxxxxxxxxx:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole" has "s3:GetObject" permissions on the manifest file. Error message from S3: The specified key does not exist.'

What Should I do?

  • Are you able to view the files from your notebook? For example, like an aws s3 ls on the prefix and make sure it exists? I would check if your processing job has executed successfully and has the file there as well. Since the bucket name has sagemaker in it, ServiceCatalogProductsUseRole would have s3:GetObject permissions by default.

Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen