How to save a .html file to S3 that is created in a Sagemaker processing container

0

Error message: "FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/processing/output/profile_case.html'"

Background: I am working in Sagemaker using python trying to profile a dataframe that is saved in a S3 bucket with pandas profiling. The data is very large so instead of spinning up a large EC2 instance, I am using a SKLearn processor.

Everything runs fine but when the job finishes it does not save the pandas profile (a .html file) in a S3 bucket or back in the instance Sagemaker is running in.

When I try to export the .html file that is created from the pandas profile, I keep getting errors saying that the file cannot be found.

Does anyone know of a way to export the .html file out of the temporary 24xl instance that the SKLearn processor is running in to S3? Below is the exact code I am using:

import os
import sys
import subprocess
def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
install('awswrangler')
install('tqdm')
install('pandas')
install('botocore==1.19.4')
install('ruamel.yaml')
install('pandas-profiling==2.13.0')
import awswrangler as wr
import pandas as pd
import numpy as np
import datetime as dt
from dateutil.relativedelta import relativedelta
from string import Template
import gc
import boto3

from pandas_profiling import ProfileReport

client = boto3.client('s3')
session = boto3.Session(region_name="eu-west-2")
%%writefile casetableprofile.py

import os
import sys
import subprocess
def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
install('awswrangler')
install('tqdm')
install('pandas')
install('botocore')
install('ruamel.yaml')
install('pandas-profiling')
import awswrangler as wr
import pandas as pd
import numpy as np
import datetime as dt
from dateutil.relativedelta import relativedelta
from string import Template
import gc
import boto3

from pandas_profiling import ProfileReport

client = boto3.client('s3')
session = boto3.Session(region_name="eu-west-2")




def run_profile():



    query = """
    SELECT  * FROM "healthcloud-refined"."case"
    ;
    """
    tableforprofile = wr.athena.read_sql_query(query,
                                            database="healthcloud-refined",
                                            boto3_session=session,
                                            ctas_approach=False,
                                            workgroup='DataScientists')
    print("read in the table queried above")

    print("got rid of missing and added a new index")

    profile_tblforprofile = ProfileReport(tableforprofile, 
                                  title="Pandas Profiling Report", 
                                  minimal=True)

    print("Generated carerequest profile")
                                      
    return profile_tblforprofile


if __name__ == '__main__':

    profile_tblforprofile = run_profile()
    
    print("Generated outputs")

    output_path_tblforprofile = ('profile_case.html')
    print(output_path_tblforprofile)
    
    profile_tblforprofile.to_file(output_path_tblforprofile)

    
    #Below is the only part where I am getting errors
import boto3
import os   
s3 = boto3.resource('s3')
s3.meta.client.upload_file('/opt/ml/processing/output/profile_case.html', 'intl-euro-uk-datascientist-prod','Mark/healthclouddataprofiles/{}'.format(output_path_tblforprofile))  
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput

session = boto3.Session(region_name="eu-west-2")

bucket = 'intl-euro-uk-datascientist-prod'

prefix = 'Mark'

sm_session = sagemaker.Session(boto_session=session, default_bucket=bucket)
sm_session.upload_data(path='./casetableprofile.py',
                                bucket=bucket,
                                key_prefix=f'{prefix}/source')
import boto3
#import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name


S3_ROOT_PATH = "s3://{}/{}".format(bucket, prefix)

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     sagemaker_session=sm_session,
                                     instance_type='ml.m5.24xlarge',
                                     instance_count=1)
sklearn_processor.run(code='s3://{}/{}/source/casetableprofile.py'.format(bucket, prefix),
                      inputs=[],
                      outputs=[ProcessingOutput(output_name='output',
                                                source='/opt/ml/processing/output',
                                                destination='s3://intl-euro-uk-datascientist-prod/Mark/')])

Thank you in advance!!!

1 Answer
1
Accepted Answer

Hi,

Firstly, you should not (usually) need to directly interact with S3 from your processing script: The fact that you've configured your ProcessingOutput means that any files your script saves in /opt/ml/processing/output should automatically get uploaded to your s3://... destination URL. Of course there might be particular special cases where you want to directly access S3 from your script, but in general the processing job inputs and outputs should do it for you, to keep your code nice and simple.

I'm no Pandas Profiler expert, but I think the error might be coming from here:

    output_path_tblforprofile = ('profile_case.html')
    print(output_path_tblforprofile)
    
    profile_tblforprofile.to_file(output_path_tblforprofile)

Doesn't this just save the report to profile_case.html in your current working directory? That's not the /opt/ml/processing/output directory: It's usually the folder where the script is downloaded to the container I believe. The FileNotFound error is telling you that the HTML file is not getting created in the folder you expect, I think.

So I would suggest to make your output path explicit e.g. /opt/ml/processing/output/profile_case.html, and also remove the boto3/s3 section at the end - hope that helps!

AWS
EXPERT
Alex_T
answered 2 years ago
  • This worked!!!!!!! Thank you so much!!!!!!!!!!!!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions