How to save a .html file to S3 that is created in a Sagemaker processing container

0

Error message: "FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/processing/output/profile_case.html'"

Background: I am working in Sagemaker using python trying to profile a dataframe that is saved in a S3 bucket with pandas profiling. The data is very large so instead of spinning up a large EC2 instance, I am using a SKLearn processor.

Everything runs fine but when the job finishes it does not save the pandas profile (a .html file) in a S3 bucket or back in the instance Sagemaker is running in.

When I try to export the .html file that is created from the pandas profile, I keep getting errors saying that the file cannot be found.

Does anyone know of a way to export the .html file out of the temporary 24xl instance that the SKLearn processor is running in to S3? Below is the exact code I am using:

import os
import sys
import subprocess
def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
install('awswrangler')
install('tqdm')
install('pandas')
install('botocore==1.19.4')
install('ruamel.yaml')
install('pandas-profiling==2.13.0')
import awswrangler as wr
import pandas as pd
import numpy as np
import datetime as dt
from dateutil.relativedelta import relativedelta
from string import Template
import gc
import boto3

from pandas_profiling import ProfileReport

client = boto3.client('s3')
session = boto3.Session(region_name="eu-west-2")
%%writefile casetableprofile.py

import os
import sys
import subprocess
def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
install('awswrangler')
install('tqdm')
install('pandas')
install('botocore')
install('ruamel.yaml')
install('pandas-profiling')
import awswrangler as wr
import pandas as pd
import numpy as np
import datetime as dt
from dateutil.relativedelta import relativedelta
from string import Template
import gc
import boto3

from pandas_profiling import ProfileReport

client = boto3.client('s3')
session = boto3.Session(region_name="eu-west-2")




def run_profile():



    query = """
    SELECT  * FROM "healthcloud-refined"."case"
    ;
    """
    tableforprofile = wr.athena.read_sql_query(query,
                                            database="healthcloud-refined",
                                            boto3_session=session,
                                            ctas_approach=False,
                                            workgroup='DataScientists')
    print("read in the table queried above")

    print("got rid of missing and added a new index")

    profile_tblforprofile = ProfileReport(tableforprofile, 
                                  title="Pandas Profiling Report", 
                                  minimal=True)

    print("Generated carerequest profile")
                                      
    return profile_tblforprofile


if __name__ == '__main__':

    profile_tblforprofile = run_profile()
    
    print("Generated outputs")

    output_path_tblforprofile = ('profile_case.html')
    print(output_path_tblforprofile)
    
    profile_tblforprofile.to_file(output_path_tblforprofile)

    
    #Below is the only part where I am getting errors
import boto3
import os   
s3 = boto3.resource('s3')
s3.meta.client.upload_file('/opt/ml/processing/output/profile_case.html', 'intl-euro-uk-datascientist-prod','Mark/healthclouddataprofiles/{}'.format(output_path_tblforprofile))  
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput

session = boto3.Session(region_name="eu-west-2")

bucket = 'intl-euro-uk-datascientist-prod'

prefix = 'Mark'

sm_session = sagemaker.Session(boto_session=session, default_bucket=bucket)
sm_session.upload_data(path='./casetableprofile.py',
                                bucket=bucket,
                                key_prefix=f'{prefix}/source')
import boto3
#import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name


S3_ROOT_PATH = "s3://{}/{}".format(bucket, prefix)

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     sagemaker_session=sm_session,
                                     instance_type='ml.m5.24xlarge',
                                     instance_count=1)
sklearn_processor.run(code='s3://{}/{}/source/casetableprofile.py'.format(bucket, prefix),
                      inputs=[],
                      outputs=[ProcessingOutput(output_name='output',
                                                source='/opt/ml/processing/output',
                                                destination='s3://intl-euro-uk-datascientist-prod/Mark/')])

Thank you in advance!!!

1개 답변
1
수락된 답변

Hi,

Firstly, you should not (usually) need to directly interact with S3 from your processing script: The fact that you've configured your ProcessingOutput means that any files your script saves in /opt/ml/processing/output should automatically get uploaded to your s3://... destination URL. Of course there might be particular special cases where you want to directly access S3 from your script, but in general the processing job inputs and outputs should do it for you, to keep your code nice and simple.

I'm no Pandas Profiler expert, but I think the error might be coming from here:

    output_path_tblforprofile = ('profile_case.html')
    print(output_path_tblforprofile)
    
    profile_tblforprofile.to_file(output_path_tblforprofile)

Doesn't this just save the report to profile_case.html in your current working directory? That's not the /opt/ml/processing/output directory: It's usually the folder where the script is downloaded to the container I believe. The FileNotFound error is telling you that the HTML file is not getting created in the folder you expect, I think.

So I would suggest to make your output path explicit e.g. /opt/ml/processing/output/profile_case.html, and also remove the boto3/s3 section at the end - hope that helps!

AWS
전문가
Alex_T
답변함 2년 전
  • This worked!!!!!!! Thank you so much!!!!!!!!!!!!

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠