How to load large amount of data from S3 onto Sagemaker?

0

I have a notebook on Sagemaker Studio, I want to read data from S3, I am using the code bellow:

s3_client = boto3.client('s3') bucket = 'bucket_name' data_key = 'file_key.csv' obj = s3_client.get_object(Bucket=bucket, Key=data_key) df = pd.read_csv(io.BytesIO(obj['Body'].read())) df.head()

It works for small datasets but fails along the way with the dataset I'm trying to load which is 15GB. I changed the instance to ml.g4dn.xlarge ( accelerated computing, 4vCPU + 16GiB + 1 GPU), still fails. what am I missing here? Is is about the instance type, or about the code? What is the best way to import large datasets from S3 to sagemaker?

Thank you

質問済み 2年前3949ビュー
3回答
1

What is the need to load large dataset onto the notebook? If you are pre-processing then there are better ways to do this - Sagemaker Spark processing job, or have your own spark cluster and process or even possibly Glue. If you are exploring the data, you should just use a smaller data set. If you are loading the data for training, Sagemaker supports different modes to read the data and data doesnt have to be downloaded on the notebook.

AWS
回答済み 2年前
profile pictureAWS
エキスパート
レビュー済み 2年前
1

If you want to download the data onto the notebook so that you don't have to load it from S3 each time you want it in Pandas, you should confirm that the volume size is sufficient for the data. It is set to 5 GB by default, which lines up with your scenario.

To change this, you'll need to edit your notebook instance, expand the "additional configuration" drop down and look for the "Volume size in GB" field.

AWS
Ben_F
回答済み 2年前
0

I have used a much simpler approach to reading a single data file into S3 using pandas. For example:

import pandas as pd

bucket = 'bucket_name'

data_key = 'file_key.csv'

df = pd.read_csv( 's3://{}/{}'.format(bucket,data_key) )

df.head()

Maybe this will perform better?

回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ