- Le plus récent
- Le plus de votes
- La plupart des commentaires
What is the need to load large dataset onto the notebook? If you are pre-processing then there are better ways to do this - Sagemaker Spark processing job, or have your own spark cluster and process or even possibly Glue. If you are exploring the data, you should just use a smaller data set. If you are loading the data for training, Sagemaker supports different modes to read the data and data doesnt have to be downloaded on the notebook.
If you want to download the data onto the notebook so that you don't have to load it from S3 each time you want it in Pandas, you should confirm that the volume size is sufficient for the data. It is set to 5 GB by default, which lines up with your scenario.
To change this, you'll need to edit your notebook instance, expand the "additional configuration" drop down and look for the "Volume size in GB" field.
I have used a much simpler approach to reading a single data file into S3 using pandas. For example:
import pandas as pd
bucket = 'bucket_name'
data_key = 'file_key.csv'
df = pd.read_csv( 's3://{}/{}'.format(bucket,data_key) )
df.head()
Maybe this will perform better?
tried it, this is the error Im getting:
AttributeError: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'
This error is a mismatch between pandas and s3fs, try
%pip install -U pandas s3fs
, then restart kernel and run again. But even faster, I'd try AWS Data Wrangler and replacepd.read_csv
withwr.s3.read_csv
: https://aws-data-wrangler.readthedocs.io/en/stable/an example of reading in chunks is illustrated here - https://github.com/data-science-on-aws/oreilly_book/blob/workshop/04_ingest/05_Query_Data_With_AWS_DataWrangler.ipynb
Contenus pertinents
- demandé il y a 6 jours
- demandé il y a 4 mois
- demandé il y a 10 mois
- demandé il y a un an
- AWS OFFICIELA mis à jour il y a 2 ans
- AWS OFFICIELA mis à jour il y a 2 ans
@AWS-cm-1462515 @Hrushi_G can somebody please point to a resource on AWS (or someplace else) with instructions on how to read data in SageMaker for training and yet not load them in the notebook?