I want to forecast the future demand based on the 69 Millions historical demand records on CSV file, what is the best practice?


I have historical demand data. 18gb in CSV format , 69M records, 30 columns.

I'm exploring SageMaker options. I see several options. Amazon Forecast, SageMaker Studio, Canvas, Training Jobs, and just plain Jupyter Notebook instance. I believe theoretically all can be used but not sure which one actually can handle such a huge dataset without taking forever.

I think I heard some of these can only support a few Million records. I'd like to know the best approach with such a huge data points. (for forecasting the future demand)

Should I use Spark? Can someone lay out how to do this?

gefragt vor 9 Monaten370 Aufrufe
1 Antwort
Akzeptierte Antwort


For such large datasets Sagemaker Data Wrangler seems quite appropriate to prepare it. In https://aws.amazon.com/blogs/machine-learning/process-larger-and-wider-datasets-with-amazon-sagemaker-data-wrangler/ you have it benchmarked on a dataset of around 100 GB with 80 million rows and 300 columns.

About the training of large models with Amazon SageMaker, see this video: https://www.youtube.com/watch?v=XKLIhIeDSCY

Also, re. training of your model, this post helps you choose the best datasource: https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/



profile pictureAWS
beantwortet vor 9 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen