By using AWS re:Post, you agree to the Terms of Use
/Sagemaker and Data on Databases/

Sagemaker and Data on Databases


A customer has a question about data sources

“most of our data is stored in SQL databases, while the SageMaker docs say that I have to put it all in S3. It’s not obvious what the best way to do this is. I can think for example of splitting my analysis code in two; one pre-processing step to go from SQL queries to tabular data, and e.g. store that as Parquet files. For high-dimensional tensor data it’s even less obvious.”

Can someone comment on that?

2 Answers
Accepted Answer

We have an example notebook for interacting from Redshift data from a SageMaker managed notebook, which I believe is suitable for an Exploratory Data Analysis (EDA) use-case:

For production purposes, the customer should consider separating the job of first extracting data from relational databases to S3 (to build out a data lake), and then using that for downstream processing/machine learning (including SageMaker, EMR, Athena, Spectrum, etc.). Customers can build extraction pipelines from popular relational databases using AWS Glue, EMR, or their preferred ETL engines like those on the AWS Marketplace.

answered 4 years ago

I'd recommend using SageMaker Data Wrangler to connects the dots of different SageMaker services.

answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions