Glue Interactive vs SageMaker Processing?

0

Greetings! I'm a data scientist working in SageMaker notebooks. I'd appreciate an explanation about when should I use Glue Interactive and not SageMaker Processing jobs. To my understanding, they are very similar and I can't differentiate them.

Thank you!

2 Answers
2

Hello! It depends on what you are trying to achieve.

Let us just talk about notebooks first - Sagemaker notebook (or even Glue notebook) is quite efficient for quick prototyping and analysis of data. For example, if you just want to make certain charts from a CSV or do quick data wrangling etc. then the Notebook is often the preferred choice. The notebook is also fantastic for documenting algorithms. The interactivity helps process step by step and to change your processing along based on the data that you would see. From a tooling perspective, the Glue notebooks provide the data engineer ability to run Jupyter notebok or Zeppelin notebook. SageMaker notebook is the tool preferred by data scientists and Machine learning engineers and provides the Jupyter notebook interface.

Sagemaker provides multiple computing options including ability to choose EC2 instances. in SageMaker Processing you can customize the execution environment, as you could provide a Docker image

A Glue job is typically built for executing ETL jobs in a Spark based/Python serverless job that executes in a cluster of nodes to parallel process data in multiple nodes. AWS Glue is a serverless data integration platform that makes combining, preparing, and finding data for application development, machine learning, and analytics a breeze. It delivers all of the features required for data integration, allowing you to begin analyzing and putting your data to use in minutes rather than months. To make data integration simpler, AWS Glue offers both code-based and visual interfaces. The AWS Glue Data Catalog allows users to quickly locate and retrieve data. With just limited clicks in AWS Glue Studio, ETL (extract, transform, and load) developers and data engineers can graphically construct, execute, and monitor ETL processes. AWS Glue DataBrew allows data analysts and scientists to visually enhance, clean, and standardize information without writing codes. AWS Glue scans your data sources, recognizes data types, and recommends schemas for storing your data. It produces the code needed to conduct your data transformations and loading operations automatically. AWS Glue makes it simple to perform and manage hundreds of ETL processes, as well as to mix and duplicate data across numerous data stores using SQL.

profile pictureAWS
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago
  • Thank you for the elaborate answer. I'll clarify the scenario. I'm a data scientist working in a Jupyter notebook, and I need some data cleaning done beforehand. Should I choose SageMaker Processing or Glue Interactive? Thanks!

1
Accepted Answer

I would suggest that you use Sagemaker processing for the data cleansing and preparation. I have led projects where all the data cleansing, preparation and model build and testing have been done in Sagemaker and the data scientists love the tool.

profile pictureAWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions