Best AWS Service to Put PySpark / ML Code into Production

0

I am very new to MLOps and seeking guidance on how to put a piece of ML python code into production on AWS.

Here are my requirements:

  1. Model is run once a week
  2. Pick up very large data files from S3
  3. Models are stored in S3 and will need to be loaded from there as well
  4. Data processing is done in PySpark
  5. Processed data is then converted into Pandas for further processing before being fed into models
  6. Models are typically SkLearn or XGBoost
  7. Model results should be stored back in S3

Any idea of what service is best for this? Again, I have very limited experience outside of writing ML models in python, so anything that is simple where I am not managing or worrying about infrastructure is preferable. However, I will do what I must to put the code into production.

1 Answer
0

From what you have described, Amazon SageMaker is your best bet. It is a managed service (meaning you dont have to worry about managing the platform) and has built-in support for feature engineering (data processing) in PySpark through SageMaker processing. You can find an example here. It natively integrates with S3, which allows you to fetch data from S3 at runtime, and it also saves model artefacts to an S3 bucket (you can specify which bucket and prefix). Commonly used frameworks/models like XGBoost and Sklearn are supported and you can run your model training tasks using SageMaker managed training. Finally, to make the code production ready, you can use SageMaker Pipelines, which is the MLOps tool that can take care of moving your code to the production environment.

AWS
answered a year ago
  • Thank you, I will investigate SageMaker Pipelines. Why would you not recommend Amazon EMR, is it overkill for this use case?

  • If your intent is to train a machine learning model, SageMaker is a better fit as it provides tooling for all steps of the model development lifecycle. You have access to features like SageMaker Model Monitoring (to monitor your model in production), SageMaker Debugger, and a number of model inference options.

  • I'm starting to dig into SageMaker pipelines and watch tutorial videos. Before I get too far, I want to ask... is it possible to create a highly custom modeling pipeline? For example, I have my own custom cross validation class, where hundreds of models are trained and tested, and my own functions which stops training when very niche custom criteria are hit. I'd like to save this class as a pickled object which is then picked up in production from S3 and fed the same set of features.

    Is SageMaker Pipelines still the best route for me? Any suggestions would be appreciated, including if I should rethink how I'm doing things.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions