- Newest
- Most votes
- Most comments
No, Glue doesn't allow you to use your own docker images and you normally don't need that.
The purpose of using a managed service like Glue is to use the binaries, libraries and engines provided whether is Python Shell, Spark or Ray.
If you want to do custom things, you can just use EKS or ECS to run your own containers but I would advise you to give Glue a chance.
I think you're misunderstanding a bit what Glue is for. Glue and AirFlow are not comparable things.
Glue ETL jobs aren't for running general-purpose Python scripts; they're highly-managed environments for doing heavy distributed data processing workloads. There are multiple types of Glue ETL jobs, one of which is based on Apache Spark, which is for processing/transforming large quantities of data on multiple machines ('nodes' in a Spark cluster). If your goal with this is to load a machine learning model from disk and serve it up, Glue is definitely not what you want.
If you want something very general purpose, I'd look into AWS Batch or AWS ECS (running on Fargate) - both of those are basically mechanisms for you to run containers you already have, and require significantly less effort/maintenance than running your own EC2 instances.
If the model is small you might be able to use Lambda (much simpler than ECS/Batch and has some nice qualities, but can only run for a short time on limited hardware). There's also Sagemaker, another managed service specifically for machine-learning-type workloads - though again, Sagemaker is an opinionated managed framework and so your code would need to be designed to work with it (more complicated than just drop in your own container)
Relevant content
- Accepted Answerasked a year ago
- asked 4 months ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
- How can I use a Lambda function to automatically start an AWS Glue job when a crawler run completes?AWS OFFICIALUpdated 2 years ago
So what environment does my python script run in? By that I mean does Glue offer a virtualised Debian instance as an environment without anything else installed? I just noticed there is an option for
Temporary path
in job details, can I control what kind of external resources (e.g. config files, tensorflow models persisted on disk etc.) can my script access? Also, if I cannot set up my fine tuned environment, what are the constraints in terms of what the python script can do?