Can I Run a Docker Container for Batch Processing from AWS Glue

0

Getting started with AWS Glue at my new workplace (I previously used AirFlow). My colleagues are running scheduled jobs via Glue, but they usually put a python script in the Glue editor, then adjust the schedule and some arguments. They cannot advise me how to get a finer control of the dependencies and binaries to control what kind of environment the script runs in. Also, it is not clear how does the script load any assets from disk (such as loading persisted machine learning models, which is my use case).

So, is there a way I can specify a Dockerfile which will build the Glue environment before running my script, or can I point to a GitLab repository containing the Dockerfile for this purpose?

Or else, is it possible to run a container itself fetched from Dockerhub? But then the Glue environment must have access to the Dockerhub registry, right?

To make the question concrete, consider the following script, that I can run via cronjob every Monday at 2 am.

#!/usr/bin/env bash
# encoding:utf-8

# Assume the host has docker installed 
# and docker login set up to access the container
# Run this script via Cron
function TIMESTAMP() { printf "%s" "$(date "+%Y-%m-%d %H-%M")";}
NAMESPACE=samy # Container registry in dockerhub, needs access to this
IMAGE="$NAMESPACE"/tf-clustering
CONTAINER=clustering-container
docker container stop $CONTAINER # Stop to enable deletion of the container
docker container rm $CONTAINER
docker image rm $IMAGE # To enable pulling the latest image from docker hub
set -e # From this point onwards, any failure must exit the script
docker container create --hostname $CONTAINER --name $CONTAINER $IMAGE
docker container start $CONTAINER
printf "%s %s Completed.\n\n\n\n" "$(TIMESTAMP)" "$CONTAINER"

Now, if instead of cron on a bare-metal (or virtual) ec2 instance, I need to run this in Glue, what are the necessary steps?

asked 5 months ago240 views
2 Answers
0

No, Glue doesn't allow you to use your own docker images and you normally don't need that.
The purpose of using a managed service like Glue is to use the binaries, libraries and engines provided whether is Python Shell, Spark or Ray.
If you want to do custom things, you can just use EKS or ECS to run your own containers but I would advise you to give Glue a chance.

profile pictureAWS
EXPERT
answered 5 months ago
  • So what environment does my python script run in? By that I mean does Glue offer a virtualised Debian instance as an environment without anything else installed? I just noticed there is an option for Temporary path in job details, can I control what kind of external resources (e.g. config files, tensorflow models persisted on disk etc.) can my script access? Also, if I cannot set up my fine tuned environment, what are the constraints in terms of what the python script can do?

0

I think you're misunderstanding a bit what Glue is for. Glue and AirFlow are not comparable things.

Glue ETL jobs aren't for running general-purpose Python scripts; they're highly-managed environments for doing heavy distributed data processing workloads. There are multiple types of Glue ETL jobs, one of which is based on Apache Spark, which is for processing/transforming large quantities of data on multiple machines ('nodes' in a Spark cluster). If your goal with this is to load a machine learning model from disk and serve it up, Glue is definitely not what you want.

If you want something very general purpose, I'd look into AWS Batch or AWS ECS (running on Fargate) - both of those are basically mechanisms for you to run containers you already have, and require significantly less effort/maintenance than running your own EC2 instances.

If the model is small you might be able to use Lambda (much simpler than ECS/Batch and has some nice qualities, but can only run for a short time on limited hardware). There's also Sagemaker, another managed service specifically for machine-learning-type workloads - though again, Sagemaker is an opinionated managed framework and so your code would need to be designed to work with it (more complicated than just drop in your own container)

Ted
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions