Can I Run a Docker Container for Batch Processing from AWS Glue

0

Getting started with AWS Glue at my new workplace (I previously used AirFlow). My colleagues are running scheduled jobs via Glue, but they usually put a python script in the Glue editor, then adjust the schedule and some arguments. They cannot advise me how to get a finer control of the dependencies and binaries to control what kind of environment the script runs in. Also, it is not clear how does the script load any assets from disk (such as loading persisted machine learning models, which is my use case).

So, is there a way I can specify a Dockerfile which will build the Glue environment before running my script, or can I point to a GitLab repository containing the Dockerfile for this purpose?

Or else, is it possible to run a container itself fetched from Dockerhub? But then the Glue environment must have access to the Dockerhub registry, right?

To make the question concrete, consider the following script, that I can run via cronjob every Monday at 2 am.

#!/usr/bin/env bash
# encoding:utf-8

# Assume the host has docker installed 
# and docker login set up to access the container
# Run this script via Cron
function TIMESTAMP() { printf "%s" "$(date "+%Y-%m-%d %H-%M")";}
NAMESPACE=samy # Container registry in dockerhub, needs access to this
IMAGE="$NAMESPACE"/tf-clustering
CONTAINER=clustering-container
docker container stop $CONTAINER # Stop to enable deletion of the container
docker container rm $CONTAINER
docker image rm $IMAGE # To enable pulling the latest image from docker hub
set -e # From this point onwards, any failure must exit the script
docker container create --hostname $CONTAINER --name $CONTAINER $IMAGE
docker container start $CONTAINER
printf "%s %s Completed.\n\n\n\n" "$(TIMESTAMP)" "$CONTAINER"

Now, if instead of cron on a bare-metal (or virtual) ec2 instance, I need to run this in Glue, what are the necessary steps?

已提问 5 个月前256 查看次数
2 回答
0

No, Glue doesn't allow you to use your own docker images and you normally don't need that.
The purpose of using a managed service like Glue is to use the binaries, libraries and engines provided whether is Python Shell, Spark or Ray.
If you want to do custom things, you can just use EKS or ECS to run your own containers but I would advise you to give Glue a chance.

profile pictureAWS
专家
已回答 5 个月前
  • So what environment does my python script run in? By that I mean does Glue offer a virtualised Debian instance as an environment without anything else installed? I just noticed there is an option for Temporary path in job details, can I control what kind of external resources (e.g. config files, tensorflow models persisted on disk etc.) can my script access? Also, if I cannot set up my fine tuned environment, what are the constraints in terms of what the python script can do?

0

I think you're misunderstanding a bit what Glue is for. Glue and AirFlow are not comparable things.

Glue ETL jobs aren't for running general-purpose Python scripts; they're highly-managed environments for doing heavy distributed data processing workloads. There are multiple types of Glue ETL jobs, one of which is based on Apache Spark, which is for processing/transforming large quantities of data on multiple machines ('nodes' in a Spark cluster). If your goal with this is to load a machine learning model from disk and serve it up, Glue is definitely not what you want.

If you want something very general purpose, I'd look into AWS Batch or AWS ECS (running on Fargate) - both of those are basically mechanisms for you to run containers you already have, and require significantly less effort/maintenance than running your own EC2 instances.

If the model is small you might be able to use Lambda (much simpler than ECS/Batch and has some nice qualities, but can only run for a short time on limited hardware). There's also Sagemaker, another managed service specifically for machine-learning-type workloads - though again, Sagemaker is an opinionated managed framework and so your code would need to be designed to work with it (more complicated than just drop in your own container)

Ted
已回答 5 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则