Workflow information needed for AWS Lambda image processing

0

We have an EC2 that handles our image processing using PHP and ImageMagick. When processing 5000 images, it takes about 5 hours. So, I've been looking at implementing Lambda with Sharp. There are times we will have 10-15k images to process as well, but that's rare. Currently, when an export is triggered by a user our steps are:

  1. Retrieve each image using it's URL and save to a folder. The images are very big.
  2. Resize each image to less than 1500x1500 and less than 600KB, store in secondary folder.
  3. Create a CSV file with data for each image, stored in secondary folder.
  4. TAR and GZ resized photos and csv, move tar.gz to an export directory and update database.

In my reading, the image processing is best handled in Lambda. However, should the retrieving of images still be from the EC2 and saving to S3 Bucket, triggering the Lamba? If so, then how to know when all the processing is done and zip, move, delete all images and folders from the bucket? Or, perhaps better to send the url via API to Lamba, process image and save? Can you hit the API 5000+ times and Lamba scales? The former solution sounds more reasonable.

Anyway, looking for anyone with some experience in this to comment. Would appreciate some answers.

3 Answers
0
Accepted Answer

Your plan is what I would do - Lambda is a good fit here as long as you can process images within the maximum memory, local storage and maximum runtime constraints of Lambda. You don't say what "very big" means - that would be a useful metric to have here.

Why store the images on an EC2 instance? Why not store them in S3 to start with? That would be far more cost effective and S3 scales as your application does so there's no need to determine what size your instance has to be. So yes, definitely send the URL to the Lambda function using API Gateway or using by invoking the Lambda function directly.

To determine when the images are processed, why not have the Lambda function move the original image to a new prefix in S3. Then you could have another Lambda function triggered by CloudWatch Events on a periodic basis (once every minute, two minutes, five minutes - whatever you like) which scans the upload prefix and if it is empty then does the final processing.

profile pictureAWS
EXPERT
answered 2 years ago
  • To be honest, we started with just the EC2 because AWS is quite confusing at times. :) We were in a hurry to get the site off the campus run servers, so I simply created what we needed. Now, I'm looking at better ways to handle things. The image processing being the major item that uses resources. The export files only live on the server for 3 months before being deleted. So yes, looking to store those as well.

    By big, I mean some retrieved images are 20Mb to 150Mb. At least that's about the largest I've seen. We don't really have control over the size of the images. They are specimen images and really depends on who took them.

    It seems in your answer you offer several solutions, but it's still mixing me up.

    1. When you say send the url to an API, I can just loop through the array of urls in PHP and simply send each to the API endpoint? No worries if it's 5000 or 15000 urls I'm sending? I assume Lambda would then scale accordingly? In this scenario, I can retrieve each image, resize it, store it on S3 all in one go. So, no original upload destination to check using CloudWatch events. How would I know it's done in this scenario?

    2. The other scenario is to have a Lambda download, then another Lambda triggered by the bucket to process the image, move to another prefix (or another bucket?), and use CloudWatch.

    Do any of these sound right? I obviously need to read up on CloudWatch too. Thanks

  • If it were me - because we're talking about decoupling the part that gathers the URLs from the part that deals with the images: I would create a SQS queue. Have your PHP code send each URL to the SQS queue and make that the trigger for your Lambda function. That way, Lambda will scale but if it reaches a limit (say, you send 15,000 URLs and the default number of concurrent Lambda functions is 1,000) then the queue will hold the messages until Lambda "catches up".

  • Yes, I was reading about the limits last night. Thank you for your help.

0

If your team's current experience level with AWS is such that "AWS is too confusing" ( and it definitely is confusing until you gain more experience with it), then I would stick with what you have, more scalable solutions are inherently 'more confusing' -- until its not :) definitely many more moving pieces and things to learn.
However - its not difficult -- it just takes time to learn to think about architecture differently and learn about how to build complex things from many simple parts.
How I would do this (and how I have done similar jobs in the past) is different then the other suggestions. I would start with StepFunctions as the core 'workflow engine'. StepFunctions are managed workflows which you can author entirely online, graphicly , and via simple JSON or YAML configuration documents. I would take each of your 'steps' and make them a seperate 'activity' in StepFunctions -- starting with a 'dummy' step for each ( use a PASS state as a placeholder). Once you get the basic structure in place then you replace each PASS state with either a Lambda Task step or a direct AWS API Integration step. Most AWS APIs can be directly invoked from StepFunctions. Lambda may be easier for some things, depends on how complex the API is -- i.e. if its too complicated to figure out the raw REST/JSON payload you may find it easier to use a Lambda and language specific AWS SDK which supports typed API calls and more traditional integrations. Everything you describe can be implemented with a simple StepFunctions workflow and a few lambda calls. Except possibly the 'update the Database' part - depending on what Database that is -- you may find it easier to integrate using the StepFunctions Activity API which allows any program (running on EC2, Docker, or anywhere in your datacenter including a traveling laptop) to take part in the workflow.

What Stepfunctions brings to the table is visibility and reliability. Visibility - every step is fully exposed as an input and output JSON document. Full logging and persistence of every execution step comes built in. You can literally 'see' every thing that went into and out of each step and function call, I find this makes understanding and debugging much easier then in a program running on EC2 or lambda or docker where it may be difficult to directly debug and time consuming to instrument with enough logs -- let alone a way to get at the logs and match them up against the execution's. StepFunctions provides a GUI and API that tracks everything cleanly. Reliability - Advice I have learned 'the hard way' --- Try to keep your 'Orchestration' physically seperate from your 'Execution' -- When you put everything in one basket -- say a program in EC2 -- then any errors or problems in a single bit of the code -- one line gone bad -- can break everything. That can mean it breaks the debug logs -- It also means bugs in one part of your code -- seemingly unrelated to any other -- can break other parts mysteriously. Run out of temp space in the Resize step ? you might break the Compress step, Bug in the S3 copy might crash crash the logs. Figuring out where something breaks is much harder in 'the cloud' then 'on your desktop' due to lack of visibility -- Run it in step functions -- and step functions itself will not break no matter what your code does -- you can build reliably retries and logging that is impervious to your code's bugs. Run a 'monolithic' app that does 10 steps -- and 'something goes wrong' -- where ? what part of the app is responsible for managing errors ? Same thing if you split it out to pure Lambda calls -- which lambda call is responsible for error recovery or figuring out "the next step" ? HOw to debug THAT one ? Lambda can be 'fragile' -- and when it breaks it simply doesn't return -- will there be debug logs ? maybe, maybe not, depends on how bad it broke. What if the lambda code worked but the 'calling the next one' breaks ? what manages that ? Now you start to need Queues and dead letter queues and lambdas watching lambdas and more lambdas cleaning up the problems of the other ones ... Really Messy and hard to code and debug --- think spaghetti code is bad ? try 20 lambdas all trying to coordinate a workflow blindly and independently.

Step functions solves that problem by reversing the control paradigm. Make your code 'do one thing' and let the workflow 'decide what to do next'

Step Functions comes built in with a controlled parallel Map task. Takes your 'array of 1000 file names' and it will divide-and-conquer -- splitting the work into batches and iterating over each in an independant thread of execution -- scaling to whatever concurrency you want without getting 'blown out of the waters' if you throw 100x more at it then expected.

It takes some practice to get used to how to design the logic -- its definitely different then procedural programming, and it is very VERY limited wrt expressions and data manipulation -- you may have to call a lambda to do simple things like multiply ... But as you get used to it you will find ways to work within the limits -- the design/edit/test iteration cycle is fast -- as short as a few seconds -- and able to be entirely developed in the console GUI or offline in a text editor or both.

DALDEI
answered 2 years ago
0

There are many solutions in your usecase and lambda is a best part of solution

1- how your images are detected to be treated, you fetch them from a database or something like that?

2- is your images uploaded somewhere, if so can you set upload location to s3, if possible simply trigger an event to sqs and trigger the lambda behind that sqs to avoid concurrency risks.

3- simply put un event to sqs from ec2 and trigger the lambda from sqs

4- use stepfunctions which has an internal queue and will help you to avoid reaching concurency limits

As a compliment, the lambda has a default concurency soft limit which can be increased, also you have a burst capacity which is added to your concurency limit if you don't use your limit for sometimes, but based on region will not passe 3000.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions