How to set up step functions to split Dataframe and process in EC2

0

Hi everyone, I'm a student doing research and part of that research requires me to download over 2 million files and analyse them, I have a python script that I can run and it does what I need it to do however, I would like to split the data frame into 256 chunks and run each chunk on a separate EC2 instance to download the files into an S3 bucket and then as a file hits that bucket I would like a second python script to run analysing the file. I know this can be done a number of ways but I'm hoping someone can help steer me in the right direction, from what little i have had to do with AWS I'm thinking something like a steps function could help to achieve this?

thanks in advance Tom

1 Answer
0

I would look into Lambda functions. You will have 2 buckets, one for the large files and one for the small files. One function will trigger from the first bucket, it will read the file and split it into multiple, smaller files, which it will save in the second bucket. The second function will be triggered from the second bucket and will run the analysis on the small files.

This is assuming that the large files can be loaded into a function (size wise) and that it takes less than 15 minutes to split a large file and less than 15 minutes to analyze a small file.

profile pictureAWS
EXPERT
Uri
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions