How to set up step functions to split Dataframe and process in EC2
Hi everyone, I'm a student doing research and part of that research requires me to download over 2 million files and analyse them, I have a python script that I can run and it does what I need it to do however, I would like to split the data frame into 256 chunks and run each chunk on a separate EC2 instance to download the files into an S3 bucket and then as a file hits that bucket I would like a second python script to run analysing the file. I know this can be done a number of ways but I'm hoping someone can help steer me in the right direction, from what little i have had to do with AWS I'm thinking something like a steps function could help to achieve this?
thanks in advance Tom
I would look into Lambda functions. You will have 2 buckets, one for the large files and one for the small files. One function will trigger from the first bucket, it will read the file and split it into multiple, smaller files, which it will save in the second bucket. The second function will be triggered from the second bucket and will run the analysis on the small files.
This is assuming that the large files can be loaded into a function (size wise) and that it takes less than 15 minutes to split a large file and less than 15 minutes to analyze a small file.
How do I use Step Functions to create EMR clusters with different specifications?Accepted Answerasked 2 years ago
How to securely pass secrets from step to step in step functionsasked 4 months ago
Cannot directly set up BatchGetItem step inside a State Machineasked 23 days ago
How to find deadlock process in RDS My SQL and set alarm notification?Accepted Answerasked 25 days ago
How to set up cross-account deploymetn fromCodeCommit repo to EC2 instance in another accountasked a month ago
How to set up step functions to split Dataframe and process in EC2asked 21 days ago
How to access API Parameters of a node and add them as part of it's own output json in AWS Step Functions?asked 4 months ago
How to find deadlock process in RDS My SQL and set alarm notification？asked 25 days ago
Image Builder places files in /tmp folder and tries to execute themasked 2 years ago
Glue ETL job write part-r-00 files to same bucket as my input. Any way to change this?Accepted Answerasked 3 months ago