Data Pipeline stops processing files in S3 bucket
I have a Data Pipeline which reads CSV files from an S3 bucket and copies the data into an RDS database. I specify the bucket/folder name and it goes through each CSV file in the bucket/folder and processes it. When it is done, a ShellCommandActivity moves the files to another 'folder' in the S3 bucket. That's how it works in testing. With the real data it just stops after a few files.
The last line in the logs is
07 Dec 2021 09:57:55,755 [INFO] (TaskRunnerService-resource:df-1234xxx1_@Ec2Instance_2021-12-07T09:53:00-0) df-1234xxx1 amazonaws.datapipeline.connector.s3.RetryableS3Reader: Reopening connection and advancing 0
The logs show that it usually downloads the CSV file, then it writes the 'Reopening connection and advancing 0' line, then it deletes a temp file, then goes onto the the next file. But on the seventh file it just stops on 'Reopening connection and advancing 0'.
It isn't the next file that is the problem, as it will process fine on it's own. I've already tried making the files smaller - originally it was stopping on the second file but now the filesizes are about 1.7MB it's getting through six of them before it stops.
The status of each task (both DataLoadActivity and ShellCommandActivity) show 'CANCELLED' after one attempt (3 attempts are allowed) and there is no error message.
I'm guessing this is some sort of timeout. How can I make the pipeline reliable so that it will process all of the files?
The place to start checking is the ShellCommmandActivity node - has this got an Attempt Timeout field? This would cause the node to fail after a given time. You can also look at other nodes in the pipeline (such as EC2Resource nodes) as these can have timeouts as well.
That was it. It's still running and it'll take a few more hours to complete. Thanks for pointing me in the right direction.
While I think you have found your answer , you could also try to look at AWS Glue to process your data from S3 into RDS, it is a serverless service and has a bookmark concept that will allow you to process each file only once without the need to relocate them (if you are not doing so for other reasons).
Thanks for the tip! I will check out AWS Glue.
Relevant questions
How could we have Glue to get data from csv as String?
Accepted Answerasked 2 months agoBatch download files from multiple different folders in the same S3 bucket
asked 4 months agoStore csv data from s3 bucket automatically inside timestream
asked 3 months agoHow to merge aws data pipeline output files into a single file?
asked 3 months agoData Pipeline and IAM errors
asked 8 months agoI need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?
Accepted Answerasked 3 months agoWhere can I found the uri of my Bucket to connect my Symfony application with S3 and upload files?
asked 3 months agoError 403, Problem with importing image data in s3 to SageMaker Notebook
Accepted Answerasked 2 years agoData Pipeline stops processing files in S3 bucket
Accepted Answerasked 7 months agoencrypted db snapshot restore from S3 not working AWS RDS(mysql) console in an S3 bucket.
asked 3 months ago
Aha! The Ec2 Resource is set to Terminate after 10 minutes! I will try changing that.