How to merge aws data pipeline output files into a single file?

0

Hi There,

I'm sourcing data from dynamo db and residing into s3 bucket using AWS data pipeline. I have been running this pipeline once in a week to get up-to date the records from dynamo db table. Everything is working fine there's no issues with pipeline since i have small tables i would like to pull all records every time. The thing is that when AWS data pipeline writes the exported files in chunks into s3 which becoming hard now because i have to read the file one by one i don't wanna do that.. I am pretty new with AWS data pipeline service... Can someone guide me how to configure AWS pipeline so that it produces just one output file for each table? or any other better to resolve this?

Any help would be appreciated...!

1 Answer
2

I don't believe you have the option to only output a single file when using DataPipeline. You are using a pre-built solution which uses the emr-dynamodb-connector, which limits your ability for customization. You can of course provide your own code to DataPipeline in which you can achieve your goal of a single file output.

You could use AWS Glue to achieve this using Spark, and before you write the data to S3 you call repartion or coalesce to reduce to a single partition. If you have understanding of Hadoop or Spark you will understand that reducing the partitions reduces the distribution of the job to essentially a single reducer. This can lead to issues if the table has a lot of data, as a single node in your cluster will need to hold the entire contents of the table, leading to Storage or OOM issues.

  1. Some Guidance on Glue
  2. Repartition/Coalesce
profile pictureAWS
EXPERT
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions