How to merge aws data pipeline output files into a single file?

0

Hi There,

I'm sourcing data from dynamo db and residing into s3 bucket using AWS data pipeline. I have been running this pipeline once in a week to get up-to date the records from dynamo db table. Everything is working fine there's no issues with pipeline since i have small tables i would like to pull all records every time. The thing is that when AWS data pipeline writes the exported files in chunks into s3 which becoming hard now because i have to read the file one by one i don't wanna do that.. I am pretty new with AWS data pipeline service... Can someone guide me how to configure AWS pipeline so that it produces just one output file for each table? or any other better to resolve this?

Any help would be appreciated...!

1 回答
2

I don't believe you have the option to only output a single file when using DataPipeline. You are using a pre-built solution which uses the emr-dynamodb-connector, which limits your ability for customization. You can of course provide your own code to DataPipeline in which you can achieve your goal of a single file output.

You could use AWS Glue to achieve this using Spark, and before you write the data to S3 you call repartion or coalesce to reduce to a single partition. If you have understanding of Hadoop or Spark you will understand that reducing the partitions reduces the distribution of the job to essentially a single reducer. This can lead to issues if the table has a lot of data, as a single node in your cluster will need to hold the entire contents of the table, leading to Storage or OOM issues.

  1. Some Guidance on Glue
  2. Repartition/Coalesce
profile pictureAWS
专家
已回答 2 年前
AWS
专家
已审核 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则