How to merge aws data pipeline output files into a single file?

0

Hi There,

I'm sourcing data from dynamo db and residing into s3 bucket using AWS data pipeline. I have been running this pipeline once in a week to get up-to date the records from dynamo db table. Everything is working fine there's no issues with pipeline since i have small tables i would like to pull all records every time. The thing is that when AWS data pipeline writes the exported files in chunks into s3 which becoming hard now because i have to read the file one by one i don't wanna do that.. I am pretty new with AWS data pipeline service... Can someone guide me how to configure AWS pipeline so that it produces just one output file for each table? or any other better to resolve this?

Any help would be appreciated...!

1 個回答
2

I don't believe you have the option to only output a single file when using DataPipeline. You are using a pre-built solution which uses the emr-dynamodb-connector, which limits your ability for customization. You can of course provide your own code to DataPipeline in which you can achieve your goal of a single file output.

You could use AWS Glue to achieve this using Spark, and before you write the data to S3 you call repartion or coalesce to reduce to a single partition. If you have understanding of Hadoop or Spark you will understand that reducing the partitions reduces the distribution of the job to essentially a single reducer. This can lead to issues if the table has a lot of data, as a single node in your cluster will need to hold the entire contents of the table, leading to Storage or OOM issues.

  1. Some Guidance on Glue
  2. Repartition/Coalesce
profile pictureAWS
專家
已回答 2 年前
AWS
專家
已審閱 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南