How to merge aws data pipeline output files into a single file?

0

Hi There,

I'm sourcing data from dynamo db and residing into s3 bucket using AWS data pipeline. I have been running this pipeline once in a week to get up-to date the records from dynamo db table. Everything is working fine there's no issues with pipeline since i have small tables i would like to pull all records every time. The thing is that when AWS data pipeline writes the exported files in chunks into s3 which becoming hard now because i have to read the file one by one i don't wanna do that.. I am pretty new with AWS data pipeline service... Can someone guide me how to configure AWS pipeline so that it produces just one output file for each table? or any other better to resolve this?

Any help would be appreciated...!

1개 답변
2

I don't believe you have the option to only output a single file when using DataPipeline. You are using a pre-built solution which uses the emr-dynamodb-connector, which limits your ability for customization. You can of course provide your own code to DataPipeline in which you can achieve your goal of a single file output.

You could use AWS Glue to achieve this using Spark, and before you write the data to S3 you call repartion or coalesce to reduce to a single partition. If you have understanding of Hadoop or Spark you will understand that reducing the partitions reduces the distribution of the job to essentially a single reducer. This can lead to issues if the table has a lot of data, as a single node in your cluster will need to hold the entire contents of the table, leading to Storage or OOM issues.

  1. Some Guidance on Glue
  2. Repartition/Coalesce
profile pictureAWS
전문가
답변함 2년 전
AWS
전문가
검토됨 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠