Skip to content

Reprocessing a file in EMR

0

While processing a file through EMR, if the cluster is terminated, few records were only updated. While processing it again should we delete the file at target location, so we can process the file without any duplicates at target location or is there any way to do it?

asked 3 years ago284 views
1 Answer
0

Hello Rohith,

Thank you raising this question in re:Post.

To give you an accurate answer to this, I would need more details on what you are using to process this file, how you are writing it to the target, what is your target etc.

In general, non-text file output formats(Parquet as an example) are known to have better handling of this scenario in application frameworks like Spark & Hive, among others. While Text file formats are prone to leave partial files at the target during abrupt exit or failure of the application.

On a high level I believe if you write an application to overwrite the target, it should take care of this and not leave any duplicates. Having said that there are multiple variables at play here, hence I'll not be able to confirm if it would be the case for your use case.

Please elaborate more on the output file format, the execution engine(Hive/Spark etc.), target location(RDBMS/S3 or something else) you would be using so that I could review this again and answer your questions. The more details you share the easier it would be to answer.

AWS
SUPPORT ENGINEER
answered 3 years ago
  • Output format will be parquet and target location can be S3 or Redshift. Execution engine is Spark

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.