Parquet vs ORC on EMR

0

What are the general pros and cons for Parquet vs. ORC specifically as it relates to EMR (EMRFS)

If the customer is planning to also leverage Redshift and Athena on the same data lake does this change the equation?

AWS
gefragt vor 7 Jahren568 Aufrufe
1 Antwort
0
Akzeptierte Antwort

For EMR:

Parquet and ORC overlap quite a bit in terms of use cases as both are columnar formats. The last time (a few years ago) I was involved in a design evaluation to choose between the two, ORC's native indexing ended up being a measurable advantage in terms of performance in our use case, namely Hive queries that filtered results based on a handful of columns with (relatively) low cardinality (at least when compared with the number of rows in the data set). If that fits the customer's use case, that may be a good reason to go the ORC route. The caveat here is that there are third party solutions available in the ecosystem that can help close that index feature gap if the customer is willing to install and manage them.

For Athena/Redshift:

  • As a straightforward consideration of compatibility, Athena supports both formats. Assuming the same compression library is used with both formats (the two have different defaults), I am not yet aware of a significant performance delta between the two if all other things are equal.
  • Assuming your question regarding Redshift is in the context of leaving the data in S3 and leveraging Spectrum, based on the docs Parquet is currently supported but not ORC.

http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

MODERATOR
beantwortet vor 7 Jahren
AWS
SUPPORT-TECHNIKER
überprüft vor einem Monat

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen