1 Answer
- Newest
- Most votes
- Most comments
0
I understand that you are experiencing performance degradation using Hive with the storage layer as S3. Given S3 is a storage layer that lives outside of your VPC, it will incur additional network traffic adding latency and cost. Having said that I hope the following will help you minimize the performance degradation if not preventing it.
- Please have a look at the Operational differences and considerations which details the drawbacks of having versioned buckets and mitigation by updating the S3 bucket Lifecycle policy delete them frequently in the "/tmp" directory.
- Enable Hive EMRFS S3 optimized committer to take advantage of the performance enhancements built into the EMRFS library for S3 on EMR.
- Please check your CloudTrail for S3 503 "Slow Down" errors for your Hive jobs, if you do have them please follow the recommendations in the Knowledge Center article on the same.
- In worst cases the above recommendation may still not yield any performance improvements, the case I'm referring here is regarding the S3 (internal) partitions which should not be confused with the Data Partitioning. S3 limit is 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned prefix. If there are too many objects in your Bucket which your Queries in Hive are reading, it may trigger GetObject requests for each of these objects, resulting in 503 "Slow Down", in most cases retrying will help however extreme cases even retrying may not succeed as you may have many worker nodes requesting objects at the same time. You will have to reach out to AWS Support to get that increased for your bucket, which should get you better performance for S3 read/write.
I believe the above will help you improve the performance for Hive on S3, please reply to this thread if you have any follow up questions.
Relevant content
- asked 2 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 14 days ago
- AWS OFFICIALUpdated 2 years ago