2 Answers
- Newest
- Most votes
- Most comments
0
Hi, for best practices on optimizing Amazon Athena performance plese refer to this blog post.
For what you explain there are 2 data storage anti-pattern design in the current implementation (not only for Athena, but for any distributed query/processing engine - Spark as well):
- small files. Best practice for for storage would be to have a file size of 64 MB up to 1 GB in some cases.
- too many small files and partitions : Based on the partition on a unique key make me think you could have hundreds of thousands of partitions, each of which contains a few small files.
Also, as a questions when you mention index what are you speaking of? AWS Glue Partition Indexes?
There are a few approaches you may benefit from but it would depend on which services will access your data.
- only Athena or Hive ? you could consider to use bucketing instead of partition bu your unique id. Limiting the number of buckets you will be able still to query by id but you will have fewer larger files and the execution time of the queries will be much faster.
- also using Spark? then bucketing it is not the best option (there are difference between Spark and Hive implementation and it becomes more complex) so probably you may want to see if you have additional design option to limit the number of partitions and increase the size of files.
- if you have not yet tested the AWS Glue partition index , they may be able to help without changing the data storage layout, read this blog post for more info.
hope this helps
0
Hi, having a big number of partitions is almost always not the best approach, also the file sizes are very small. Would it be possible for you to show the table DDL ?
answered 3 years ago
Relevant content
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 4 months ago
- Unlocking the power of structured data for enterprises using natural language with Amazon Q BusinessAWS OFFICIALUpdated 2 months ago
I am talking about Glue Partition Index. Basically, I have hundreds of thousands of stocks with data on each and want to be able to collect different stocks and build different portfolios. I am storing each stock as a partition in s3 because when I grab the data, I want the full time series, never a subset. So the workflow is: 1) Extract data from db 2) Transform data to clean state 3) Store the data using a unique id as the partition value 4) Load the data by any set of partition values (i.e., stock data: Google, Amazon, Microsoft) to run a workflow that builds different portfolio attributes.