Best way to structure data for fast access via Athena?


I am storing 400,000 parquet files in S3 that are partitioned based on a unique id (e.g. 412812). The files range in size from 25kb to 250kb of data. I then want to query the data using Athena. Like so,

Select * 
From Table 
where id in (412812, 412813, 412814)

This query is much slower than anticipated. I want to be able to search for any set of ids and get a fast response. I believe it is slow is because Athena must search through the entire glue catalog looking for the right file (i.e., a full scan of files).

The following query is extremely fast. Less than a second.

Select * 
From Table 
where id = 412812

partition.filtering is enabled on the table. I tried adding an index to the table that was the same as the partition, but it did not speed anything up.

Is there something wrong with my approach or a table configuration that would make this process faster.


asked 2 years ago2637 views
2 Answers

Hi, for best practices on optimizing Amazon Athena performance plese refer to this blog post.

For what you explain there are 2 data storage anti-pattern design in the current implementation (not only for Athena, but for any distributed query/processing engine - Spark as well):

  1. small files. Best practice for for storage would be to have a file size of 64 MB up to 1 GB in some cases.
  2. too many small files and partitions : Based on the partition on a unique key make me think you could have hundreds of thousands of partitions, each of which contains a few small files.

Also, as a questions when you mention index what are you speaking of? AWS Glue Partition Indexes?

There are a few approaches you may benefit from but it would depend on which services will access your data.

  1. only Athena or Hive ? you could consider to use bucketing instead of partition bu your unique id. Limiting the number of buckets you will be able still to query by id but you will have fewer larger files and the execution time of the queries will be much faster.
  2. also using Spark? then bucketing it is not the best option (there are difference between Spark and Hive implementation and it becomes more complex) so probably you may want to see if you have additional design option to limit the number of partitions and increase the size of files.
  3. if you have not yet tested the AWS Glue partition index , they may be able to help without changing the data storage layout, read this blog post for more info.

hope this helps

answered 2 years ago
  • I am talking about Glue Partition Index. Basically, I have hundreds of thousands of stocks with data on each and want to be able to collect different stocks and build different portfolios. I am storing each stock as a partition in s3 because when I grab the data, I want the full time series, never a subset. So the workflow is: 1) Extract data from db 2) Transform data to clean state 3) Store the data using a unique id as the partition value 4) Load the data by any set of partition values (i.e., stock data: Google, Amazon, Microsoft) to run a workflow that builds different portfolio attributes.


Hi, having a big number of partitions is almost always not the best approach, also the file sizes are very small. Would it be possible for you to show the table DDL ?

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions