Best way to structure data for fast access via Athena?
I am storing 400,000 parquet files in S3 that are partitioned based on a unique id (e.g. 412812). The files range in size from 25kb to 250kb of data. I then want to query the data using Athena. Like so,
Select * From Table where id in (412812, 412813, 412814)
This query is much slower than anticipated. I want to be able to search for any set of ids and get a fast response. I believe it is slow is because Athena must search through the entire glue catalog looking for the right file (i.e., a full scan of files).
The following query is extremely fast. Less than a second.
Select * From Table where id = 412812
partition.filtering is enabled on the table. I tried adding an index to the table that was the same as the partition, but it did not speed anything up.
Is there something wrong with my approach or a table configuration that would make this process faster.
Hi, for best practices on optimizing Amazon Athena performance plese refer to this blog post.
For what you explain there are 2 data storage anti-pattern design in the current implementation (not only for Athena, but for any distributed query/processing engine - Spark as well):
- small files. Best practice for for storage would be to have a file size of 64 MB up to 1 GB in some cases.
- too many small files and partitions : Based on the partition on a unique key make me think you could have hundreds of thousands of partitions, each of which contains a few small files.
Also, as a questions when you mention index what are you speaking of? AWS Glue Partition Indexes?
There are a few approaches you may benefit from but it would depend on which services will access your data.
- only Athena or Hive ? you could consider to use bucketing instead of partition bu your unique id. Limiting the number of buckets you will be able still to query by id but you will have fewer larger files and the execution time of the queries will be much faster.
- also using Spark? then bucketing it is not the best option (there are difference between Spark and Hive implementation and it becomes more complex) so probably you may want to see if you have additional design option to limit the number of partitions and increase the size of files.
- if you have not yet tested the AWS Glue partition index , they may be able to help without changing the data storage layout, read this blog post for more info.
hope this helps
I am talking about Glue Partition Index. Basically, I have hundreds of thousands of stocks with data on each and want to be able to collect different stocks and build different portfolios. I am storing each stock as a partition in s3 because when I grab the data, I want the full time series, never a subset. So the workflow is: 1) Extract data from db 2) Transform data to clean state 3) Store the data using a unique id as the partition value 4) Load the data by any set of partition values (i.e., stock data: Google, Amazon, Microsoft) to run a workflow that builds different portfolio attributes.
Hi, having a big number of partitions is almost always not the best approach, also the file sizes are very small. Would it be possible for you to show the table DDL ?
How to control how result of Glue job is split into files?Accepted Answerasked 4 years ago
Redshift UNLOAD parquet file sizeAccepted AnswerMODERATORasked 2 years ago
Corrupted Binary Filesasked 16 days ago
How can I explicitly specify the size of the files to be split or the number of files?Accepted Answerasked a month ago
How fast can glue ETL convert data to parquet?Accepted AnswerMODERATORasked 3 years ago
Transfer files (1GB to 2G) from web url to S3 on a scheduleAccepted Answerasked 2 years ago
Best way to structure data for fast access via Athena?asked 2 months ago
Clarification on S3 Select with Parquet - indexing, range offsets, and pricingasked 2 months ago
unable to access S3 log files owned by "s3-log-service"asked 2 years ago
Drastically increase data size after sync local data to s3 deep archiveasked 3 years ago