I want to create partitioned tables in Amazon Athena, and use them to improve my queries.
Short description
To create a partitioned Athena table, complete the following steps:
- Store your data as a partition in Amazon Simple Storage Service (Amazon S3) buckets.
- Specify the partitioning columns and the root location of partitioned data when you create the table.
- Upload the partitions into the AWS Glue Data Catalog.
Note: The table refers to the Data Catalog when you run your queries.
Resolution
Store your data as a partition in Amazon S3
Choose one of the following formats to store your partitioned data:
- Hive style format: s3://doc-example-bucket/example-folder/year=2021/month=01/day=01/myfile.csv
Note: The path includes the names of the partition keys and their values. For example, year=2021.
- Non-Hive style format: s3://doc-example-bucket/example-folder/2021/01/01/myfile.csv
Specify the partitioning information when you create the table
In your CREATE TABLE statement, include PARTITIONED BY to define the partition columns and LOCATION to specify the root location of the partitioned data.
Example query:
CREATE EXTERNAL TABLE doc-example-table (
first string,
last string,
username string
)
PARTITIONED BY (year string, month string, day string)
STORED AS parquet
LOCATION 's3://doc-example-bucket/example-folder'
Note: In the preceding query, replace the following values with your values:
- doc-example-table with the name of the table that you're creating
- doc-example-bucket with the name of the S3 bucket where you store your table
- example-folder with the name of your S3 folder
- first, last, and username with the names of the columns
- year string, month string, and day string with the names of the partition columns
Upload partitions into the Data Catalog
After you create the table, choose one of the following methods to add the partitions to the Data Catalog.
Use the MSCK REPAIR TABLE query for Hive style format data
To update the Data Catalog metadata after you add the partitions, run the MSCK REPAIR TABLE command:
MSCK REPAIR TABLE doc-example-table
Note: Replace doc-example-table with your table.
If you have more than a few thousand partitions, then this method isn't a best practice to use. Your DDL queries might experience timeout issues. For more information, see Why doesn't my MSCK REPAIR TABLE query add partitions to the AWS Glue Data Catalog?
Use the ALTER TABLE ADD PARTITION query for both Hive style and non-Hive style format data
To add partitions to the Data catalog, run the ALTER TABLE ADD PARTITION command.
To add a single partition, run the following command:
ALTER TABLE doc-example-table ADD PARTITION (year='2021', month='01', day='01') LOCATION 's3://doc-example-bucket/example-folder/2021/01/01/'
To add multiple partitions in one query, run the following command:
ALTER TABLE doc-example-table ADD
PARTITION (year='2021', month='01', day='01') LOCATION 's3://doc-example-bucket/example-folder/2021/01/01/'
PARTITION (year='2020', month='06', day='01') LOCATION 's3://doc-example-bucket/example-folder/2020/06/01/'
Note: In the preceding commands, replace the example values with your values.
Use the AWS Glue crawler for both Hive and non-Hive style format data
To automatically infer a table schema from your dataset, create the table, and add the partitions to the Data Catalog, use the AWS Glue crawler. You can also use the crawler to only add partitions to a table that you manually create with the CREATE TABLE statement.
For more information, see Customizing crawler behavior.
Use partition projection for highly partitioned data in Amazon S3
It's not a best practice to run queries against highly partitioned data in Amazon S3 because the queries are slow. If your data is highly partitioned, then use partition projection to speed up the query process and automate partition management.
Related information
Why do I get zero records when I query my Amazon Athena table?
Why didn't my Athena query return results after I added new partitions?