How do I create and use partitioned tables in Amazon Athena?

4 minute read
0

I want to create partitioned tables in Amazon Athena, and use them to improve my queries.

Short description

To create a partitioned Athena table, complete the following steps:

  1. Store your data as a partition in Amazon Simple Storage Service (Amazon S3) buckets.
  2. Specify the partitioning columns and the root location of partitioned data when you create the table.
  3. Upload the partitions into the AWS Glue Data Catalog.
    Note: The table refers to the Data Catalog when you run your queries.

Resolution

Store your data as a partition in Amazon S3

Choose one of the following formats to store your partitioned data:

  • Hive style format: s3://doc-example-bucket/example-folder/year=2021/month=01/day=01/myfile.csv
    Note: The path includes the names of the partition keys and their values. For example, year=2021.
  • Non-Hive style format: s3://doc-example-bucket/example-folder/2021/01/01/myfile.csv

Specify the partitioning information when you create the table

In your CREATE TABLE statement, include PARTITIONED BY to define the partition columns and LOCATION to specify the root location of the partitioned data.

Example query:

CREATE EXTERNAL TABLE doc-example-table (
first string,
last string,
username string
)
PARTITIONED BY (year string, month string, day string)
STORED AS parquet
LOCATION 's3://doc-example-bucket/example-folder'

Note: In the preceding query, replace the following values with your values:

  • doc-example-table with the name of the table that you're creating
  • doc-example-bucket with the name of the S3 bucket where you store your table
  • example-folder with the name of your S3 folder
  • first, last, and username with the names of the columns
  • year string, month string, and day string with the names of the partition columns

Upload partitions into the Data Catalog

After you create the table, choose one of the following methods to add the partitions to the Data Catalog.

Use the MSCK REPAIR TABLE query for Hive style format data

To update the Data Catalog metadata after you add the partitions, run the MSCK REPAIR TABLE command:

MSCK REPAIR TABLE doc-example-table

Note: Replace doc-example-table with your table.

If you have more than a few thousand partitions, then this method isn't a best practice to use. Your DDL queries might experience timeout issues. For more information, see Why doesn't my MSCK REPAIR TABLE query add partitions to the AWS Glue Data Catalog?

Use the ALTER TABLE ADD PARTITION query for both Hive style and non-Hive style format data

To add partitions to the Data catalog, run the ALTER TABLE ADD PARTITION command.

To add a single partition, run the following command:

ALTER TABLE doc-example-table ADD PARTITION (year='2021', month='01', day='01') LOCATION 's3://doc-example-bucket/example-folder/2021/01/01/'

To add multiple partitions in one query, run the following command:

ALTER TABLE doc-example-table ADD  
PARTITION (year='2021', month='01', day='01') LOCATION 's3://doc-example-bucket/example-folder/2021/01/01/'  
PARTITION (year='2020', month='06', day='01') LOCATION 's3://doc-example-bucket/example-folder/2020/06/01/'

Note: In the preceding commands, replace the example values with your values.

Use the AWS Glue crawler for both Hive and non-Hive style format data

To automatically infer a table schema from your dataset, create the table, and add the partitions to the Data Catalog, use the AWS Glue crawler. You can also use the crawler to only add partitions to a table that you manually create with the CREATE TABLE statement.

For more information, see Customizing crawler behavior.

Use partition projection for highly partitioned data in Amazon S3

It's not a best practice to run queries against highly partitioned data in Amazon S3 because the queries are slow. If your data is highly partitioned, then use partition projection to speed up the query process and automate partition management.

Related information

Why do I get zero records when I query my Amazon Athena table?

Why didn't my Athena query return results after I added new partitions?

AWS OFFICIAL
AWS OFFICIALUpdated a month ago