How can I set the number or size of files when I run a CTAS query in Athena?
When I run a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena, I want to define the number of files or the amount of data per file.
Resolution
Use bucketing to set the file size or number of files in a CTAS query.
Note: The following steps use the Global Historical Climatology Network Daily public dataset (s3://noaa-ghcn-pds/csv.gz/) to illustrate the solution. For more information about this dataset, see Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight. These steps show how to examine your dataset, create the environment, and then modify the dataset:
- Modify the number of files in the Amazon Simple Storage Service (Amazon S3) dataset.
- Set the approximate size of each file.
- Convert the data format and set the approximate file size.
Examine the dataset
Run the following AWS Command Line Interface (AWS CLI) to verify the number of files and the size of the dataset:
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.
aws s3 ls s3://noaa-ghcn-pds/csv.gz/ --summarize --recursive --human-readable
The output looks similar to the following:
2019-11-30 01:58:05 3.3 KiB csv.gz/1763.csv.gz 2019-11-30 01:58:06 3.2 KiB csv.gz/1764.csv.gz 2019-11-30 01:58:06 3.3 KiB csv.gz/1765.csv.gz 2019-11-30 01:58:07 3.3 KiB csv.gz/1766.csv.gz ... 2019-11-30 02:05:43 199.7 MiB csv.gz/2016.csv.gz 2019-11-30 02:05:50 197.7 MiB csv.gz/2017.csv.gz 2019-11-30 02:05:54 197.0 MiB csv.gz/2018.csv.gz 2019-11-30 02:05:57 168.8 MiB csv.gz/2019.csv.gz Total Objects: 257 Total Size: 15.4 GiB
Create the environment
1. Run a statement similar to the following to create a table:
CREATE EXTERNAL TABLE historic_climate_gz( id string, yearmonthday int, element string, temperature int, m_flag string, q_flag string, s_flag string, obs_time int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://noaa-ghcn-pds/csv.gz/'
2. Run the following command to test the table:
SELECT * FROM historic_climate_gz LIMIT 10
The output shows ten lines from the dataset. After the environment is created, use one or more of the following methods to modify the dataset when you run CTAS queries.
Modify the number of files in the dataset
It's a best practice to bucket data by a column that has high cardinality and evenly distributed values. For more information, see Bucketing vs Partitioning. In the following example, we use the yearmonthday field.
1. To convert the dataset into 20 files, run a statement similar to the following:
CREATE TABLE "historic_climate_gz_20_files" WITH ( external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/', format = 'TEXTFILE', bucket_count=20, bucketed_by = ARRAY['yearmonthday'] ) AS SELECT * FROM historic_climate_gz
Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query
format: format that you want for the output (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE)
bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket (for example, yearmonthday)
2. Run the following command to confirm that the bucket contains the desired number of files:
aws s3 ls s3://awsexamplebucket/historic_climate_gz_20_files/ --summarize --recursive --human-readable Total Objects: 20 Total Size: 15.6 Gib
Set the approximate size of each file
1. Determine how many files you need to achieve the desired file size. For example, to split the 15.4 GB dataset into 2 GB files, you need 8 files (15.4 / 2 = 7.7, rounded up to 8).
2. Run a statement similar to the following:
CREATE TABLE "historic_climate_gz_2GB_files" WITH ( external_location = 's3://awsexamplebucket/historic_climate_gz_2GB_file/', format = 'TEXTFILE', bucket_count=8, bucketed_by = ARRAY['yearmonthday']) AS SELECT * FROM historic_climate_gz
Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query
format: must be the same format as the source data (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE)
bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket. Choose a field with high cardinality.
3. Run the following command to confirm that the dataset contains the desired number of files:
aws s3 ls s3://awsexamplebucket/historic_climate_gz_2GB_file/ --summarize --recursive --human-readable
The output looks similar to the following:
2019-09-03 10:59:20 1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00000.gz 2019-09-03 10:59:20 2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00001.gz 2019-09-03 10:59:20 2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00002.gz 2019-09-03 10:59:19 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00003.gz 2019-09-03 10:59:17 1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00004.gz 2019-09-03 10:59:21 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00005.gz 2019-09-03 10:59:18 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00006.gz 2019-09-03 10:59:17 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00007.gz Total Objects: 8 Total Size: 15.0 GiB
Convert the data format and set the approximate file size
1. Run a statement similar to the following to convert the data to a different format:
CREATE TABLE "historic_climate_parquet" WITH ( external_location = 's3://awsexamplebucket/historic_climate_parquet/', format = 'PARQUET') AS SELECT * FROM historic_climate_gz
Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query format: format that you want to covert to (ORC,PARQUET, AVRO, JSON, or TEXTFILE)
2. Run the following command to confirm the size of the dataset:
aws s3 ls s3://awsexamplebucket/historic_climate_parquet/ --summarize --recursive --human-readable
The output looks similar to the following:
Total Objects: 30 Total Size: 9.8 GiB
3. Determine how many files that you need to achieve the desired file size. For example, if you want 500 MB files and the dataset is 9.8 GB, then you need 20 files (9,800 / 500 = 19.6, rounded up to 20).
4. To convert the dataset into 500 MB files, run a statement similar to the following:
CREATE TABLE "historic_climate_parquet_500mb" WITH ( external_location = 's3://awsexamplebucket/historic_climate_parquet_500mb/', format = 'PARQUET', bucket_count=20, bucketed_by = ARRAY['yearmonthday'] ) AS SELECT * FROM historic_climate_parquet
Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket. Choose a field with high cardinality.
5. Run the following command to confirm that the dataset contains the desired number of files:
aws s3 ls s3://awsexamplebucket/historic_climate_parquet_500mb/ --summarize --recursive --human-readable
The output looks similar to the following:
2019-09-03 12:01:45 333.9 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00000 2019-09-03 12:01:01 666.7 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00001 2019-09-03 12:01:00 665.6 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00002 2019-09-03 12:01:06 666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00003 2019-09-03 12:00:59 667.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00004 2019-09-03 12:01:27 666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00005 2019-09-03 12:01:10 666.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00006 2019-09-03 12:01:12 668.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00007 2019-09-03 12:01:03 666.8 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00008 2019-09-03 12:01:10 646.4 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00009 2019-09-03 12:01:35 639.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00010 2019-09-03 12:00:52 529.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00011 2019-09-03 12:01:29 334.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00012 2019-09-03 12:01:32 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00013 2019-09-03 12:01:34 332.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00014 2019-09-03 12:01:44 333.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00015 2019-09-03 12:01:51 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00016 2019-09-03 12:01:39 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00017 2019-09-03 12:01:47 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00018 2019-09-03 12:01:49 332.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00019 Total Objects: 20 Total Size: 9.9 GiB
Note: The INSERT INTO statement isn't supported on bucketed tables. For more information, see Bucketed tables not supported.
Related information

Relevant content
- Accepted Answerasked 2 years agolg...
- asked 7 months agolg...
- asked 5 years agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago