How do I adjust the number or size of files when I run a CTAS query in Amazon Athena?
When I run a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena, I want to define the number of files or the amount of data for each file.
Resolution
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
You can use bucketing in a CTAS query to control the number of output files. However, the number of files that are created might not always match the specified number of buckets. The bucketing feature can group similar data together, but it can't precisely control the file count.
The following resolution uses the Global Historical Climatology Network Daily public dataset, s3://noaa-ghcn-pds/csv.gz/. For more information, see Visualize over 200 years of global climate data using Amazon Athena and Amazon Quick Sight.
Note: In the following example commands, replace the following values with your values:
external_location: Amazon Simple Storage Service (Amazon S3) location where you saved your CTAS query.
format: The format that you want for the output, such as ORC, PARQUET, AVRO, JSON, or TEXTFILE.
bucket_count: The number of buckets that you want.
bucketed_by: The field to hash and save the data in the bucket, for example yearmonthday.
Examine the dataset
To verify the number of files and the size of the dataset, run the following ls command:
aws s3 ls s3://noaa-ghcn-pds/csv.gz/ --summarize --recursive --human-readable
Note: Replace s3://noaa-ghcn-pds/csv.gz/ with your s3 bucket path.
Example output:
2019-11-30 01:58:05 3.3 KiB csv.gz/1763.csv.gz 2019-11-30 01:58:06 3.2 KiB csv.gz/1764.csv.gz 2019-11-30 01:58:06 3.3 KiB csv.gz/1765.csv.gz 2019-11-30 01:58:07 3.3 KiB csv.gz/1766.csv.gz ... 2019-11-30 02:05:43 199.7 MiB csv.gz/2016.csv.gz 2019-11-30 02:05:50 197.7 MiB csv.gz/2017.csv.gz 2019-11-30 02:05:54 197.0 MiB csv.gz/2018.csv.gz 2019-11-30 02:05:57 168.8 MiB csv.gz/2019.csv.gz Total Objects: 257 Total Size: 15.4 GiB
Create the environment
Complete the following steps:
-
To create a table, run the following command:
CREATE EXTERNAL TABLE historic_climate_gz( id string, yearmonthday int, element string, temperature int, m_flag string, q_flag string, s_flag string, obs_time int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://noaa-ghcn-pds/csv.gz/' -
To test the table, run the following command:
SELECT * FROM historic_climate_gz LIMIT 10The output shows 10 lines from the dataset.
After you create the environment, use the following methods to modify the dataset when you run CTAS queries.
Modify the number of files in the dataset
It's a best practice to bucket data by a column that has high cardinality and evenly distributed values. For more information, see Bucketing benefits.
Complete the following steps:
-
To convert the dataset into 20 files, run the following command:
CREATE TABLE "historic_climate_gz_20_files" WITH ( external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/', format = 'TEXTFILE', bucket_count=20, bucketed_by = ARRAY['yearmonthday'] ) AS SELECT * FROM historic_climate_gzNote: The preceding example uses the yearmonthday field.
-
To confirm that the bucket contains the desired number of files, run the following ls command:
aws s3 ls s3://awsexamplebucket/historic_climate_gz_20_files/ --summarize --recursive --human-readableExample output:
Total Objects: 20 Total Size: 15.6 Gib
Set the approximate size of each file
Complete the following steps:
-
Determine the number of buckets that you must use to achieve the number of desired files. For example, to split the 15.4 GB dataset into 2 GB files, you must have 8 files (15.4 / 2 = 7.7, rounded up to 8).
-
To create a new bucketed table, run the following command:
CREATE TABLE "historic_climate_gz_2GB_files" WITH ( external_location = 's3://awsexamplebucket/historic_climate_gz_2GB_file/', format = 'TEXTFILE', bucket_count=8, bucketed_by = ARRAY['yearmonthday']) AS SELECT * FROM historic_climate_gz -
To confirm that the dataset contains the desired number of files, run the following ls command:
aws s3 ls s3://awsexamplebucket/historic_climate_gz_2GB_file/ --summarize --recursive --human-readableExample output:
2019-09-03 10:59:20 1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00000.gz 2019-09-03 10:59:20 2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00001.gz 2019-09-03 10:59:20 2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00002.gz 2019-09-03 10:59:19 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00003.gz 2019-09-03 10:59:17 1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00004.gz 2019-09-03 10:59:21 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00005.gz 2019-09-03 10:59:18 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00006.gz 2019-09-03 10:59:17 1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00007.gz Total Objects: 8 Total Size: 15.0 GiB
Convert the data format and set the approximate file size
Complete the following steps:
-
To convert the data to a different format, run the following command:
CREATE TABLE "historic_climate_parquet" WITH ( external_location = 's3://awsexamplebucket/historic_climate_parquet/', format = 'PARQUET') AS SELECT * FROM historic_climate_gz -
To confirm the size of the dataset, run the following ls command:
aws s3 ls s3://awsexamplebucket/historic_climate_parquet/ --summarize --recursive --human-readableExample output:
Total Objects: 30 Total Size: 9.8 GiB -
Determine the number of buckets that you must use to achieve the number of desired files. For example, for 500 MB files and a dataset that's 9.8 GB, you must have 20 files.
-
To convert the dataset into 500 MB files, run the following command:
CREATE TABLE "historic_climate_parquet_500mb" WITH ( external_location = 's3://awsexamplebucket/historic_climate_parquet_500mb/', format = 'PARQUET', bucket_count=20, bucketed_by = ARRAY['yearmonthday'] ) AS SELECT * FROM historic_climate_parquet -
To confirm that the dataset contains the desired number of files, run the following ls command:
aws s3 ls s3://awsexamplebucket/historic_climate_parquet_500mb/ --summarize --recursive --human-readableExample output:
2019-09-03 12:01:45 333.9 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00000 2019-09-03 12:01:01 666.7 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00001 2019-09-03 12:01:00 665.6 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00002 2019-09-03 12:01:06 666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00003 2019-09-03 12:00:59 667.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00004 2019-09-03 12:01:27 666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00005 2019-09-03 12:01:10 666.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00006 2019-09-03 12:01:12 668.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00007 2019-09-03 12:01:03 666.8 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00008 2019-09-03 12:01:10 646.4 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00009 2019-09-03 12:01:35 639.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00010 2019-09-03 12:00:52 529.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00011 2019-09-03 12:01:29 334.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00012 2019-09-03 12:01:32 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00013 2019-09-03 12:01:34 332.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00014 2019-09-03 12:01:44 333.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00015 2019-09-03 12:01:51 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00016 2019-09-03 12:01:39 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00017 2019-09-03 12:01:47 333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00018 2019-09-03 12:01:49 332.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00019 Total Objects: 20 Total Size: 9.9 GiB
Note: Bucketed tables don't support the INSERT INTO statement.
Related information
- Topics
- Analytics
- Tags
- Amazon Athena
- Language
- English

Relevant content
- Accepted Answerasked 4 years ago
- asked a year ago
- asked 7 years ago
AWS OFFICIALUpdated 6 months ago
AWS OFFICIALUpdated 7 months ago