Skip to content

How do I adjust the number or size of files when I run a CTAS query in Amazon Athena?

6 minute read
0

When I run a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena, I want to define the number of files or the amount of data for each file.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

You can use bucketing in a CTAS query to control the number of output files. However, the number of files that are created might not always match the specified number of buckets. The bucketing feature can group similar data together, but it can't precisely control the file count.

The following resolution uses the Global Historical Climatology Network Daily public dataset, s3://noaa-ghcn-pds/csv.gz/. For more information, see Visualize over 200 years of global climate data using Amazon Athena and Amazon Quick Sight.

Note: In the following example commands, replace the following values with your values:

external_location: Amazon Simple Storage Service (Amazon S3) location where you saved your CTAS query.

format: The format that you want for the output, such as ORC, PARQUET, AVRO, JSON, or TEXTFILE.

bucket_count: The number of buckets that you want.

bucketed_by: The field to hash and save the data in the bucket, for example yearmonthday.

Examine the dataset

To verify the number of files and the size of the dataset, run the following ls command:

aws s3 ls s3://noaa-ghcn-pds/csv.gz/ --summarize --recursive --human-readable

Note: Replace s3://noaa-ghcn-pds/csv.gz/ with your s3 bucket path.

Example output:

2019-11-30 01:58:05    3.3 KiB csv.gz/1763.csv.gz
2019-11-30 01:58:06    3.2 KiB csv.gz/1764.csv.gz
2019-11-30 01:58:06    3.3 KiB csv.gz/1765.csv.gz
2019-11-30 01:58:07    3.3 KiB csv.gz/1766.csv.gz
...
2019-11-30 02:05:43  199.7 MiB csv.gz/2016.csv.gz
2019-11-30 02:05:50  197.7 MiB csv.gz/2017.csv.gz
2019-11-30 02:05:54  197.0 MiB csv.gz/2018.csv.gz
2019-11-30 02:05:57  168.8 MiB csv.gz/2019.csv.gz

Total Objects: 257
Total Size: 15.4 GiB

Create the environment

Complete the following steps:

  1. To create a table, run the following command:

    CREATE EXTERNAL TABLE historic_climate_gz(  
      id string,
      yearmonthday int,
      element string,
      temperature int,
      m_flag string,
      q_flag string,
      s_flag string,
      obs_time int)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
    STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://noaa-ghcn-pds/csv.gz/'
  2. To test the table, run the following command:

    SELECT * FROM historic_climate_gz LIMIT 10

    The output shows 10 lines from the dataset.

After you create the environment, use the following methods to modify the dataset when you run CTAS queries.

Modify the number of files in the dataset

It's a best practice to bucket data by a column that has high cardinality and evenly distributed values. For more information, see Bucketing benefits.

Complete the following steps:

  1. To convert the dataset into 20 files, run the following command:

    CREATE TABLE "historic_climate_gz_20_files" WITH (
          external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/',  
          format = 'TEXTFILE',
          bucket_count=20,
          bucketed_by = ARRAY['yearmonthday']
         ) AS
    SELECT * FROM historic_climate_gz

    Note: The preceding example uses the yearmonthday field.

  2. To confirm that the bucket contains the desired number of files, run the following ls command:

    aws s3 ls s3://awsexamplebucket/historic_climate_gz_20_files/ --summarize --recursive --human-readable

    Example output:

    Total Objects: 20
    Total Size: 15.6 Gib

Set the approximate size of each file

Complete the following steps:

  1. Determine the number of buckets that you must use to achieve the number of desired files. For example, to split the 15.4 GB dataset into 2 GB files, you must have 8 files (15.4 / 2 = 7.7, rounded up to 8).

  2. To create a new bucketed table, run the following command:

    CREATE TABLE "historic_climate_gz_2GB_files" WITH (
          external_location = 's3://awsexamplebucket/historic_climate_gz_2GB_file/',  
          format = 'TEXTFILE',
          bucket_count=8,
          bucketed_by = ARRAY['yearmonthday']) AS
    SELECT * FROM historic_climate_gz
  3. To confirm that the dataset contains the desired number of files, run the following ls command:

    aws s3 ls s3://awsexamplebucket/historic_climate_gz_2GB_file/ --summarize --recursive --human-readable

    Example output:

    2019-09-03 10:59:20    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00000.gz
    2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00001.gz
    2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00002.gz
    2019-09-03 10:59:19    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00003.gz
    2019-09-03 10:59:17    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00004.gz
    2019-09-03 10:59:21    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00005.gz
    2019-09-03 10:59:18    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00006.gz
    2019-09-03 10:59:17    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00007.gz
    
    Total Objects: 8
    Total Size: 15.0 GiB

Convert the data format and set the approximate file size

Complete the following steps:

  1. To convert the data to a different format, run the following command:

    CREATE TABLE "historic_climate_parquet" WITH (
          external_location = 's3://awsexamplebucket/historic_climate_parquet/',
          format = 'PARQUET') AS
    SELECT * FROM historic_climate_gz
  2. To confirm the size of the dataset, run the following ls command:

    aws s3 ls s3://awsexamplebucket/historic_climate_parquet/ --summarize --recursive --human-readable

    Example output:

    Total Objects: 30
    Total Size: 9.8 GiB
  3. Determine the number of buckets that you must use to achieve the number of desired files. For example, for 500 MB files and a dataset that's 9.8 GB, you must have 20 files.

  4. To convert the dataset into 500 MB files, run the following command:

    CREATE TABLE "historic_climate_parquet_500mb" WITH (
          external_location = 's3://awsexamplebucket/historic_climate_parquet_500mb/',
          format = 'PARQUET',
          bucket_count=20,
          bucketed_by = ARRAY['yearmonthday']
           ) AS
    SELECT * FROM historic_climate_parquet
  5. To confirm that the dataset contains the desired number of files, run the following ls command:

    aws s3 ls s3://awsexamplebucket/historic_climate_parquet_500mb/ --summarize --recursive --human-readable

    Example output:

    2019-09-03 12:01:45  333.9 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00000
    2019-09-03 12:01:01  666.7 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00001
    2019-09-03 12:01:00  665.6 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00002
    2019-09-03 12:01:06  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00003
    2019-09-03 12:00:59  667.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00004
    2019-09-03 12:01:27  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00005
    2019-09-03 12:01:10  666.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00006
    2019-09-03 12:01:12  668.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00007
    2019-09-03 12:01:03  666.8 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00008
    2019-09-03 12:01:10  646.4 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00009
    2019-09-03 12:01:35  639.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00010
    2019-09-03 12:00:52  529.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00011
    2019-09-03 12:01:29  334.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00012
    2019-09-03 12:01:32  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00013
    2019-09-03 12:01:34  332.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00014
    2019-09-03 12:01:44  333.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00015
    2019-09-03 12:01:51  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00016
    2019-09-03 12:01:39  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00017
    2019-09-03 12:01:47  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00018
    2019-09-03 12:01:49  332.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00019
    Total Objects: 20
    Total Size: 9.9 GiB

Note: Bucketed tables don't support the INSERT INTO statement.

Related information

Examples of CTAS queries

Considerations and limitations for CTAS queries

AWS OFFICIALUpdated a year ago