How can I set the number or size of files when I run a CTAS query in Athena?

6 分的閱讀內容
0

When I run a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena, I want to define the number of files or the amount of data per file.

Resolution

Use bucketing to set the file size or number of files in a CTAS query.

Note: The following steps use the Global Historical Climatology Network Daily public dataset (s3://noaa-ghcn-pds/csv.gz/) to illustrate the solution. For more information about this dataset, see Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight. These steps show how to examine your dataset, create the environment, and then modify the dataset:

  1. Modify the number of files in the Amazon Simple Storage Service (Amazon S3) dataset.
  2. Set the approximate size of each file.
  3. Convert the data format and set the approximate file size.

Examine the dataset

Run the following AWS Command Line Interface (AWS CLI) to verify the number of files and the size of the dataset:

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

aws s3 ls s3://noaa-ghcn-pds/csv.gz/ --summarize --recursive --human-readable

The output looks similar to the following:

2019-11-30 01:58:05    3.3 KiB csv.gz/1763.csv.gz
2019-11-30 01:58:06    3.2 KiB csv.gz/1764.csv.gz
2019-11-30 01:58:06    3.3 KiB csv.gz/1765.csv.gz
2019-11-30 01:58:07    3.3 KiB csv.gz/1766.csv.gz
...
2019-11-30 02:05:43  199.7 MiB csv.gz/2016.csv.gz
2019-11-30 02:05:50  197.7 MiB csv.gz/2017.csv.gz
2019-11-30 02:05:54  197.0 MiB csv.gz/2018.csv.gz
2019-11-30 02:05:57  168.8 MiB csv.gz/2019.csv.gz

Total Objects: 257
Total Size: 15.4 GiB

Create the environment

1.    Run a statement similar to the following to create a table:

CREATE EXTERNAL TABLE historic_climate_gz(
  id string,
  yearmonthday int,
  element string,
  temperature int,
  m_flag string,
  q_flag string,
  s_flag string,
  obs_time int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://noaa-ghcn-pds/csv.gz/'

2.    Run the following command to test the table:

SELECT * FROM historic_climate_gz LIMIT 10

The output shows ten lines from the dataset. After the environment is created, use one or more of the following methods to modify the dataset when you run CTAS queries.

Modify the number of files in the dataset

It's a best practice to bucket data by a column that has high cardinality and evenly distributed values. For more information, see Bucketing vs Partitioning. In the following example, we use the yearmonthday field.

1.    To convert the dataset into 20 files, run a statement similar to the following:

CREATE TABLE "historic_climate_gz_20_files"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/',
      format = 'TEXTFILE',
      bucket_count=20,
      bucketed_by = ARRAY['yearmonthday']
     ) AS
SELECT * FROM historic_climate_gz

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query
format: format that you want for the output (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE)
bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket (for example, yearmonthday)

2.    Run the following command to confirm that the bucket contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_gz_20_files/ --summarize --recursive --human-readable
Total Objects: 20
Total Size: 15.6 Gib

Set the approximate size of each file

1.    Determine how many files you need to achieve the desired file size. For example, to split the 15.4 GB dataset into 2 GB files, you need 8 files (15.4 / 2 = 7.7, rounded up to 8).

2.    Run a statement similar to the following:

CREATE TABLE "historic_climate_gz_2GB_files"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_gz_2GB_file/',
      format = 'TEXTFILE',
      bucket_count=8,
      bucketed_by = ARRAY['yearmonthday']) AS
SELECT * FROM historic_climate_gz

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query
format: must be the same format as the source data (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE)
bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket. Choose a field with high cardinality.

3.    Run the following command to confirm that the dataset contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_gz_2GB_file/ --summarize --recursive --human-readable

The output looks similar to the following:

2019-09-03 10:59:20    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00000.gz
2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00001.gz
2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00002.gz
2019-09-03 10:59:19    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00003.gz
2019-09-03 10:59:17    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00004.gz
2019-09-03 10:59:21    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00005.gz
2019-09-03 10:59:18    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00006.gz
2019-09-03 10:59:17    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00007.gz
Total Objects: 8
Total Size: 15.0 GiB

Convert the data format and set the approximate file size

1.    Run a statement similar to the following to convert the data to a different format:

CREATE TABLE "historic_climate_parquet"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_parquet/',
      format = 'PARQUET') AS
SELECT * FROM historic_climate_gz

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query format: format that you want to covert to (ORC,PARQUET, AVRO, JSON, or TEXTFILE)

2.    Run the following command to confirm the size of the dataset:

aws s3 ls s3://awsexamplebucket/historic_climate_parquet/ --summarize --recursive --human-readable

The output looks similar to the following:

Total Objects: 30
Total Size: 9.8 GiB

3.    Determine how many files that you need to achieve the desired file size. For example, if you want 500 MB files and the dataset is 9.8 GB, then you need 20 files (9,800 / 500 = 19.6, rounded up to 20).

4.    To convert the dataset into 500 MB files, run a statement similar to the following:

CREATE TABLE "historic_climate_parquet_500mb"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_parquet_500mb/',
      format = 'PARQUET',
      bucket_count=20,
      bucketed_by = ARRAY['yearmonthday']
       ) AS
SELECT * FROM historic_climate_parquet

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket. Choose a field with high cardinality.

5.    Run the following command to confirm that the dataset contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_parquet_500mb/ --summarize --recursive --human-readable

The output looks similar to the following:

2019-09-03 12:01:45  333.9 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00000
2019-09-03 12:01:01  666.7 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00001
2019-09-03 12:01:00  665.6 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00002
2019-09-03 12:01:06  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00003
2019-09-03 12:00:59  667.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00004
2019-09-03 12:01:27  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00005
2019-09-03 12:01:10  666.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00006
2019-09-03 12:01:12  668.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00007
2019-09-03 12:01:03  666.8 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00008
2019-09-03 12:01:10  646.4 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00009
2019-09-03 12:01:35  639.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00010
2019-09-03 12:00:52  529.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00011
2019-09-03 12:01:29  334.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00012
2019-09-03 12:01:32  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00013
2019-09-03 12:01:34  332.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00014
2019-09-03 12:01:44  333.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00015
2019-09-03 12:01:51  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00016
2019-09-03 12:01:39  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00017
2019-09-03 12:01:47  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00018
2019-09-03 12:01:49  332.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00019
Total Objects: 20
Total Size: 9.9 GiB

Note: The INSERT INTO statement isn't supported on bucketed tables. For more information, see Bucketed tables not supported.


Related information

Examples of CTAS queries

Considerations and limitations for CTAS queries

AWS 官方
AWS 官方已更新 3 年前