Charges for CREATE TABLE AS SELECT

0

In the Athena documentation, it suggests that there are no charges for CREATE TABLE commands. However, if a user runs a CTAS query to create a new table from an original table, Athena runs a SELECT statement and scans the data in the original table in order to generate the new table.

So I have two question based on this observation:

  1. Does as CREATE TABLE AS SELECT command generate costs the same way as any other query would would? I can't see this mentioned in the docs, but looking at the Console a large amount of data is scanned when creating a table with CTAS.

  2. Let's imagine a new CTAS table is created from an original Athena table that gets its data from text files in an s3 bucket. Once the new table is created using CTAS, in order for it to remain in sync with new files (in s3), does it intermittently run a query against the original table, thus generating a regular new cost?

This seems like things could get expensive just to keep the CTAS table up to date?

asked 2 years ago913 views
1 Answer
1
Accepted Answer
  1. CREATE TABLE AS SELECT (CTAS) queries are considered DML queries and do incur costs based on the amount of data scanned by the SELECT query.
  2. The table created as a result of running CTAS query is not a pointer to the original table but a separate copy of it and hence does not involve any recurring costs.
AWS
SUPPORT ENGINEER
Manu_G
answered 2 years ago
  • okay thanks, about point 2, does that me it does not get updated after it's been created, and remains static from the time it's first created?

  • AWS-User-5296216, it's not precise to call the table "static". It might change. It might change a lot it you make lots of updates! But it will not be affected in any way at all by things that happen to the original table, which is I think what you're asking about.

  • Yes, after you create a table as select, it is a separate, unrelated, table to the original. The data for the table is stored separately, in whichever location you chose when creating it (likely an s3 bucket, with a different prefix). Any change to the original table will not propagate to the new table. You're basically creating a brand new data set that's pre-seeded with data. If you want changes to the original data to be seen in the derived table, you could use Glue to load it periodically into the new table.

  • The answer is correct, it is important to notice that the CTAS is a one time operation that will create a copy of the data in the original table at the time of execution.

    If you update or add data to the original table and you NEED to keep the 2 tables in sync , then you will need to schedule an insert statement that reads from the original table ONLY the new data (if possible).

  • @Fabrizio@AWS Thanks for the info. Inserting at regular intervals would be perfect. However, as the original logs that the primary table reads are txt files, not columnar data, generating a regular insert statement (which would need to read datetime column data) would mean that all of the data has to once again be read, which means much of the cost saving advantage of using a CTAS statement to generate a columnar table is lost. Unless I've misunderstood. I know I could generate a SELECT/INSERT statement using a partition column which includes a date, but if that partition is created from a non-hive style format like that of alb logs (e.g 'OBJECTDIR/2022/01/01'), the regular insert can only be done accurately at most once a day, which is not ideal for monitoring applications. Again, unless I've missed something?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions