How to limit the number of columns in a CUR report

0

I am using AWS CUR reports to analyze my billing and usage data, I want to copy my CUR report to a sql table, but since the columns on the CUR reports don't remain constant, hence i am unable to copy it to table until i modify the table columns each time when the columns change. Can anybody suggest a way to limit or restrict the columns in CUR report or alternate way to copy the data to sql table.

6 Answers
3

I assume your final objective is to then be able to query your CUR. If that's the case, instead of having to do some extra transformation, data migration, and also account for the cost of running a database and holding the CUR data in it, a better approach is to use Amazon Athena, which lets you query the CUR directly in S3, not to mention you will get a better performance. If that's something that appeals you, here (1) is how it's done.

References

(1) https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html

AWS
answered 2 years ago
profile pictureAWS
EXPERT
reviewed 2 years ago
2

I agree with @LorenzoAWS that Athena is one good option to run ad-hoc queries against your CUR.

Though, there are cases where a relational SQL database might still be preferred or needed. For example, one might have an existing SQL data warehouse that integrates with your financial reporting system and doesn't have a practical path to reading from S3 or Athena directly.

Also, @LorenzoAWS also said that "[Athena] will give better performance [over a SQL database]". I don't think this is necessarily true in all cases because it depends on how we define performance and what type of SQL database we are comparing to (e.g. OLTP like Postgres? Distributed OLAP warehouse like Amazon Redshift or Snowflake?). For example:

  1. If performance is defined as query throughput (time to complete) for a single OLAP query over a very large CUR data set, then sure - I imagine that Athena could outperform an OLTP SQL database because under the hood its a distributed Presto cluster. However, if comparing to modern distributed OLAP warehouses like Redshift, I'm not sure who win. Each Redshift node (typically) reads data into memory from disks locally attached to the node which I imagine may be faster than Athena reading from S3 over HTTPS.

  2. If your use case calls for a high volume of concurrent queries throughout the day, you might not care about the raw speed of a single query and instead good performance might mean "no user waits longer than 3 seconds for results". By default, Athena supports 20-25 concurrent queries depending on the region, and while this is a soft limit that can be raised, I'm guessing that it still won't be as high as say an RDS database or Redshift cluster with concurrency scaling enabled.

Anyway, regardless of reason, if you want to get the CUR into a SQL database, I want to share a suggestion that builds on Lorenzo's answer and can eliminate the impact of some (but not all) schema changes to the CUR.

Create a staging view / materialized view

You can normally configure the CUR to deliver to S3 in either CSV (with optional zip/gzip) or Parquet format. Though, if you choose to include the optional "data integration" the CUR will make the decision for you (Athena = Parquet, Redshift & QuickSight = CSV+GZIP).

Anyway, if your CUR is delivered as CSV and you are loading these files directly into a SQL database, then you're correct in that any column addition, delete, rename, or re-ordering would break many traditional ETL strategies.

My recommendation is that you configure your CUR to deliver with "Athena integration" in Parquet format. Then, [use CREATE TABLE AS] (https://docs.aws.amazon.com/athena/latest/ug/create-view.html) (aka CTAS) to create a materialized view (new CSV file(s) in S3) that only contains the columns you care about. It's this new "staging" file, rather than the files originally delivered by CUR, that you would then import to your SQL database. The secret to this approach is that your CTAS query will not fail if new Parquet files have unexpected column additions, removals, or re-ordering.

There are still two small potential issues that might arise with subsequent Parquet files:

  1. By default, Athena matches columns by name and thus a renamed column will be seen as a new column to Athena. This means that the old column name in your SELECT will have null values for rows in the new file.

  2. If a column data type changes in a new Parquet file, I believe (but can't remember for certain) that the query will succeed but show the column as null for the affected rows. I think that solving this depends on whether you are creating and loading incremental staging files for each new CUR report versus creating a "rolling" staging table (e.g. last X days).

My suspicion is that its unlikely AWS would rename a column or change a data type in the CUR, so hopefully neither of the above will ever come up.

Hope this helps!

~Mat

mwrb
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago
2

You can automate this following the guide that is outlined in the Well Architected Labs, https://wellarchitectedlabs.com/cost/300_labs/300_automated_cur_updates_and_ingestion/ (Repo https://github.com/awslabs/aws-well-architected-labs). It allows you to setup that delivery and automation to Athena to then be able to report off of it. You can then tie it into the CUDOS (https://github.com/aws-samples/aws-cudos-framework-deployment) solution that was published to be able to easily report off of it. The Labs outline the steps to setup the automated ingestion and also the reporting and standard dashboards in Quicksight using CUDOS.

profile pictureAWS
answered 2 years ago
1

If you do automate this with Athena (which in my opinion is the better route) there is a CloudFormation template which provisions all the resources for you. You don't need to do it manually with the Well Architected Lab mentioned above. The documentation is located here:

https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html

Once you have it setup, there are also example queries to query your CUR data located in this Well Architected Labs link:

https://wellarchitectedlabs.com/cost/300_labs/300_cur_queries/

-cs

AWS
answered 2 years ago
0

As others suggested, it would be cost effective to store the CUR reports on S3 and apply schema on read. In other words query query it via Athena or Redshift Spectrum. Please refer to this blog about How Aruba Networks built a cost analysis solution using AWS Glue, Amazon Redshift, and Amazon QuickSight.

AWS
Kunal_G
answered 2 years ago
0
AWS
AWS TAM
answered 23 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions