How to limit the number of columns in a CUR report

0

I am using AWS CUR reports to analyze my billing and usage data, I want to copy my CUR report to a sql table, but since the columns on the CUR reports don't remain constant, hence i am unable to copy it to table until i modify the table columns each time when the columns change. Can anybody suggest a way to limit or restrict the columns in CUR report or alternate way to copy the data to sql table.

已提問 2 年前檢視次數 682 次
6 個答案
3

I assume your final objective is to then be able to query your CUR. If that's the case, instead of having to do some extra transformation, data migration, and also account for the cost of running a database and holding the CUR data in it, a better approach is to use Amazon Athena, which lets you query the CUR directly in S3, not to mention you will get a better performance. If that's something that appeals you, here (1) is how it's done.

References

(1) https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html

AWS
已回答 2 年前
profile pictureAWS
專家
已審閱 2 年前
2

I agree with @LorenzoAWS that Athena is one good option to run ad-hoc queries against your CUR.

Though, there are cases where a relational SQL database might still be preferred or needed. For example, one might have an existing SQL data warehouse that integrates with your financial reporting system and doesn't have a practical path to reading from S3 or Athena directly.

Also, @LorenzoAWS also said that "[Athena] will give better performance [over a SQL database]". I don't think this is necessarily true in all cases because it depends on how we define performance and what type of SQL database we are comparing to (e.g. OLTP like Postgres? Distributed OLAP warehouse like Amazon Redshift or Snowflake?). For example:

  1. If performance is defined as query throughput (time to complete) for a single OLAP query over a very large CUR data set, then sure - I imagine that Athena could outperform an OLTP SQL database because under the hood its a distributed Presto cluster. However, if comparing to modern distributed OLAP warehouses like Redshift, I'm not sure who win. Each Redshift node (typically) reads data into memory from disks locally attached to the node which I imagine may be faster than Athena reading from S3 over HTTPS.

  2. If your use case calls for a high volume of concurrent queries throughout the day, you might not care about the raw speed of a single query and instead good performance might mean "no user waits longer than 3 seconds for results". By default, Athena supports 20-25 concurrent queries depending on the region, and while this is a soft limit that can be raised, I'm guessing that it still won't be as high as say an RDS database or Redshift cluster with concurrency scaling enabled.

Anyway, regardless of reason, if you want to get the CUR into a SQL database, I want to share a suggestion that builds on Lorenzo's answer and can eliminate the impact of some (but not all) schema changes to the CUR.

Create a staging view / materialized view

You can normally configure the CUR to deliver to S3 in either CSV (with optional zip/gzip) or Parquet format. Though, if you choose to include the optional "data integration" the CUR will make the decision for you (Athena = Parquet, Redshift & QuickSight = CSV+GZIP).

Anyway, if your CUR is delivered as CSV and you are loading these files directly into a SQL database, then you're correct in that any column addition, delete, rename, or re-ordering would break many traditional ETL strategies.

My recommendation is that you configure your CUR to deliver with "Athena integration" in Parquet format. Then, [use CREATE TABLE AS] (https://docs.aws.amazon.com/athena/latest/ug/create-view.html) (aka CTAS) to create a materialized view (new CSV file(s) in S3) that only contains the columns you care about. It's this new "staging" file, rather than the files originally delivered by CUR, that you would then import to your SQL database. The secret to this approach is that your CTAS query will not fail if new Parquet files have unexpected column additions, removals, or re-ordering.

There are still two small potential issues that might arise with subsequent Parquet files:

  1. By default, Athena matches columns by name and thus a renamed column will be seen as a new column to Athena. This means that the old column name in your SELECT will have null values for rows in the new file.

  2. If a column data type changes in a new Parquet file, I believe (but can't remember for certain) that the query will succeed but show the column as null for the affected rows. I think that solving this depends on whether you are creating and loading incremental staging files for each new CUR report versus creating a "rolling" staging table (e.g. last X days).

My suspicion is that its unlikely AWS would rename a column or change a data type in the CUR, so hopefully neither of the above will ever come up.

Hope this helps!

~Mat

mwrb
已回答 2 年前
AWS
專家
已審閱 2 年前
2

You can automate this following the guide that is outlined in the Well Architected Labs, https://wellarchitectedlabs.com/cost/300_labs/300_automated_cur_updates_and_ingestion/ (Repo https://github.com/awslabs/aws-well-architected-labs). It allows you to setup that delivery and automation to Athena to then be able to report off of it. You can then tie it into the CUDOS (https://github.com/aws-samples/aws-cudos-framework-deployment) solution that was published to be able to easily report off of it. The Labs outline the steps to setup the automated ingestion and also the reporting and standard dashboards in Quicksight using CUDOS.

profile pictureAWS
已回答 2 年前
1

If you do automate this with Athena (which in my opinion is the better route) there is a CloudFormation template which provisions all the resources for you. You don't need to do it manually with the Well Architected Lab mentioned above. The documentation is located here:

https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html

Once you have it setup, there are also example queries to query your CUR data located in this Well Architected Labs link:

https://wellarchitectedlabs.com/cost/300_labs/300_cur_queries/

-cs

AWS
已回答 2 年前
0

As others suggested, it would be cost effective to store the CUR reports on S3 and apply schema on read. In other words query query it via Athena or Redshift Spectrum. Please refer to this blog about How Aruba Networks built a cost analysis solution using AWS Glue, Amazon Redshift, and Amazon QuickSight.

AWS
Kunal_G
已回答 2 年前
0
AWS
AWS TAM
已回答 1 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南