- Newest
- Most votes
- Most comments
I assume your final objective is to then be able to query your CUR. If that's the case, instead of having to do some extra transformation, data migration, and also account for the cost of running a database and holding the CUR data in it, a better approach is to use Amazon Athena, which lets you query the CUR directly in S3, not to mention you will get a better performance. If that's something that appeals you, here (1) is how it's done.
References
(1) https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html
I agree with @LorenzoAWS that Athena is one good option to run ad-hoc queries against your CUR.
Though, there are cases where a relational SQL database might still be preferred or needed. For example, one might have an existing SQL data warehouse that integrates with your financial reporting system and doesn't have a practical path to reading from S3 or Athena directly.
Also, @LorenzoAWS also said that "[Athena] will give better performance [over a SQL database]". I don't think this is necessarily true in all cases because it depends on how we define performance and what type of SQL database we are comparing to (e.g. OLTP like Postgres? Distributed OLAP warehouse like Amazon Redshift or Snowflake?). For example:
-
If performance is defined as query throughput (time to complete) for a single OLAP query over a very large CUR data set, then sure - I imagine that Athena could outperform an OLTP SQL database because under the hood its a distributed Presto cluster. However, if comparing to modern distributed OLAP warehouses like Redshift, I'm not sure who win. Each Redshift node (typically) reads data into memory from disks locally attached to the node which I imagine may be faster than Athena reading from S3 over HTTPS.
-
If your use case calls for a high volume of concurrent queries throughout the day, you might not care about the raw speed of a single query and instead good performance might mean "no user waits longer than 3 seconds for results". By default, Athena supports 20-25 concurrent queries depending on the region, and while this is a soft limit that can be raised, I'm guessing that it still won't be as high as say an RDS database or Redshift cluster with concurrency scaling enabled.
Anyway, regardless of reason, if you want to get the CUR into a SQL database, I want to share a suggestion that builds on Lorenzo's answer and can eliminate the impact of some (but not all) schema changes to the CUR.
Create a staging view / materialized view
You can normally configure the CUR to deliver to S3 in either CSV (with optional zip/gzip) or Parquet format. Though, if you choose to include the optional "data integration" the CUR will make the decision for you (Athena = Parquet, Redshift & QuickSight = CSV+GZIP).
Anyway, if your CUR is delivered as CSV and you are loading these files directly into a SQL database, then you're correct in that any column addition, delete, rename, or re-ordering would break many traditional ETL strategies.
My recommendation is that you configure your CUR to deliver with "Athena integration" in Parquet format. Then, [use CREATE TABLE AS] (https://docs.aws.amazon.com/athena/latest/ug/create-view.html) (aka CTAS) to create a materialized view (new CSV file(s) in S3) that only contains the columns you care about. It's this new "staging" file, rather than the files originally delivered by CUR, that you would then import to your SQL database. The secret to this approach is that your CTAS query will not fail if new Parquet files have unexpected column additions, removals, or re-ordering.
There are still two small potential issues that might arise with subsequent Parquet files:
-
By default, Athena matches columns by name and thus a renamed column will be seen as a new column to Athena. This means that the old column name in your
SELECT
will have null values for rows in the new file. -
If a column data type changes in a new Parquet file, I believe (but can't remember for certain) that the query will succeed but show the column as null for the affected rows. I think that solving this depends on whether you are creating and loading incremental staging files for each new CUR report versus creating a "rolling" staging table (e.g. last X days).
My suspicion is that its unlikely AWS would rename a column or change a data type in the CUR, so hopefully neither of the above will ever come up.
Hope this helps!
~Mat
You can automate this following the guide that is outlined in the Well Architected Labs, https://wellarchitectedlabs.com/cost/300_labs/300_automated_cur_updates_and_ingestion/ (Repo https://github.com/awslabs/aws-well-architected-labs). It allows you to setup that delivery and automation to Athena to then be able to report off of it. You can then tie it into the CUDOS (https://github.com/aws-samples/aws-cudos-framework-deployment) solution that was published to be able to easily report off of it. The Labs outline the steps to setup the automated ingestion and also the reporting and standard dashboards in Quicksight using CUDOS.
If you do automate this with Athena (which in my opinion is the better route) there is a CloudFormation template which provisions all the resources for you. You don't need to do it manually with the Well Architected Lab mentioned above. The documentation is located here:
https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html
Once you have it setup, there are also example queries to query your CUR data located in this Well Architected Labs link:
https://wellarchitectedlabs.com/cost/300_labs/300_cur_queries/
-cs
As others suggested, it would be cost effective to store the CUR reports on S3 and apply schema on read. In other words query query it via Athena or Redshift Spectrum. Please refer to this blog about How Aruba Networks built a cost analysis solution using AWS Glue, Amazon Redshift, and Amazon QuickSight.
https://docs.aws.amazon.com/cur/latest/userguide/dataexports-create-standard.html is a new feature that allows you to do this.
Relevant content
- asked 3 years ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 2 years ago