How to investigate the cost increase of the Postgres Aurora on performance insights, cost explorer and cloudwatch logs and metrics

0

Enter image description here Enter image description here Above Shows the cost graph where we found cost spikes on 8th oct, 14th Oct, 7th Nov. Due to Aurora:StorageIOUsage highlighted in red colour.

We want to know for what this cost has been increased by relating it to logs and metrics from cloudwatch and performance insights.

Enter image description here Enter image description here

These Graphs show the ReadIOPS of blue instance-2 reader on 8th oct and the orange instance-1 writer on 14th Oct and on 7th Nov

Is Aurora:StorageIOUsage related to these IOPS?

Why orange ReadIOPS was suddenly change from spike (5mins) on 6th Nov to extended high values (14hours) on 7 Nov?

fields @timestamp, @message, @logStream, @log
| parse @message "elapsed: * s" as elapsed_time
| filter (elapsed_time) > 1000
| sort @timestamp
| limit 10000

Enter image description here

This Query is run from 7th Oct to 8th Nov

Is autovacuum one of the reasons for increased ReadIOPS costs?

What other ways are there to query cloudwatch logs or metrics to investigate the cost spikes?

Enter image description here

Above image Show performance insights graph,

Is the extra blue IO:DataFileRead related to Aurora:StorageIOUsage and the cost spike?

2 Answers
1

To investigate the cost increase of PostgreSQL Aurora related to performance insights, cost explorer, and CloudWatch logs and metrics, we need to analyze several aspects:

  1. Aurora:StorageIOUsage and IOPS: Yes, Aurora:StorageIOUsage is directly related to the IOPS (Input/Output Operations Per Second) you're seeing in the graphs. The cost spikes on October 8th, October 14th, and November 7th correspond to the increased ReadIOPS observed in the graphs for the reader and writer instances. The StorageIOUsage metric represents the total amount of I/O operations, which includes both read and write operations.

  2. Change in ReadIOPS pattern: The sudden change from a short spike on November 6th to extended high values on November 7th could be due to several factors:

  • A long-running query or batch job
  • Data replication or backup processes
  • Increased application load or user activity
  • Index rebuilding or database maintenance tasks

To pinpoint the exact cause, you'd need to correlate this with application logs and database query logs.

  1. Autovacuum and increased ReadIOPS costs: Autovacuum can indeed contribute to increased ReadIOPS and, consequently, higher costs. The CloudWatch Logs query you've shown is looking for operations that took more than 1000 seconds (about 16.7 minutes). If these long-running operations are autovacuum processes, they could be causing significant I/O activity, leading to increased costs.

  2. Additional ways to investigate cost spikes:

  • Analyze query performance using Performance Insights
  • Check for long-running queries in PostgreSQL logs
  • Monitor CPU utilization and memory usage
  • Examine network throughput metrics
  • Look at the number of client connections over time
  • Analyze I/O patterns for read, write, and metadata operations
  1. Performance Insights and cost spike: The extra "IO:DataFileRead" activity shown in blue in the Performance Insights graph is likely related to the Aurora:StorageIOUsage and the cost spike. This metric indicates increased read operations from data files, which would contribute to higher I/O usage and, consequently, higher costs.

To further investigate and potentially mitigate these issues:

  • Review and optimize your database queries
  • Consider implementing connection pooling to reduce connection churn
  • Analyze and adjust your autovacuum settings if necessary
  • Monitor and tune your instance's resources (CPU, memory, network)
  • Consider scaling your instance size if the workload consistently exceeds current capacity

Remember to establish performance baselines and regularly monitor your Aurora PostgreSQL cluster to identify anomalies and address issues proactively.

Sources
Analyzing PostgreSQL logs using CloudWatch Logs Insights - Amazon Aurora
Using Amazon CloudWatch metrics to analyze resource usage for Aurora PostgreSQL - Amazon Aurora
Managing Aurora PostgreSQL connection churn with pooling - Amazon Aurora
Monitoring metrics in an Amazon Aurora cluster - Amazon Aurora

profile picture
answered 2 months ago
0

For a production environment, beyond Cost Explorer, I recommend deploying ALL of the Cloud Intelligence Dashboards Framework.

I suggest deploying all (3) foundational dashboards (CUDOS, CID, and KPI) but also add the Compute Optimizer Dashboard, the Trusted Advisor Organizational Dashboard as well as the Cost Anomaly Dashboard. I think these are low-cost, low-effort, and significant value.

I think it's also worth the (minimal) effort to also deploy the remaining advanced dashboards, at least the: Graviton Savings Dashboard and the Extended Support - Cost Projection. I also suggest deploying the Health Events Dashboard. It provides useful information when reviewing and correlating a service event and documenting a RCA.

Thoughts? We all benefit from feedback! Tyia!

AWS
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions