How do I identify and troubleshoot performance issues and slow-running queries in my Amazon RDS for PostgreSQL or Aurora PostgreSQL-Compatible DB instance?

8 minute read
0

My Amazon Relational Database Service (Amazon RDS) for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition DB instance is slow. I want to identify and troubleshoot the slow-running queries.

Resolution

Undersized hardware, changes in workload, increased traffic, memory issues, or unoptimized queries affect DB instance performance. To resolve performance issues, take the following actions.

Identify the cause

Check CloudWatch metrics

To identify performance bottlenecks that are caused by insufficient resources, monitor the CPUUtilization, FreaableMemory, and SwapUsage Amazon CloudWatch metrics.

When CPU utilization is high, an active workload on the DB instance requires more CPU resources. Low memory availability for the workload causes frequent swapping that results in high memory utilization and high swap space usage. Long-running queries, suddenly increased traffic, or a large number of idle connections can cause high CPU utilization and high memory resources.

To view the active queries with runtime, run the following command:

SELECT pid, usename, age(now(),xact_start) query_time, query FROM pg_stat_activity WHERE state='active';

To view the idle connections in the database, run the following command:

SELECT count(*) FROM pg_stat_activity WHERE state='idle';

To terminate idle connections, run the following command:

SELECT pg_terminate_backend(example-pid);

Note: Replace example-pid with the idle connection's process ID.

To verify that your DB instance reaches the expected network throughput, check the NetworkReceiveThroughput and NetworkTransmitThroughput metrics. Undersized or unoptimized Amazon Elastic Block Service (Amazon EBS) instance classes might affect network throughput and result in slow instances. A low network throughput can result in slow responses for all the application requests regardless of the database performance.

To evaluate the I/O performance, check the ReadIOPS, WriteIOPS, ReadLatency, WriteLatency, ReadThroughput, WriteThroughput, and DiskQueueDepth metrics. For more information, see How do I troubleshoot the latency of Amazon EBS volumes caused by an IOPS bottleneck in my Amazon RDS instance?

Use Enhanced Monitoring

Use Enhanced Monitoring to view metrics at the operating system (OS) level and list the top 100 processes that use high CPU and memory. Activate Enhanced Monitoring with Granularity set to 1 to identify intermittent performance issues on your DB instance.

Evaluate the available OS metrics to diagnose performance issues that are related to CPU, workload, I/O, memory, and network. From the process list, identify processes with high values for CPU% or Mem%.

Example:

NAMEVIRTRESCPU%MEM%VMLIMIT
postgres: postgres postgres 178.2.0.44(42303) SELECT [10322]250.66 MB27.7 MB85.932.21unlimited

Connect to the database, and then run the following query to find the connection with high CPU in the database:

SELECT * FROM pg_stat_activity WHERE pid = 10322;

Note: Replace 10322 with the connection's process ID.

Check Performance Insights metrics

Use Performance Insights to evaluate database workloads by waits, SQL, host, or users. You can also get the database and SQL-level metrics.

Use the Top SQL tab on the Performance Insights dashboard to view the SQL statements that contribute the most to the DB load. If the DB load or load by waits (AAS) is higher than the Max vCPU, then the workload on the DB instance class is throttled.

Use the average latency per call in SQL statistics to view the average running time of a query. Top SQL is based on total running time. As a result, the SQL with the highest running time is often different from the SQL that's the top DB load contributor.

Check database statistics

To evaluate database performance in PostgreSQL, check the data distribution statistics, extended statistics, and monitoring statistics. For information about the statistics, see Understanding statistics in PostgreSQL.

Check native database tools

To identify slow queries, use the native pgbadger tool on the GitHub website. For more information, see Optimizing and tuning queries in Amazon RDS for PostgreSQL based on native and external tools.

Optimize performance

Tune memory settings

You can set the shared_buffers parameter to a value that helps improve query performance.

The work_mem and maintenance_work_mem parameters define the amount of memory that's used for backend processes. For more information, see 20.4 Resource consumption on the PostgreSQL website. If you frequently experience high memory usage on the DB instance, then lower the parameter values in the custom parameter group that's attached to your instance.

Use Aurora PostgreSQL-Compatible query plan management

Use Aurora PostgreSQL-Compatible query plan management to control how and when query run plans change. For more information, see Best practices for Aurora PostgreSQL query plan management.

Troubleshoot slow-running queries

Infrastructure issues, unoptimized query planning, or high overall resource usage cause slow-running queries. The PostgreSQL query planner uses table statistics to create query plans. Schema changes and old statistics might affect the plans. Overloaded tables and indexes can also result in slow-running queries.

When a table reaches the dead tuple threshold, the autovacuum daemon creates autovacuum worker processes that remove dead tuples from the table. The autovacuum daemon also runs the ANALYZE operation that refreshes the table's statistics.

Run the following query to check for dead tuples and autovacuum or vacuum operations and autoanalyze or analyze runs:

SELECT schemaname, relname, n_live_tup,n_dead_tup, last_autoanalyze, last_analyze, last_autovacuum, last_vacuum,autovacuum_count+vacuum_count vacuum_count, analyze_count+autoanalyze_count analyze_count 
FROM pg_stat_user_tables
ORDER BY 5 DESC;

Use the pg_stat_activity view to find data that's related to current activities, such as a backend process ID or query. To find long-running queries, run the following query:

SELECT pid, datname, query, extract(epoch from now()) - extract(epoch from xact_start) AS duration, case
WHEN wait_event IS NULL THEN 'CPU' 
ELSE wait_event_type||':'||wait_event end wait FROM pg_stat_activity
WHERE query!=current_query() AND xact_start IS NOT NULL ORDER BY 4 DESC;

Queries that are waiting for locks might be slow. To check whether a query is waiting for locks, run the following query:

SELECT pid, virtualtransaction AS vxid, locktype AS lock_type, mode AS lock_mode, granted,fastpath,CASE
WHEN virtualxid IS NOT NULL AND transactionid IS NOT NULL
THEN virtualxid || ' ' || transactionid
WHEN virtualxid::text IS NOT NULL
THEN virtualxid
ELSE transactionid::text
END AS xid_lock, relname, page, tuple, classid, objid, objsubid
FROM pg_locks LEFT OUTER JOIN pg_class ON (pg_locks.relation = pg_class.oid)
WHERE -- do not show our view's locks
pid != pg_backend_pid() AND
virtualtransaction IS DISTINCT FROM virtualxid
ORDER BY 1, 2, 5 DESC, 6, 3, 4, 7;

Use pg_stat_statements to view the statistics of queries. Before you create the pg_stat_statements extension, add the pg_stat_statements entry to shared_preload_libraries. To create the pg_stat_statements extension within the database, run the following query:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Note: You can modify parameters for pg_stats_statements only when a custom parameter group is attached to your DB instance.

To identify the SQL queries that affect the performance of your DB instance, run the following queries.

PostgreSQL versions 12 and earlier:

SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY 3 DESC;

PostgreSQL versions 13 and later:

SELECT query, calls, total_plan_time+total_exec_time AS total_time, mean_plan_time + mean_exec_time AS mean_time FROM pg_stat_statements ORDER BY 3 DESC;

To find queries with a lower buffer cache hit ratio, run the following queries.

PostgreSQL versions 12 and earlier:

SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percentFROM pg_stat_statements ORDER BY total_time
DESC LIMIT 10;

PostgreSQL versions 13 and later:

SELECT query, calls, total_plan_time+total_exec_time as total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit +shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY 5 ASC
LIMIT 10;

To find long-running queries or query plans in database error logs, configure the log_min_duration_statement parameter for your DB instance, and then use the auto_explain module.

You can also use the explain and explain analyze commands to get the query plan. Use the auto_explain module or explain commands to identify how you can tune your query tuning. For more information, see 14.1 Using EXPLAIN and F3. auto_explain - log execution plans of slow queries on the PostgreSQL website.

If you optimized your system and you still experience performance issues, then it's a best practice to scale up the DB instance class. When you scale up the DB instance, you allocate more compute and memory resources.

Related information

How can I troubleshoot high CPU utilization for Amazon RDS or Amazon Aurora PostgreSQL?

Working with parameters on your RDS for PostgreSQL DB instance

Why is my Amazon RDS DB instance using swap memory when I have sufficient memory?