Why am I seeing high CPU usage on my Amazon Redshift leader node?

5 minute read
0

My Amazon Redshift cluster's leader node is experiencing high CPU utilization. Why is this happening?

Short description

Your Amazon Redshift cluster's leader node parses and develops execution plans to carry out database operations. The leader node also performs final processing of queries and merging or sorting of data before returning that data to the client. Depending on how complex or resource-intensive the database operations are, the CPU utilization can spike for your cluster's leader node.

In particular, your leader node's CPU utilization can spike for the following reasons:

  • Increase in database connections
  • Query compilation and recompilation
  • High number of concurrent connections
  • High number of concurrent queries running in WLM
  • Leader node-only functions and catalog queries

Note: You can't check for specific processes that occupy your leader node. Use the STV_RECENTS table to check for the queries that are running at a particular time.

Resolution

Increase in database connections

The client server communicates with the Amazon Redshift cluster through the leader node. If there are a growing number of database connections, the CPU utilization increases to process those connections. Check Amazon CloudWatch metrics to make sure the DatabaseConnections limit hasn't been exceeded.

Query compilation and recompilation

Amazon Redshift generates and compiles code for each query execution plan. Query compilation and recompilation are resource-intensive operations, and this can result in high CPU usage of the leader node. However, CPU performance will return to normal when the query compilation or recompilation operations are complete.

Note also that Amazon Redshift caches compiled code. When a query is submitted, Amazon Redshift reuses whatever segments are available while the remaining segments are recompiled. As a result, this process can contribute to high CPU usage of the leader node.

Note: After an Amazon Redshift cluster reboots, the compiled segment stills persists, even though the results cache gets cleared. Amazon Redshift doesn't run the query if your query was previously cached. All caches are removed when a patch is applied.

To check the compilation time (in seconds) and segment execution location for each query segment, use the SVL_COMPILE system view:

select userid,xid,pid,query,segment,locus,starttime, endtime,
datediff(second,starttime,endtime) as TimetoCompile,compile from svl_compile;

High number of concurrent connections

More connections can lead to a higher concurrency and an increase in transactions of your Amazon Redshift cluster. The increase in transactions can result in high CPU utilization of the leader node.

To check for concurrent connections, run the following query:

select s.process as process_id,
c.remotehost || ':' || c.remoteport as remote_address,
s.user_name as username,
s.starttime as session_start_time,
s.db_name,
i.starttime as current_query_time,
i.text as query 
from stv_sessions s
left join pg_user u on u.usename = s.user_name
left join stl_connection_log c
on c.pid = s.process
and c.event = 'authenticated'
left join stv_inflight i 
          on u.usesysid = i.userid
          and s.process = i.pid
where username <> 'rdsdb'
order by session_start_time desc;

Then, use PG_TERMINATE_BACKEND to close any active sessions.

High number of concurrent queries running in WLM

All client connections are processed through the leader node. Before returning data to the client server, Amazon Redshift's leader node parses, optimizes, and then compiles queries. The leader node also distributes tasks to compute nodes, performing final sorting or aggregation. With high query concurrency, CPU usage can increase at the leader node level. Additionally, some database operations can be applied only at the leader node level. For example, a query with a LIMIT clause might consume high CPU because the limit is applied to the leader node before data is redistributed.

To confirm whether there is correlation between the number of concurrent queries and CPU usage, first check the WLMRunningQueries and CPUutilization metrics in Amazon CloudWatch.

Then, check to see which queries are consuming high CPU:

SELECT userid, query, xid, aborted,
ROUND(query_cpu_time::decimal,3),
query_execution_time,
segment_execution_time,
substring(querytxt,1,250)
FROM stl_query
JOIN
(SELECT query,
query_cpu_time,
query_execution_time,
segment_execution_time
FROM svl_query_metrics_summary
ORDER BY 2 DESC) a USING (query)
WHERE userid>1
AND starttime BETWEEN '2019-12-02 22:00:00' and '2019-12-05 23:59:59'
ORDER BY 5 DESC;

Review the output to confirm which queries are processed by the leader node and any other outlier queries that increase CPU usage.

Note: It's a best practice to tune query performance for your queries. Consider increasing your leader node capacity and choosing large node types (rather than adding more compute nodes).

Leader node-only functions and catalog queries

Amazon Redshift implements certain SQL functions supported on the leader node. If there are complex queries with leader node functions and overloading catalog queries, then CPU utilization can spike on a leader node. For more information, see SQL functions supported on the leader node.

To identify steps referencing catalog tables (which are executed only on a leader node), check the EXPLAIN plan:

explain select * from pg_class;
                           QUERY PLAN                          
----------------------------------------------------------------
 LD Seq Scan on pg_class  (cost=0.00..24.57 rows=557 width=243)

Check for the LD prefix in your output. In this example, the LD prefix appears in "LD Seq Scan on pg_class (cost=0.00..24.57 rows=557 width=243)". The LD prefix indicates that a query is running exclusively on a leader node, which can cause a spike in your CPU usage.


AWS OFFICIAL
AWS OFFICIALUpdated a year ago