Why does my Amazon Redshift query keep exceeding the WLM timeout that I set?
4 minute read
I set a workload management (WLM) timeout for an Amazon Redshift query, but the query keeps running after this period expires. Why is this happening?
A WLM timeout applies to queries only during the query running phase. If WLM doesn’t terminate a query when expected, it’s usually because the query spent time in stages other than the execution stage. For example, the query might wait to be parsed or rewritten, wait on a lock, wait for a spot in the WLM queue, hit the return stage, or hop to another queue.
When querying STV_RECENTS, starttime is the time the query entered the cluster, not the time that the query begins to run. When the query is in the Running state in STV_RECENTS, it is live in the system. However, the query doesn't use compute node resources until it enters STV_INFLIGHT status. For more information about query planning, see Query planning and execution workflow.
To view the status of a running query, query STV_INFLIGHT instead of STV_RECENTS:
select * from STV_INFLIGHT where query = your_query_id;
Use this query for more information about query stages:
select * from SVL_QUERY_REPORT where query = your_query_id ORDER BY segment, step, slice;
Use the STV_EXEC_STATE table for the current state of any queries that are actively running on compute nodes:
select * from STV_EXEC_STATE where query = your_query_id ORDER BY segment, step, slice;
Here are some common reasons why a query might appear to run longer than the WLM timeout period:
The query is in the "return" phase
There are two "return" steps. Check STV_EXEC_STATE to see if the query has entered one of these return phases:
The return to the leader node from the compute nodes
If a read query reaches the timeout limit for its current WLM queue, or if there's a query monitoring rule that specifies a hop action, then the query is pushed to the next WLM queue. To confirm whether the query hopped to the next queue:
If an Amazon Redshift server has a problem communicating with your client, then the server might get stuck in the "return to client" state. Check for conflicts with networking components, such as inbound on-premises firewall settings, outbound security group rules, or outbound network access control list (network ACL) rules. For more information, see Connecting from outside of Amazon EC2 —firewall timeout issue.
An issue with the cluster
Issues on the cluster itself, such as hardware issues, might cause the query to freeze. When this happens, the cluster is in "hardware-failure" status. To recover a single-node cluster, restore a snapshot. In multi-node clusters, failed nodes are automatically replaced.