Redshift Spectrum with hanging select queries about 25 minutes

0

Good Day,

so i`m writing here because we have tried a lot of things and got no solution until now. So we hope, there is a solution out there or somebody can help us debugging.

We use Redshift with Spectrum and partitioned S3 Parquet Data. We have only external Spectrum schemas, no local Redshift data. The client is a Tableau Server 2020.4.1, but even with desktop client`s we observe the issue.

The issue is quite simple. We click some random filter in dashboard and sometimes the query didnt response after a few seconds. It hangs, hangs hangs, until 25 minutes or 1500 seconds. But it deliver a correct response! If i copied the exactly same Query into DB Tool it response after a few seconds. So it looks like a race condition. After a lot of manuel testing we write a jmeter plan, executing some dashboard queries and now can reproduce the issue in many times. Often there is only one hanging query, so i dont get where the lock comes from, if there is only one. Even we have no write Operations (because we use only Spectrum) where the lock can come from.
We are using two environments, with different data sets (dev, live), both have the same issue.

The queries themself looks simple, a kind of select query of a data table with some left joins of meta tables and a group by and aggregation function. But it doesn`t matter which query we use. Still after some jmeter iterations, ther are one, two or three hanging queries.

Per default tableau uses cursors for all queries, but even if we switch to normal queries the issue occure.

So we have looked into many system tables for debugging purpose. In case of hanging queries there are entries in STV_LOCKS table with the depending queries. STL_TR_CONFLICT and STL_ALERT_EVENT_LOG are empty.

svv_transactions and pg_stat_activity even shows the hanging queries, but no hint why they are hanging.

If i look into SVL_S3QUERY_SUMMARY, i can see all s3 queries response in a fast way, so possible no issues there.

We don`t now since when the issue occure, we working with redshift for around 9 month without any problem. Possibly it comes with one of the last redshift revisions. (we are using auto update on every sunday)

Here comes my Questions:

Has somebody the same Issue?
What is about the magic 25 minutes, is there a lock timeout or something like that?
Do you have some hints, how we an debug where the lock came from?
Is there a way to switch to an old revision? At least i have tried it once, resulting in a not responding cluster with state "Hardware Failure".

kind regards,
Frederik Schreiber

frecong
asked 3 years ago642 views
4 Answers
0

Hello,

I have the same problem with Redshift.

Last Wednesday (May 5, 2021) the cluster upgraded from revision 1.0.25813 to 1.0.26073 and the problem started to happen. Here, it affects only a table of a specific external schema that we have. It's a nightmare because we have some queues at RabbitMQ that queries some Redshift data and the consumer dies because the query mysteriously hangs until the it's aborted.

We are now discussing whether to revert the update or wait the solution from AWS.

Hope the AWS devs find the solution quickly.

Edited by: kawe on May 11, 2021 5:00 AM

kawe
answered 3 years ago
0

Hi frecong and kawe,

Have you looked at your queues in Redshift to see if you have a queue wait time issue where the queries you're SQL clients are waiting for might just be waiting on an queue slot to execute. The Redshift doc. can show you how to do this.

Regards,
-Kurt

klarson
answered 3 years ago
0

Since revision 1.0.26742 the problem seems to be solved. I cannot reproduce hanging queries with or without using cursor. But i cannot find any changelog of that revision. magic things happens.

frecong
answered 3 years ago
0

Hi frecong,

It's worth noting that Redshift does not communicate a complete change log for Redshift releases, like you may be accustomed to with internal code bases or open source projects. What we get in the release announcements here in the forum or the doc page that points to the forum are the the release highlights that Redshift determines affect or improve the experience of most customers. So, issue resolutions, however severe the issue may be, that only affect a narrow group of customers are often not openly reported or announced. Typically, if you're an enterprise support customer you'll learn of those through your TAM or account team. If you are not an enterprise support customer yet still have support case access and report the issue in a support case you may learn of a fix through post to the support case by the support agent if the support case is still active because the fix is close enough in time to your last interaction with support. Otherwise, for the most part, you, the customer, are left to your own devices to figure out if and when your issue is addressed.

If it is important to you to know when an issue is fixed, it's helpful to have a reproducible test case that you can run periodically or on demand to identify when an issue is resolved. As another Redshift customer, I too find this frustrating at times to be left in the dark on an issue status even with enterprise support and crack account team that stays on top of most every issue. IMO and in my experience, the only way this will change is much like anything else in AWS, if enough Redshift customers voice a strong opinion in favor of change then the AWS "customer obsession" inherent in the AWS value system will kick in and drive a customer driven change. So, please make your voices heard.

Regards,
-Kurt

klarson
answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions