Need more details on an RDS Customer Reachout...

0

Hi,

I'm evaluating Aurora Global Databases for a new application, and it seems I managed to find a bug/crash in the MySQL write forwarding feature during my testing yesterday, as I received a "RDS Customer Reachout" email from AWS support stating:

Your Amazon Aurora cluster '[REDACTED]' in the 'ap-southeast-2 Region experienced a restart at 2022-12-15T02:07:56 UTC.

Based on our investigation, the root cause of the issue is a defect in global database write forwarding when a connection's session variables are reset.

To prevent this issue from reoccurring, we advise to not use any api call that may switch the current user, e.g. mysql_change_user(). Detailed documentation is available here [1] .

We apologize for the inconvenience caused and continue to strive to improve our customer experience with Amazon RDS.

If you have any further questions or require any guidance, please do not hesitate to contact the AWS Support team. We are available on AWS Developer Community [2] or by contacting AWS Premium Support [3].

[1] https://dev.mysql.com/doc/c-api/8.0/en/mysql-change-user.html

[2] https://repost.aws/

[3] https://console.aws.amazon.com/support/

On the one hand, this level of proactive notification is very impressive and welcome from AWS. On the other, hand, given that I'm not paying for support yet, the lack of any further detail or follow-up on precisely what I triggered is frustrating - so I'm taking the requested course of action and asking for more information on the nature of the "defect".

In particular, I need to know the specific command/query that triggered the session variable reset - I only have a single database user configured in my test database, so the mysql_change_user() method mentioned in the email is almost certainly not the cause, and I'm not doing anything that would trigger a user change either.

How do I get more info from AWS in this situation? For now, I'm going to try and reproduce the issue on my own... not ideal, but it's the only option I have.

1 Answer
0

I haven't heard anything further back from AWS on any of the (multiple) "reach out" cases that have been opened for this issue yet, or here, but I have figured out what's triggering the bug in Aurora.

Steps to Reproduce:

  1. Configure a Aurora MySQL global database, with write forwarding enabled from secondary to primary region.
  2. Create a basic Entity Framework Core app that performs 2 writes against the secondary region via Dapper direct SQL statements.

Expected Behaviour

Both writes succeed and are forwarded and committed in the primary region.

Actual Behaviour:

The second write triggers an assertion error on the secondary instance, causing an instance restart - the write, and all subsequent reads fail for a period of time until the instance restarts. Enabling error logging on the secondary cluster, the following assertion error is observed in the secondary instance log:

2022-12-19T03:45:38.970437Z 137 [ERROR] [MY-013183] [InnoDB] Assertion failure: fwd_cmd.cc:2266:thd->variables.aurora_using_thread_pool_connection_handler == true thread 70368986431216 (ut0dbg.cc:57)

On the application side, the write fails with a MySQL protocol error:

MySqlConnector.MySqlException (0x80004005): Failed to read the result set. ---> System.IO.EndOfStreamException: Expected to read 4 header bytes but only received 0.

Suspected Issue:

Entity Framework core uses https://mysqlconnector.net/ under the hood, which provides a connection pool to the application, and crucially per the documentation at https://mysqlconnector.net/connection-options/ - the Pooling, and ConnnectionReset options are both defaulted to true, this triggers the following sequence of events:

  1. First write requests a connection
  2. A new connection is opened to Aurora.
  3. The first write sets the aurora_replica_read_consistency session variable successfully and performs its write.
  4. The first write completes and the connection is returned to the idle pool.
  5. The second write requests a connection
  6. EF/MySqlConnector return the existing idle connection from the pool, issuing a Reset instruction to Aurora
  7. This triggers the Aurora bug and the secondary instance asserts, crashes and reboots
  8. The subsequent Write fails as shown above

The issue can be worked around by disabling either Pooling or ConnectionReset in the connection string, however both of these are desired options in general, and even with this bug avoided, we are still hitting the subsequent issues below.

It would be great to get confirmation from AWS that you agree this is an Aurora issue, and an ETA for when it will be fixed?

profile picture
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions