Skip to content

API Gateway Websocket suddenly stops working

0

We have an API Gateway Websocket that sends HTTP messages to our HTTP back-end. Everything is working just fine until it doesn't anymore.

It just suddenly just stops working. Without any re-configuration from our part, it stops working. We can fix it by re-deploying the latest configuration (we do 0 changes to it). In CloudWatch we have log messages that says something in the line of "configuration error" but it does not specify which configuration that is faulty.

Right now, we have an alarm on those error messages. If they occur, we manually go in and re-deploy the latest configuration. Obviously, this is not ideal.

How do we debug this? We're out of ideas.

[EDIT] To clarify:

  1. When the issue happens, then everything stops working entirely. It's not an issue for X number of clients; it's an issue for ALL clients.
  2. We are using Terraform and are deploying an identical infra to dev, staging and prod. Dev and staging works just fine. This is an issue for just prod. The only difference between the environments is the amount of traffic, but still, we don't have that much traffic in prod.
1 Answer
0
    This issue with your API Gateway WebSocket suddenly stopping working could be caused by several factors. Let me suggest some debugging approaches:

    1. Check your custom domain configuration if you're using one. Sometimes the connection between custom domains and API Gateway can become disconnected unexpectedly. Verify that your API mappings are still intact and correctly pointing to the right API and stage.

    2. If you're using CloudFront with your API Gateway WebSocket, ensure that the CloudFront distribution is properly configured. Verify that the origin domain name matches your API Gateway WebSocket endpoint, and that the behavior settings allow necessary HTTP methods and include headers such as Upgrade and Connection to support WebSocket handshake.

    3. Look for concurrent connection issues. API Gateway WebSocket uses a single TCP connection for each WebSocket connection. If multiple concurrent requests are made to the same WebSocket connection, there's potential for data race conditions and interleaving of frames, which can cause the client to receive corrupted frames and disconnect.

    4. Review your timeout settings. If your backend processing takes longer than the API Gateway timeout limit (29 seconds for WebSocket APIs), you'll experience disconnections.

    5. Set up more detailed CloudWatch logging for your API Gateway to capture more specific error information. Look for patterns in when the failures occur.

    6. Check if there are any service-linked role issues that might be affecting your API Gateway configuration.

    7. Consider implementing an automated solution that monitors for these configuration errors and automatically redeploys the API when needed, rather than requiring manual intervention.

    Since redeploying without changes fixes the issue temporarily, it suggests that something is causing your configuration to become invalid or disconnected. This could be related to internal API Gateway service maintenance or updates that occasionally affect your specific setup.
  

Sources
Socket connection Cloudfront / ApiGateway | AWS re:Post
API Gateway WebSocket - Invalid Frame Header | AWS re:Post
Custom Domain & API Gateway Connection suddenly disconnected. | AWS re:Post
AWS REST apigateway throws 502 even though integration timeout is set to 59000 ms | AWS re:Post

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.