- Newest
- Most votes
- Most comments
Greeting
Hi Dave,
Thanks for sharing such a detailed description of your issue! I can see you've put significant effort into diagnosing this, and I appreciate your persistence in identifying potential causes. Let’s delve into your problem and see if we can clarify what's happening and how to address it. 😊
Clarifying the Issue
From your explanation, it sounds like you're encountering intermittent failures in EventBridge Pipes after redeploying your stacks, especially when transitioning from development to production. The error, "The security token included in the request is invalid," affects a large proportion of the messages initially but resolves over time. Your observation that modifying resources seems to "reset" things supports the theory of misconfigured internal pollers or a timing issue during deployment. It's intriguing that the issue gradually resolves without manual intervention, suggesting potential retries or restarts of internal processes.
You're on the right track thinking about resource dependencies and timing during deployment. Let’s walk through how you can address this systematically.
Key Terms
- EventBridge Pipes: A service for integrating sources (e.g., SQS queues) with targets (e.g., Lambda functions) via a simple, event-driven pipeline.
- Security Token Invalid Error: A failure indicating the request's authentication or IAM role permissions are not recognized or valid.
- Cold Start: The initialization time required for AWS services (like Lambda or EventBridge pollers) when starting from scratch.
The Solution (Our Recipe)
Steps at a Glance:
- Ensure IAM roles and policies are correct and fully propagated.
- Introduce explicit resource dependencies to delay pipe creation.
- Validate EventBridge and SQS resource configurations post-deployment.
- Consider adding a retry mechanism or interim "warm-up" period after deployment.
- Test resource creation in isolation to identify potential timing issues.
- Investigate and refine debugging configurations for deeper error analysis.
Step-by-Step Guide:
-
Verify IAM Roles and Policies:
Double-check that your IAM roles include all necessary permissions for the pipe, SQS, and EventBridge targets. Permissions should includeevents:PutEventsfor the event bus andsqs:ReceiveMessagefor the queue. Ensure propagation time is accounted for in your deployment process.{ "Effect": "Allow", "Action": [ "events:PutEvents", "sqs:ReceiveMessage", "sqs:DeleteMessage" ], "Resource": "*" }
-
Add Resource Dependencies:
Modify your deployment to ensure that the EventBridge event bus and SQS queues are fully initialized before the pipes are created. In AWS CDK, you can add dependencies explicitly:const queue = new sqs.Queue(this, 'Queue'); const eventBus = new events.EventBus(this, 'EventBus'); const pipe = new pipes.CfnPipe(this, 'Pipe', { source: queue.queueArn, target: eventBus.eventBusArn, roleArn: pipeRole.roleArn }); pipe.node.addDependency(queue); pipe.node.addDependency(eventBus);
-
Validate Resources Post-Deployment:
After deploying, manually validate that the EventBridge bus, SQS queues, and associated IAM roles are configured correctly. Use the AWS CLI or SDK to confirm:aws events.describe-event-bus --name DevL8State-3a0a9147 aws sqs.get-queue-attributes --queue-url <Queue URL>
-
Add a Warm-Up Period:
Introduce a delay after deployment to allow services like EventBridge pollers to stabilize. You can use a Lambda function or CloudFormationWaitConditionfor this purpose.const warmUpFunction = new lambda.Function(this, 'WarmUpFunction', { runtime: lambda.Runtime.NODEJS_18_X, handler: 'index.handler', code: lambda.Code.fromInline(` exports.handler = async () => { console.log('Warm-up complete'); return; }; `) }); warmUpFunction.node.addDependency(pipe);
- Isolate and Test Deployment Timing:
Break down your stack deployment into smaller steps and monitor the behavior of individual resources. This helps pinpoint where the timing issue originates.
-
Investigate and Refine Debugging Configurations:
Step-by-Step Enhancements:
-
Token Expiry or Invalidity Causes: Temporary credentials provided by IAM roles (e.g., through AssumeRole or using temporary session tokens) might expire during long-running deployments. To mitigate:
- Use the AWS Security Token Service (STS) to confirm the validity of tokens:
aws sts get-caller-identity - If deploying through automation (e.g., CI/CD pipelines), ensure tokens are refreshed or use long-lived credentials for deployment.
- Check the default session duration for roles. Extend it if needed using the
DurationSecondsparameter when assuming a role:aws sts assume-role --role-arn <ROLE_ARN> --role-session-name <SESSION_NAME> --duration-seconds 3600
- Use the AWS Security Token Service (STS) to confirm the validity of tokens:
-
IAM Trust Relationships and Role Permissions: Ensure all roles assumed by EventBridge, SQS, and your pipelines explicitly trust the services and users involved. For instance:
- The trust relationship for EventBridge should include:
{ "Effect": "Allow", "Principal": { "Service": "events.amazonaws.com" }, "Action": "sts:AssumeRole" }
- The trust relationship for EventBridge should include:
-
Region or Account Misconfiguration: If resources span multiple regions or accounts, ensure the roles and policies grant cross-account or cross-region access explicitly. Use
ResourceARNs carefully to avoid restricting access unnecessarily.
Enable Verbose Debugging:
- Turn on CloudWatch detailed logs for EventBridge Pipes to gain insights into token failures:
- Use this policy snippet to allow logs:
{ "Effect": "Allow", "Action": "logs:CreateLogStream", "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws/events/*" } - Analyze logs for specific errors such as
UnrecognizedClientExceptionorExpiredToken.
- Use this policy snippet to allow logs:
Test Token Lifecycles:
- Run targeted tests to simulate token behavior under deployment stress using the following script:
import boto3 import time # Constants ROLE_ARN = "arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME>" # Replace with your role ARN DURATION = 3600 # Session duration in seconds (1 hour) INTERVAL = 600 # Time between validation checks in seconds (10 minutes) def assume_role(role_arn, duration): """ Assumes the specified IAM role and returns temporary credentials. """ sts_client = boto3.client('sts') response = sts_client.assume_role( RoleArn=role_arn, RoleSessionName="TokenLifecycleTest", DurationSeconds=duration ) return response['Credentials'] def validate_token(credentials): """ Uses the temporary credentials to validate the token's validity by calling STS GetCallerIdentity. """ session = boto3.Session( aws_access_key_id=credentials['AccessKeyId'], aws_secret_access_key=credentials['SecretAccessKey'], aws_session_token=credentials['SessionToken'] ) sts_client = session.client('sts') try: # Validate the token by making an authenticated API call response = sts_client.get_caller_identity() print(f"Token is valid. Caller: {response['Arn']}") except Exception as e: print(f"Token validation failed: {e}") if __name__ == "__main__": print("Starting token lifecycle diagnostic...") # Step 1: Assume the IAM role and retrieve temporary credentials credentials = assume_role(ROLE_ARN, DURATION) print(f"Token assumed. Validating every {INTERVAL} seconds for {DURATION} seconds.") # Step 2: Validate token periodically for the duration of its validity for _ in range(DURATION // INTERVAL): validate_token(credentials) time.sleep(INTERVAL) # Wait for the specified interval before next validation print("Token lifecycle diagnostic complete.") -
Closing Thoughts
I hope these steps provide a clear path forward, Dave. Timing issues in AWS resource creation can be tricky, but adding explicit dependencies and validating configurations should help mitigate this issue. If the problem persists, consider reaching out to AWS Support for deeper investigation into the behavior of EventBridge Pipes.
For further reading, you might find these documentation links helpful:
Let me know how it goes or if you have more questions! Good luck with your production rollout! 🚀✨
Cheers,
Aaron 😊
Thanks for the response. After quite a protracted business level support request, AWS confirmed that this behaviour was due to caching of security tokens on their side. In effect, when a pipe was created the tokens were cached based on the resource name alone, so a if a destroy/deployment cycle used the same name, the outdated security tokens would be used initially, causing the issues. They implemented a fix which I confirmed to be working.
Relevant content
- asked 2 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 9 months ago

I would recommend reaching out to AWS support for this one https://aws.amazon.com/contact-us/ . They can work with the development team to investigate this further.