All Questions

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters. After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing: ``` [ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch ``` This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation): ``` [ec2-user@- ~]$ lsmod | grep nvidia nvidia_drm 61440 0 nvidia_modeset 1200128 1 nvidia_drm nvidia_uvm 1142784 0 nvidia 35459072 2 nvidia_uvm,nvidia_modeset drm_kms_helper 184320 1 nvidia_drm drm 421888 4 drm_kms_helper,nvidia,nvidia_drm i2c_core 77824 3 drm_kms_helper,nvidia,drm [ec2-user@- ~]$ sudo rmmod nvidia_uvm [ec2-user@- ~]$ sudo rmmod nvidia_drm [ec2-user@- ~]$ sudo rmmod nvidia_modeset [ec2-user@- ~]$ sudo rmmod nvidia [ec2-user@- ~]$ nvidia-smi ``` This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?
0
answers
0
votes
4
views
asked 2 hours ago

Is X-Ray on Lambda Compatible with .NET Ahead-of-Time compilation?

When attempting to set up XRay in a .NET/C# Lambda function published with AOT, I am getting the following error upon invoking the function: ``` Unhandled Exception: System.TypeInitializationException: A type initializer threw an exception. To determine which type, inspect the InnerException's StackTrace property. ---> System.MissingMethodException: No parameterless constructor defined for type 'Amazon.XRay.Recorder.Core.Sampling.Local.SamplingConfiguration'. at System.ActivatorImplementation.CreateInstance(Type, Boolean) + 0x120 at ThirdParty.LitJson.JsonMapper.ReadValue(Type, JsonReader) + 0x483 at ThirdParty.LitJson.JsonMapper.ToObject[T](TextReader) + 0x4f at Amazon.XRay.Recorder.Core.Sampling.Local.LocalizedSamplingStrategy.Init(Stream) + 0x60 at Amazon.XRay.Recorder.Core.Sampling.Local.LocalizedSamplingStrategy.InitWithDefaultSamplingRules() + 0x53 at Amazon.XRay.Recorder.Core.AWSXRayRecorder..ctor(ISegmentEmitter) + 0x5e at Amazon.XRay.Recorder.Core.AWSXRayRecorder..cctor() + 0xcd at System.Runtime.CompilerServices.ClassConstructorRunner.EnsureClassConstructorRun(StaticClassConstructionContext*) + 0xb9 --- End of inner exception stack trace --- [...] ``` I've tried adding the following to rd.xml, to no avail: ```xml <Directives xmlns="http://schemas.microsoft.com/netfx/2013/01/metadata"> <Application> <Assembly Name="bootstrap" Dynamic="Required All"/> <Assembly Name="AWSSDK.Core" Dynamic="Required All"/> <Assembly Name="AWSSDK.SecretsManager" Dynamic="Required All"/> <Assembly Name="AWSSDK.XRay" Dynamic="Required All"/> <Assembly Name="AWSXRayRecorder.Core" Dynamic="Required All"/> <Assembly Name="AWSXRayRecorder.Handlers.AwsSdk" Dynamic="Required All"/> <Assembly Name="System.Configuration.ConfigurationManager"> <Type Name="System.Configuration.ClientConfigurationHost" Dynamic="Required All" /> <Type Name="System.Configuration.AppSettingsSection" Dynamic="Required All" /> </Assembly> </Application> </Directives> ``` My initialization code is as follows: ```c# AWSSDKHandler.RegisterXRayForAllServices(); AWSXRayRecorder.InitializeInstance(); // pass IConfiguration object that reads appsettings.json file ``` Any ideas?
0
answers
0
votes
3
views
Jason T
asked 4 hours ago
0
answers
0
votes
4
views
asked 5 hours ago

Why are ECS Container Instances No Longer Able to Register With ECS Cluster?

We have several EC2 / ECS clusters with ASG capacity providers and managed scaling. Somewhere between Dec 7, 2022 12:30am - ~7am US Pacific time, new instances being provisioned by managed scaling do not register with their ECS clusters. We had a successful deploy completing at ~12:30am, which was performed via a CloudFormation stack update, and turned over the live container instances successfully, per our capacity provider rules. The ASG provisioned additional EC2 instances mid morning in response to managed scaling signals, but they never registered with the cluster. Is anyone else experiencing problems with this? What could explain a regression in the absence of any CF resource changes? What I tried: I used the AWSSupport-TroubleshootECSContainerInstance notebook in AWS Systems Manager on one of the instances that failed to register, per the troubleshooting docs. It reported back that it lacked all required permissions on the IAM instance profile. The same instance profile is attached to the instances that DID successfully register a little after midnight, and there have been no changes to the permissions since. Specifically, the AWSSupport-TroubleshootECSContainerInstance automation reported back the following: ``` {"Payload":{"stdout":"The container instance profile <REDACTED> is missing the following required permission(s): ['ecs:RegisterContainerInstance', 'ecs:CreateCluster', 'ecs:DeregisterContainerInstance', 'ecs:Poll', 'ecs:StartTelemetrySession', 'ecs:UpdateContainerInstancesState', 'ecs:SubmitAttachmentStateChange', 'ecs:SubmitContainerStateChange', 'ecs:SubmitTaskStateChange'] Make sure that the container instance has all the recommended permissions. See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#instance-iam-role-permissions ","info_codes":["I002"]}} ``` When I examined the instance profile in question it had the following permissions, scoped to the correct cluster, as generated by CDK v2: * ecs:RegisterContainerInstance * ecs:DeregisterContainerInstance * ecs:Poll * ecs:StartTelemerySession * ecs:Submit* It lacked the following two permissions altogether: * ecs:CreateCluster * ecs:UpdateContainerInstancesState Eventually I was able to resolve the IAM problems reported by AWSSupport-TroubleshootECSContainerInstance by: 1. Adding the two missing permissions 2. Changing the scope of the ones that I had from the cluster in question, to `"Resource": "*"` This seems like a bug pushed by the ECS team to me. It does not make sense why a previously functioning instance profile stopped working. The new requirement of "ecs:CreateCluster" is especially suspicious -- why would a container instance need to create a cluster? The requirement to have Resource access beyond the target cluster is also extremely suspicious. Regardless, new EC2 instances still do not register with their cluster. The AWSSupport-TroubleshootECSContainerInstance automation now comes up blank, stating: ``` {"Payload":{"stdout":"Unable to identify the cause of issue. If you are still experiencing issues while registering <REDACTED> in the cluster named vila-v0-prod, please open a case with Premium Support and attach the logs generated by the ECS Logs Collector script. See: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html If it is not possible to upload the file while case creation, please ask the assigned engineer to provide the instruction to upload the file. ","info_codes":["I000"]}} ```
0
answers
0
votes
6
views
asked 7 hours ago