All Questions

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Upgrade Elasticsearch cluster size and use 2 subnet ids.

Hi, We are using hashicorp terraform to maintain our aws resources. but aside from that Currently we have an ElasticSearh(opensearch domain) A Elasticsearch cluster: A single subnet version 5.3 (really old) instant count =1 instant type = m4.large.elasticsearch volumn_size = 512 use EBS = true EBS volumn type = gp2 zone awarness = false (if providing one subnet) =true (if providing 2 subnets) * If we upgraded to 2 instances with a larger instance type, such as m5.large.elasticsearch. Could you please confirm: - During the upgrade, (I believe it is blue/green upgrade), we can still call API to insert documents and the inserts during upgrade will still be available after upgrade. Am I correct? - Within same subnet, upgrading to more nodes will balance part of indexes and documents to 2nd node which provides performance enhancement due to load balance. Am I correct? Or actually the 2nd node will contain only replica shards from 1st node - which means query will improve but insert will not? * If we used 2 subnets and 2 nodes - will each subnet get one node? If so, will one become the total replica (only containing replica shards to ensure fail safe) of the other so queries/inserts/updates performance will not be improved? - if upgrading to 2 nodes, will the EBS volumn size be shared by these 2 nodes (eahch get 256 or just a shared ESB between 2 nodes?), or each node get 512? * Does version 5.3 even support dedicated mast nodes?
1
answers
0
votes
18
views
asked a day ago

How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters. After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing: ``` [ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch ``` This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation): ``` [ec2-user@- ~]$ lsmod | grep nvidia nvidia_drm 61440 0 nvidia_modeset 1200128 1 nvidia_drm nvidia_uvm 1142784 0 nvidia 35459072 2 nvidia_uvm,nvidia_modeset drm_kms_helper 184320 1 nvidia_drm drm 421888 4 drm_kms_helper,nvidia,nvidia_drm i2c_core 77824 3 drm_kms_helper,nvidia,drm [ec2-user@- ~]$ sudo rmmod nvidia_uvm [ec2-user@- ~]$ sudo rmmod nvidia_drm [ec2-user@- ~]$ sudo rmmod nvidia_modeset [ec2-user@- ~]$ sudo rmmod nvidia [ec2-user@- ~]$ nvidia-smi ``` This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?
0
answers
0
votes
7
views
asked a day ago