Questions tagged with High Performance Compute
Content language: English
Sort by most recent
Hyperthreading and core cost
Hi all, I have come across a technical point that could benefit from expert discussion. Computational Fluid Dynamics (CFD, starccm+ by Siemens) software does not use hyperthreading. It actually runs about 40% slower with hyperthreading on. This is because the CFD algorithms are very efficient and use the CPUs almost 100%. When I have a limit of say 500 vCPUs I have to turn off hyperthreading and run on only 250 vCPUs. I am of course paying for 500, so effectively I am being charged double for efficient code. I had sort of accepted this, except it looks like the Siemens cloud GUI solution (new and uses AWS in the background - a direct compare of capability) charges only for the actual cores used. Early suggestions (very early) are that costs per core may be about half of the AWS cores (best computing ones c6i). This is likely because of hyperthreading. I have asked if it is possible to look at the costing of AWS compute cores with regards to hyperthreading – maybe there can be a CFD option on core cost? Look forward to comments and thoughts! Kind Regards Stephen
Parallel cluster nodes failed
When running a parallel cluster using a partition with c6g-medium ondemand machines, 19 of them failed during a run and never powered up again. My sinfo returns: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST c6gm-ondemand up infinite 19 idle% c6gm-ondemand-dy-c6gmedium-[32-50] c6gm-ondemand up infinite 31 alloc c6gm-ondemand-dy-c6gmedium-[1-31] While sacct contains the following entries: 12033 2022_6_39+ c6gm-onde+ 1 NODE_FAIL 0:0 12034 2022_6_40+ c6gm-onde+ 1 NODE_FAIL 0:0 12037 2022_6_43+ c6gm-onde+ 1 NODE_FAIL 0:0 12039 2022_6_45+ c6gm-onde+ 1 NODE_FAIL 0:0 12040 2022_6_46+ c6gm-onde+ 1 NODE_FAIL 0:0 Does anyone know how I can figure out what caused these nodes to fail and never be booted up again? The other 31 ondemand nodes have been running similar task as the 19 failed nodes without problems. Also, is there any way to restart the 19 failed nodes somehow? I would really like to run 50 nodes in parallel, not 31. EDIT: my squeue contains hundreds more `PENDING` jobs to be run on nodes in this partition, so I'm a bit confused why the `idle%` nodes aren't being powered up again.
Faster processing: EBS Vs S3
What is the best solution to read data by an application hosted in an EC2 instance using S3 than EBS? I am using an EC2 instance for reading data stored in EBS (size approx. 2 TB) and performing many transformations using ETL and analytics jobs. But as part of strict 3 tier architecture, there is a need to move this data from EBS (application layer) to data tier (preferably S3). My understanding is that if I move all these data permanently from EBS to S3, and read 2TB data from S3 daily for my jobs, the performance of jobs will be very low. 1) Can you please suggest how can I achieve a better approach? 2) Instead of S3, can I use any other service? 3) The system is Linux system and hence I can't use Fsx 4) I need the lightening performance for my jobs. Any help in this regard, will be appreciated.
SSM Agency Amazon running CPU high load near 100%
I just started a t3a.nano instance it's normal at stat up but it's not stable later. The ssm agent service ran as cron and made my server crashed almost. Right now I can not connect to SSH. I have tried Stop and Start also, nothing help to solve it up to now. Anyone meet this issue? P/s: I have tried read old topic and wait 1-2 hours to ssm update but no change. For AWS Support you can check my instance i-0d8bcd6234b2d9ac6
AWS instance randomly becomes unresponsive
My AWS instance randomly becomes unresponsive everyday. I won't be able to ``ping`` it and all ports on the public IP are inaccessible but it shows the instance is running from the AWS dashboard. The only way to fix it is by rebooting but I don't want to have to do this every day. The instance reachability check fails but the system status check doesn't ![status check](/media/postImages/original/IMrJ6TIc7pSVKt1XkZhJ6iRQ) The CPU utilization is not even high so I know it isn't crashing or something. ![cpu utilization](/media/postImages/original/IMG3l0VCteQKypIPRy7h_w1w) The system log doesn't show anything wrong either (pastebin: https://pastebin.com/Ri1ui8sp). My machine type doesn't have EC2 serial console so I can't access that either.
Elasticache vertical scale up strange behaviour
Hi community! In my application, ElastiCache (Cluster mode disabled) is used in two scenarios, daily: 1. Intense usage, for about 3 hours, in which we need an improved network performance, and is run with cache.m6g.2xlarge 2. Light usage, for the rest of the day, in which a cache.m6g.large would be more than enough. We now use the 2xlarge 24/7, but would be nice to be able to vertically scale up and down during the intensive hours. However, when we do a scale up (large ⇾ 2xlarge) right before the heavy process, the behavior of the instance is not the same if we don't scale (keep 2xlarge for the whole day). Just for comparison, the first graph shows the Network Bytes In when there is a scale up right before the process, and the same metric when there isn't: ![With scale up, from cache.m6g.large to cache.m6g.2xlarge, reaching a max of 13Gb per minute](/media/postImages/original/IMox_zRCrAQyqlKEXGQo_Ixg) With scale up, from cache.m6g.large to cache.m6g.2xlarge, reaching a max of 13Gb per minute ![No scale up, instance cache.m6g.2xlarge reaches a max of 24Gb per minute](/media/postImages/original/IMkONubqtMS2SW_dQK_FAOAQ) No scale up, reaches a max of 24Gb per minute Note that the cache process only starts after the cluster Status is set to available. This drop in the Network Bytes In rate shouldn't be happening, and it is making the scale option to be impracticable to us. What is the point in giving an online scaling feature that does not work as it should after the scaling? Has anyone experienced something similar, and do you know of any alternatives to accomplish our goal, to provide performance only during the cache hours, keeping our costs reasonable? Thanks.
Point parallel cluster onNodeConfigured option to script on locally mounted filesystem
When setting up a parallel cluster, there is an option to mount an EFS file system on head and compute nodes with something like: SharedStorage: - MountDir: /efs Name: standard-efs StorageType: Efs EfsSettings: FileSystemId: fs-blah There is also a way to run an automatic configuration script on head/compute nodes with, e.g.,: HeadNode: CustomActions: OnNodeConfigured: script-url Currently, `script-url` has to start with `s3://` or `https://`. Is there a way to point a head/compute node configuration script to my mounted EFS dir, e.g., to `file://efs/my-node-setup.sh`?
RE : AWS Inspector Shows Critical Updates Pending But Instance Says Otherwise
Hi Team, Instance ID - i-0e5934adddc2d8372 I've updated all the packages (See Libcurl-2.png). But the Inspector still shows critical updates are pending on my instance (See Libcurl-1.png) Requesting help in investigating this.![![Enter image description here](/media/postImages/original/IMgUGzjYUXQAOinPmO1hAqZg) Enter image description here](/media/postImages/original/IMjXCbaW5ZTuaGmhzz6Nw26g) ![Enter image description here](/media/postImages/original/IMQf0BnF4-RvGK_YroaAC43Q) Eg : >>> This is what Inspector Shows For The Instance : Affected packages Name libcurl Installed version / Fixed Version 0:7.79.1-4.amzn2.0.1.X86_64 / 0:7.79.1-6.amzn2.0.1 Package manager OS Name curl Installed version / Fixed Version 0:7.79.1-4.amzn2.0.1.X86_64 / 0:7.79.1-6.amzn2.0.1 Package manager OS >>> This is what the Instance shows when trying to remediate (i.e update the package -> It says its already updated) sh-4.2$ sudo yum update libcurl Loaded plugins: extras_suggestions, langpacks, priorities, update-motd amzn2-core | 3.7 kB 00:00:00 No packages marked for update sh-4.2$
AWS Batch requesting more VCPU's than tasks require
Hi, We have an AWS Batch compute environment set up to use EC2 spot instances, with no limits on instance type, and with the `SPOT_CAPACITY_OPTIMIZED` allocation strategy. We submitted a task requiring 32 VCPUs and 58000MB memory (which is 2GB below the minimum amount of memory for the smallest 32 VCPU instance size, c3.8xlarge, just to leave a bit of headroom), which is reflected in the job status page. We expected to receive an instance with 32 VCPUs and >64GB memory, but received an `r4.16xlarge` with 64 VCPUs and 488GB memory. An `r4.16xlarge` is rather oversized for the single task in the queue, and our task can't take advantage of the extra cores, as we pin the processes to the specified number of cores so multiple tasks scheduled on the same host don't contend over CPU. We have no other tasks in the queue and no currently-running compute instances, nor any desired/minimum set on the compute environment before this task was submitted. In the autoscaling history, it shows: `a user request update of AutoScalingGroup constraints to min: 0, max: 36, desired: 36 changing the desired capacity from 0 to provide the desired capacity of 36` Where did this 36 come from? Surely this should be 32 to match our task? I'm aware that the docs say: `However, AWS Batch might need to exceed maxvCpus to meet your capacity requirements. In this event, AWS Batch never exceeds maxvCpus by more than a single instance.` But we're concerned that once we start scaling up, each task will be erroneously requested with 4 extra VCPUs. I'm guessing this is what happened in this case is due to the `SPOT_CAPACITY_OPTIMIZED` allocation strategy. * Batch probably queried for the best available host to meet our 32 VCPU requirement and got the answer c4.8xlarge, which has 36 cores. * Batch then told the autoscaling group to scale to 36 cores, expecting to get a c4.8xlarge from the spot instance request. * The spot instance allocation strategy is currently set to `SPOT_CAPACITY_OPTIMIZED`, which prefers instances that are less likely to be killed (rather than preferring the cheapest/best fitting). * The spot instance request looked at the availability of c4.8xlarge and decided that they were too likely to be killed under the `SPOT_CAPACITY_OPTIMIZED` allocation strategy, and decided to sub it in with the most-available host that matched the 36 core requirement set by batch, which turned out to be an oversized 64 VCPU r5 instead of the better-fitting-for-the-task 32 or 48 VCPU R5. But the above implies that Batch itself doesn't follow the same logic as the `SPOT_CAPACITY_OPTIMIZED`, and instead requests the specs of the "best fit" host even if that host will not be provided by the spot request, resulting in potentially significantly oversized hosts. Alternatively, the 64 VCPU r5 happened to have better availability than the 48/32 VCPU r5, but I don't see how that would be possible, since the 64 VCPU r5 is just 2*the 32 VCPU one, and these are virtualised hosts, so you would expect the availability of the 64 VCPU to be half that of the 32 VCPU one. Can it be confirmed if either of my guesses here are correct, or if I'm thinking about it the wrong way, or if we missed a configuration setting? Thanks!
Right Size EC2 Instance
Hi, I have a physical on-premises server with 2 Intel Xeon Gold processors, running SQL Server 2019. As per the specifications, the Xeon processor has 20 cores and 40 Thread. I want to now migrate SQL Server to AWS EC2 and looking for right size EC2 instance. The Database is used quite extensively with 70% CPU utilisation in every 5 mins. Could someone please suggest the right instance type for EC2?
If i enable automated backups for open-search Cluster what happens if it gets crashed
Hi Team, Automated snapshots are only for cluster recovery. You can use them to restore your domain in the event of red cluster status or data loss. What if my open-search cluster crashed or deleted by mistake, and how to restore the automated backup of my open-search.