Unanswered Questions tagged with High Performance Compute
Content language: English
Sort by most recent
Hyperthreading and core cost
Hi all, I have come across a technical point that could benefit from expert discussion. Computational Fluid Dynamics (CFD, starccm+ by Siemens) software does not use hyperthreading. It actually runs about 40% slower with hyperthreading on. This is because the CFD algorithms are very efficient and use the CPUs almost 100%. When I have a limit of say 500 vCPUs I have to turn off hyperthreading and run on only 250 vCPUs. I am of course paying for 500, so effectively I am being charged double for efficient code. I had sort of accepted this, except it looks like the Siemens cloud GUI solution (new and uses AWS in the background - a direct compare of capability) charges only for the actual cores used. Early suggestions (very early) are that costs per core may be about half of the AWS cores (best computing ones c6i). This is likely because of hyperthreading. I have asked if it is possible to look at the costing of AWS compute cores with regards to hyperthreading – maybe there can be a CFD option on core cost? Look forward to comments and thoughts! Kind Regards Stephen
SSM Agency Amazon running CPU high load near 100%
I just started a t3a.nano instance it's normal at stat up but it's not stable later. The ssm agent service ran as cron and made my server crashed almost. Right now I can not connect to SSH. I have tried Stop and Start also, nothing help to solve it up to now. Anyone meet this issue? P/s: I have tried read old topic and wait 1-2 hours to ssm update but no change. For AWS Support you can check my instance i-0d8bcd6234b2d9ac6
RE : AWS Inspector Shows Critical Updates Pending But Instance Says Otherwise
Hi Team, Instance ID - i-0e5934adddc2d8372 I've updated all the packages (See Libcurl-2.png). But the Inspector still shows critical updates are pending on my instance (See Libcurl-1.png) Requesting help in investigating this.![![Enter image description here](/media/postImages/original/IMgUGzjYUXQAOinPmO1hAqZg) Enter image description here](/media/postImages/original/IMjXCbaW5ZTuaGmhzz6Nw26g) ![Enter image description here](/media/postImages/original/IMQf0BnF4-RvGK_YroaAC43Q) Eg : >>> This is what Inspector Shows For The Instance : Affected packages Name libcurl Installed version / Fixed Version 0:7.79.1-4.amzn2.0.1.X86_64 / 0:7.79.1-6.amzn2.0.1 Package manager OS Name curl Installed version / Fixed Version 0:7.79.1-4.amzn2.0.1.X86_64 / 0:7.79.1-6.amzn2.0.1 Package manager OS >>> This is what the Instance shows when trying to remediate (i.e update the package -> It says its already updated) sh-4.2$ sudo yum update libcurl Loaded plugins: extras_suggestions, langpacks, priorities, update-motd amzn2-core | 3.7 kB 00:00:00 No packages marked for update sh-4.2$
AWS Batch requesting more VCPU's than tasks require
Hi, We have an AWS Batch compute environment set up to use EC2 spot instances, with no limits on instance type, and with the `SPOT_CAPACITY_OPTIMIZED` allocation strategy. We submitted a task requiring 32 VCPUs and 58000MB memory (which is 2GB below the minimum amount of memory for the smallest 32 VCPU instance size, c3.8xlarge, just to leave a bit of headroom), which is reflected in the job status page. We expected to receive an instance with 32 VCPUs and >64GB memory, but received an `r4.16xlarge` with 64 VCPUs and 488GB memory. An `r4.16xlarge` is rather oversized for the single task in the queue, and our task can't take advantage of the extra cores, as we pin the processes to the specified number of cores so multiple tasks scheduled on the same host don't contend over CPU. We have no other tasks in the queue and no currently-running compute instances, nor any desired/minimum set on the compute environment before this task was submitted. In the autoscaling history, it shows: `a user request update of AutoScalingGroup constraints to min: 0, max: 36, desired: 36 changing the desired capacity from 0 to provide the desired capacity of 36` Where did this 36 come from? Surely this should be 32 to match our task? I'm aware that the docs say: `However, AWS Batch might need to exceed maxvCpus to meet your capacity requirements. In this event, AWS Batch never exceeds maxvCpus by more than a single instance.` But we're concerned that once we start scaling up, each task will be erroneously requested with 4 extra VCPUs. I'm guessing this is what happened in this case is due to the `SPOT_CAPACITY_OPTIMIZED` allocation strategy. * Batch probably queried for the best available host to meet our 32 VCPU requirement and got the answer c4.8xlarge, which has 36 cores. * Batch then told the autoscaling group to scale to 36 cores, expecting to get a c4.8xlarge from the spot instance request. * The spot instance allocation strategy is currently set to `SPOT_CAPACITY_OPTIMIZED`, which prefers instances that are less likely to be killed (rather than preferring the cheapest/best fitting). * The spot instance request looked at the availability of c4.8xlarge and decided that they were too likely to be killed under the `SPOT_CAPACITY_OPTIMIZED` allocation strategy, and decided to sub it in with the most-available host that matched the 36 core requirement set by batch, which turned out to be an oversized 64 VCPU r5 instead of the better-fitting-for-the-task 32 or 48 VCPU R5. But the above implies that Batch itself doesn't follow the same logic as the `SPOT_CAPACITY_OPTIMIZED`, and instead requests the specs of the "best fit" host even if that host will not be provided by the spot request, resulting in potentially significantly oversized hosts. Alternatively, the 64 VCPU r5 happened to have better availability than the 48/32 VCPU r5, but I don't see how that would be possible, since the 64 VCPU r5 is just 2*the 32 VCPU one, and these are virtualised hosts, so you would expect the availability of the 64 VCPU to be half that of the 32 VCPU one. Can it be confirmed if either of my guesses here are correct, or if I'm thinking about it the wrong way, or if we missed a configuration setting? Thanks!
Amazon GameLift now supports AWS Local Zones
Hello GameLift Devs, Today, the GameLift team is excited to announce the general availability of AWS Local Zones. With this update, you can seamlessly provide gameplay experiences across 8 new AWS Local Zones in Chicago, Houston, Dallas, Kansas City, Denver, Atlanta, Los Angeles, and Phoenix. Along with the updated support for Local Zones, we are adding new instance types specifically supported in the various Local Zone Regions, including C5d and R5d instance types. Additionally we are adding support for the next generation [C6a](https://aws.amazon.com/ec2/instance-types/c6a/) and [C6i](https://aws.amazon.com/ec2/instance-types/c6i/) instance types. Amazon EC2 C6i instances are powered by 3rd Generation Intel Xeon Scalable processors and deliver up to 15% better price performance compared to C5 instances for a wide variety of workloads and are ideal for highly scalable multiplayer games. You can find updated pricing on the [GameLift pricing page](https://aws.amazon.com/gamelift/pricing/) as well as in the [AWS Pricing Calculator](https://calculator.aws/#/addService/GameLift). For more information, please refer to our [Release Notes](https://docs.aws.amazon.com/gamelift/latest/developerguide/release-notes.html#release-notes-summary) and [What’s New post](https://aws.amazon.com/about-aws/whats-new/2022/08/amazon-gamelift-supports-aws-local-zones/). Mark Choi, GameLift PM
Cloud rendering with AWS + Nvidia for Octane 2021
Hi, In my current job, I work in a studio that specializes in 3D and VFX. Our goal is to render scenes from our pipeline on an AWS virtual machine with the best GPU configuration. We use the following software in our pipeline: Maya Autodesk 2021 (using Octane 2021) , After Effects 2020. 1. Is it possible to use your services in combination with AWS for rendering scenes from our pipeline? 2. Could you please explain how? Can you give us a tutorial or a guide on how to do that? Waiting for your replay, Thank you.
Does using SPOT_CAPACITY _OPTIMIZED launch spot instances into an auto-scaling group in AWS Batch?
I am trying to run multiple jobs in a compute environment using AWS Batch. From my understanding, when there are multiple jobs in a job queue and the allocation strategy is BEST_FIT, AWS Batch will wait for the job running in the environment to complete and only then it will launch the next job. Hence, it does not auto-scales when there are more jobs. But if I use an allocation strategy of BEST_FIT_PROGRESSIVE or SPOT_CAPACITY_OPTIMIZED (in spot compute resources), AWS Batch will auto scale the instances if more jobs become available in the job queue. So, AWS Batch service can add more instances to the ECS Cluster in which the compute environment is running. Am I right here? I am reference the documentation [here](https://docs.aws.amazon.com/batch/latest/userguide/allocation-strategies.html)
Nvidia Driver R515 does not work on P3 instance types
I have loaded the Nvidia drivers on a CentOS 7 AMI from Nvidia's RPM repositories. In the past this was loading the R510 drivers and they were working on P3 instances with V100 GPUs, but earlier this month they updated the repository to R515 and those drivers are not recognizing the V100 GPUs in the P3 instances. I have some on-premise nodes with a different model of V100 and the R515 drivers work there so the problem appears to only be with AWS instances. I have tried opening a bug report with Nvidia, but have not gotten a response yet. Can anybody else reproduce this issue or have I messed up something? If this is indeed an issue with the new driver then any tips on getting Nvidia to fix it? Or is it a problem specific to AWS instances that AWS needs to fix?
Deploying a Machine Learning Project with django and laravel as a backend.
Hello, i need some help regarding my question title. i am working on a project that needs to get a video stored on a website which i have created to work with my machine learning code which is running on pycharm it all seems to be working locally and schedulely but i want it to run it on cloud what type of instance do i need because the model i am using takes a lot of time on cpu and gpu based instance will certainly help. if you need more details regarding this let me know thanks.
why do I see Throughput throttled to 5GB for C5.9xlarge C5n.9xlarge instances with 100% traffic load
I have VMs configured with C5.9xlarge C5n.9xlarge EC2 instances and sending 100% traffic load while running RFC 2544, but I see only a maximum throughput of 5GB ,and not the maximum BW that these instances can go upto. How can I get the maximum throughput performance please. It looks like it is being throttled.