Questions tagged with High Performance Compute
Content language: English
Sort by most recent
Network bandwidth performance
We have deployed 2 ec2 instances in same availability zone i.e. r5.2xlarge as per this instance features it has capability of up 10 GB network performance I'm confuse about its bandwidth calculation. Can someone give clarity on below points. 1. When it says up to 10 GB network performance is that mean it will give constant speed of 10GBPS.? 2. If I want to check maximum network bandwidth limit between 2 ec2 instances (Linux) how we can measure? Thanks in advance as this may help us for long term.
Master and Slave Instances Architecture
Hi Dears, I have a qustion on how to build master EC2 that can communicate with other EC2s based on need and send tasks to them based on a certian code/ group of tasks. Where Master EC2 will take data from RDS db and send data with task to each EC2 based on scudualing and aknowledgamet from each one. I appriclate your help dears! Thanks Basem
AWS architecture for Low latency trading system
What architecture would one use to design a low latency trading application? w.r.t: Compute: Serverless vs EC2/ Fargate vs EKS (on EC2 or Fargate) wr.t. DB/ Caching/ Streams/ Global Accelators/Local zones. Pointers to any case studies I can reference or does someone have experience with developing a low latency trading system? Thanks!
Data transfer speeds from S3 bucket -> EC2 SLURM cluster are slower than S3 bucket -> Google SLURM cluster
Hello, I am currently benchmarking big data multi-cloud transfer speeds at a range of parallel reads using a cluster of EC2 instances & similar Google machines. I first detected an issue when using a `c5n.2xlarge` EC2 instance for my worker nodes reading a 7 GB dataset in multiple formats from an S3 bucket. I have verified that the bucket is in the same cloud region as the EC2 nodes, but the data transfer executed far slower to EC2 instances than it did for GCP. The data is not going into EBS, rather being read in-memory, where the data chunks are then removed from memory when the process is complete. Here are a list of things I have tried to diagnose the problem: 1. Upgrading to a bigger instance type. I am aware that there is a network bandwidth limit to each instance type, and I saw a read speed increase when I changed to a `c5n.9xlarge` (From your documentation, there should be 50 Gpbs of bandwidth), but it was still slower than reading from S3 to a Google VM with larger network proximity. I also upgraded instance type again, but there little to no speed increase. Note that hyperthreading is turned off for each EC2 instance. 2. Changing the S3 bucket parameter `max_concurrent_requests` to `100`. I am using python to benchmark these speeds, so this parameter was passed into a `storage_options` dictionary that is used in different remote data access APIs (see the [Dask documentation](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html#:~:text=%22config_kwargs%22%3A%20%7B%22s3%22%3A%20%7B%22addressing_style%22%3A%20%22virtual%22%7D%7D%2C) for more info). Editing this parameter had absolutely no effect on the transfer speeds. 3. Verified that enhanced networking is active on all worker nodes & controller node. 4. Performed the data transfer directly from a worker node command line for both AWS and GCP machines. This was done to rule out my testing code being at fault, and the results were the same: S3 -> EC2 was slower than S3-> GCP. 5. Varying how many cores of each EC2 instance were used in each SLURM job. For the Google machines, each worker node has 4 cores and 16 GB memory, so each job that I submit there takes up an entire node. However, when I had to upgrade my EC2 worker node instances, there are clearly more cores than just 4 per node. To try and maintain a fair comparison, I configured each SLURM job to only access 8 cores per node in my EC2 cluster (I am performing 40 parallel reads at maximum, so if my understanding is correct each node will have 8 separate data stream connections, with 5 total nodes being active at a time with `c5n.9xlarge` instances). I also tried seeing if allocating all of a node's resources for my 40 parallel reads would speed things up (2 instances with all 18 cores in each active, and a third worker instance with only 4 cores active), but there was no effect. I'm fairly confident there is a solution to this, but I am having an extremely difficult time figuring out what it is. I know that setting an endpoint shouldn't be the problem, because GCP is faster than EC2 and there is egress occurring there. Any help would be appreciated, because I want to make sure I get an accurate picture of S3->EC2 before presenting my work. Please let me know if more information is needed!
Performance drop on AWS T3 instances - Linux
The issue with performance drop when having just a couple of hundreds of connections might be typical to AWS T3 instances. Our server-side Apache modules maintain "sleeping" connections to clients, which is not quite typical for regular web applications. Any time a new request come for a host PC, the client's server-side Apache instance wakes up and puts the request's data into a queue for the host PC and signals its server-side Apache instance, which wakes up and sends the data to the host PC. Hence, every data exchange through the server involves quite a few of server processes scheduling. It appears that our T3 starts experiencing lagging of the AWS scheduling at some point, which is not reflected in CPU usage.
Glue: No autoscaling while option enabled
We are running glue 3.0 jobs with the --enable-auto-scaling set but according to the metrics, the number of active executors is not reduced while it should be according to the number of maximum needed executor. It is working sometimes but clearly not as we would expect. We're seeing that on lots of jobs causing major cost impact. Has anyone experienced such kind of issues or could help us understand what could cause this? Here is an example to illustrate our issue: ![Enter image description here](https://repost.aws/media/postImages/original/IMbKNqT5VnT4y3yiVx3Z8SIw) Thanks, Alain
AWS Glue pyspark, AWS OpenSearch and 429 Too Many Requests
Hi forum, I'm on AWS and trying to write ~ 1.2mio documents from an AWS Glue 2.0 job Python / pyspark job to an OpenSearch 1.2 "t3.small.search"/SSD cluster. The issue I'm facing is that after a while I'm facing "429 Too Many Requests": org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [PUT] on [MY/doc/_bulk] failed; server[https://SOME_ENDPOINT_HERE] returned [429|Too Many Requests:] From what I understand and read so far this is pretty much about configuration, throttling down indexing requests on the client side giving the server more time to process queued requests. And that's what I tried but somehow the config on the Hadoop connector does not work for me. Already tried to send smaller batches of documents to the ElasticSearch and increased retry wait time: set 'es.batch.size.entries' to 100 and 'es.batch.write.retry.wait' to 30s: ``` df \ .write \ .mode('overwrite') \ .format('org.elasticsearch.spark.sql') \ .option('es.nodes', 'SOME_ENDPOINT_HERE') \ .option('es.port', 443) \ .option('es.net.ssl', 'true') \ .option('es.net.http.auth.user', 'SOME_USER_NAME_HERE') \ .option('es.net.http.auth.pass', 'SOME_PASS_HERE') \ .option('es.nodes.wan.only', 'true') \ .option('es.nodes.discovery', 'false') \ .option('es.resource', 'SOME_NAME_HERE') \ .option('es.index.auto.create', 'true') \ .option('es.mapping.id', 'SOME_FIELD_HERE') \ .option('es.write.operation', 'index') \ .option('es.batch.size.entries', '100') \ .option('es.batch.write.retry.policy', 'simple') \ .option('es.batch.write.retry.count', '-1') \ .option('es.batch.write.retry.limit', '-1') \ .option('es.batch.write.retry.wait', '30s') \ .save() ``` Already set logging for 'org.elasticsearch.hadoop.rest' logger to DEBUG level: ``` Bulk Flush #: Sending batch of  bytes/ entries Bulk Flush #: Response received Bulk Flush #: Completed.  Original Entries.  Attempts. [1000/1000] Docs Sent. [0/1000] Docs Skipped. [0/1000] Docs Aborted. ``` From what I understand the Hadoop-Connector is sending batches of 1000 documents, not the 100 from my config. Further I can not see any wait time. My actual setup on AWS is: Spark: 2.4.3 Python: 3.7 OpenSearch: 1.2 Elasticsearch Hadoop: 7.13.4 (elasticsearch-spark-20_2.11-7.13.4.jar) Any hints or ideas on my setup? Many Thanks, Matthias
Cloud rendering with AWS + Nvidia for Octane 2021
Hi, In my current job, I work in a studio that specializes in 3D and VFX. Our goal is to render scenes from our pipeline on an AWS virtual machine with the best GPU configuration. We use the following software in our pipeline: Maya Autodesk 2021 (using Octane 2021) , After Effects 2020. 1. Is it possible to use your services in combination with AWS for rendering scenes from our pipeline? 2. Could you please explain how? Can you give us a tutorial or a guide on how to do that? Waiting for your replay, Thank you.
EC2 Instance stops working after some time
Hello All, I have deployed an AI model within a Django application, It has just one rest API which activates the AI model. It works perfectly fine for some time but the whole instance stops responding after a while. I am using a c5.4xlarge instance and the CPU percentage in cloud watch is 11-12% max, I have tried both ways with docker and without docker the condition is the same. Please help me with this....
Is it possible to connect the resources of two or more EC2 instances?
Hello, I want to ask if its possible to connect or unify the resources of two or more EC2 instances? Background is that we use heavy machine learning on prem right now and are running into hardware limitations. We thought about using EC2 but the possible instances specifications are not enough and we would like to combine two or more instances (like p3.16xlarge). Is something like that possible?
EBS IOPS, Throughput and EBS bandwidth
we need to deploy ec2 with following EBS configuration- random 4k write : 2,00,000 IOPS sequential read-2000Mb/s sequential write-2000Mb/s this is database server requirement. 1. I have some confusion if I select **io2** EBS type with 200000 provision IOPS let say for 24 hours and after that I have modified EBS type to GP3 do AWS charge me for full month or only for 24 hours? if AWS would accrued 24 hours charge what would be charges? 2. what ec2 type I should go for any suggestions? Or am I making any mistake in selecting EBS type please guide me.
Get current instance features from within said instance
I've been working on some code that would benefit from some level of awareness about the platform on which its running. When it runs on bare metal, several options are available (lshw, hwloc and so on). In EC2 instances, this task is not so straight forward, as they run on virtualization (excluding bare metal instances, evidently). Running 'lshw' for instance, lists the hardware, that not necessarily corresponds with available resources. As an example, running lshw on a t2.micro instance, which has 1 default core available, gives the actual model of the CPU on which it is running, a Intel Xeon with 12 cores. I understand that I am able to fetch [instance metadata](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html), find which instance type the code is running on and use AWS CLI and/or EC2 API to get [the description of the instance](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/ec2-api.pdf). The issue with that workaround is that it presupposes that the current instance has either the AWS CLI configured with proper credentials or that the user credentials are available as environment variables to the system, which may or may not be true. I've been looking for a more general solution, that could work, at least, on the most popular Linux distros, such as querying the system about actually available resources (cpus cores, threads, memory, cache and accelerators) but have so far failed to find a suitable solution. Is this possible? Or in this circumstances such query is not a possibility?