Questions tagged with High Performance Compute

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Job on parallel cluster using hpc6a.48xlarge not running

Hello, I want to run a simulation on a parallel cluster (alinux2) using 2 hpc6a.48xlarge instances (192 CPUs). I created the cluster and submitted the job using slurm. The problem is that the job stays waiting in the queue and never runs (I left it for more than 1 day). I tried running the same job using another kind of instance, with the same number of CPUs and it worked perfectly, so it is an issue with this specific instance (hpc6a.48xlarge). I also tried using only 1 hpc6a.48xlarge instance (96 CPUs) but did not work either. I copy the squeue information at the end of the message. It shows some 'BeginTime' reasons, although I have not programmed my job to start later. What may be the reason for this issue? I am creating the cluster on a new company account. May the issue be related to the usage of the account? I ask this because I have already configured the same cluster on a personal account (with significantly more usage than the company account) and the job is executed almost immediately. I would appreciate any advice on resolving this issue [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)
4
answers
0
votes
115
views
asked 8 months ago

Trying Sagemaker example but getting error: AttributeError: module 'sagemaker' has no attribute 'create_transform_job'

Hi, I keep getting this error: AttributeError: module 'sagemaker' has no attribute 'create_transform_job', when using a batch transform example that AWS graciously had in the notebook instances. Code: ***Also, I updated Sagemaker to the newest package and its still not working. ``` %%time import time from time import gmtime, strftime batch_job_name = "Batch-Transform-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime()) input_location = "s3://{}/{}/batch/{}".format( bucket, prefix, batch_file ) # use input data without ID column output_location = "s3://{}/{}/output/{}".format(bucket, prefix, batch_job_name) request = { "TransformJobName": batch_job_name, "ModelName": 'xgboost-parquet-example-training-2022-03-28-16-02-31-model', "TransformOutput": { "S3OutputPath": output_location, "Accept": "text/csv", "AssembleWith": "Line", }, "TransformInput": { "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": input_location}}, "ContentType": "text/csv", "SplitType": "Line", "CompressionType": "None", }, "TransformResources": {"InstanceType": "ml.m4.xlarge", "InstanceCount": 1}, } sagemaker.create_transform_job(**request) print("Created Transform job with name: ", batch_job_name) # Wait until the job finishes try: sagemaker.get_waiter("transform_job_completed_or_stopped").wait(TransformJobName=batch_job_name) finally: response = sagemaker.describe_transform_job(TransformJobName=batch_job_name) status = response["TransformJobStatus"] print("Transform job ended with status: " + status) if status == "Failed": message = response["FailureReason"] print("Transform failed with the following error: {}".format(message)) raise Exception("Transform job failed") ``` Everything else is working well. I've had no luck with this on anyother forum.
1
answers
0
votes
273
views
asked 8 months ago