"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"

0

Hello all,

I'm using Sagemaker Studio to run a training script. I'm getting the error

"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"

I notice that the imports warn the following:

2023-09-07 08:24:01,052 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-07 08:24:01,084 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-07 08:24:01,118 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)

Is this typical? How can I go about solving this issue?

I checked memory utilization with !df -h

which produces

Filesystem         Size  Used Avail Use% Mounted on
overlay             32G   48K   32G   1% /
tmpfs               64M     0   64M   0% /dev
tmpfs              1.9G     0  1.9G   0% /sys/fs/cgroup
shm                395M     0  395M   0% /dev/shm
127.0.0.1:/200010  8.0E  1.3G  8.0E   1% /root
/dev/nvme0n1p1     160G   11G  150G   7% /opt/.sagemakerinternal
devtmpfs           1.9G     0  1.9G   0% /dev/tty
tmpfs              1.9G     0  1.9G   0% /proc/acpi
tmpfs              1.9G     0  1.9G   0% /sys/firmware

However, I was previously working on a Sagemaker notebook and I filled the /dev/xlda1 folder as I saved datasets to the model_dir folder accidentally. I then attempted to clear memory by !rm -rf /dev/xlda1 which resulted in more issues so I switched. Could this full disk also be the reason? If so, how can I free that memory?

Many thanks to anyone who takes time to help. I'm figuring this out as I go.

Samuel

Samuel
asked 8 months ago35 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions