"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"

0

Hello all,

I'm using Sagemaker Studio to run a training script. I'm getting the error

"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"

I notice that the imports warn the following:

2023-09-07 08:24:01,052 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-07 08:24:01,084 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-07 08:24:01,118 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)

Is this typical? How can I go about solving this issue?

I checked memory utilization with !df -h

which produces

Filesystem         Size  Used Avail Use% Mounted on
overlay             32G   48K   32G   1% /
tmpfs               64M     0   64M   0% /dev
tmpfs              1.9G     0  1.9G   0% /sys/fs/cgroup
shm                395M     0  395M   0% /dev/shm
127.0.0.1:/200010  8.0E  1.3G  8.0E   1% /root
/dev/nvme0n1p1     160G   11G  150G   7% /opt/.sagemakerinternal
devtmpfs           1.9G     0  1.9G   0% /dev/tty
tmpfs              1.9G     0  1.9G   0% /proc/acpi
tmpfs              1.9G     0  1.9G   0% /sys/firmware

However, I was previously working on a Sagemaker notebook and I filled the /dev/xlda1 folder as I saved datasets to the model_dir folder accidentally. I then attempted to clear memory by !rm -rf /dev/xlda1 which resulted in more issues so I switched. Could this full disk also be the reason? If so, how can I free that memory?

Many thanks to anyone who takes time to help. I'm figuring this out as I go.

Samuel

Samuel
gefragt vor 8 Monaten35 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen