Hello all,
I'm using Sagemaker Studio to run a training script. I'm getting the error
"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"
I notice that the imports warn the following:
2023-09-07 08:24:01,052 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-09-07 08:24:01,084 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-09-07 08:24:01,118 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
Is this typical? How can I go about solving this issue?
I checked memory utilization with
!df -h
which produces
Filesystem Size Used Avail Use% Mounted on
overlay 32G 48K 32G 1% /
tmpfs 64M 0 64M 0% /dev
tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
shm 395M 0 395M 0% /dev/shm
127.0.0.1:/200010 8.0E 1.3G 8.0E 1% /root
/dev/nvme0n1p1 160G 11G 150G 7% /opt/.sagemakerinternal
devtmpfs 1.9G 0 1.9G 0% /dev/tty
tmpfs 1.9G 0 1.9G 0% /proc/acpi
tmpfs 1.9G 0 1.9G 0% /sys/firmware
However, I was previously working on a Sagemaker notebook and I filled the /dev/xlda1 folder as I saved datasets to the model_dir folder accidentally. I then attempted to clear memory by !rm -rf /dev/xlda1 which resulted in more issues so I switched. Could this full disk also be the reason? If so, how can I free that memory?
Many thanks to anyone who takes time to help. I'm figuring this out as I go.
Samuel