cuDNN error ml.g5.24xlarge


I'm training a machine learning model on ml.g5.24xlarge:

The following code:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(encoder, decoder, device).to(device)

when run through the model training code:

model.train() for epoch in range(10): epoch_loss = 0 for context_batch, question_batch, answer_batch in train_loader: padded_contexts = pad_sequences(context_batch, max_seq_len).to(device) padded_questions = pad_sequences(question_batch, max_seq_len).to(device) padded_answers = pad_sequences(answer_batch, max_seq_len).to(device) optimizer.zero_grad() output = model(padded_contexts, padded_questions) output_dim = output.shape[-1] output = output[:, 1:, :].reshape(-1, output_dim) padded_answers = padded_answers[:, 1:].reshape(-1) loss = criterion(output, padded_answers) loss.backward() optimizer.step() epoch_loss += loss.item() print(f'Epoch {epoch+1} loss: {epoch_loss/len(train_loader):.4f}')

returns the dreading following error: /opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/ nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.


I'm running I'm running the conda_pytorch_p39 enviroment on Jupyter lab

Any ideas welcome!

Thanks :)

  • This error usually occurs if you have a value that's not in the range (0, n_classes) in your dataset. Can you re-check your data? Also, for future, you can use code blocks (on the GUI or start with ``` to use markdown) for posting code, for improved readability :)

