I'm training a machine learning model on ml.g5.24xlarge:
The following code:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Seq2Seq(encoder, decoder, device).to(device)
when run through the model training code:
model.train()
for epoch in range(10):
epoch_loss = 0
for context_batch, question_batch, answer_batch in train_loader:
padded_contexts = pad_sequences(context_batch, max_seq_len).to(device)
padded_questions = pad_sequences(question_batch, max_seq_len).to(device)
padded_answers = pad_sequences(answer_batch, max_seq_len).to(device)
optimizer.zero_grad()
output = model(padded_contexts, padded_questions)
output_dim = output.shape[-1]
output = output[:, 1:, :].reshape(-1, output_dim)
padded_answers = padded_answers[:, 1:].reshape(-1)
loss = criterion(output, padded_answers)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f'Epoch {epoch+1} loss: {epoch_loss/len(train_loader):.4f}')
returns the dreading following error:
/opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes
failed.
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
I'm running I'm running the conda_pytorch_p39 enviroment on Jupyter lab
Any ideas welcome!
Thanks :)
This error usually occurs if you have a value that's not in the range (0, n_classes) in your dataset. Can you re-check your data? Also, for future, you can use code blocks (on the GUI or start with ``` to use markdown) for posting code, for improved readability :)