cuDNN error ml.g5.24xlarge

0

I'm training a machine learning model on ml.g5.24xlarge:

The following code:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(encoder, decoder, device).to(device)

when run through the model training code:

model.train() for epoch in range(10): epoch_loss = 0 for context_batch, question_batch, answer_batch in train_loader: padded_contexts = pad_sequences(context_batch, max_seq_len).to(device) padded_questions = pad_sequences(question_batch, max_seq_len).to(device) padded_answers = pad_sequences(answer_batch, max_seq_len).to(device) optimizer.zero_grad() output = model(padded_contexts, padded_questions) output_dim = output.shape[-1] output = output[:, 1:, :].reshape(-1, output_dim) padded_answers = padded_answers[:, 1:].reshape(-1) loss = criterion(output, padded_answers) loss.backward() optimizer.step() epoch_loss += loss.item() print(f'Epoch {epoch+1} loss: {epoch_loss/len(train_loader):.4f}')

returns the dreading following error: /opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

I'm running I'm running the conda_pytorch_p39 enviroment on Jupyter lab

Any ideas welcome!

Thanks :)

  • This error usually occurs if you have a value that's not in the range (0, n_classes) in your dataset. Can you re-check your data? Also, for future, you can use code blocks (on the GUI or start with ``` to use markdown) for posting code, for improved readability :)

gefragt vor einem Jahr67 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen