- Newest
- Most votes
- Most comments
Roberto,
The issue you are having may come from the fact that when you create a PyTorchJob, it doesn't necessarily mean that a pod with the same name as the PyTorchJob will be created. The PyTorchJob is a custom resource that represents a distributed PyTorch training job, and it will manage the lifecycle of the pods that perform the training. The pods themselves will have names that are derived from the PyTorchJob name, but not exactly the same.
If you're not seeing any pods when you run kubectl get pods -o wide | grep train, it could be that the pods have not yet been scheduled, or they have already completed their work and exited. You might want to check the status of the PyTorchJob itself with a command like kubectl describe pytorchjob cifar10-train.
As for the kubetail command, it's a bash script that enables you to aggregate (tail/follow) logs from multiple pods into one stream. If it's saying that no pod exists that matches cifar10-train, it could be because the pods have a different naming pattern.
Hope this helps!
Relevant content
- asked 2 years ago
- asked 3 years ago
- asked 3 years ago
- AWS OFFICIALUpdated 7 months ago
