Skip to content

Confusion about pods and pytorchjob

0

After the command

kubectl apply -f train.yaml

the response of the system is pytorchjob.kubeflow.org/cifar10-train created. But if I issue the command

kubectl get pods -o wide | grep train

There is nothing as a result. And if I issue the command

kubetail cifar10-train

The answer of the system is -> No pod exists that matches cifar10-train. What's wrong ? It seems there is no pod with the name cifar10-train. Thanks. Bye I attach the train.yaml here

apiVersion: "kubeflow.org/v1"
`kind: PyTorchJob
metadata:
  name: cifar10-train
spec:
  elasticPolicy:
    rdzvBackend: etcd
    rdzvHost: etcd-service
    rdzvPort: 2379
    minReplicas: 1
    maxReplicas: 128
    maxRestarts: 100
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80
  pytorchReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: 374402744818.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu:latest
              imagePullPolicy: IfNotPresent
              env:
              - name: PROCESSOR
                value: "cpu"
              command:
                - python3
                - -m
                - torch.distributed.run
                - /workspace/cifar10-model-train.py
                - "--epochs=10"
                - "--batch-size=128"
                - "--workers=15"
                - "--model-file=/efs-shared/cifar10-model.pth"
                - "/efs-shared/cifar-10-batches-py/"
              volumeMounts:
                - name: efs-pv
                  mountPath: /efs-shared
                # The following enables the worker pods to use increased shared memory 
                # which is required when specifying more than 0 data loader workers
                - name: dshm
                  mountPath: /dev/shm
          volumes:
            - name: efs-pv
              persistentVolumeClaim:
                claimName: efs-pvc
            - name: dshm
              emptyDir:     
                medium: Memory`
asked 2 years ago474 views
1 Answer
0

Roberto,

The issue you are having may come from the fact that when you create a PyTorchJob, it doesn't necessarily mean that a pod with the same name as the PyTorchJob will be created. The PyTorchJob is a custom resource that represents a distributed PyTorch training job, and it will manage the lifecycle of the pods that perform the training. The pods themselves will have names that are derived from the PyTorchJob name, but not exactly the same.

If you're not seeing any pods when you run kubectl get pods -o wide | grep train, it could be that the pods have not yet been scheduled, or they have already completed their work and exited. You might want to check the status of the PyTorchJob itself with a command like kubectl describe pytorchjob cifar10-train.

As for the kubetail command, it's a bash script that enables you to aggregate (tail/follow) logs from multiple pods into one stream. If it's saying that no pod exists that matches cifar10-train, it could be because the pods have a different naming pattern.

Hope this helps!

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.