PyTorch Training (PyTorchJob)

Using PyTorchJob to train a model with PyTorch

This page describes PyTorchJob for training a machine learning model with PyTorch.

PyTorchJob is a Kubernetes custom resource to run PyTorch training jobs on Kubernetes. The Kubeflow implementation of PyTorchJob is in training-operator.

Note: PyTorchJob doesn’t work in a user namespace by default because of Istio automatic sidecar injection. In order to get it running, it needs annotation sidecar.istio.io/inject: "false" to disable it for either PyTorchJob pods or namespace. To view an example of how to add this annotation to your yaml file, see the TFJob documentation.

Creating a PyTorch training job

You can create a training job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

Deploy the PyTorchJob resource to start training:

kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l training.kubeflow.org/job-name=pytorch-simple -n kubeflow

Training takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=pytorch-simple,training.kubeflow.org/replica-type=master,training.kubeflow.org/replica-index=0 -o name -n kubeflow)
kubectl logs -f ${PODNAME} -n kubeflow

Monitoring a PyTorchJob

kubectl get -o yaml pytorchjobs pytorch-simple -n kubeflow

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-12-16T21:39:09Z
  generation: 1
  name: pytorch-tcp-dist-mnist
  namespace: default
  resourceVersion: "15532"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
  uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
spec:
  cleanPodPolicy: None
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
            - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              name: pytorch
              ports:
                - containerPort: 23456
                  name: pytorchjob-port
              resources: {}
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
            - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              name: pytorch
              ports:
                - containerPort: 23456
                  name: pytorchjob-port
              resources: {}
status:
  completionTime: 2018-12-16T21:43:27Z
  conditions:
    - lastTransitionTime: 2018-12-16T21:39:09Z
      lastUpdateTime: 2018-12-16T21:39:09Z
      message: PyTorchJob pytorch-tcp-dist-mnist is created.
      reason: PyTorchJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2018-12-16T21:39:09Z
      lastUpdateTime: 2018-12-16T21:40:45Z
      message: PyTorchJob pytorch-tcp-dist-mnist is running.
      reason: PyTorchJobRunning
      status: "False"
      type: Running
    - lastTransitionTime: 2018-12-16T21:39:09Z
      lastUpdateTime: 2018-12-16T21:43:27Z
      message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
      reason: PyTorchJobSucceeded
      status: "True"
      type: Succeeded
  replicaStatuses:
    Master: {}
    Worker: {}
  startTime: 2018-12-16T21:40:45Z