Efficient data sharing in Kubeflow with Vineyard CSI Driver#

If you are using Kubeflow Pipeline or Argo Workflow to manage your machine learning workflow, you may find that the data saving/loading to the volumes is slow. To speed up the data saving/loading within these volumes, we design the Vineyard CSI Driver to map each vineyard object to a volume, and the data saving/loading is handled by vineyard. Next, we will show you how to use the Vineyard CSI Driver to speed up a kubeflow pipeline.

Prerequisites#

  • A kubernetes cluster with version >= 1.25.10. If you don’t have one by hand, you can refer to the guide Initialize Kubernetes Cluster to create one.

  • Install the Vineyardctl by following the official guide.

  • Install the Argo Workflow CLI by following the official guide.

  • Install the kfp package with version < 2.0.0.

Deploy the Vineyard Cluster#

$ vineyardctl deploy vineyard-cluster --create-namespace

This command will create a vineyard cluster in the namespace vineyard-system. You can check as follows:

$ kubectl get pod -n vineyard-system
NAME                                             READY   STATUS    RESTARTS   AGE
vineyard-controller-manager-648fc9b7bf-zwnhd     2/2     Running   0          4d3h
vineyardd-sample-79c8ffb879-6k8mk                1/1     Running   0          4d3h
vineyardd-sample-79c8ffb879-f9kkr                1/1     Running   0          4d3h
vineyardd-sample-79c8ffb879-lzgwz                1/1     Running   0          4d3h
vineyardd-sample-etcd-0                          1/1     Running   0          4d3h

Deploy the Vineyard CSI Driver#

Before deploying the Vineyard CSI Driver, you are supposed to check the vineyard deployment is ready as follows:

$ kubectl get deployment -n vineyard-system
NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
vineyard-controller-manager      1/1     1            1           4d3h
vineyardd-sample                 3/3     3            3           4d3h

Then deploy the vineyard csi driver which specifies the vineyard cluster to use:

Tip

If you want to look into the debug logs of the vineyard csi driver, you can add a flag --verbose in the following command.

$ vineyardctl deploy csidriver --clusters vineyard-system/vineyardd-sample

Then check the status of the Vineyard CSI Driver:

$ kubectl get pod -n vineyard-system
NAME                                             READY   STATUS    RESTARTS   AGE
vineyard-controller-manager-648fc9b7bf-zwnhd     2/2     Running   0          4d3h
vineyard-csi-sample-csi-driver-fb7cb5b5d-nlrxs   4/4     Running   0          4m23s
vineyard-csi-sample-csi-nodes-69j77              3/3     Running   0          4m23s
vineyard-csi-sample-csi-nodes-k85hb              3/3     Running   0          4m23s
vineyard-csi-sample-csi-nodes-zhfz4              3/3     Running   0          4m23s
vineyardd-sample-79c8ffb879-6k8mk                1/1     Running   0          4d3h
vineyardd-sample-79c8ffb879-f9kkr                1/1     Running   0          4d3h
vineyardd-sample-79c8ffb879-lzgwz                1/1     Running   0          4d3h
vineyardd-sample-etcd-0                          1/1     Running   0          4d3h

Deploy Argo Workflows#

Install the argo server on Kubernetes:

$ kubectl create namespace argo
$ kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.8/install.yaml

Then check the status of the argo server:

$ kubectl get pod -n argo
NAME                                  READY   STATUS    RESTARTS   AGE
argo-server-7698c96655-ft6sj          1/1     Running   0          4d1h
workflow-controller-b888f4458-sfrjd   1/1     Running   0          4d1h

Running a Kubeflow Pipeline example#

The example is under the directory k8s/examples/vineyard-csidriver, and pipeline.py under this directory is the original pipeline definition. To use the Vineyard CSI Driver, we need to do two modifications:

  1. Change APIs like pd.read_pickle/write_pickle to vineyard.csi.write/read in the source code.

2. Add the vineyard object VolumeOp to the pipeline’s dependencies. The path in the API changed in the first step will be mapped to a volume. Notice, the volume used in any task needs to be explicitly mounted to the corresponding path in the source code, and the storageclass_name format of each VolumeOp is {vineyard-deployment-namespace}.{vineyard-deployment-name}.csi.

There are two ways to add the vineyard object VolumeOp to the pipeline’s dependencies:

  • Each path in the source code is mapped to a volume, and each volume is mounted to the actual path

in the source code. The benefit is that the source path does not need to be modified. - Create a volume for the paths with the same prefix in the source code. You can add the prefix /vineyard for the paths in the source code, and mount a volume to the path /vineyard. In this way, you can only create one volume for multiple paths/vineyard objects.

You may get some insights from the modified pipeline pipeline-with-vineyard.py. Then, we need to compile the pipeline to an argo-workflow yaml. To be compatible with benchmark test, we update the generated pipeline.yaml and pipeline-with-vineyard.yaml.

Now, we can build the docker images for the pipeline:

$ cd k8s/examples/vineyard-csidriver
$ make docker-build

Check the images built successfully:

$ docker images
train-data               latest    5628953ffe08   14 seconds ago   1.47GB
test-data                latest    94c8c75b960a   14 seconds ago   1.47GB
prepare-data             latest    5aab1b120261   15 seconds ago   1.47GB
preprocess-data          latest    5246d09e6f5e   15 seconds ago   1.47GB

Then push the image to a docker registry that your kubernetes cluster can access, as we use the kind cluster in this example, we can load the image to the clusters:

$ make load-images

To simulate the data loading/saving of the actual pipeline, we use the nfs volume to store the data. The nfs volume is mounted to the /mnt/data directory of the kind cluster. Then apply the data volume as follows:

Tip

If you already have nfs volume that can be accessed by the kubernetes cluster, you can update the prepare-data.yaml to use your nfs volume.

$ kubectl apply -f prepare-data.yaml

Deploy the rbac for the pipeline:

$ kubectl apply -f rbac.yaml

Submit the kubeflow example without vineyard to the argo server:

$ for data_multiplier in 3000 4000 5000; do \
    argo submit --watch pipeline.yaml -p data_multiplier=${data_multiplier}; \
done

Clear the previous resources:

$ argo delete --all

Submit the kubeflow example with vineyard to the argo server:

$ for data_multiplier in 3000 4000 5000; do \
    argo submit --watch pipeline-with-vineyard.yaml -p data_multiplier=${data_multiplier}; \
done

Result Analysis#

The data scale are 8500 Mi, 12000 Mi and 15000 Mi, which correspond to the 3000, 4000 and 5000 in the previous data_multiplier respectively, and the time of argo workflow execution of the pipeline is as follows:

Argo workflow duration#

data scale

without vineyard

with vineyard

8500 Mi

189s

164s

12000 Mi

234s

199s

15000 Mi

298s

252s

Actually, the cost time of argo workflow is affected by lots of factors, such as the network, the cpu and memory of the cluster, the data volume, etc. So the time of argo workflow execution of the pipeline is not stable. But we can still find that the time of argo workflow execution of the pipeline with vineyard is shorter than that without vineyard.

Also, we record the whole execution time via logs. The result is as follows:

Actual execution time#

data scale

without vineyard

with vineyard

8500 Mi

142.2s

94.3s

12000 Mi

191.2s

123.1s

15000 Mi

253.5s

181.4s

According to the above results, we can find that the time of actual execution of the pipeline with vineyard is shorter than that without vineyard. To be specific, we record the write/read time of the following steps:

Writing time#

data scale

without vineyard

with vineyard

8500 Mi

21.6s

5.5s

12000 Mi

26.6s

6.8s

15000 Mi

32.7s

9.2s

From the above results, we can find that the writing time the pipeline with vineyard is nearly 4 times shorter than that without vineyard. The reason is that the data is stored in the vineyard cluster, so it’s actually a memory copy operation, which is faster than the write operation of the nfs volume.

Reading time#

We delete the time of init data loading, and the results are as follows:

data scale

without vineyard

with vineyard

8500 Mi

37.3s

0.04s

12000 Mi

49.5s

0.04s

15000 Mi

61.7s

0.04s

Based on the above results, we can find that the read time of vineyard is nearly a constant, which is not affected by the data scale. The reason is that the data is stored in the shared memory of vineyard cluster, so it’s actually a pointer copy operation.

As a result, we can find that with vineyard, the argo workflow duration of the pipeline is reduced by 10%~20% and the actual execution time of the pipeline is reduced by about 30%.

Clean up#

Delete the rbac for the kubeflow example:

$ kubectl delete -f rbac.yaml

Delete all argo workflow

$ argo delete --all

Delete the argo server:

$ kubectl delete ns argo

Delete the csi driver:

$ vineyardctl delete csidriver

Delete the vineyard cluster:

$ vineyardctl delete vineyard-cluster

Delete the data volume:

$ kubectl delete -f prepare-data.yaml