Accelerate Data Sharing in Kedro¶

This is a tutorial that shows how Vineyard accelerate the intermediate data sharing between tasks in Kedro pipelines using our vineyard-kedro plugin, when data scales and the pipeline are deployed on Kubernetes.

Note

This tutorial is based on the Developing and Learning MLOps at Home project, a tutorial about orchestrating a machine learning pipeline with Kedro.

Prepare the Kubernetes cluster¶

To deploy Kedro pipelines on Kubernetes, you must have a kubernetes cluster.

Note

Vineyard Scheduler is compatible with Kubernetes versions 1.19 to 1.24. Ensure your Kubernetes cluster is within this version range for proper functionality.

We recommend kind v0.13.0 to create a multi-node Kubernetes cluster on your local machine as follows:

$ cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.19.3
- role: worker
  image: kindest/node:v1.19.3
- role: worker
  image: kindest/node:v1.19.3
- role: worker
  image: kindest/node:v1.19.3
EOF

Deploy Argo Workflows¶

Install the argo operator on Kubernetes:

$ kubectl create namespace argo
$ kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.8/install.yaml

When the deployment becomes ready, you can see the following pods:

$ kubectl get pod -n argo
NAME                                READY   STATUS    RESTARTS   AGE
argo-server-7698c96655-jg2ds        1/1     Running   0          11s
workflow-controller-b888f4458-x4qf2 1/1     Running   0          11s

Deploy Vineyard¶

Install the vineyard operator:

$ helm repo add vineyard https://vineyard.oss-ap-southeast-1.aliyuncs.com/charts/
$ helm repo update
$ helm install vineyard-operator vineyard/vineyard-operator \
    --namespace vineyard-system \
    --create-namespace

Create a vineyard cluster:

Tip

To handle the large data, we set the memory of vineyard cluster to 8G and
the shared memory to 8G.
```
$ python3 -m vineyard.ctl deploy vineyardd --vineyardd.memory=8Gi --vineyardd.size=8Gi
```
Note

The above command will try to create a vineyard cluster with 3 replicas by default. If you are working with Minikube, Kind, or other Kubernetes that has less nodes available, try reduce the replicas by
```
$ python3 -m vineyard.ctl deploy vineyardd --replicas=1 --vineyardd.memory=8Gi --vineyardd.size=8Gi
```

Prepare the S3 Service¶

Deploy the Minio cluster:

Tip

If you already have the AWS S3 service, just skip this section and jump to the next section.
```
$ kubectl apply -f python/vineyard/contrib/kedro/benchmark/mlops/minio-dev.yaml
```
Tip

The default access key and secret key of the minio cluster are minioadmin and minioadmin.

Create the S3 bucket:

If you are working with AWS S3, you can create a bucket named aws-s3-benchmark-bucket with the following command:
```
$ aws s3api create-bucket --bucket aws-s3-benchmark-bucket --region <Your AWS Region Name>
```

If you are working with Minio, you first need to expose the services and then create the bucket:

Forward minio-artifacts service:

$ kubectl port-forward service/minio -n minio-dev 9000:9000

Install the minio client:

$ wget https://dl.min.io/client/mc/release/linux-amd64/mc
$ chmod +x mc
$ sudo mv mc /usr/local/bin

Configure the minio client:

$ mc alias set minio http://localhost:9000
Enter Access Key: <Your Access Key>
Enter Secret Key: <Your Secret Key>

Finally, create the bucket minio-s3-benchmark-bucket:

$ mc mb minio/minio-s3-benchmark-bucket
Bucket created successfully `minio/minio-s3-benchmark-bucket`.

Prepare the Docker images¶

Vineyard has delivered a benchmark project to test Kedro pipelines on Vineyard and S3:
```
$ cd python/vineyard/contrib/kedro/benchmark/mlops
```

Configure the credentials configurations of AWS S3:

$ cat conf/aws-s3/credentials.yml
benchmark_aws_s3:
    client_kwargs:
        aws_access_key_id: Your AWS/Minio Access Key ID
        aws_secret_access_key: Your AWS/Minio Secret Access Key
        region_name: Your AWS Region Name

To deploy pipelines to Kubernetes, you first need to build the Docker image for the benchmark project.

To show how vineyard can accelerate the data sharing along with the dataset scales, Docker images for different data size will be generated:
- For running Kedro on vineyard:
```
$ make docker-build
```
  You will see Docker images for different data size are generated:
```
$ docker images | grep mlops
mlops-benchmark    latest    fceaeb5a6688   17 seconds ago   1.07GB
```
To make those images available for your Kubernetes cluster, they need to be pushed to your registry (or load to kind cluster if you setup your Kubernetes cluster using kind):
- Push to registry:
```
$ docker tag mlops-benchmark:latest <Your Registry>/mlops-benchmark:latest
$ docker push <Your Registry>/mlops-benchmark:latest
```
- Load to kind cluster:
```
$ kind load docker-image mlops-benchmark:latest
```

Deploy the Kedro Pipelines¶

Deploy the Kedro pipeline with vineyard for intermediate data sharing:

$ kubectl create namespace vineyard
$ for multiplier in 1 10 100 500; do \
     argo submit -n vineyard --watch argo-vineyard-benchmark.yml -p multiplier=${multiplier}; \
  done

Similarly, using AWS S3 or Minio for intermediate data sharing:

Using AWS S3:

$ kubectl create namespace aws-s3
# create the aws secrets from your ENV
$ kubectl create secret generic aws-secrets -n aws-s3 \
     --from-literal=access_key_id=$AWS_ACCESS_KEY_ID \
     --from-literal=secret_access_key=$AWS_SECRET_ACCESS_KEY
$ for multiplier in 1 10 100 500 1000 2000; do \
     argo submit -n aws-s3 --watch argo-aws-s3-benchmark.yml -p multiplier=${multiplier}; \
  done

Using Cloudpickle dataset:

$ kubectl create namespace cloudpickle
# create the aws secrets from your ENV
$ kubectl create secret generic aws-secrets -n cloudpickle \
     --from-literal=access_key_id=$AWS_ACCESS_KEY_ID \
     --from-literal=secret_access_key=$AWS_SECRET_ACCESS_KEY
$ for multiplier in 1 10 100 500 1000 2000; do \
     argo submit -n cloudpickle --watch argo-cloudpickle-benchmark.yml -p multiplier=${multiplier}; \
  done

Using Minio:

$ kubectl create namespace minio-s3
$ for multiplier in 1 10 100 500 1000 2000; do \
     argo submit -n minio-s3 --watch argo-minio-s3-benchmark.yml -p multiplier=${multiplier}; \
  done

Performance¶

After running the benchmark above on Kubernetes, we recorded each node’s execution time from the logs of the argo workflow and calculated the sum of all nodes as the following end-to-end execution time for each data scale:

Data Scale	Vineyard	Minio S3	Cloudpickle S3	AWS S3
1	4.2s	4.3s	22.5s	16.9s
10	4.9s	5.5s	28.6s	23.3s
100	13.2s	20.3s	64.4s	74s
500	53.6s	84.5s	173.2s	267.9s
1000	109.8s	164.2s	322.7s	510.6s
2000	231.6s	335.9s	632.8s	1069.7s

We have the following observations from above comparison:

Vineyard can significantly accelerate the data sharing between tasks in Kedro pipelines, without the need for any intrusive changes to the original Kedro pipelines;
When data scales, the performance of Vineyard is more impressive, as the intermediate data sharing cost becomes more dominant in end-to-end execution;
Even compared with local Minio, Vineyard still outperforms it by a large margin, thanks to the ability of Vineyard to avoid (de)serialization, file I/O and excessive memory copies.
When using the Cloudpickle dataset(pickle + zstd), the performance is better than AWS S3, as the dataset will be compressed before uploading to S3.