Machine Learning with Vineyard

Vineyard-ML: A Vineyard package that integrates Machine Learning Frameworks to Vineyard.

TensorFlow

Using Numpy Data

>>> import tensorflow as tf
>>> from vineyard.contrib.ml import tensorflow
>>> dataset = tf.data.Dataset.from_tensor_slices((data, label))
>>> data_id = vineyard_client.put(dataset)
>>> vin_data = vineyard_client.get(data_id)

Vineyard supports the tf.data.Dataset. The vin_data will be a shared-memory object from the vineyard.

Using Dataframe

>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'target': [1.0, 2.0, 3.0, 4.0]})
>>> label = df.pop('target')
>>> dataset = tf.data.Dataset.from_tensor_slices((dict(df), label))
>>> data_id = vineyard_client.put(dataset)
>>> vin_data = vineyard_client.get(data_id)

Wrap the dataframe with tf.data.Dataset. This enables the use of feature columns as a bridge to map from the columns in Pandas Dataframe to features in Dataset. The dataset should return a dictionary of column names (from the dataframe) that maps to column values. The dataset should only contain numerical data.

Using RecordBatch of Pyarrow

>>> import pyarrow as pa
>>> arrays = [pa.array([1, 2, 3, 4]), pa.array([3.0, 4.0, 5.0, 6.0]), pa.array([0, 1, 0, 1])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'f1', 'label'])
>>> data_id = vineyard_client.put(batch)
>>> vin_data = vineyard_client.get(data_id)

Vineyard supports direct integration of RecordBatch. The vin_data object will be a TensorFlow Dataset, i.e. tf.data.Dataset. Here the label row should be named as label.

Using Tables of Pyarrow

>>> arrays = [pa.array([1, 2, 3, 4]), pa.array([3.0, 4.0, 5.0, 6.0]), pa.array([0, 1, 0, 1])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'f1', 'label'])
>>> batches = [batch]*3
>>> table = pa.Table.from_batches(batches)
>>> data_id = vineyard_client.put(table)
>>> vin_data = vineyard_client.get(data_id)

Vineyard supports direct integration of Tables as well. Here, the vin_data object will be of type TensorFlow Dataset, i.e. tf.data.Dataset. Here the label row should be named as label.

PyTorch

Using Numpy Data

Vineyard supports Custom Datasets inherited from the PyTorch Dataset.

>>> import torch
>>> from vineyard.contrib.ml import pytorch
>>> data_id = vineyard_client.put(dataset, typename='Tensor')
>>> vin_data = vineyard_client.get(data_id)

The dataset object should be an object of the type CustomDataset class which is inherited from torch.utils.data.Dataset class. Adding the typename as Tensor is important. The vin_data will be of type torch.utils.data.TensorDataset.

Using Dataframe

>>> df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'c': [1.0, 2.0, 3.0, 4.0]})
>>> label = torch.from_numpy(df['c'].values.astype(np.float32))
>>> data = torch.from_numpy(df.drop('c', axis=1).values.astype(np.float32))
>>> dataset = torch.utils.data.TensorDataset(data, label)
>>> data_id = vineyard_client.put(dataset, typename='Dataframe', cols=['a', 'b', 'c'], label='c')
>>> vin_data = vineyard_client.get(data_id, label='c)

While using the PyTorch form of the dataframe with vineyard, it is important to mention the typename as Dataframe, a list of column names in cols and the label name in label tag. The vin_data will be of the form TensorDataset with the label as mentioned with the label tag. If no value is passed to the label tag vineyard will consider the default value which is the value of label passed in while calling the put method

Using RecordBatch of Pyarrow

>>> import pyarrow as pa
>>> arrays = [pa.array([1, 2, 3, 4]), pa.array([3.0, 4.0, 5.0, 6.0]), pa.array([0, 1, 0, 1])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'f1', 'f2'])
>>> data_id = vineyard_client.put(batch)
>>> vin_data = vineyard_client.get(data_id, label='f2')

The vin_data will be of the form TensorDataset with the label as mentioned with the label tag. In this case it is important to mention the label tag.

Using Tables of Pyarrow

>>> arrays = [pa.array([1, 2, 3, 4]), pa.array([3.0, 4.0, 5.0, 6.0]), pa.array([0, 1, 0, 1])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'f1', 'f2'])
>>> batches = [batch]*3
>>> table = pa.Table.from_batches(batches)
>>> data_id = vineyard_client.put(table)
>>> vin_data = vineyard_client.get(data_id, label='f2')

The vin_data object will be of the form TensorDataset with the label as mentioned with the label tag. In this case, it is important to mention the label tag.

MxNet

Using Numpy Data

Vineyard supports Array Datasets from the gluon.data of MxNet.

>>> import mxnet as mx
>>> from vineyard.contrib.ml import mxnet
>>> dataset = mx.gluon.data.ArrayDataset((data, label))
>>> data_id = vineyard_client.put(dataset, typename='Tensor')
>>> vin_data = vineyard_client.get(data_id)

The dataset object should be an object of the type ArrayDataset from mxnet.gluon.data class. Here, Adding the typename as Tensor is important. The vin_data will be of type mxnet.gluon.data.ArrayDataset.

Using Dataframe

>>> df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'c': [1.0, 2.0, 3.0, 4.0]})
>>> label = df['c'].values.astype(np.float32)
>>> data = df.drop('c', axis=1).values.astype(np.float32)
>>> dataset = mx.gluon.data.ArrayDataset((data, label))
>>> data_id = vineyard_client.put(dataset, typename='Dataframe', cols=['a', 'b', 'c'], label='c')
>>> vin_data = vineyard_client.get(data_id, label='c)

While using the MxNet form of the dataframe with vineyard, it is important to mention the typename as Dataframe, a list of column names in cols and the label name in label tag. The vin_data will be of the form ArrayDataset with the label as mentioned with the label tag. If no value is passed to the label tag vineyard will consider the default value which is the value of label passed in while calling the put method

Using RecordBatch of Pyarrow

>>> import pyarrow as pa
>>> arrays = [pa.array([1, 2, 3, 4]), pa.array([3.0, 4.0, 5.0, 6.0]), pa.array([0, 1, 0, 1])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'f1', 'f2'])
>>> data_id = vineyard_client.put(batch)
>>> vin_data = vineyard_client.get(data_id, label='f2')

The vin_data will be of the form ArrayDataset with the label as mentioned with the label tag. In this case, it is important to mention the label tag.

Using Tables of Pyarrow

>>> arrays = [pa.array([1, 2, 3, 4]), pa.array([3.0, 4.0, 5.0, 6.0]), pa.array([0, 1, 0, 1])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'f1', 'f2'])
>>> batches = [batch]*3
>>> table = pa.Table.from_batches(batches)
>>> data_id = vineyard_client.put(table)
>>> vin_data = vineyard_client.get(data_id, label='f2')

The vin_data object will be of the form ArrayDataset with the label as mentioned with the label tag. In this case, it is important to mention the label tag.

XGBoost

Vineyard supports resolving XGBoost::DMatrix from various kinds of vineyard data types.

From Vineyard::Tensor

>>> arr = np.random.rand(4, 5)
>>> vin_tensor_id = vineyard_client.put(arr)
>>> dmatrix = vineyard_client.get(vin_tensor_id)

The dmatrix will be a DMatrix instance with the same shape (4, 5) resolved from the Vineyard::Tensor object with the id vin_tensor_id.

From Vineyard::DataFrame

>>> df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'c': [1.0, 2.0, 3.0, 4.0]})
>>> vin_df_id = vineyard_client.put(df)
>>> dmatrix = vineyard_client.get(vin_df_id, label='a')

The dmatrix will be a DMatrix instance with shape of (4, 2) and feature_names of ['b', 'c']. While the label of dmatrix is the values of column a.

Sometimes the dataframe is a complex data structure and only one column will be used as the features. We support this case by providing the data kwarg.

>>> df = pd.DataFrame({'a': [1, 2, 3, 4],
>>>                    'b': [[5, 1.0, 4], [6, 2.0, 3], [7, 3.0, 2], [8, 9.0, 1]]})
>>> vin_df_id = vineyard_client.put(df)
>>> dmatrix = vineyard_client.get(vin_df_id, data='b', label='a')

The dmatrix will have the shape of (4, 3) corresponding to the values of column b. While the label is the values of column a.

From Vineyard::RecordBatch

>>> import pyarrow as pa
>>> arrays = [pa.array([1, 2, 3, 4]), pa.array([3.0, 4.0, 5.0, 6.0]), pa.array([0, 1, 0, 1])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'f1', 'target'])
>>> vin_rb_id = vineyard_client.put(batch)
>>> dmatrix = vineyard_client.get(vin_rb_id, label='target')

The dmatrix will have the shape of (4, 2) and feature_names of ['f0', 'f1']. While the label is the values of column target.

From Vineyard::Table

>>> arrays = [pa.array([1, 2]), pa.array([0, 1]), pa.array([0.1, 0.2])]
>>> batch = pa.RecordBatch.from_arrays(arrays, ['f0', 'label', 'f2'])
>>> batches = [batch] * 3
>>> table = pa.Table.from_batches(batches)
>>> vin_tab_id = vineyard_client.put(table)
>>> dmatrix = vineyard_client.get(vin_tab_id, label='label')

The dmatrix will have the shape of (6, 2) and feature_names of ['f0', 'f2']. While the label is the values of column label.

Nvidia-DALI

Vineyard supports integration of Dali Pipelines.

>>> from nvidia.dali import pipeline_def
>>> pipeline = pipe(device_id=device_id, num_threads=num_threads, batch_size=batch_size)
>>> pipeline.build()
>>> pipe_out = pipeline.run()
>>> data_id = vineyard_client.put(pipe_out)
>>> vin_pipe = vineyard_client.get(data_id)

In this case, the pipe is a pipeline_def function. The data received after executing pipe.run() can be stored into vineyard. The Pipeline should only return two values, namely data and label. The return type of the data and label values should be of type TensorList. The vin_pipe object will be the output of a simple in-built pipeline after executing the pipeline.build() and pipeline.run(). It will simply return two values of type Pipeline.