Getting Started#

Installing vineyard#

Vineyard is distributed as a python package and can be easily installed with pip:

pip3 install vineyard

Installing etcd#

Vineyard is based on etcd, please refer the doc to install it.

Starting vineyard server#

$ python3 -m vineyard

A vineyard daemon server will be launched on the underlying machine with default settings. The default socket is /var/run/vineyard.sock, and it is listened by the server for ipc connections.

Tip

If you encounter errors like cannot launch vineyardd on '/var/run/vineyard.sock': Permission denied,, that means you don’t have the permission to create a UNIX-domain socket at /var/run/vineyard.sock, you could either

  • run vineyard as root, using sudo

  • or change the socket path to a different one, with the --socket command line option, like

    $ python3 -m vineyard --socket /tmp/vineyard.sock
    

A vineyard daemon server is a vineyard instance in a vineyard cluster. Thus, to start a vineyard cluster, we can simply start vineyardd over all the machines in the cluster, and make sure these vineyard instances can register to the same etcd_endpoint. The default value of etcd_endpoint is http://127.0.0.1:2379, and vineyard will launch the etcd_endpoint in case the etcd servers are not started on the cluster.

Tip

Use python3 -m vineyard --help for other parameter settings.

Connecting to vineyard#

Vineyard daemon serves clients via UNIX domain socket:

>>> import vineyard
>>> client = vineyard.connect('/var/run/vineyard.sock')

Here we established a vineyard client connected to the vineyardd instance via the IPC socket /var/run/vineyard.sock.

Getting and putting Python object#

>>> import numpy as np
>>> import vineyard.data.tensor
>>> arr = np.arange(8)
>>> arr_id = client.put(arr)
>>> arr_id
00002ec13bc81226
>>> shared_arr = client.get(arr_id)
>>> shared_arr
array([0, 1, 2, 3, 4, 5, 6, 7])

We first use client.put() to build the vineyard object from the local variable arr, which returns the object_id that is the unique id in vineyard to represent the object.

Then given the object_id, we can obtain a shared-memory object from vineyard with client.get().

Note

Note that shared_arr doesn’t allocate memory in the client process; instead, it shares the memory from the vineyard server.

Creating a dataframe#

>>> import numpy as np
>>> import pandas as pd
>>> import vineyard.data.dataframe
>>> df = pd.DataFrame({'u': [0, 0, 1, 2, 2, 3],
>>>                    'v': [1, 2, 3, 3, 4, 4],
>>>                    'weight': [1.5, 3.2, 4.7, 0.3, 0.8, 2.5]})
>>> df_id = client.put(df)
>>> shared_object = client.get_object(df_id)
>>> shared_object.typename
vineyard::DataFrame
>>> shared_df = client.get(df_id)
>>> shared_df
u  v  weight
0  0  1     1.5
1  0  2     3.2
2  1  3     4.7
3  2  3     0.3
4  2  4     0.8
5  3  4     2.5

We first build the vineyard dataframe object from pandas dataframe variable df, then to further understand the client.get() method, we use client.get_object() to get the vineyard object, and check its typename.

Actually, client.get() works in two steps, it first gets the vineyard object from vineyardd via client.get_object(), and then resolves the vineyard object based on the registered resolver.

In this case, when we import vineyard.dataframe, a resolver that can resolve a vineyard dataframe object to a pandas dataframe is registered to the resolver factory under the vineyard type vineyard::DataFrame, so that the client can automatically resolve the vineyard dataframe object. To further understand the registration design in vineyard, see I/O Drivers.

Shared Memory#

Vineyard supports shared memory interface of SharedMemory and ShareableList like things in multiprocessing.shared_memory.

The shared memory interface can be used in the following way:

>>> from vineyard import shared_memory
>>> value = shared_memory.ShareableList(client, [b"a", "bb", 1234, 56.78, True])
>>> value
ShareableList([b'a', 'bb', 1234, 56.78, True], name='o8000000119aa10c0')
>>> value[4] = False
>>> value
ShareableList([b'a', 'bb', 1234, 56.78, False], name='o8000000119aa10c0')

Caution

Note that the semantic of the vineyard’s shared_memory is slightly different with the shared_memory in python’s multiprocessing module. Shared memory in vineyard cannot be mutable after been visible to other clients.

We have added a freeze method to make such transformation happen:

>>> value.freeze()

After being freezed, the shared memory (aka. the ShareableList in this case) is available for other clients:

>>> value1 = shared_memory.ShareableList(client, name=value.shm.name)
>>> value1
ShareableList([b'a', 'bb', 1234, 56.78, False], name='o8000000119aa10c0')

For more details, see Shared memory.

Using streams#

Vineyard supports streaming to facilitate big data pipelining.

Open a local file as a dataframe stream#

>>> from vineyard.io.stream import open
>>> stream = open('file://twitter.e')
>>> stream.typename
vineyard::DataframeStream

In practice, the file may be stored in an NFS, and we want to read the file in parallel to further speed up the IO process.

To further understand the implementation of the driver open, and the underlying registration mechanism for drivers in vineyard, see also I/O Drivers.