DDRaid: Distributed Data Raid

News

DDRaid was demoed at LCA last week, running on a 12 node personal cluster that Peter Anvin of Orion Multisystems hand-carried halfway around the world expressly for the demo.

Bad news and good news. The bad news: ddraid crashed in the demo. The good news: ddraid crashed only because a buggy old ddraid driver that can't handle an Ext3 mkfs was on the machine. Otherwise, it worked pretty well and performed to spec (on every try other than the actual demo).

Slides for the LCA DDRAID presentation are here.

Mpeg of the balloon ride is too big to post here.

Introduction

This is the home page of the Distributed Data Raid block device project. DDRaid is a device mapper extension that allows you to run a higher-level software raid array where each member of the raid is a separate cluster node rather than a local disk. It also works for ordinary software raid, and is actually pretty efficient.

A cluster raid device together with a cluster filesystem like GFS or OCFS2 creates a distributed data cluster (the "dd" in "ddraid") that does not rely on a single shared disk. The cluster raid 3.5 array is redundant at the node level, so one data node can fail without losing any data. The cluster raid device will automatically find another node in the cluster to replace the missing node, in order to restore the safety factor.

Besides extra saftey, the cluster raid array offers increased performance, particularly for linear IO loads. Random IO loads perform no worse than a single, raw disk. Performance is never worse than a single disk, and often far better.

Similar to the existing device mapper mirror device, a ddraid array keeps a persistent record on disk of which regions of the array are currently being written so that in the event of a system crash, only those regions need to be resynced (by recomputing parity blocks).

A cluster may contain both data nodes that are members of the distributed data array, and ordinary nodes that have full access to the array data via the cluster filesystem. Data nodes themselves may access the shared filesystem.

DDraid is based on the Raid 3.5 model, which I investigated a couple of years ago but did not implement. A ddraid array can only have certain numbers of members, in practice: 2, 3, 5 or 9. However, each of these members can itself be an array, so any multiple of these numbers is possible. DDRaid arrays can be joined together linearly, and do not have to have the same number of members. So there is considerable flexibility in how a distributed data cluster may be configured.

A ddraid device consists of three components:

A device mapper target module
A synchronization server (one per cluster what a hot standby on each data node)
A user space agent (one per node) that interfaces the server and device mapper device to the cluster infrastructure

It is possible to use the ddraid device locally, without any cluster. To do this, a different version of the ddraid agent is used that does not know anything about clusters, only about the ddraid server and device-mapper target. In fact, neither the device mapper target or server know anything at all about a cluster infrastructure, and as such, can be used with any cluster infrastructure, not just the cman/gdlm cluster infrastructure. Only the ddraid agent (a user space program) changes to accomodate this.

A consequence of this infrastructe independence is, you can try out the ddraid device without installing any cluster patches. You only need the tarball linked below. [This isn't quite true at the moment, because the non-cluster ddraid agent still has a few gdlm dependencies, which will be fixed soon.]

Project Team

Once again, that would be me!

Daniel Phillips <phillips@redhat.com>

Status

First benchmarks on a realistic configuration were achieved recently, using an Ext2 filesystem. Benchmarks show that a five member array can be up to 62% faster than a single, raw disk. Degraded mode operation (one member failed) was recently implemented. This project began life as a cluster mirror, and the server still only knows about mirrors at this point, not how to reconstruct parity blocks. Some work remains be done on client failure recovery and server failover. Disk errors are not handled yet, though the mechanism to continue operating with a failed disk is in place and tested. There are several known bugs. In other words, don't use ddraid for real data.

Hackers and tire kickers are cordially invited to download the tarball and try it out.

Source code

This is ddraid.0.0.5.tgz, the tarball that should have been used last week at LCA, and it would not have hung while making the Ext3 filesystem. It has also been (lightly) tested with Ext2, ReiserFS, GFS and OCFS2. Device bringup is still somewhat manual, as is failover because ddraid-cman-agent.c hasn't been coded yet. Assignment of the dirty map partition is also manual (part of the ddraid-server command line).

This ddraid snapshot does exhibit the linear performance acceleration claimed above and achieves close to the aggregate platter speeds, less the parity disk. The code lets you turn of global synchronization entirely, or just turn off dirty logging. This makes a modest difference to performance, but of course, you have to do the synchronization yourself which could get a little tedious (though for an active/passive Ext2/3 configuration, turning off global synchronization might be just the thing to get that last drop of performance).

Here is the latest tarball, to be built against kernel 2.6.11.3. This code is suitable for unit testing and experimenting, not live use. Tracing output is on by default, which needs to be turned off for any reasonable performance. Follow the directions in the README.

The ddraid tarball 0.0.6 is around here somewhere and will be posted soon. It's, um... better :-)

Note Note Note! This is still pre-alpha. Do not use it on a filesystem you care about.

To get the latest source from CVS:

cvs -d :pserver:cvs@sources.redhat.com:/cvs/cluster login cvs
cvs -d :pserver:cvs@sources.redhat.com:/cvs/cluster checkout cluster/ddraid

At the login password prompt, just hit enter.

But as of today, there isn't anything in cvs! Soon.

Documentation

Linux Kongress Raid 3.5 paper

Mailing lists

linux-cluster is the mailing list for cluster-related questions and discussion.
Subscribe to this from https://www.redhat.com/mailman/listinfo/linux-cluster.
The list archives are at https://www.redhat.com/archives/linux-cluster/.
The mailing list address is linux-cluster@redhat.com.

Whenever the development source code repository is updated, email is sent to the cluster-cvs mailing list.
To subscribe, send an email to cluster-cvs-subscribe@sources.redhat.com or cluster-cvs-digest-subscribe@sources.redhat.com.
You can read the list archives at http://sources.redhat.com/ml/cluster-cvs/

IRC

Channel #linux-cluster on freenode

Links

Cluster Project Page