Cluster Project FAQ - Frequently Asked Questions

Last updated: 04 Dec 2007

Recent changes:

02 Jan 2008 - Revised question:How do I change the time after which a non-responsive node is considered dead?
04 Dec 2007 - Added question:My RHEL5 or similar cluster won't work with my Cisco switch.
09 Nov 2007 - Added question:On Fedora 8, CMAN won't start, complaining about "aisexec not started". How do I fix it?
18 Oct 2007 - Revised question:What's the "right" way to get cman to use a different NIC, say, eth2 rather than eth0?
07 Aug 2007 - Revised question:What is a tie-breaker, and do I need one in two-node clusters?
26 Jul 2007 - Revised question:I want to use GFS for Samba (smb) file serving. Is that okay?
05 Jul 2007 - Added question: I get 'generic error' while trying to start a Xen guest as a service, how do I fix it?
20 Jun 2007 - Revised answer:If my cluster is mission-critical, can I override quorum rules and have a "last-man-standing" cluster that's still functioning?
20 Jun 2007 - Added question: Do I really need a shared disk to use QDisk?
20 Jun 2007 - Revised question: How do I set up a quorum disk/partition?
05 Jun 2007 - Revised question: What ports do I have to enable for the iptables firewall?
14 May 2007 - Revised question: On RHEL5, why do I get "cman not started: Can't bind to local cman socket /usr/sbin/cman_tool"?
11 May 2007 - Revised question: What's the "right way" to propagate the cluster.conf file to a running cluster?
01 May 2007 - Added question: Can I speed up the time it takes to fail over a service?
01 May 2007 - Added question: On RHEL5, why do I get "cman not started: Can't bind to local cman socket /usr/sbin/cman_tool"?
01 May 2007 - Clarified question: In RHEL3, what is the explanation for maximum restarts and maximum false restarts?

If you have corrections, please send them to Bob Peterson: rpeterso@redhat.com

If you have questions, please send them to the mailing list: linux-cluster@redhat.com
To subscribe to the linux-cluster mailing list, please visit the following page:
https://www.redhat.com/mailman/listinfo/linux-cluster

Introduction

This FAQ answers questions about the cluster project as a whole, broken into its components. Some of the answers cross multiple components, so if you don't see the question you're looking for, search the document and you may find the question and answer under a different component.
In many cases, there are multiple answers to every question. For example, there's the answer that a developer wants to hear, and there's the answer that a lay person wants to hear without the technical jibberish. In most cases I've tried to provide answers suitable to lay people and people who are relatively new to clustering technology.

Questions

Answers

Answers to General Questions

What is Cluster Project?
The Cluster Project is a set of components designed to enable clustering, which means a group of computers all sharing resources, such as shared storage devices and services. Clustering ensures data integrity when people are working on devices from multiple machines (or virtual machines) at the same time.
What is Red Hat Cluster Suite (RHCS)?
Red Hat Cluster Suite is a marketing term under which some of this software is promoted. Red Hat has bundled components from the cluster project together and made them available for its various platforms.
What's the history of the cluster project?
Somewhere around 1996, Red Hat developed its first Cluster Suite, which primarily managed cluster-cooperative services. That's the equivalent of rgmanager now.

From 1997 to 2003, Sistina Software was spun off from a project at the University of Minnesota, and they developed a clustering file system which became the Global File System, GFS, which it sold to customers.

In 2004, Red Hat, Inc. bought Sistina, merged GFS into its Cluster Suite, and open-sourced the whole thing.

Today, the cluster project belongs to the people and is available for free to the public through Red Hat's CVS repository. The open-source community continues to improve and develop the cluster project with new clustering technology and infrastructures, such as OpenAIS.
What does the Cluster Project encompass?
That depends on what version you are using. Like all active technology, it is constantly evolving. The Cluster Project involves development in many different areas including:
- CCS - cluster configuration system to manage the cluster.conf file
- Cluster Suite Deployment Tool - graphical tool to deploy Cluster Suite on multiple machines
- CLVM - clustering extensions to the LVM2 logical volume manager toolset
- CMAN - cluster manager
- Conga - gui-based cluster manager
- DLM - distributed lock manager
- Fence - I/O fencing system
- GFS* - shared-disk cluster file system (Global File System)
- GFS2* - shared-disk cluster file system (Global File System 2)
- GNBD - kernel module to share block devices to many machines over a network
- GULM - redundant server-based cluster and lock manager (alternative to CMAN and DLM)
- OpenAIS - open cluster infrastructure
- Magma - clustering/locking library used for transition between GULM and CMAN/DLM
- RGManager - resource group manager to monitor, start and stop applications, services, resources
- system-config-cluster - graphical tool to manage Cluster Suite on multiple machines
How do you configure a cluster?
Assuming you have all the necessary pieces and/or RPMs in place, there are four ways to configure a cluster:
- Manually edit /etc/cluster/cluster.conf and propagate it to all nodes.
- Use system-config-cluster tool gui (RHEL4) - For cluster configuration maintenance.
- Cluster Suite Deployment tool gui (cs-deploy-tool) (RHEL4) - For initial cluster setup.
- Conga web-based cluster configuration tool (RHEL5 and FC6).
Why do my changes to cluster.conf file keep disappearing?
The cluster configuration system (ccs) tries to manage the cluster.conf file and keep all the nodes in sync. If you make changes to the cluster.conf file, you have to tell ccs and cman that you did it, so they can update the other nodes. If you don't, your changes are likely to be overwritten with an older version of the cluster.conf file from a different node. See the next question.
What's the "right way" to propagate the cluster.conf file to a running cluster?
The cluster configuration guis take case of propagating changes to cluster.conf to your cluster. The system-config-cluster gui has a big button that says "Send to Cluster". If you're maintaining your cluster.conf file by hand and want to propagate it to the rest of the cluster, do this:
1. Edit /etc/cluster/cluster.conf using the editor of choice.
2. Tell ccs about the change:
  ccs_tool update /etc/cluster/cluster.conf
3. Find out what version your cluster.conf file currently is from cman's perspective:
  cman_tool status | grep "Config version"
  It should come back with something like this:
  Config version: 37
4. Tell cman your newer cluster.conf is a newer version:
  cman_tool version -r 38
Note: For RHEL5 and similar, cman_tool -r is no longer necessary.
What are all the possible options that can go into cluster.conf file?
A list of options can be found at the following link. I won't guarantee it's complete or comprehensive, but it's pretty close:
http://sources.redhat.com/cluster/doc/cluster_schema.html
Are there any examples of cluster.conf I can look at?
Take a look at the man page for cluster.conf (5). There's also a small example in the usage.txt file: cluster/doc/usage.txt
Which kernel does this code run on?
The GFS 6.0 cluster code runs on the 2.4.xx (for Red Hat Enterprise Linux 3). The GFS 6.1 code runs on the 2.6.xx series kernels for Red Hat Enterprise Linux 4, Fedora Core and other distributions.
Where is the source code?
The source code for the current development tree is kept in the Red Hat CVS repository: http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/?cvsroot=cluster

You can check the entire source code tree out from CVS with this command:
cvs -d :ext:sources.redhat.com:/cvs/cluster co cluster
Is the Cluster Project code in CVS production ready?
The cluster project code in our CVS head is development code. If you want the stable version, you can check it out from CVS with this command:

cvs -d :ext:sources.redhat.com:/cvs/cluster co -r STABLE cluster
Does the cluster project run on OS foo?
The cluster project was primarily designed to run on linux. Some of the cluster infrastructure, such as OpenAIS has been successfully ported to FreeBSD and possibly Darwin.
Where is the project page?
The project page is: here
What hardware do I need to run one of these "clusters"?
It depends on which components you need to use. For a basic cluster, all you need is two or more computers and a network between them. If you want to use GFS, you'll need shared storage.
What is the largest cluster of this type in existence?
This is a moving target, so it's not possible to give up-to-date information. However, without naming names the largest single cluster in production with GFS was an oil and gas company with 152 nodes directly on a SAN (McData switches, Qlogic 1GB HBAs and LSI Storage). This customer used this cluster for almost 2 years but not in use anymore after they were acquired by a larger company and the architecture changed. That cluster used GULM locking.

As of this writing, we haven't tested GFS 6.1 with DLM locking past 31 nodes.
Do I really need shared storage?
It depends on what you're planning to do. The point of using GFS and CLVM is that you have storage you want to share between machines concurrently. Without shared storage, you have a local filesystem and lvm2, neither of which need the cluster infrastructure. If you want to use the cluster infrastructure for High Availability services, you don't need shared storage.
Are there any manuals or other documentation I can reference?
Yes. They are here:
- cluster/doc/usage.txt (frequently updated in cvs)
- http://www.redhat.com/docs/manuals/csgfs/ - Manuals and other docs
- http://www.redhat.com/docs/manuals/csgfs/Oracle_GFS-en-US/index.html - Oracle RAC 10gR2 & GFS installation guide
- http://sources.redhat.com/cluster/doc/nfscookbook.pdf - The Unofficial NFS/GFS Cookbook.
And, of course, this FAQ.

What ports do I have to enable for the iptables firewall?

These ports should be enabled:

41966	rgmanager/clurgmgrd	tcp
41967	rgmanager/clurgmgrd	tcp
41968	rgmanager/clurgmgrd	tcp
41969	rgmanager/clurgmgrd	tcp
50006	ccsd	tcp
50007	ccsd	udp
50008	ccsd	tcp
50009	ccsd	tcp
21064	dlm	tcp
6809 (RHEL4 and under) 5405 (RHEL5 and above)	cman (RHEL4 and under) openais (RHEL5 and above)	udp
14567	gnbd	tcp

Are there any public source tarballs that compile against specific kernels?
Yes. From time to time, we build the STABLE branch against different kernels and release the tarballs. You'll find them here: ftp://sources.redhat.com/pub/cluster/releases/
What are the differences between the RHEL4 and RHEL5 versions?
The cluster software isn't specific to any Linux distribution or release. However, many of the users are running the software on Red Hat Enterprise Linux (RHEL). Most customers are currently running on RHEL4 (or the RHEL4 equivalent of CentOS, or at least the RHEL4 branch of the source tree in CVS). So they may want to know the differences between the way things work now in RHEL4 and how they'll work in RHEL5.

This list is by no means complete, but these are the differences I know about offhand:
1. There's a new web-based cluster configuration tool for RHEL5 called "Conga".
  Actually, the plan is to release this in RHEL4 U5.
2. Cluster.conf in RHEL5 has a new "nodeid=X" requirement for each node. (X is the node number).
3. The ccsd, cman and fenced init scripts in RHEL4 were combined into a single init script for RHEL5: service cman start.
  However, since users can use a cluster without clvmd, gfs or rgmanager, they are still separate init scripts.
4. RHEL5 has a new "locking_type = 3" in /etc/lvm/lvm.conf
  The logical volume manager can take a new "locking_type = 3" to figure out the appropriate locking for clustered and non-clustered volumes.
5. RHEL5 has no more lock_gulm locking protocol. Users are encouraged to use dlm locking.
6. The new GFS2 file system will be available as a "tech preview" (not production ready) in RHEL5.
  See question "What improvements will GFS2 have over GFS(1)?"
Some of the less noticed internal changes:
1. A lot of code has been moved from the kernel to userland for better system integrity and easier debugging.
2. New group daemons will run: groupd, gfs_controld, dlm_controld.
3. The new cluster infrastructure is built on top of openais clustering technology.
4. Openais uses multicast rather than broadcast packets for better efficiency.
5. DLM and GFS2 are now part of the base kernel.
  They were accepted into the 2.6.18 upstream kernel by kernel.org.
6. Lots of little improvements.
Can I have a mixed cluster with some RHEL4 and some RHEL5 nodes?
It's definitely not a good idea to mix the two within a single cluster. With the introduction of RHEL5, there are now two distinct and separate cluster infrastructures. The older (RHEL4 or STABLE branch in CVS) infrastructure passes cluster messages using a kernel module (cman or the one internal to gulm). The newer infrastructure (RHEL5 or HEAD branch in CVS) passes cluster messages using openais and userland daemons. If you try to mix and match the two, it will not work.

That said, you could probably still fetch the STABLE branch of the cluster code from CVS, compile it on a RHEL5 system, and have it interact properly in a RHEL4 cluster through the old infrastructure. Since the STABLE branch tracks the upstream kernel, you may also need to build a newer kernel from source code as well on the RHEL5 system.

It would be extremely difficult, if not impossible, to go the other way around (i.e. to get the new infrastructure and openais running on a RHEL4 system so it could interact with a RHEL5 cluster).
Can I use cluster suite from xen or vmware?
Yes you can. For example, you could have a single computer, running Xen virtualization, act as a complete cluster consisting of several xen guests. There are special fencing issues to consider. For example, if you use power fencing, one guest could cause the whole machine to be powered off and never come back (because it wouldn't be alive to tell the power switch to power back on). There is a special fencing agent designed to reboot xen guests as needed.

You can also create clusters made of several computers, each of which has several virtual xen guest nodes. This has other fencing complications. For example, a xen guest can't use a simple xen fencing agent to reboot a xen guest that's physically running on a different physical computer.
When I reboot a xen dom, I get cluster errors and it gets fenced. What's going on and how do I fix it?
As I understand it, the problem is due to the fact that xen nodes tear down and rebuild the ethernet nic after cluster suite has started. We're working on a more permanent solution. In the meantime, here is a workaround:
1. Edit the file: /etc/xen/xend-config.sxp line. Locate the line that reads:
  
  (network-script network-bridge)
  
  Change that line to read:
  
  (network-script /bin/true)
2. Create and/or edit file /etc/sysconfig/network-scripts/ifcfg-eth0 to look something like:
  
  DEVICE=eth0
  ONBOOT=yes
  BRIDGE=xenbr0
  HWADDR=XX:XX:XX:XX:XX:XX
  
  Where XX:XX:XX:XX:XX:XX is the mac address of your network card.
3. Create and/or edit file /etc/sysconfig/network-scripts/ifcfg-xenbr0 to look something like:
  
  DEVICE=xenbr0
  ONBOOT=yes
  BOOTPROTO=static
  IPADDR=10.0.0.116
  NETMASK=255.255.255.0
  GATEWAY=10.0.0.254
  TYPE=Bridge
  DELAY=0
  
  Substitute your appropriate IP address, netmask and gateway information.

Answers to GFS (Global File System) Questions

What is GFS and why do I need it?
GFS is the file system that runs on each of the nodes in the cluster. Like all file systems, it is basically a kernel module that runs on top of the vfs (virtual file system) layer of the kernel. It controls how and where the data is stored on a block device or logical volume. In order to make a cluster of computers ("nodes") cooperatively share the data on a SAN, you need GFS's ability to coordinate with a cluster locking protocol. One such cluster locking protocol is dlm, the distributed lock manager, which is also a kernel module. It's job is to ensure that nodes in the cluster who share the data on the SAN don't corrupt each other's data.

Many other file systems like ext3 are not cluster-aware, and therefore data kept on a volume that is shared between multiple computers, would quickly become corrupt otherwise.
What hardware do I need to run GFS in a cluster?
You need some form of shared storage - Fibre Channel and iSCSI are typical. If you don't have Fibre Channel or iSCSI, look at GNBD instead. Also, you need two or more computers and a network connection between them.
Can I use GFS to take two off-the-shelf PCs and cluster their storage?
No. GFS will only allow PCs with shared storage, such as a SAN with a Fibre Channel switch, to work together cooperatively on the same storage. Off-the-shelf PCs don't have shared storage.
What is the maximum size of a GFS file system?
GFS 6.1 (on RHEL 4) supports 16TB when any node in the cluster is running 32 bit RHEL. If all nodes in the cluster are 64-bit RHEL (x86-64, ia64) then the theoretical maximum is 8 EB (exabytes). We have field reports of 45 and 50 TB file systems. Testing these configurations is difficult due to our lack of access to very large array systems.
That's theoretical...So what is the biggest GFS file system you've actually seen in production?
I've seen more than one 45TB GFS file system. If you know of a bigger one, I'd love to hear from you.
Does GFS or GFS2 support millisecond timestamps?
Currently, gfs and gfs2 do not use milliseconds for files. They use seconds. This is to maintain compatibility with the underlying vfs layer of the kernel. If the kernel changes to milliseconds, we will also change.

People don't normally care about milliseconds, so milliseconds only become important to computers when doing things like NFS file serving. For example, to see if another computer has changed the data on disk since the time of the last known request. For GFS2, we're planning to implement inode generation numbers to keep track of these things more accurately than a timestamp can.
Why doesn't gfs_tool setflag inherit_directio affect new_files_directio?
If I do:
[root@node-01#] gfs_tool setflag inherit_directio my_directory [root@node-01#] gfs_tool gettune my_directory
It displays:
new_files_directio = 0
Here's what's going on: inherit_directio and new_files_directio are two separate things. If you look at the man page, inherit_directio operates on a single directory whereas new_files_directio is a filesystem-wide "settune" value. If you do:
gfs_tool setflag inherit_directio my_directory
You're telling the fs that ONLY your directory and all new files within that directory should have this attribute, which is why your tests are acting as expected, as long as you're within that directory. It basically sets an attribute on an in-memory inode for the directory. If instead you were to do:
gfs_tool settune mount-point new_files_directio 1
The value new_files_directio value would change for the whole mount point, not just that directory. Of course, you're seeing what gfs_tool gettune my_directory is reporting for the global flag.
I heard that GFS NFS failover prevents data loss. Is that true?
No, it's not true. What it prevents is data corruption as a result of the node waking up and erroneously issuing writes to the disk when it shouldn't.

The simple fact is that no one can guarantee against loss of data when a computer goes down. If a client goes down in the middle of a write, its cached data will be lost. If a server goes down in the middle of a write, cached data will be lost unless the filesystem is mounted with "sync" option. Unfortunately, the "sync" option has a huge performance penalty. GFS's journaling should minimize and/or guard against this loss.

With NFS failover, if a server goes down in the middle of an NFS request (which is far more likely), the failed NFS service should be failed over to another GFS server in the cluster. The NFS client should get a timeout on its write request, and that will cause it to retry the request, which should go to the server that has taken over the responsibilities of the failed NFS server. And GFS will ensure the original server having the problem will not corrupt the data.
I just did mkfs on my file system, so why do I get 'permission denied' mounting it?
You probably miss-typed the cluster name on mkfs. Use the 'dmesg' command to see what gfs is complaining about. If that's the problem, you can use gfs_tool or another mkfs to fix it.

Even if this is not your problem, if you have a problem mounting, always use dmesg to view complaints from the kernel.
Is there an easy way to see which nodes in my cluster have my GFS fs mounted?
It depends on whether you're using GULM or DLM locking. If you're using DLM, use this command from a node that has it mounted:

cman_tool services

If you're using GULM, or aren't on a node that has it mounted, here's another way to do it:

for i in `grep "<clusternode name" /etc/cluster/cluster.conf | cut -d '"' -f2` ; do ssh $i "mount | grep gfs" ; done
gfs_tool df shows 100% of my inodes are used. Is that a problem?
Unlike ext3, GFS will dynamically allocate inodes as it needs them. Therefore, it's not a problem.
How much overhead do my files take in GFS?
It depends on file size and file system block size. Assuming the file system block size is a standard 4K, let's do the math: A GFS inode is 232 bytes (0xe8) in length. Therefore, the most data you can fit along with an inode is 4096 - 232 = 3864 bytes. By the way, in this case we say the file "height" is 0.

Slightly bigger and the file needs to use a single-level of indirection, also known as height 1. The inode's 3864 bytes will be used to hold a group of block pointers. These pointers are 64-bits each (or 8 bytes) so you can fit exactly 483 of them on the block after the disk inode. If you have all 483 pointers to 4K blocks, you have at most 1.88MB.

If your file gets over 1.88MB, it will need a second-level of indirection (height 2), the block of which will have a 24-byte (0x18 byte) header and 64-bytes of reserved space. That means you're inode will have at most 483 pointers to 4K-blocks which can each hold 501 block pointers. So 483*501 = 241983 blocks, or 991162368 bytes of data (945MB).

If your file is bigger than 945 MB, you'll need a third level of indirection (height 3), which means your file can grow to have 945MB of pointers, which is enough for 121233483 pointers. The file can grow to 496572346368 bytes, or 473568MB, also known as 462GB.

Still bigger, at height 4, we get a max file size of 248782745530368, also known as 231696GB or 226TB.

If your file is bigger than 226TB, (egads!) height 5, max file size is 124640155510714368 bytes, also known as 113359TB.
- 0 - 3864 bytes, it will only consume one 4K block for inode plus data.
- 3865 - 1.88MB, it will consume one 4K block of inode plus data size.
- 1.88MB - 945MB, it will consume one 4K block of inode, plus (file size / (509*4096)) blocks, plus data.
- 945MB - 462GB, it will consume 4K plus (file size / (121233483*4096)) blocks, plus data.
- 462GB - 226TB, it will consume 4K plus (file size / (496572346368*4096)) blocks, plus data.
Also, extended attributes like ACLs, if used, take up more blocks.
Can I use striping, hardware RAID, mirroring, etc., with GFS?
Yes you can. Since GFS can manage the contents of a block device (SCSI, logical volume, etc), there is still the underlying logical volume manager, LVM2, that takes care of things like spanning physical volumes, striping, hardware RAID, mirroring and such. For clusters, there is a special version of LVM2 called CLVM that is needed, but not much changes other than the locking protocol specified in /etc/lvm/lvm.conf.

Note that GFS won't work properly in a cluster with software RAID (the MD driver). At the time of this writing, software RAID is not cluster-aware. Since software RAID can only be running on one node in the cluster, the other nodes will not be able to see the data properly, or will likely destroy each other's data. However, if GFS is used as a stand-alone file system on a single-node, software RAID should be okay.
Why do I get errors (such as implicit mutex_lock) when I try to compile GFS?
Sometime after 2.6.15, the upstream kernel changed from using the semaphores (i_sem) within the VFS layer to using mutexes (i_mutex). If your Linux distribution is running an older kernel, you may not be able to compile GFS.

Your choices are: (1) upgrade your kernel to a newer one, or (2) downgrade your GFS or change the source code so that it uses semaphores like before. Older versions are available from CVS.

Because this is an open-source project, it's constantly evolving, as does the Linux kernel. Compile problems are to be expected (and usually easily overcome) unless you are compiling against the exact same kernel the developers happen to be using at the time.
Can I boot a diskless client off of a SAN with GFS?
Surprisingly, yes. Atix Corporate has a SourceForce project called "Open-Sharedroot" for this purpose.

Visit http://www.open-sharedroot.org/ for more information.

There's a quick how-to at:

http://www.open-sharedroot.org/documentation/the-opensharedroot-mini-howto.

Mark Hlawatschek from Atix gave a presentation about this at the 2006 Red Hat Summit. His slides can be seen here:

http://www.atix.de/downloads/vortrage-und-workshops/ATIX_Shared-Root-Cluster.pdf.
Is GFS certified to work with Oracle Real Application Clusters (RAC)?
Yes, with the following caveats:
1. RHEL 4 / GFS 6.1 is only certified to work with GULM locking protocol.
2. GULM nodes need to be external to the RAC/GFS cluster, that is, RAC/GFS nodes are not allowed to be GULM nodes (that run the lock manager)
See the following for more information:

Red Hat GFS: Installing and Configuring Oracle9i RAC with GFS:
http://www.redhat.com/docs/manuals/csgfs/oracle-guide/

RAC Technologies Compatibility Matrix for Linux Clusters:
http://www.oracle.com/technology/products/database/clustering/certify/tech_generic_linux.html

RAC Technologies Compatibility Matrix for Linux x86 Clusters:
http://www.oracle.com/technology/products/database/clustering/certify/tech_linux_x86.html

RAC Technologies Compatibility Matrix for Linux x86-64 (AMD64/EM64T) Clusters:
http://www.oracle.com/technology/products/database/clustering/certify/tech_linux_x86_64.html

Oracle Certification Environment Program:
http://www.oracle.com/technology/software/oce/oce_fact_sheet.htm
Are there any known bugs when running Oracle RAC on top of GFS?
Not currently, however, playing this song at high volume in your data center has been rumored to introduce entropy in to the GFS+RAC configuration. Please consider Mozart or Copin instead.
Yes, that's a joke, ha ha...
Is it possible for more than one node to work on the same file without locking each other?
Yes and No. Yes it's possible, and one application will not block the other. No, since only one node can cache the content of the inode in question at a particular time, so the performance may be poor. The application should use some kind of locking (for example, byte range locking, i.e. fcntl) to protect the data.

However, GFS does not excuse the application from locking to protect the data. Two processes trying to write data to the same file can still clobber each other's data unless proper locking is in place to prevent it.

Here's a good way to think about it: GFS will make two or more processes on two or more different nodes be treated the same as two or more processes on a single node. So if two processes can share data harmoniously on a single machine, then GFS will ensure they share data harmoniously on two nodes. But if two processes would collide on a single machine, then GFS can't protect you against their lack of locking.
If GFS doesn't protect my files from multiple data writers, then why use it? What good is it?
If you have shared storage that you need to mount read/write, then you still need it. Perhaps it's best to explain why with an example.

Suppose you had a fibre-channel linked SAN storage device attached to two computers, and suppose they were running in a cluster, but using EXT3 instead of GFS to access the data. Immediately after they mount, both systems would be able to see the data on the SAN. Everything would be fine as long as the file system was mounted as read-only. But without GFS, as soon as one node writes data, the other node's file system doesn't know what's happened.

Suppose node A creates a file and assigns inode number 4351 to it, and write 16K of data to it in blocks 3120 and 2240. As far as node B is concerned, there is no inode 4351, and blocks 3120 and 2240 are free. So now it is free to try to create inode 4351, and write data to block 2240, but still believing block 3120 is free. The file system maps of which data areas are used and unused would soon overlap, as would the inode numbers. It wouldn't take long before the while file system was hopelessly corrupt, along with the files inside it.

With GFS, when node A assigns inode 4351, node B automatically knows about the change, and the data is kept harmoniously on disk. When one data area is allocated, all nodes in the cluster are aware of the file allocations, and they don't bump into one another. If node B needs to create another inode, it wouldn't choose 4351, and file system would not be corrupted.

However, even with GFS, if nodes A and B both decide to operate on a file X, even though they both agree on where the data is located, they can still overwrite the data within the file unless the program doing the writing uses some kind of locking scheme to prevent it.
Why is GFS unable to allow a lock when the "group execute" bit has been set?
If you have set-group-ID on, and then turn off group-execute you mark a file for mandatory locking. A file with mandatory locking will have the group exec bit as well as set-group-ID on and it would look like this in 'ls' (result of a chmod 2770):

-rwxrws--- 1 tangzx2 idev 347785 Jan 17 10:22 temp.txt
Can I control journal size and placement in GFS?
Not really. The gfs_mkfs command decides exactly where everything should go and you have no choice in the matter. The volume is carved into logical "sections." The first and last sections are for multiple resource groups, based roughly on the rg size specified on the gfs_mkfs commandline. The journals are always placed between the first and last section. Specifying a different number of journals will force gfs_mkfs to carve the section size smaller, thus changing where your journals will end up.
Is GFS 100% Posix compliant?
Only insofar as Linux is. Linux isn't 100% posix compliant, but GFS is as much compliant as any other file system can be under Linux.
Can I shrink my GFS or GFS2 file system through lvreduce?
No. GFS and GFS2 do not currently have the ability to shrink. Therefore, you can not reduce the size of your volume.
Why is GFS slow doing things like 'ls -lr *' whereas 'ls -r *' is fast?
Mostly due to design constraints. An ls -r * can simply traverse the directory structures, which is very fast. An ls -lr * has to traverse the directory, but also has to stat each file to get more details for the ls. That means it has to acquire and release a cluster lock on each file, which can be slow. We've tried to address these problems with the new GFS2 file system.
Is it possible to create GFS on MD multi-device?
It is possible to create GFS on an MD device as long as you are only using it for multipath. Software RAID is not cluster-aware and therefore not supported with GFS. The preferred solution is to use device mapper (DM) multipathing rather than md in these configurations.
How can I mount a GFS partition at bootup time?
Put it in /etc/fstab.

During startup, the "service gfs start" script (/etc/rc.d/init.d/gfs) gets called by init. The script checks /etc/fstab to see if there are any gfs file systems to be mounted. If so, it loads the gfs device driver and appropriate locking module, assuming the rest of the cluster infrastructure has been started.
What improvements will GFS2 have over GFS(1)?
GFS2 will address some of the shortcomings of GFS1:
- Faster fstat calls, making operations like 'ls -l' and 'df' faster.
- Faster directory scans.
- More throughput and better latency.
- Ability to add journals without extending the file system.
- Supports "data=ordered" mode (similar to ext3).
- Truncates appear to be atomic even in the presence of machine failure.
- Unlinked inodes / quota changes / statfs changes recovered without remounting that journal.
- Quotas are turned on and off by a mount option "quota=[on|off|account]"
- No distinction between data and metadata blocks. That means "gfs2_tool reclaim" is no longer necessary/present.
- More efficient allocation of inodes and bitmaps.
- Smaller journals allow for less overhead and more available storage to the users.
- Improved gfs2_edit tool for examining, extracting and recovering file system data.
- Numerous internal improvements, such as:
What is the expected performance increase of GFS2 over GFS(1)?
I don't think anyone has speculated about this, and it's still to early for performance comparisons.
Why is one of my GFS nodes faster or slower to access my file when they're identical hardware?
With GFS, the first node to access a file becomes its lock master. Therefore, access to that file will be faster than other nodes.
Does GFS and GFS2 work with SELinux security?
In the RHEL4 and STABLE branches of the code in CVS, SELinux is not currently supported.

In the development version (HEAD) and in upcoming releases, this support is built in.
How long will my gfs_fsck or gfs2_fsck take?
That depends highly on the type of hardware that it's running on. File system check (fsck) operations take a long time regardless of the file system, and we'd rather do a thorough job than a fast one.

Running it in verbose mode (-v) will also slow it down considerably.

We recently had report of a 45TB GFS file system on a dual Opteron 275 (4Gb RAM) with 4Gb Fibre Channel to six SATA RAIDs. The 4GB of RAM was not enough to do the fsck. FSCK required about 15GB of RAM to do the job, so a large swap drive was added. It took 48 hours for gfs_fsck to run to completion without verbose mode.
Does GFS support the use of sparse files?
Yes it does.
I want to use GFS for MySQL. Is that okay?
Yes, but you need to be careful.

If you only want one MySQL server running, (Active-Passive) there's no problem. You can use rgmanager to manage a smooth failover to redundant MySQL servers if your MySQL server goes down. However, you should be aware that in some releases, the mysql init script has an easily-fixed problem where it doesn't return the proper return code. That can result in rgmanager problems with starting the service.

If you want multiple MySQL services running on the cluster (Active-Active), that's where things get tricky. You can still use rgmanager to manage your MySQL services for High Availability. However, you need to configure MySQL so that:
- Only the MyISAM storage engine is used.
- Each mysqld service must start with the external-locking parameter on.
- Each mysqld service has to have the query cache parameter off (other cache mechanisms remain on, since they are automatically invalidated by external locking)
If you don't follow these rules, the multiple mysqld servers will not play nice in the cluster and your database will likely be corrupted.

For information on configuring MySQL, visit the mysql web site: http://www.mysql.com

MySQL also sells a clustered version of MySQL called "MySQL Cluster", but that does its own method of clustering, and is completely separate from Cluster Suite and GFS. I'm not sure how it would interact with our cluster software. For more information, see: http://www.mysql.com/products/database/cluster/
I want to use MySQL in a cluster WITHOUT GFS. Is that okay?
It depends on where you keep your databases.

If you keep your databases on shared storage, such as a SAN or iSCSI, you should use a cluster-aware file system like GFS to keep the file system sane with the multiple nodes trying to access the data at the same time. You can easily use rgmanager to manage the servers, since all the nodes will be seeing the same data. Without a cluster file system like GFS, there's likely to be corruption on your shared storage.

If your databases are on storage that is local to the individual nodes (i.e. local hard drives) then there are no data corruption issues, since the nodes won't have access to the storage on other nodes where the data is kept. However, if you plan to use the rgmanager to provide High Availability (Active-Passive) for each of your database servers, you will probably want to make copies of the database on each of the nodes so that it can also serve the database from any node that fails. You may have to do it often, too, or your backup database may quickly get out of sync with the original it is trying to provide backup service for. It may be tricky to copy these databases between the nodes, so you may need to follow special instructions on the MySQL web site: http://www.mysql.com
I want to use GFS for PostgreSQL. Is that okay?
Yes it is, for high-availability only (like MySQL, PostgreSQL is not yet cluster-aware). We even have a RG Manager resource agent for PostgreSQL 8 (only) which we plan to release in RHEL4 update 5. There is a bugzilla to track this work:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=204058
I want to use GFS for Samba (smb) file serving. Is that okay?
It depends on what you want to do with it.

You can serve samba from a single node without a problem.

If you want to use samba to serve the same shared file system from multiple nodes (clustered samba aka samba in active/active mode), you'll have to wait: there are still issues being worked out regarding clustered samba.

If you want to use samba with failover to other nodes (active/passive) it will work but if failover occurs, active connections to samba are severed, so the clients will have to reconnect. Locking states are also lost. Other than that, it works just fine.
Why does GFS lock up/freeze when a node gets fenced?
When a node fails, cman detects the missing heartbeat and begins the process of fencing the node. The cman and lock manager (e.g. lock_dlm) prevent any new locks from being acquired until the failed node is successfully fenced. That has to be done to ensure the integrity of the file system, in case the failed node wants to write to the file system after the failure is detected by the other nodes (and therefore out of communication with the rest of the cluster).

The fence is considered successful after the fence script completes with a good return code. After the fence completes, the lock manager coordinates the reclaiming of the locks held by the node that had failed. Then the lock manager allows new locks and the GFS file system continues on its way.

If the fence is not successful or does not complete for some reason, new locks will continue to be prevented and therefore the GFS file system will freeze for the nodes that have it mounted and try to get locks. Processes that have already acquired locks will continue to run unimpeded until they try to get another lock.

There may be several reasons why a fence operation is not successful. For example, if there's a communication problem with a network power switch.

There may be several reasons why a fence operation does not complete. For example, if you were foolish enough to use manual fencing and forgot to run the script that informs the cluster that you manually fenced the node.
GFS gave me an error: fatal: filesystem consistency error. What is it and what can I do about it?
That pretty much means your file system is corrupt. There are a number of ways that this can happen that can't be blamed on GFS:
1. If you use fence_manual and manually acknowledge a node is fenced before it really is.
2. Faulty or flakey SAN.
3. Faulty or flakey Host Bus Adapter in any of the nodes.
4. Someone running gfs_fsck while the GFS file system is still mounted on a node.
5. Someone doing a mkfs or other operation on the GFS partition from a node that can see the SAN, but is still outside the cluster.
6. Someone modifying the GFS file system bypassing the GFS kernel module, such as doing some kind of lvm operation with locking_type = 1 in /etc/lvm/lvm.conf.
I've got several GFS file systems but when I try to mount more than one I get mount: File exists. What am I doing wrong?
I'm guessing that maybe you gave them the same locking table on gfs_mkfs, and they're supposed to be different. When you did mkfs, did you use the same -t cluster:fsname for more than one? You can find this out by doing:

gfs_tool sb <device> table

for each device and see if the same value appears. You can change it after the mkfs has already been done with this command:

gfs_tool sb <device> table cluster_name:new_name

How does GFS compare to ocfs2? Which is better?

We believe GFS is better than OCFS2 because GFS has several key features that are missing from OCFS2:

GFS	OCFS2
Integrated cluster intrastructure. You can even write your own cluster apps if you want.	No cluster infrastructure. Limited lock coordination through a quorum disk.
Quorum disk optional, easily scales to 32 nodes (soon to scale to 100 or more nodes) Without quorum disk, GFS already supports more than a hundred nodes.	Quorum disk limits you to 16 or less nodes
Clustered volume manager lvm2-cluster	No clustered volume manager
Limited support for extended attributes (ACLs currently supported, SELinux support will be available in RHEL4 U5, RHEL5 and going forward.)	No extended attribute support
Memory mapped IO for interprocess communication	No memory mapped IO
Quota support	No quota support
Cluster-wide flocks and POSIX locks	No cluster-aware flock or POSIX locks
POSIX Access Control Lists (ACLs)	No POSIX ACLs
Robust fencing mechanism to ensure file system integrity	No fencing
Integrated support for application failover (high availability)	No integrated application failover

How can I performance-tune GFS or make it any faster?

You shouldn't expect GFS to perform as fast as non-clustered file systems because it needs to do inter-node locking and file system coordination. That said, there are some things you can do to improve GFS performance.

Use -r 2048 on gfs_mkfs and mkfs.gfs2 for large file systems.
The issue has to do with the size of the GFS resource groups, which is an internal GFS structure for managing the file system data. This is an internal GFS structure, not to be confused with rgmanager's Resource Groups. Some file system slowdown can be blamed on having a large number of RGs. The bigger your file system, the more RGs you need. By default, gfs_mkfs carves your file system into 256MB RGs, but it allows you to specify a preferred RG size. The default, 256MB, is good for average size file systems, but you can increase performance on a bigger file system by using a bigger RG size. For example, my 40TB file system needs 156438 RGs of 256MB each and whenever GFS has to run that linked list, it takes a long time. The same 40TB file system can be created with bigger RGs--2048MB--requiring only 19555 of them. The time savings is dramatic: It took nearly 23 minutes for my system to read in all 156438 RG Structures with 256MB RGs, but only 4 minutes to read in the 19555 RG Structures for my 2048MB RGs. The time to do an operation like df on an empty file system dropped from 24 seconds with 256MB RGs, to under a second with 2048MB RGs. I'm sure that increasing the size of the RGs would help gfs_fsck's performance as well. Future versions of gfs_mkfs and mkfs.gfs2 will dynamically choose an RG size to reduce the RG overhead.
Preallocate files.
You can squeak out a little bit more speed out of GFS if you pre-allocate files. When blocks of data are allocated to a file, GFS takes a certain amount of time to run the RG list (see previous bullet) and coordinate the allocation with the other nodes in the cluster. This all takes time. If you can pre-allocate your files, for example, by using "dd if=/dev/zero of=/your/gfs/filesystem bs=1M count=X", the underlying application won't have to take time to do that allocation later. It's not actually saving you any time, it's just managing the time better.
Break file systems up when huge numbers of files are involved.
There's a certain amount of overhead when dealing with lots (millions) of files. If you have a lot of files, you'll get better performance from GFS if you reorganize your data into several smaller file systems with fewer files each. For example, some people who use GFS for a mail server will break the file system up into groups, rather than having all email on one huge file system.
Disable GFS quota support if you don't need it.
If you don't need quotas enforced in GFS, you can make your file system a little faster by disabling quotas. To disable quotas, mount with the "-o noquota" option.
Make sure your hardware isn't slowing you down
Since GFS coordinates locks through the network, you might be able to speed up locking by using faster Ethernet hardware. For example, 1GB Ethernet will work faster than 100MB ethernet. You can also put your cluster on its own network switch or hub to reduce slowdowns due to excessive Ethernet traffic and collisions. In some cases, slowdowns may also be blamed on faulty or flaky network switches or ports, cables, and Host Bus Adapters (HBAs).
Open files with (or without) O_DIRECT
If you have control over your application, you can sometimes squeak out a little bit more speed from GFS if you open your files with the O_DIRECT flag. That instructs GFS to use its DEFERRED locking mode for the file. It also tells GFS to bypass the Linux page cache, but that may be good or bad for your application. Using the page cache (without O_DIRECT) gives you better performance when you're repeatedly accessing the same data from the file because it's faster to access pages in the page cache than it is to read from the disk. But if your application is just doing massive writes or massive reads (i.e. not doing a lot of re-writing and re-reading) O_DIRECT may perform better.
Try increasing or decreasing the number of GFS locks cached.
GFS caches a certain number of locks in memory, but keeping a good amount in cache is a delicate balancing act. Too many locks cached and your performance may be hurt by the overhead to manage them all. Too few locks cached and your performance may be hurt by the overhead of constantly needing to recreate locks that were previously cached.
Depending on your applications running on GFS and the types of files and locks they use, it might run faster if you keep more locks cached in memory. Try to bring the number of cached locks closer in line with the number of locks you'll actually have in use. To change this value, do a command like this:
echo "200000" > /proc/cluster/lock_dlm/drop_count
Increase "statfs_slots"
When statting a file system with asynchronous internode locking, GFS fills in stat data as the locks become available. It normally allocates 64 locks for this task. Increasing the number of locks can often make it go faster because the nodes in the cluster can all work on more lock coordination asychronously. Use this command:
gfs_tool settune /mnt/gfs statfs_slots 128
This causes a bit more of traffic among the nodes but can sustain larger number of files.

This value is not persistent so it won't survive a reboot. If you want to make it persistent, you can add it to the gfs init script, /etc/init.d/gfs, after your file systems are mounted.

Adjust how often the GFS daemons run.

There are six GFS daemons that perform tasks under the covers. By default, each of these daemons wakes up every so often, and that may affect performance. Slight tweaks to these numbers may help. Here's a list of GFS daemons and how often they run:

Daemon	Function	Frequency	Parameter
gfs_glockd	reclaim unused glock structures	As needed	Unchangable
gfs_inoded	reclaim unlinked inodes.	15 secs	inoded_secs
gfs_logd	journal maintenance	1 sec.	logd_secs
gfs_quotad	write cached quota changes to disk	5 secs	quotad_secs
gfs_scand	Look for cached glocks and inodes to toss from memory	5 secs	scand_secs
gfs_recoverd	Recover dead machine's journals	60 secs	recoverd_secs

To change the frequency that a daemon runs, use a command like this:

gfs_tool settune /mnt/bob3 inoded_secs 30

These values are not persistent so they won't survive a reboot. If you want to make them persistent, you can add them to the gfs init script, /etc/init.d/gfs, after your file systems are mounted.

How can I convert a file system from gfs1 to gfs2?
There's a tool called gfs2_convert whose job is to convert a file system from gfs1 to gfs2. At this time, gfs2_convert will only convert file systems created with the default 4K block size. I recommend following this procedure:
1. Unmount the file system from all nodes.
2. gfs_fsck /your/file/system (make sure there's no corruption to confuse the tool)
3. IMPORTANT: Make a backup of your file system in case something goes wrong
4. gfs2_convert /your/file/system
After it gives you some warnings and asks you the all-important "are you sure" question, it converts it to gfs2.

WARNING: At this time, gfs2 is still being worked on, so you should not use it for a production cluster.
Why is access to a GFS file system slower right after it's mounted?
The first access after a GFS mount will be slower because GFS needs to read in the resource group index and resource groups (internal GFS data structures) from disk. Once they're in memory, subsequent access to the file system will be faster. This should only happen right after the file system is mounted.

It also takes additional time to read in from disk: (1) the inodes for the root directory, (2) the journal index, (3) the root directory entries and other internal data structures.

You should be aware of this when performance testing GFS. For example, if you want to test the performance of the "df" command, the first "df" after a mount will be a lot slower than subsequent "df" commands.
After a node is fenced GFS hangs for X seconds on my other nodes. Can I reduce that time?
After a node fails, there is a certain amount of time where cman is waiting for a heartbeat. When if doesn't get a heartbeat, it performs fencing, and has to wait for the fencing agent to return a good return code, verifying that the node has indeed been fenced. While the node is being fenced, GFS is prevented from taking out new locks (existing locks, however, remain valid, so some IO activity may still take place). After the fence is successful, DLM has to do lock recovery (to reclaim the locks held by the fenced node) and GFS has to replay the fenced node's journals. There's an additional configuration setting called post_fail_delay that can delay things further. So GFS is delayed for three things:
1. Time for the fence agent to perform fencing
  This varies widely based on the type of fencing you're using. Some network power switches are fast. Others agents like ilo are slower.
2. Time for DLM to recover the locks.
  This varies based on how much activity was happening on the file system. For example, if your application had thousands of locks taken, it will take longer to recover those locks than if your node were idle before the failure.
3. Time for GFS to replay the journals
  Again, this varies, based on the activity of the file system before the fence. If there was lots of writing, there might be lots of journal entries to recover, which would take longer than an idle node.
There's not much you can do about the time taken, other than to reduce post_fail_delay to 0 or buy a faster power switch.
Will my application run properly on a GFS file system?
The GFS file system is like most other file systems with regard to applications, with one exception: It makes an application running on multiple nodes work as if they are multiple instances of the application running on a single node. GFS will maintain file system integrity when multiple nodes are accessing data on the same shared storage. However, the application is free to corrupt data within its own files unless it is cluster-aware.

For example, if you were to run multiple copies of the regular MySQL database on a single computer, you're going to get into trouble. That's because right now, MySQL doesn't do record-level locking on its database, and therefore a second instance would overwrite data from the first instance. Of course, there are safeguards within MySQL to prevent you from running two instances on a single computer. But if you ran MySQL from two clustered nodes on a GFS file system, it would be just like both instances are running on the same computer, except that there are no safeguards: Data corruption is likely. (Note, however, that there is a special version of MySQL that is more cluster friendly.)

The same holds true for other applications. If you can safely run multiple instances on the same computer, then you should be able to run multiple instances within your cluster safely on GFS.
What does it mean when a GFS file system is withdrawn?
If a GFS file system detects corruption due to an operation it has just performed, it will withdraw itself. The idea of withdrawning from GFS is just slightly nicer than a kernel panic. It means that the node feels it can no longer operate safely on that file system because it found out that one of its assumptions is wrong. Instead of panicking the kernel, it gives you an opportunity to reboot the node "nicely".
After a withdraw, can you simply remount your GFS file system?
No. The withdrawn node should be rebooted.
The files in my GFS file system are corrupt; Why did it happen and what should I do?
Corruption in GFS is extremely rare and almost always indicates a hardware problem with your storage or SAN. The problem might be in the SAN itself, the motherboards, fibre channel cards (HBAs) or memory of the nodes, although that's still not guaranteed. Many things can cause data corruption, such as rogue machines that have access to the SAN that you're not aware of.

I recommend you:
1. Verify the hardware is working properly in all respects. One way you can do this is to make a backup of the raw data to another device and verify the copy against the original without GFS or any of the cluster software in the mix. For example, unmount the file system from all nodes in the cluster, then do something like: [root@node-01#] dd if=/dev/my_vg/lvol0 of=/mnt/backup/sanbackup [root@node-01#] diff /dev/my_vg/lvol0 /mnt/backup/sanbackup (assuming of course that /dev/my_vg/lvol0 is the logical volume you have your GFS partition on, and /mnt/backup/ is some scratch area big enough to hold that much data.) The idea here is simply to test that reading from the SAN gives you the same data twice. If that works successfully on one node, try it on the other nodes.
2. You may want to do a similar test, only writing random data to the SAN, then reading it back and verifying the results. Obviously this will destroy the data on your SAN unless you are careful, so if this is a production machine, please take measures to protect the data before trying anything like this. This example only verifies the first 4GB of data: [root@node-01#] dd if=/dev/my_vg/lvol0 of=/tmp/sanbackup2 bs=1M count=4096 [root@node-01#] dd if=/dev/urandom of=/tmp/randomjunk bs=1M count=4096 [root@node-01#] dd if=/tmp/randomjunk of=/dev/my_vg/lvol0 bs=1M count=4096 [root@node-01#] dd if=/dev/my_vg/lvol0 of=/tmp/junkverify bs=1M count=4096 [root@node-01#] diff /tmp/randomjunk /tmp/junkverify [root@node-01#] dd if=/tmp/sanbackup2 of=/dev/my_vg/lvol0 bs=1M count=4096
  The two diffed files should be identical, or else you're having a hardware problem.
3. Once you verify the hardware is working properly, run gfs_fsck on it. The latest version of gfs_fsck (RHEL4 or newer) can repair most file system corruption.
4. If the file system is fixed okay, you should back it up.
5. If you can read and write to the SAN reliably from all the nodes without GFS, then try using it again with GFS and see if the problem comes back.
Perhaps someone else (the SAN manufacturer?) can recommend hardware tests you can run to verify the data integrity.

I realize these kinds of tests take a long time to do, but if it's a hardware problem, you really need to know. If you know it's not hardware and can recreate this kind of corruption with some kind of test using GFS, please let us know how and open a bugzilla.

Answers to Cluster Manager (CMAN) Questions

What is Cluster Manager (cman)?
It depends on which version of the code you are running. Basically, cluster manager is a component of the cluster project that handles communications between nodes in the cluster.

In the latest cluster code, cman is just a userland program that interfaces with the OpenAIS membership and messenging system.

In the previous versions, cman was a kernel module whose job was to keep a "heartbeat" message moving throughout the cluster, letting all the nodes know that the others are alive.

It also handles cluster membership messages, determining when a node enters or leaves the cluster.
What does Quorum mean and why is it necessary?
Quorum is a voting algorithm used by the cluster manager.

A cluster can only function correctly if there is general agreement between the members about things. We say a cluster has 'quorum' if a majority of nodes are alive, communicating, and agree on the active cluster members. So in a thirteen-node cluster, quorum is only reached if seven or more nodes are communicating. If the seventh node dies, the cluster loses quorum and can no longer function.

It's necessary for a cluster to maintain quorum to prevent 'split-brain' problems. If we didn't enforce quorum, a communication error on that same thirteen-node cluster may cause a situation where six nodes are operating on the shared disk, and another six were also operating on it, independently. Because of the communication error, the two partial-clusters would overwrite areas of the disk and corrupt the file system. With quorum rules enforced, only one of the partial clusters can use the shared storage, thus protecting data integrity.

Quorum doesn't prevent split-brain situations, but it does decide who is dominant and allowed to function in the cluster. Should split-brain occur, quorum prevents more than one cluster group from doing anything.
How can I define a two-node cluster if a majority is needed to reach quorum?
We had to allow two-node clusters, so we made a special exception to the quorum rules. There is a special setting "two_node" in the /etc/cluster.conf file that looks like this:

<cman expected_votes="1" two_node="1"/>

This will allow one node to be considered enough to establish a quorum. Note that if you configure a quorum disk/partition, you don't want two_node="1".
What is a tie-breaker, and do I need one in two-node clusters?
Tie-breakers are additional heuristics that allow a cluster partition to decide whether or not it is quorate in the event of an even-split - prior to fencing. A typical tie-breaker construct is an IP tie-breaker, sometimes called a ping node. With such a tie-breaker, nodes not only monitor each other, but also an upstream router that is on the same path as cluster communications. If the two nodes lose contact with each other, the one that wins is the one that can still ping the upstream router. Of course, there are cases, such as a switch-loop, where it is possible for two nodes to see the upstream router - but not each other - causing what is called a split brain. This is why fencing is required in cases where tie-breakers are used.

Other types of tie-breakers include disk tie-breakers where a shared partition, often called a quorum disk, provides additional details. clumanager 1.2.x (Red Hat Cluster Suite 3) had a disk tie-breaker that allowed safe split brain operation if the network went down as long as both nodes were still communicating over the shared partition.

More complex tie-breaker schemes exist, such as QDisk (part of linux-cluster). QDisk allows arbitrary heuristics to be specified. These allow each node to determine its own fitness for participation in the cluster. It is often used as a simple IP tie-breaker, however. See the qdisk(5) manual page for more information.

CMAN has no internal tie-breakers for various reasons. However, tie-breakers can be implemented using the libcman API. This API allows quorum device registration and updating. For an example, look at the QDisk source code.

You might need a tie-breaker if you:
- Have a two node configuration with the fence devices on a different network path than the path used for cluster communication
- Have a two node configuration where fencing is at the fabric level - especially for SCSI reservations
However, if you have a correct network & fencing configuration in your cluster, a tie-breaker only adds complexity, except in pathological cases.
If both nodes in a two-node cluster lose contact with each other, don't they try to fence each other?
They do. When each node recognizes that the other has stopped responding, it will try to fence the other. It can be like a gunfight at the O.K. Coral, and the node that's quickest on the draw (first to fence the other) wins. Unfortunately, both nodes can end up going down simultaneously, losing the whole cluster.

It's possible to avoid this by using a network power switch that serializes the two fencing operations. That ensures that one node is rebooted and the second never fences the first. For other configurations, see below.
What is the best two-node network & fencing configuration?
In a two node cluster (where you are using two_node="1" in the cluster configuration, and w/o QDisk), there are several considerations you need to be aware of:
- If you are using per-node power management of any sort where the device is not shared between cluster nodes, you MUST have all fence devices on the same network path as the path used by CMAN for cluster communication. Failure to do so can result in both nodes simultaneously fencing each other, leaving the entire cluster dead, or end up in a fence loop. Typically, this includes all integrated power management solutions (iLO, IPMI, RSA, ERA, IBM Blade Center, Egenera Blade Frame, Dell DRAC, etc.), but also includes remote power switches (APC, WTI) if the devices are not shared between the two nodes.
- It is best to use power-type fencing. SAN or SCSI-reservation fencing might work, as long as it meets the above requirements. If it does not, you should consider using a quorum disk or partition
If you can not meet the above requirements, you can use quorum disk or partition.
What if the fenced node comes back up and still can't contact the other? Will it corrupt my file system?
The two_node cluster.conf option allows one node to have quorum by itself. A network partition between the nodes won't result in a corrupt fs because each node will try to fence the other when it comes up prior to mounting gfs.

Strangely, if you have a persistent network problem and the fencing device is still accessible to both nodes, this can result in a "A reboots B, B reboots A" fencing loop.

This problem can be worked around by using a quorum disk or partition to break the tie, or using a specific network & fencing configuration.
I lost quorum on my six-node cluster, but my remaining three nodes can still write to my GFS volume. Did you just lie?
It's possible to still write to a GFS volume, even without quorum, but ONLY if the three nodes that left the cluster didn't have the GFS volume mounted. It's not a problem because if a partitioned cluster is ever formed that gains quorum, it will fence the nodes in the inquorate partition before doing anything.

If, on the other hand, nodes failed while they had gfs mounted and quorum was lost, then gfs activity on the remaining nodes will be mostly blocked. If it's not then it may be a bug.
Can I have a mixed cluster with some nodes at RHEL4U1 and some at RHEL4U2?
You can't mix RHEL4 U1 and U2 systems in a cluster because there were changes between U1 and U2 that changed the format of internal messages that are sent around the cluster.

Since U2, we now require these messages to be backward-compatible, so mixing U2 and U3 or U3 and U4 shouldn't be a problem.
How do I add a third node to my two-node cluster?
Unfortunately, two-node clusters are a special case. A two-node cluster needs two nodes to establish quorum, but only one node to maintain quorum. This special status is set by a special "two_node" option in the cman section of cluster.conf. Unfortunately, this setting can only be reset by shutting down the cluster. Therefore, the only way to add a third node is to:
1. Shut down the cluster software on both nodes.
2. Add the third node into your /etc/cluster/cluster.conf file.
3. Get rid of two_node="1" option in cluster.conf.
4. Copy the modified cluster.conf to your third node.
5. Restart all three nodes.
The system-config-cluster gui gets rid of the two_node option automatically when you add a third node. Also, note that this does not apply to two-node clusters with a quorum disk/partition. If you have a quorum disk/partition defined, you don't want to use the two_node option to begin with.

Adding subsequent nodes to a three-or-more node cluster is easy and the cluster does not need to be stopped to do it.
1. Add the node to your cluster.conf
2. Increment the config file version number near the top and save the changes
3. Do ccs_tool update /etc/cluster/cluster.conf to propagate the file to the cluster
4. Use cman_tool status | grep "Config version" to get the internal version number.
5. Use cman_tool version -r <new config version>.
6. Start the cluster software on the additional node.
I removed a node from cluster.conf but the cluster software and services kept running. What did I do wrong?
You're supposed to stop the node before removing it from the cluster.conf.
How can I rename my cluster?
Here's the procedure:
1. Unmount all GFS partitions and stop all clustering software on all nodes in the cluster.
2. Change name="old_cluster_name" to name="new_cluster_name" in /etc/cluster/cluster.conf
3. If you have GFS partitions in your cluster, you need to change their superblock to use the new name. For example:
  gfs_tool sb /dev/vg_name/gfs1 table new_cluster_name:gfs1
4. Restart the clustering software on all nodes in the cluster.
5. Remount your GFS partitions
What's the proper way to shut down my cluster?
Halting a single node in the cluster will seem like a communication failure to the other nodes. Errors will be logged and the fencing code will get called, etc. So there's a procedure for properly shutting down a cluster. Here's what you should do:

Use the "cman_tool leave remove" command before shutting down each node. That will force the remaining nodes to adjust quorum to accomodate the missing node and not treat it as an error.
Why does the cman daemon keep shutting down and reconnecting?
Additional info: When I try to start cman, I see these messages in /var/log/messages:

Sep 11 11:44:26 server01 ccsd[24972]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5
Sep 11 11:44:26 server01 ccsd[24972]: Initial status:: Inquorate
Sep 11 11:44:57 server01 ccsd[24972]: Cluster is quorate. Allowing connections.
Sep 11 11:44:57 server01 ccsd[24972]: Cluster manager shutdown. Attemping to reconnect...

I see these messages in dmesg:

CMAN: forming a new cluster
CMAN: quorum regained, resuming activity
CMAN: sendmsg failed: -13
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -13
CMAN: we are leaving the cluster.
CMAN: Waiting to join or form a Linux-cluster
CMAN: sendmsg failed: -13

This is almost always caused by a mismatch between the kernel and user space CMAN code. Update the CMAN user tools to fix the problem.
I've heard there are issues with using an even/odd number of nodes. Is it true?
No, it's not true. There is only one special case: two node clusters have special rules for determining quorum. See question 3 above.
What is a quorum disk/partition and what does it do for you?
A quorum disk or partition is a section of a disk that's set up for use with components of the cluster project. It has a couple of purposes. Again, I'll explain with an example.

Suppose you have nodes A and B, and node A fails to get several of cluster manager's "heartbeat" packets from node B. Node A doesn't know why it hasn't received the packets, but there are several possibilities: either node B has failed, the network switch or hub has failed, node A's network adapter has failed, or maybe just because node B was just too busy to send the packet. That can happen if your cluster is extremely large, your systems are extremely busy or your network is flakey.

Node A doesn't know which is the case, and it doesn't know whether the problem lies within itself or with node B. This is especially problematic in a two-node cluster because both nodes, out of touch with one another, can try to fence the other.

So before fencing a node, it would be nice to have another way to check if the other node is really alive, even though we can't seem to contact it. A quorum disk gives you the ability to do just that. Before fencing a node that's out of touch, the cluster software can check whether the node is still alive based on whether it has written data to the quorum partition.

In the case of two-node systems, the quorum disk also acts as a tie-breaker. If a node has access to the quorum disk and the network, that counts as two votes.

A node that has lost contact with the network or the quorum disk has lost a vote, and therefore may safely be fenced.
Is a quorum disk/partition needed for a two-node cluster?
In older versions of the Cluster Project, a quorum disk was needed to break ties in a two-node cluster. Early versions of Red Hat Enterprise Linux 4 (RHEL4) did not have quorum disks, but it was added back as an optional feature in RHEL4U4.

In RHCS 4 update 4 and beyond, see the man page for qdisk for more information. As of September 2006, you need to edit your configuration file by hand to add quorum disk support. The system-config-cluster gui does not currently support adding or editing quorum disk properties.

Whether or not a quorum disk is needed is up to you. It is possible to configure a two-node cluster in such a manner that no tie-breaker (or quorum disk) is required. Here are some reasons you might want/need a quorum disk:
- If you have a special requirement to go down from X -> 1 nodes in a single transition. For example, if you have a 3/1 network partition in a 4-node cluster - where the 1-node partition is the only node which still has network connectivity. (Generally, the surviving node is not going to be able to handle the load...)
- If you have a special situation causing a need for a tie-breaker in general.
- If you have a need to determine node-fitness based on factors which are not handled by CMAN
In any case, please be aware that use of a quorum disk requires additional configuration information and testing.
How do I set up a quorum disk/partition?
The best way to start is to do "man qdisk" and read the qdisk.5 man page. This has good information about the setup of quorum disks.

Note that if you configure a quorum disk/partition, you don't want two_node="1" or expected_votes="2" since the quorum disk solves the voting imbalance. You want two_node="0" and expected_votes="3" (or nodes + 1 if it's not a two-node cluster). However, since 0 is the default value for two_node, you don't need to specify it at all. If this is an existing two-node cluster and you're changing the two_node value from "1" to "0", you'll have to stop the entire cluster and restart it after the configuration is changed (normally, the cluster doesn't have to be stopped and restarted for configuration changes, but two_node is a special case.) Basically, you want something like this in your /etc/cluster/cluster.conf:
<cman two_node="0" expected_votes="3" .../> <clusternodes> <clusternode name="node1" votes="1" .../> <clusternode name="node2" votes="1" .../> </clusternodes> <quorumd device="/dev/mapper/lun01" votes="1"/>
Note: You don't have to use a disk or partition to prevent two-node fence-cycles; you can also set your cluster up this way. You can set up a number of different heuristics for the qdisk daemon. For example, you can set up a redundant NIC with a crossover cable and use ping operations to the local router/switch to break the tie (this is typical, actually, and is called an IP tie breaker). A heuristic can be made to check anything, as long as it is a shared resource.
Do I really need a shared disk to use QDisk?
Currently, yes. There have been suggestions to make qdiskd operate in a 'diskless' mode in order to help prevent a fence-race (i.e. prevent a node from attempting to fence another node), but no work has been done in this area (yet).
Are the quorum disk votes reported in "Total_votes" from cman_tool nodes?
Yes. if the quorum disk is registered correctly with cman you should see the votes it contributes and also it's "node name" in cman_tool nodes.
What's the minimum size of a quorum disk/partition?
The official answer is 10MB. The real number is something like 100KB, but we'd like to reserve 10MB for possible future expansion and features.
Is quorum disk/partition reserved for two-node clusters, and if not, how many nodes can it support?
Currently a quorum disk/partition may be used in clusters of up to 16 nodes.
In a 2 node cluster, what happens if both nodes lose the heartbeat but they can still see the quorum disk? Don't they still have quorum and cause split-brain?
First of all, no, they don't cause split-brain. As soon as heartbeat contact is lost, both nodes will realize something is wrong and lock GFS until it gets resolved and someone is fenced.

What actually happens depends on the configuration and the heuristics you build. The qdisk code allows you to build non-cluster heuristics to determine the fitness of each node beyond the heartbeat. With the heuristics in place, you can, for example, allow the node running a specific service to have priority over the other node. It's a way of saying "This node should win any tie" in case of a heartbeat failure. The winner fences the loser.

If both nodes still have a majority score according to their heuristics, then both nodes will try to fence each other, and the fastest node kills the other. Showdown at the Cluster Corral. The remaining node will have quorum along with the qdisk, and GFS will run normally under that node. When the "loser" reboots, unlike with a cman operation, it will not become quorate with just the quorum disk/partition, so it cannot cause split-brain that way either.

At this point (4-Apr-2007), if there are no heuristics defined whatsoever, the QDisk master node wins (and fences the non-master node). [This functionality will appear in Update 5 of Red Hat Cluster Suite for Red Hat Enterprise Linux 4, but is already available in CVS]
If my cluster is mission-critical, can I override quorum rules and have a "last-man-standing" cluster that's still functioning?
This may not be a good idea in most cases because of the dangers of split-brain, but there is a way you can do this: You can adjust the "votes" for the quorum disk to be equal to the number of nodes in the cluster, minus 1

For example, if you have a four-node cluster, you can set the quorum disk votes to 3, and expected_votes to 7. That way, even if three of the four nodes die, the remaining node may still function. That's because the quorum disk's 3 votes plus the remaining node's 1 vote makes a total of 4 votes out of 7, which is enough to establish quorum. Additionally, all of the nodes can be online - but not the qdiskd (which you might need to take down for maintenance or reconfiguration).
My cluster won't come up. It says: kernel: CMAN: Cluster membership rejected. What do I do?
One or more of the nodes in your cluster is rejecting the membership of this node. Check the syslog (/var/log/messages) on all remaining nodes in the cluster for messages regarding why the membership was rejected.

This message will only appear when another node is rejecting the node in question and it WILL tell syslog (/var/log/messages) why unless you have kernel logging switched off for some reason. There are several reasons your node may be rejected:
- Mismatched cluster.conf version numbers.
- Mismatched cluster names.
- Mismatched cluster number (a hash of the name).
- Node has the wrong node ID (i.e. it joined with the same name and a different node ID or vice versa).
- CMAN protocol version differs (or other software mismatch - there are several error messages for these but they boil down to the same thing).
Something else you might like to try is changing the port number that this cluster is using, or changing the cluster name to something totally different.

If you find that things work after doing this then you can be sure there is another cluster with that name or number on the network. If not, then you need to double/triple check that the config files really do all match on all nodes.

I've seen this message happen when I've accidentally done something like this:
1. Created a cluster.conf file with 5 nodes: A, B, C, D and E.
2. Tried to bring up the cluster.
3. Realized that node E has the wrong software, has no access to the SAN, has a hardware problem or whatever.
4. Removed node E from cluster.conf because it really doesn't belong in the cluster after all.
5. Updated all five machines with the new cluster.conf.
6. Rebooted nodes A, B, C and D to restart the cluster.
Guess what? None of the nodes come up in a cluster. Can you guess why?

It's because node E still thinks it's part of the cluster and still has a claim on the cluster name. You still need to shut down the cluster software on E, or else reboot it before the correct nodes can form a cluster.
Is it a problem if node order isn't the same for all nodes in cman_tool services?
No, this isn't a problem and can be ignored. Some nodes may report [1 2 3 4 5] while others report a different order, like [4 3 5 2 1]. This merely has to do with the order in which cman join messages are received.
Why does cman_tool leave say "cman_tool: Can't leave cluster while there are X active subsystems"?
This message indicates that you tried to leave the cluster from a node that still has active cluster resources, such as mounted GFS file systems.

A node cannot leave the cluster if there are subsystems (e.g. DLM, GFS, rgmanager) active. You should unmount all GFS filesystems, stop the rgmanager service, stop the clvmd service, stop fenced and anything else using the cluster manager before using cman_tool leave. You can use cman_tool status and cman_tool services to see how many (and which) services are running.
What are these services/subsystems and how do I make sense of what cman_tool services prints?
Although this may be an over-simplification, you can think of the services as a big membership roster for different special interest groups or clubs. Each "service-name" pair corresponds to access to a unique resource, and each node corresponds to a voting member in the club.

So let's weave a inane piece of fiction around this concept: let's pretend that a journalist named Sam, wants to write an article for her newspaper, "The National Conspiracy Theorist." To write her article, she needs access to secret knowledge kept hidden for centuries by a secret society known only as "The Group." The only way she can become a member is to petition the existing members to join and the decision must be unanimously in her favor. But The Group is so secretive, they don't even know each other's names; every member is assigned a unique id number. Their only means of communication is through a chat room, and they won't even speak to you unless you're a member or unless you know to become a member.

So she logs into chat room and joins the channel #default. In the chat room, she can see there are seven members of The Group. They're not listed in order, but they're all there.
[root@roth-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [7 6 1 2 3 4 5]
She finds a blog (called "cluster.conf") and reads from it that her own ID number is 8. So she sends them a message: "Node 8 wants to join the default group".

Secretly, the other members take attendance to make sure all the members are present and accounted for. Then they take a vote. If all of them vote yes, she's allowed into the group and she becomes the next member. Her ID number is added to the list of members.
[root@roth-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [7 6 1 2 3 4 5 8]
Now that she's a member of the Group, she is told that the secrets of the order are not given to ordinary newbies; they're kept in a locked space. They are stored in an office building owned by the order, that they oddly call "clvmd." Since she's a newbie, she has to petition the other members to get a key to the clvmd office building. After a similar vote, they agree to give her a key, and they keep track of everyone who has a key.
[root@roth-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [7 6 1 2 3 4 5 8] DLM Lock Space: "clvmd" 7 3 run - [7 6 1 2 3 4 5 8]
Eager to write her article, she drives to the clvmd office building, unlocks the door, and goes inside. She's heard rumors that the secrets are kept in suite labeled "secret". She goes from room to room until she finds a door marked "secret." Then she discovers that the door is locked and her key doesn't fit. Again, she has to petition the others for a key. They tell her that there are actually two adjacent rooms inside the suite, the "DLM" room and the "GFS" room, each holding a different set of secrets.

Four of the members (3, 4, 6 and 7) never really cared what was in those rooms, so they never bothered to learn the grueling rituals, and consequently, they were never issued keys to the two secret rooms. So after months of training, Sam once again petitions the other members to join the "secret rooms" group. She writes "Node 8 wants to join the 'secret' DLM group" and sends it to the members who have a key: #1, #2 and #5. She sends them a similar message for the other room as well: "Node 8 wants to join the 'secret' GFS group". Having performed all the necessary rituals, they agree, and she's issued a duplicate key for both secret rooms.
[root@roth-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [7 6 1 2 3 4 5 8] DLM Lock Space: "clvmd" 7 3 run - [7 6 1 2 3 4 5 8] DLM Lock Space: "secret" 12 8 run - [1 2 5 8] GFS Mount Group: "secret" 13 9 run - [1 2 5 8]
Then something shocking rocks the secret society: member 2 went into cardiac arrest and died on the operating table. Clearly, something must be done to recover the keys held by member 2. In order to secure the contents of both rooms, no one is allowed to touch the information in the secret rooms until they've verified member 2 was really dead and recovered his keys. The members decide to leave that task to the most senior member, member 7.

That night, when no one is watching, Member 7 breaks into the morgue, verifies #2 is really dead, and steals back the key from his pocket. Then #7 drives to the office building, returns all the secrets he had borrowed from the secret room. (They call it "recovery".) He also informs the other members that #2 is truly dead and #2 is taken off the group membership lists. Relieved that their secrets are safe, the others are now allowed access to the secret rooms.
[root@roth-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [7 6 1 3 4 5 8] DLM Lock Space: "clvmd" 7 3 run - [7 6 1 3 4 5 8] DLM Lock Space: "secret" 12 8 run - [1 5 8] GFS Mount Group: "secret" 13 9 run - [1 5 8]
You get the picture...Each of these "services" keeps a list of members who are allowed access, and that's how the cluster software on each node knows which others to contact for locking purposes. Each GFS file system has two groups that are joined when the file system is mounted; one for GFS and one for DLM.

The "state" of each service corresponds to its status in the group: "run" means it's a normal member. There are also states corresponding to joining the group, leaving the group, recovering its locks, etc.
What can cause a node to leave the cluster?
A node may leave the cluster for many reasons. Among them:
1. Shutdown: cman_tool leave was run on this node
2. Killed by another node. The node was killed with either by cman_tool kill or qdisk.
3. Panic: cman failed to allocate memory for a critical data structure or some other very bad internal failure.
4. Removed: Like 1, but the remainder of the cluster can adjust quorum downwards to keep working.
5. Membership Rejected: The node attempted to join a cluster but it's cluster.conf file did not match that of the other nodes. To find the real reason for this you need to examine the syslog of all the valid cluster members to find out why it was rejected.
6. Inconsistent cluster view: This is usually indicative of a bug but it can also happen if the network is extremely unreliable.
7. Missed too many heartbeats: This means what it says. All nodes are expected to broadcast a heartbeat every 5 seconds (by default). If none is received within 21 seconds (by default) then it is removed for this reason. The heartbeat values may be changed from their defaults.
8. No response to messages: This usually happens during a state transition to add or remove another node from a group. The reporting node sent a message five times (by default) to the named node and did not get a response.
How do I change the time interval for the heartbeat messages?
Just add hello_timer="value" to the cman section in your cluster.conf file. For example:

<cman hello_timer="5">

The default value is 5 seconds.
How do I change the time after which a non-responsive node is considered dead?
Just add deadnode_timeout="value" to the cman section in your cluster.conf file. For example:

<cman deadnode_timeout="21">

The default value is 21 seconds.
What does "split-brain" mean?
"Split brain" is a condition whereby two or more computers or groups of computers lose contact with one another but still act as if the cluster were intact. This is like having two governments trying to rule the same country. If multiple computers are allowed to write to the same file system without knowledge of what the other nodes are doing, it will quickly lead to data corruption and other serious problems.

Split-brain is prevented by enforcing quorum rules (which say that no group of nodes may operate unless they are in contact with a majority of all nodes) and fencing (which makes sure nodes outside of the quorum are prevented from interfering with the cluster).
What's the "right" way to get cman to use a different NIC, say, eth2 rather than eth0?
There are several reasons for doing this. You may want to do this in cases where you want the cman heartbeat messages to be on a dedicated network so that a heavily used network doesn't cause heartbeat messages to be missed (and nodes in your cluster to be fenced). Second, you may have security reasons for wanting to keep these messages off of an Internet-facing network.

First, you want to configure your alternate NIC to have its own IP address, and the settings that go with that (subnet, etc).

Next, add an entry into /etc/hosts (on all nodes) for the ip address associated with the NIC you want to use. In this case, eth2. One way to do this is to append a suffix to the original host name. For example, if your node is "node-01" you could give it the name "node-01-p" (-p for private network). For example, your /etc/hosts file might look like this:
# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 10.0.0.1 node-01 192.168.0.1 node-01-p
If you're using RHEL4.4 or above, or 5.1 or above, that's all you need to do. There is code in cman to look at all the active network interfaces on the node and find the one that corresponds to the entry in cluster.conf. Note that this only works on ipv4 interfaces.
Does cluster suite use multicast or broadcast?
By default, the older cluster infrastructure (RHEL4, STABLE and so on) uses broadcast. By default, the newer cluster infrastructure with openais (RHEL5, HEAD and so on) uses multicast. You can configure a RHEL4 cluster to use multicast rather than broadcast. However, you can't switch openais to use broadcast.
Is it possible to configure a cluster with nodes running on different networks (subnets)?
Yes, it is. If you configure the cluster to use multicast rather than broadcast (there is an option for this in system-config-cluster) then the nodes can be on different subnets.

Be careful that any switches and/or routers between the nodes are of good specification and are set to pass multicast traffic though.
How can I configure my RHEL4 cluster to use multicast rather than broadcast?
Put something like this in your cluster.conf file:

<clusternode name="nd1"> <multicast addr="224.0.0.1" interface="eth0"/> </clusternode>
On RHEL5, why do I get "cman not started: Can't bind to local cman socket /usr/sbin/cman_tool"?
There is currently a known problem with RHEL5 whereby system-config-cluster is trying to improperly access /usr/sbin/cman_tool (cman_tool currently resides in /sbin). We'll correct the problem, but in the meanwhile, you can circumvent the problem by creating a symlink from /sbin/cman_tool to /usr/sbin/. For example:

[root@node-01 ~]# ln -s /sbin/cman_tool /usr/sbin/cman_tool
If this is not your problem, read on:

Ordinarily, this message would mean that cman could not create the local socket in /var/run for communication with the cluster clients.

The cman tries to create /var/run/cman_client and /var/run/cman_admin. Things like cman_tool, groupd and ccsd talk to cman over this link. If it can't be created then you'll get this error.

Check /var/run is writable and able to hold Unix domain sockets.

On Fedora 8, CMAN won't start, complaining about "aisexec not started". How do I fix it?

On Fedora 8 and other distributions where the core supports multiple architectures (ex: x86, x86_64), you must have a matched set of packages installed. A cman package for x86_64 will not work with an x86 (i386/i686) openais package, and vice-versa. To see if you have a mixed set, run:

WRONG:


[root@ayanami ~]# file `which cman_tool`; file `which aisexec`
/usr/sbin/cman_tool: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped
/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped

RIGHT:


[root@ayanami ~]# file `which cman_tool`; file `which aisexec`
/usr/sbin/cman_tool: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped
/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped

You need to use the same architecture as your kernel for running the userland parts of the cluster packages; on x86_64, this generally means you should only have the x86_64 versions of the cluster packages installed.


rpm -e cman.i386 openais.i386 rgmanager.i386 ...
yum install -y cman.x86_64 openais.x86_64 rgmanager.x86_64 ...

Note: If you were having trouble getting things up, there's a chance that an old aisexec process might be running on one of the nodes; make sure you kill it before trying to start again!

My RHEL5 or similar cluster won't work with my Cisco switch.
Some Cisco switches do not support IP multicast in their default configuration. Since openais uses multicast for cluster communications, you may have to enable it in the switch in order to use the cluster software.

Before making any changes to your Cisco switches it is adviseable to contact your Cisco TAC to ensure the changes will have no negative consequences in your network.

Please visit this page for more information: http://www.openais.org/doku.php?id=faq:cisco_switches

OpenAIS Questions

What is OpenAIS?
Please see: http://www.openais.org/
How does OpenAIS differ from cman?
The cluster manager (cman) locking scheme uses kernel modules to communicate cluster status and changes between nodes. OpenAIS uses userspace programs to accomplish the same thing. Moving this function to userspace made more sense since it makes it easier to monitor and debug, non-fatal if it crashes, and it meshes better with the communications layers of the operating system.

Fencing Questions

What is fencing and why do I need it?
Fencing is the component of cluster project that cuts off access to a resource (hard disk, etc.) from a node in your cluster if it loses contact with the rest of the nodes in the cluster.

The most effective way to do this is commonly known as STONITH, which is an acronym that stands for "Shoot The Other Node In The Head." In other words, it forces the system to power off or reboot. That might seem harsh to the uninitiated, but really it's a good thing. If a node that is not cooperating with the rest of the cluster can seriously damage the data unless it's forced off. So by fencing an errant node, we're actually protecting the data.

Fencing is often accomplished with a network power switch, which is a power switch that can be controlled through the network. This is known as power fencing.

Fencing can also be accomplished by cutting off access to the resource, such as using SCSI reservations. This is known as fabric fencing.
What fencing devices are supported by Cluster Suite?
This is constantly changing. Manufacturers come out with new models and new microcode all the time, forcing us to change our fence agents. Your best bet is to look at the source code in CVS and see if your device is mentioned:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/?cvsroot=cluster

We are looking into ways to improve this.
Can't I just use my own watchdog or manual fencing?
No. Fencing is absolutely required in all production environments. That's right. We do not support people using only watchdog timers anymore.

Manual fencing is absolutely not supported in any production environment, ever, under any circumstances.
Should I use power fencing or fabric fencing?
Both do the job. Both methods guarantee the victim can't write to the file system, thereby ensuring file system integrity.

However, we recommend to customers to use power-cycle fencing anyway for a number of reasons. There are arguments where fabric level fencing is useful. The common "Fabric Fencing" arguments go something like this:

"What if the node has a reproducible failure that keeps happening over and over if we reset it each time?" and "What if I have non-clustered, but mission-critical tasks running on the node, and it is evicted from the cluster but is not actually dead (say, the cluster software crashed)? Power-cycling the machine would kill the Mission Critical tasks running on it..."

However, once a node is fabric fenced, you need to reboot it before it can rejoin the cluster.
What should I do it fenced dies or is killed by accident?
Killing fenced, or having it otherwise exit, while the node is using gfs isn't good. If the node fails without fenced running it won't be fenced. Fenced can simply be restarted if it exits somehow, which is what you should do if you find it's been killed. I don't think we can really prevent it from being intentionally killed, though.
HP's iLo (Integrated Lights Out) fencing doesn't work for me. How do I debug it?
The first step is to try fencing it from a command line that looks something like this:

/sbin/fence_ilo -a myilo -l login -p passwd -o off -v

Second, check the version of RIBCL you are using. You may want to consider upgrading your firmware. Also, you may want to scan bugzilla to see if there are any issues regarding your level of firmware.
What are fence methods, fence devices and fence levels and what good are they?
A node can have multiple fence methods and each fence method can have multiple fence devices.

Multiple fence methods are set up for redundancy/insurance. For example, you may be using a baseboard management fencing method for a node in your cluster such as IPMI, or iLO, or RSA, or DRAC. All of these depend on a network connection. If this connection would fail, fencing could not occur, so as a backup fence method you could declare a second method of fencing that used a power switch or somesuch to fence the node. If the first method failed to fence the node, the second fence method would be employed.

Multiple fence devices per method are used, for example, if a node has dual power supplies and power fencing is the fence method of choice. If only one power supply were fenced, the node would not reboot - as the other power supply would keep it up and running. In this case you would want two fence devices in one method: one for power supply A and one for power supply B.

All fence devices within a fence method must succeed in order for the method to succeed.

If someone refers to fence "levels" they are the same thing as methods. The term "method" used to refer to "power" versus "fabric" fencing. But the technology has outgrown that but the config file has not. So the term "fencing level" might be more accurate, but we still refer to them as "fencing methods" because "method" is how you specify it in the config file.
Why does node X keep getting fenced?
There can be multiple causes for nodes that repeatedly get fenced, but the bottom line is that one of the nodes in your cluster isn't seeing enough "heartbeat" network messages from the node that's getting fenced.

Most of the time, these come down to flaky or faulty hardware, such as bad cables and bad ports on the network hub or switch.

Test your communications paths thoroughly without the cluster software running to make sure your hardware is okay.
Why does node X keep getting fenced at startup?
If your network is busy, your cluster may decide it's not getting enough heartbeat packets, but that may be due to other activities that happen when a node joins a cluster. You may have to increase the post_join_delay setting in your cluster.conf. It's basically a grace period to give the node more time to join the cluster. For example:

<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="600"/>
Is manual fencing supported?
No. No. A thousand times no. Oh sure. You can use it. But don't complain when a node needs to be fenced and the cluster locks up, and services don't fail over.
But I don't want to buy a network power switch. Why isn't manual fencing supported?
Because we can't be responsible when this happens:
1. A node stops sending heartbeats long enough to be dropped from the cluster but has not panicked or the hardware has not failed. There a number of ways this could happen: faulty network switch, rogue application on the system that locks out other applications, someone trips over the network cable, or perhaps someone did a modprobe for a device driver that takes a very long time to download firmware and initialize the hardware, etc.
2. Fencing of the node is initiated in the cluster by one of the other members. Fence_manual is called, lock manager operations are put on hold until the fencing operation is complete. NOTE: Existing locks are still valid and I/O still continues for those activities not requiring additional lock requests.
3. System Administrator sees the fence_manual and immediately enters fence_ack_manual to the get cluster running again, prior to checking on the status of the failed node.
4. Journals for the fenced node are replayed and locks cleared for those entries so other operations can continue.
5. The fenced node continues to do read/write operations based on its last lock requests. File system is now corrupt.
6. Administrator now gets to the fenced node and resets it because operations ground to a halt due to no longer having status in the cluster.
7. File system corruption causes other nodes to panic.
8. System Administrator runs gfs_fsck for five days of lost production time trying to fix the corruption.
9. Adminstrator now complains that gfs is not stable and can't survive node failures. Ugly and untrue rumors start spreading about GFS corrupting data.
When will a node withdraw vs. getting fenced?
When a node can't talk to the rest of the cluster through its normal heartbeat packets, it will be fenced by another node.

If a GFS file system detects corruption due to an operation it has just performed, it will withdraw itself. The idea of withdrawning from GFS is just slightly nicer than a kernel panic. It means that the node feels it can no longer operate safely on that file system because it found out that one of its assumptions is wrong. Instead of panicking the kernel, it gives you an opportunity to reboot the node "nicely".
What's the right way to configure fencing when I have redundant power supplies?
You have to be careful when configuring fencing for redundant power supplies. If you configure it wrong, each power supply will be fenced separately and the other power supply will allow the system to continue running. The system won't be fenced at all. What you really want is to configure it so that both power supplies are shut off and the system is taken completely down. What you want is a set of two fencing devices inside a single fencing method.

If you're using dual power supplies, both of which are plugged into the same power switch, using ports 1 and 2, you can do something like this:
<clusternode name="node-01" votes="1"> <fence> <method name="1"> <device name="pwr01" option="off" switch="1" port="1"/> <device name="pwr01" option="off" switch="1" port="2"/> <device name="pwr01" option="on" switch="1" port="1"/> <device name="pwr01" option="on" switch="1" port="2"/> </method> </fence> </clusternode> ... <fencedevices> <fencedevice agent="fence_apc" ipaddr="192.168.0.101" login="admin" name="pwr01" passwd="XXXXXXXXXXX"/> </fencedevices>
The intrinsic problem with this, of course, is that if your UPS fails or needs to be swapped out, your system will lose power to both power supplies and you have down time. This is unacceptable in a High Availability (HA) cluster. To solve that problem, you'd really want redundant power switches and UPSes for the dual power supplies.

For example, let's say you have two APC network power switches (pwr01 and pwr02), each of which runs on its own separate UPS and has its own unique IP address. Let's assume that the first power supply of node 1 is plugged into port 1 of pwr01, and the second power supply is plugged into port 1 of pwr02. That way, port 1 on both switches is reserved for node 1, port 2 for node 2, etc. In your cluster.conf you can do something like this:
<clusternode name="node-01" votes="1"> <fence> <method name="1"> <device name="pwr01" option="off" switch="1" port="1"/> <device name="pwr02" option="off" switch="1" port="1"/> <device name="pwr01" option="on" switch="1" port="1"/> <device name="pwr02" option="on" switch="1" port="1"/> </method> </fence> </clusternode> ... <fencedevices> <fencedevice agent="fence_apc" ipaddr="192.168.0.101" login="admin" name="pwr01" passwd="XXXXXXXXXXX"/> <fencedevice agent="fence_apc" ipaddr="192.168.1.101" login="admin" name="pwr02" passwd="XXXXXXXXXXX"/> </fencedevices>
Do you have any specific recommendations for configuring fencing devices?
We have some. For WTI please visit this link:
http://people.redhat.com/lhh/wti_devices.html

Clustered Volume Manager (CLVM) Questions

Why doesn't pvscan find the volume I just created? [root@taft-04 ~]# pvcreate /dev/sdb1 [root@taft-04 ~]Physical volume "/dev/sdb1" successfully created [root@taft-04 ~]# pvscan [root@taft-04 ~]No matching physical volumes found
Filters can cause this to happen. pvscan respects the filters and scans everything but if pvcreate finds the device you request immediately, it only applies the filter to the name on the command line, i.e. it doesn't scan everything.

This can give a different result to the filter matching. Internally, lvm2 only knows about device numbers - major/minor. Names are just a means to finding the device number required. Device numbers can have multiple names in the file system and the rules for applying filters can give a different answer if only applied to a subset of names in the filesystem. But scanning everything every time is slow, so it takes short cuts - at the price of occasional inconsistency.

Try running "pvscan -vvvv | grep sdb" to make sure it's not filtered out.
Why didn't my clvmd start?
When I try to start clvmd, I get this message:
[root@node001 ~]# clvmd clvmd could not connect to cluster manager Consult syslog for more information
My syslog says: "clvmd: Unable to create lockspace for CLVM: No such file or directory"

Make sure your that dlm kernel module is loaded by using lsmod. If not, do "modprobe dlm" to insert the module. Also, make sure the failing node can physically see the shared storage in /proc/partitions. I've seen some weird things like this happen when a cluster comes up but some of the nodes can't physically access the storage.
Can't I just use the LVM-Toolchain ae (Access Exclusive) option to protect data on my SAN?
No you can't. Without some kind of cluster infrastructure, there's nothing to stop the computers attached to your shared storage from corrupting and overwriting each other's data. In fact, each of the nodes will let you use the 'ae' option and each will be convinced it has exclusive access.
Why aren't changes to my logical volume being picked up by the rest of the cluster?
There's a little-known "clustering" flag for volume groups that should be set on when a cluster uses a shared volume. If that bit is not set, you can see strange lvm problems on your cluster. For example, if you extend a volume with lvresize and gfs_grow, the other nodes in the cluster will not be informed of the resize, and will likely crash when they try to access the volume.

To check if the clustering flag is on for a volume group, use the "vgs" command and see if the "Attr" column shows a "c". If the attr column shows something like "wz--n-" the clustering flag is off for the volume group. If the "Attr" column shows something like "wz--nc" the clustering flag is on.

To set the clustering flag on, use this command:

vgchange -cy <volume group name>

DLM, GULM and Lock Manager Questions

What is a lock manager and why do I need it?
A lock manager is a traffic cop who controls access to resources in the cluster, such as access to a GFS file system. You need it because without a lock manager, there would be no control over access to your shared storage, and the nodes in the cluster would corrupt each other's data.

The GFS file system was written to interface with different lock managers. Today, there are three lock managers:
1. DLM (Distributed Lock Manager)
2. GULM (Grand Unified Lock Manager)
3. Nolock (for local file systems that require no locking.)
Which lock manager should I choose?
It depends on how many nodes in the cluster and what you're going to be using your cluster for.

For almost all cases, the DLM protocol is preferred. It's more modern and more efficient.

The first thing to consider is the number of computers in your cluster. DLM has known problems when you have more than 32 nodes in your cluster. We're working to resolve those issues, but until then use GULM if you have more than 32 nodes.

The GULM locking manager, on the other hand, requires more machines. GULM requires three or more independent computers outside the cluster that act as lock servers. That means that the minimum GULM configuration is five computers: A two-node GULM cluster with three independent GULM lock servers. So if you've got fewer than five computers, you'll have to use DLM.

The second thing to consider is the software that will be accessing the storage. Right now, Oracle and Oracle RAC are only Oracle-certified to work with the GULM locking manager.

Oracle RAC should work just fine with DLM locking; it just won't be a configuration that has passed Oracle certification. That means you can still run Oracle RAC in a two-node cluster without the additional lock servers, but you'll have to use DLM. and Oracle won't support your configuration. (Red Hat still will.) If you have a problem with Oracle, you should be able to temporarily introduce three lock servers and switch to GULM long enough to get their tech support. But please make sure the problem is still there before contacting them.

We're in the process of phasing out the GULM locking manager for future development, such as Fedora Core 6 and Red Hat Enterprise Linux 5.
How do I specify which locking protocol to use?
You specify the locking protocol when you make your file system with gfs_mkfs or mkfs.gfs2. For example:

gfs_mkfs -t smoke_cluster:my_gfs -p lock_dlm -j 3 /dev/bobs_vg/lvol0

mkfs.gfs2 -t bob_cluster2:bobs_gfs2 -p lock_gulm -j 5 /dev/bobs_vg/lvol1
What if I change my mind? Can I change which locking protocol to use?
It's easy to change the locking protocol for a GFS file system:

gfs_tool sb <device> proto <locking protocol>

For example:

gfs_tool sb /dev/bobs_vg/lvol0 proto lock_dlm

See the man page for gfs_tool for more information and the full range of options.
Can I hand-craft my own cluster application that uses DLM locking?
Absolutely. Check out the source tree from CVS or download the source files from sources.redhat.com. There's documentation in dlm/doc/ and also several example programs (several of which might do exactly what you are looking for) in dlm/tests/usertest/

Testing a lock without blocking is available in the normal locking API (flag LKF_NOQUEUE). The only way of receiving notification of a lock being released is to queue another lock that is incompatible with it - so that lock will be granted when the previous one is released. That's also how you would do it on VMS.
What is the future of GULM (Grand Unified Locking Manager)?
The GULM locking protocol will be supported for Red Hat Enterprise Linux 3 & 4, but we are dropping GULM after that and don't have any plans to support it in future software. In the future, users will be required to switch to DLM locking protocol, which is easy to do.
If you're dropping GULM in favor of DLM, will I still be able to run Oracle and Oracle RAC?
Yes. For future releases of RHEL, we will go through the Oracle certification process again, this time using the newer DLM locking protocol.
What happens when a node that's holding a lock gets fenced or is rebooted?
The node's locks should be freed up.
Is there any way I can see all the GFS locks that are held?
Yes. On RHEL4 and equivalent, do this command:
gfs_tool lockdump /mnt/bob
Unfortunately, the output won't make much sense, but some of the numbers correspond to inode numbers. Atix did a fairly good analysis of what these numbers mean, and you can find it here:

http://www.open-sharedroot.org/documentation/gfs-lockdump-analysis

Right now, there isn't a GFS2 equivalent, but we plan to add it. There's a bugzilla record to track the progress, and it includes a patch to add the functionality:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=228540
Is there any way I can see all the DLM locks that are held?
Yes you can, but only on a per-lockspace basis, so you have to choose a lock space to dump. What you need to do is to echo the lockspace name into /proc/cluster/dlm_locks, then dump that file to get the results. You can get the lockspace names with the "cman_tool services" command. For example:
# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2] DLM Lock Space: "clvmd" 2 3 run - [1 2] # echo "clvmd" > /proc/cluster/dlm_locks # cat /proc/cluster/dlm_locks
This shows locks held by clvmd. If you want to look at another lockspace just echo the other name into the /proc file and repeat.

Again, the output won't make much sense, but some of the numbers may correspond to inode numbers.
Can I share the same three GULM lock servers for multiple clusters?
Yes you can.
How does DLM work? Is there an explanation of the logic inside?
This is an excellent description of a dlm and the general ideas/logic reflect very well our own dlm:

http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf

Cluster Configuration System (CCS) Questions

Why doesn't ccsd start? (Hint: Magma plugins are not in the right spot.)
You can get this when you build the Cluster Project by hand (i.e. compiling it, not adding it with RPMs or up2date, etc.) and somehow did something wrong. The solution I've used to fix it is:
cd cluster; make uninstall; make distclean; ./configure; make install
(Assuming the cluster suite source resides in directory "cluster").
Since ccsd 6.1 uses broadcast to send around the config, should I put each cluster on its own subnet?
As long as the clusters have different names, you don't need to place them on separate subnets.

Global Network Block Device (GNBD) Questions

What is GNBD and when do I need it?
GNBD is a kernel module that lets you export any block device from one node to another. You don't need it for normal cluster operations, but you can do some cool things with it.
How do I import GNBDs?
You modprobe gnbd, and then run gnbd_import -i <server_name> The gnbd export name must be cluster unique. So you cannot export a GNBD named "foo" from serverA and serverB. You need to have both gnbd devices imported.
Do the client machines have to be in cluster.conf?
Yes. When you have Fibre Channel attached storage, you are get a new sd* device for every path to the storage device. Multipath takes all these devices and makes one multipathed device that routes IO over all of them. It works exactly the same with GNBD. Only it takes a couple more steps to get a the paths to appear (gnbd_export, gnbd_import).

The other important thing is that you have to specify the -U option when you export the gnbd device. Without this, the device cannot be multipathed. The -U option gives the device a UUID. If you are using SCSI devices, it should work fine.

For multipathing to work, you really do need two paths to the same actual physical device. Otherwise half you data will end up on one device, and half your data will end up on another device.
Do I need something special in multipath.conf to work with GNBD?
No. GNBD devices should work correctly by using defaults.

See the gnbd_export man page for more information.
What ports should be enabled in the iptables firewall to allow gnbd?
You need to enable port 14567 (tcp).
Should I use -U or -u for gnbd_export?
In almost all cases, you should use -U (capital U) not -u. The -U option specifies a uuid callout command. You can specify -U<command> but if you don't specify a command, it uses a script that makes sure it deals with partitions correctly. Specifying -U with no command should work correctly for almost every type of shared storage device. If you specify -u and get it wrong with multipathing, you can cause data corruption.
What are the advantages/disadvantages to using GNBD versus iSCSI for exporting storage in a cluster?
The only real advantage of using GNBD is that it has built in fencing. With iSCSI, you still need some somthing to fence all the machines (unless your scsi target supports SCSI-3 persistent reservations). Theoretically, GNBD could run faster, since it doesn't need to do the work to imitate a SCSI device, but but there's a lot of work that needs to be done for GNBD to reach its full speed potential. Since there isn't much active development of GNBD and iSCSI has an active community of developers, if iSCSI isn't already faster than it, it eventually will be. . Using iSCSI also allows a much more seemless transition to a hardware shared storage solution later on.

If you don't have any fencing hardware, and your iSCSI target doesn't support SCSI-3 persistent reservations, then you should probably go with GNBD. Otherwise it's up to you.

Resource Group Manager (rgmanager) Questions

What is rgmanager?
The rgmanager program manages cluster resources defined by the user. It allows you to define services for high-availability on your cluster. Basically, you can define cluster services, for example an NFS server, that is available to computers on the network (in or out of the cluster). Rgmanager monitors the services, and if a node fails, it will relocate the service to another node in the cluster. So if your NFS server fails, the service can be automatically moved to another node in the cluster and the NFS clients on the network probably won't even know it failed. They should continue running seamlessly without knowing or caring about the failure.
How does rgmanager work?
The rgmanager program is complex. There is a service monitor that checks for services that are defined in the cluster to make sure they're running. The services may be configured to run on a subset of nodes in the cluster, which I call a service group. A cluster may have many service groups, so even if your cluster has lots of nodes, you can restrict the service to run on only the nodes you want. A cluster may have multiple service groups, so you can have a group of nodes. For example, you can define an NFS service to run on a group of nodes, and an Apache httpd service to run on a different group of nodes. If a service fails, a script is called to automatically restart the service. If a node fails, the service may be relocated to a different node in the service group.
Where are rgmanager's scripts located?
/usr/share/cluster/*
Can I define my own services and scripts or am I stuck with the ones Red Hat has defined?
Yes you can define your own services and scripts. The rgmanager is flexible enough to allow you to define your own services with their own scripts. We encourage you to share them so that others may use them as well.
Are failover services managed by rgmanager considered active-active or doubly-active-passive?
Some people have found "active-active" vs. "active-passive" described differently in different places, so you may be wondering if you're using the terms correctly. If both nodes of a two-node cluster are running their own service, and that service has the ability to failover to the other node, does that make this an active-active cluster, or a doubly active-passive cluster?

Cluster Suite is an active-active cold failover cluster, though many services might not be. RHCS certainly can't make a service "active-active". For example, RHCS can not transform Oracle 10g CFC into a multiple instance Oracle 10g RAC database, or make ext3 into a file system that you can mount on multiple nodes safely. Nothing can do these things.

It's open to interpretation and linguistic changes over time, of course...

Historically, active-passive in the context of a failover cluster meant that only one node can serve *any* of the services at a time, because the underlying device topology or the way the cluster uses it requires it.

Examples:

(1) Device topology requirements: DRBD 0.7 or similar technologies (block-journal NBD, such as used by Steeleye Lifekeeper): only one node can have the shared device open read-write at any one time due to the way the design works (replication over network, in these cases).

(2) Cluster use restriction: SCSI reservations: only one node may talk to a given SCSI device because of the way SCSI reservations work. Requires multiple-initiator buses (IIRC), which get messy very quickly. Note that this might be considered a form of "fencing", but in a negative sense: The one node who has the reservation may access the data on that device.

Now, you can, for example, use the same GFS mount point to construct a multiple-NFS server on RHCS, because GFS does not have the limitation that ext3 does WRT one-node-at-a-time. You might call this service an 'active-active NFS service'... (In this case, there are multiple services which share resources, though - RHCS doesn't let you start the same service multiple times; I can elaborate on the 'why' of this if you would like).

Here's the thing with "active-active" services: most internally active-active services have internal clustering to begin with. Back to a previous example: Oracle 10g RAC probably will not benefit from something like RHCS managing instances at all, where a 10g infrastructure database in CFC configuration will benefit a great deal.
When I start a service manually, why doesn't it start on my highest priority node?
Manual intervention always overrides configured rules. If you want a service to start on a specific node, use:

clusvcadm -e < service > -n < node >

Not specifying is the same as "Start on the node I'm running clusvcadm on..."
Why shouldn't I use rgmanager to move virtual IPs around with EXT3 but I can with GFS?
With GFS you mount the file system on all the nodes, and have /etc/exports in sync. With ext3, you can't mount the file system on all the nodes. If the file system is mounted and the exports are all in sync cross-cluster. You can move IPs around and the NFS clients should just "do the right thing" Now...You can't mount ext3 on multiple nodes. and you can't have /etc/exports set up to export a file system that's not mounted. So, you have to make the whole bit a cluster service - ext3 mount point and all so that the mount is only mounted on one node at a time then use the cluster to bring up the exports. You could, alternatively just make the cluster start/stop nfsd after mounting the ext3 file system -- but then you can only have one NFS daemon safely running in the cluster at a time (because you can't run two instances of NFSd)
What's the difference between a resource in the <resources> section and a resource in the <service> section?
In other words, what's the difference between:
<resources> <clusterfs device="/dev/bob_vg/lvol0" force_unmount="0" fstype="gfs" mountpoint="/mnt/bob" name="bobfs" options="acl"/> <nfsexport name="NFSexports"/> <nfsclient name="trin-16" options="rw" target="trin-16.lab.msp.redhat.com"/> </resources> <service autostart="1" domain="nfsdomain" name="nfssvc"> <ip address="10.15.84.250" monitor_link="1"/> <clusterfs ref="bobfs"> <nfsexport name="bobfs"> <nfsclient ref="trin-16"/> </nfsexport> </clusterfs> </service>
and...
<service autostart="1" domain="bobdmn" name="nfssvc"> <clusterfs device="/dev/bob_vg/lvol0" force_unmount="0" fsid="51084" fstype="gfs" mountpoint="/mnt/bob" name="bobfs" options="acl"/> <nfsexport name="NFSexports"/> <nfsclient name="trin-16" options="rw" target="trin-16.lab.msp.redhat.com"/> </service>
The difference is primarily architectural. Resources in the <resource> block might be able to be used multiple times. Resources in a <service> block may only be used in that one place. So, you can also detach a resource from one service and reattach it to another service if it's in the <resources> block. If it was privately declared, you must recreate it the global section is primarily for <nfsclient>, <clusterfs> and <nfsexport> resources.
Can I control where rgmanager logs or its log level?
Yes you can. Starting with U3, you will be able to have rgmanager log to different places and a different level by changing the cluster/rm tag. For example:

<rm log_facility="local4" log_level="7">

Then you can add "local4.* /var/log/foo" to your /etc/syslog.conf file to send daemon output to file foo.

Note: The default log level for rgmanager is 5 (LOG_NOTICE).
What is the recommended way to configure multiple heartbeat channels?
Channel bonding.
Is there any way to do NIC failover other than channel bonding?
No, not really. The channel bonding driver is designed for this purpose. When used with a good, internally redundant switch that supports trunking, you end up with higher bandwidth and increased availability.
What is the 'depth' of an action in an rgmanager script?
The rgmanager script 'depth' means how intensive of a check. There used to be 0, 10 and 20.
0 was "Is the IP still there?"
10 was "Can I ping it, and is the ethernet link up?"
20 used to be (but was removed) - "Attempt to ping the router."
Why am I getting '<err> No export path specified.'?
There is a parent/child inheritance relationship with nfs exports. Your problem might be that you don't have your nfs client as a child of the nfs export. For example:

Wrong:

<service autostart="1" domain="nfs" name="nfs">
  <fs device="/dev/nfsvol/lvol01" force_fsck="0" force_unmount="1" fsid="8508" fstype="ext3" mountpoint="/export" name="/export" options="" self_fence="1"/>
    <nfsexport name="/export"/>
    <nfsclient name="/export" options="rw" target="81.19.179.*"/>
    <ip address="192.168.1.77" monitor_link="1"/>
  </service>

Right:

  <service name="nfstest" nfslock="1">
    <fs ref="NFS Mount">
      <nfsexport name="exports">
        <nfsclient ref="world-rw"/>
      </nfsexport>
    </fs>
    <ip address="192.168.1.77/22"/>
  </service>
How can I tell where my service is running?
The GUI (system-config-cluster) will tell you where the services are running.

From the command line, the clustat command will tell you as well.
How can I change the interval at which rgmanager checks a given service?
The interval is in the script for each service, in /usr/share/cluster/

It's easier to just change the script.sh file to use whatever value you want (<5 is not supported, though). Checking is per-resource-type, not per-service, because it takes more system time to check one resource type vs. another resource type.

That is, a check on a "script" might happen only every 30 seconds, while a check on an "ip" might happen every 10 seconds.

The status checks are not supposed to consume system resources. Historically, people have done one of two things which generate support calls:

(a) Does not set a status check interval at all (why is my service not being checked?), or

(b) sets the status check interval to something way too low, like 10 seconds for an Oracle service (why is the cluster acting strange/running slowly?).

If the status check interval is lower than the actual amount of time it takes to check the status of a service, you end up with endless status-checking, which is a pure waste of resources.
In RHEL3, what is the explanation for maximum restarts and maximum false restarts?
A false start is a start where the first status check fails.

A restart occurs after a status check fails.

If either of those values are exceeded, the service is relocated rather than restarted locally.

Note: These values pertained only to clumanager and were phased out after RHEL3. These values don't exist for the current cluster suite (and it would be difficult to add them).
If I have two services for the same file system resource, when I relocate one service do they both relocate?
If you relocate one service by hand, the other one will not automatically follow. However, if the node running the two services fails, both services should be relocated to a failover node automatically.

WARNING: You should never reference the same ext3 file system from two services. Two services may reference the same GFS file system, but not the same ext3 file system.
The rgmanager won't start my script, even though it says it does. Why?
There are a couple of possibilities. First, you could be a victim of the "resource scripts not returning 0 when they should" bug described in the next question. Otherwise, you might have a "resource collision" which is a little bit more complicated.

To determine if you have a resource collision, run this command:
# rg_test test /etc/cluster/cluster.conf
You have a resource collision if the output looks something like this:
Unique/primary not unique type clusterfs, name=WWWData Error storing clusterfs resource
This can happen, for example, if you cut and paste a service section in your cluster.conf file and forget to change the name. For example, check out this invalid snippet:
<service autostart="1" domain="apache25" name="apache25"> <clusterfs device="/dev/emcpowerd1" force_unmount="0" fsid="41106" fstype="gfs" mountpoint="/opt/www" name="WWWData" options=""> <script ref="vsftpd"/> </clusterfs> </clusterfsclusterfs device="/dev/emcpowera1" force_unmount="0" fsid="30342" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft" options=""> <script ref="apache start-stop"/> </clusterfs> </service> <service autostart="1" domain="apache26" name="apache26"> <clusterfs device="/dev/emcpowerd1" force_unmount="0" fsid="41107" fstype="gfs" mountpoint="/opt/www" name="WWWData" options=""> <script ref="vsftpd"/> </clusterfs> <clusterfs device="/dev/emcpowerb1" force_unmount="0" fsid="30343" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft" options=""> <script ref="apache start-stop"/> </clusterfs> </service>
In the example above, the apache26 service has two resource collisions with the apache25 service:
1. WWWData is defined twice, with basically identical components (except fsid, which does not affect the configuration).
  You should put this one in your <resources> block and pass it by reference.
2. WWWSoft is defined twice with a different device, but the same mount point, causing a naming & mount point collision.
  You need to rename one to something else to resolve the naming collision.
The mount point is also the same, and that must be unique.

When rgmanager detects collisions between attributes of a resource type which are required to be unique across the resource type, it stops parsing that branch of the tree. So references to scripts in the apache26 service are largely ignored in the example above.

If the collisions are fixed, rgmanager should start the service.
The rgmanager keeps stopping and restarting mysql/named/ypserv/httpd/other script. Why?
According to the LSB specification, all init scripts are always supposed to return 0 if everything is running correctly. They should only return a non-zero return code if the service is not running, even if the service was stopped "cleanly." Unfortunately, many of the stock Red Hat init scripts wrongly don't adhere to this rule at various releases, and that causes these kinds of rgmanager symptoms. A lot of people just edit the init scripts by hand, but there are various patches available. For example, here's a patch to fix httpd in RHEL4:

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111998

For more information, see this bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104
Can I have rgmanager behave differently based on the return code of my init script?
Here are some rules for service script writing:

Always return "0" if the status is non-fatal. Rgmanager follows the LSB + OCF RA API draft specifications. That means that 0 is "OK" from running a "status" operation and non-zero is "not OK". Some scripts do check-restarts of non-critical components from within the "status" check.

If you have a "recover" action in a resource agent (resource agent > plain script), rgmanager will try a "recover" operation first (and will fall back to full service restart if recovery fails). A "recover" action is by definition *not allowed* to interfere with other parts of the service. So, if a component X fails, and recovery succeeds, the rest of the service continues running uninterrupted.

None of our agents except the nfsclient have recover actions because they're all considered critical (file systems, IPs, etc.).

You could add a "critical" parameter to resource agents and always return 0 if "$OCF_RESKEY_critical" is not set in the script(s), or not allow recover actions if "critical" is set... etc.
What is a failover domain, what types are there and what type should I use?
A failover domain is an ordered subset of members to which a service may be bound. The following is a list of semantics governing the options as to how the different configuration options affect the behavior of a failover domain:
- Preferred node or preferred member:
  
  This is a notion which is no longer present in rgmanager. In older versions, the preferred node was the member designated to run a given service if the member is online. In most cases, it was used with the "Relocate on Preferred Node Boot" service option (as it was generally thought to be useless without!). In newer rgmanagers, we can emulate this behavior by specifying an unordered, unrestricted failover domain of exactly one member. There is no equivalent to the "Relocate on Preferred Node Boot" option in Cluster Manager 1.0.x.
- Restricted domain:
  
  Services bound to the domain may only run on cluster members which are also members of the failover domain. If no members of the failover domain are available, the service is placed in the stopped state.
- Unrestricted domain:
  
  Services bound to this domain may run on all cluster members members, but will run on a member of the domain whenever one is available. This means that if a service is running outside of the domain and a member of the domain comes online, the service will migrate to that member.
- Ordered domain:
  
  The order specified in the configuration dictates the order of preference of members within the domain. The highest-ranking member of the domain will run the service whenever it is online. This means that if member A has a higher-rank than member B, the service will migrate to A if it was running on B if A transitions from offline to online.
- Unordered domain:
  
  Members of the domain have no order of preference; any member may run the service. Services will always migrate to members of their failover domain whenever possible, however, in an unordered domain.
Ordering and restriction are flags and may be combined in any way (ie, ordered+restricted, unordered+unrestricted, etc.). These combinations affect both where services start after initial quorum formation and which cluster members will take over services in the event that the service has failed.

You can have multiple nodes per ordered level in the failover domains with RHEL4 and RHEL5, but not with RHEL3.

Examples:
Given a cluster comprised of this set of members: {A, B, C, D, E, F, G}
- Ordered, restricted failover domain {A, B, C}.
  service 'S' will always run on member 'A' whenever member 'A' is online and there is a quorum. If all members of {A, B, C} are offline, the service will not run. If the service is running on 'C' and 'A' transitions online, the service will migrate to 'A'.
- Unordered, restricted failover domain {A, B, C}.
  A service 'S' will only run if there is a quorum and at least one member of {A, B, C} is online. If another member of the domain transitions online, the service does not relocate.
- Ordered, unrestricted failover domain {A, B, C}.
  A service 'S' will run whenever there is a quorum. If a member of the failover domain is online, the service will run on the highest-ordered member. That is, if 'A' is online, the service will run on 'A'.
- Unordered, unrestricted failover domain {A, B, C}.
  This is also called a "Set of Preferred Members". When one or more members of the failover domain are online, the service will run on a nonspecific online member of the failover domain. If another member of the failover domain transitions online, the service does not relocate.
My NFS service for ext3 won't shut down because it's in use. What should I do?
Try adding nfslock="1" to the <service> tag:
<service autostart="1" domain="nfsdomain" name="nfssvc" nfslock="1">
Also, don't forget to enable force-unmount of the file system:
<fs device="/dev/testing/test01" force_unmount="1" fstype="ext3" mountpoint="/test" name="test01" options="">
Can I speed up the time it takes to fail over a service?
We've tried to optimize the failover time as much as we can, so there isn't much room for improvement. However, if you're not doing NFS as part of your service (e.g. with the "nfsexport/nfsclient setup"), you can delete the "sleep 10" in the /usr/share/cluster/ip.sh script. That will speed things up a bit.
I get 'generic error' while trying to start a Xen guest as a service, how do I fix it?
This is usually caused by incorrect use of the 'path' attribute for the VM resource. The path attribute is like the environment path in a shell: it is a colon-separated list of directories, and is *NOT* a path to an individual file. Example of an exec search path:
```
        PATH=/sbin:/bin:/usr/sbin:/usr/bin
		   
```
Example 'path' as a vm attribute in cluster.conf:
```
        <vm name="foo" path="/etc/xen" ... />
        <vm name="foo2" path="/etc/xen:/usr/etc/xen" ... />
		   
```
Example incorrect 'path' as a vm attribute in cluster.conf (assuming /etc/xen/foo is a Xen domain config file):
```
        <vm name="foo" path="/etc/xen/foo" ... />
		   
```

GUI (system-config-cluster) Questions

When I open system-config-cluster, why do I get strange errors like 'element device: Relax-NG validity error'?
It's just an overactive XML validity checker. You should be fine ignoring this error.
Is there a web-based GUI for configuring clusters?
Yes, but it's new and only available for RHEL5 and Fedora Core 6. It's called Conga. More information can be found here:
http://sourceware.org/cluster/conga/

API Questions

Are there APIs for the Cluster Project so I can write my own cluster software?
Yes, but there isn't much documentation to support them.

When integrating into the latest Cluster Project (HEAD branch in CVS), use the cman api, dlm api and openais api. When integrating into the Cluster Project (RHEL4 or STABLE branches in CVS), use the Magma api. For GFS and GFS2 disk tools that require the file system NOT be mounted, use the libgfs and libgfs2 apis respectively.
What is Magma?
Magma was the cluster API we used for RHEL4, with minimal documentation in magma/doc/magma.txt. It uses a plugin infrastructure to translate very simple APIs to cluster-specific APIs. For example, it allows rgmanager in the RHEL4 branch to operate almost identically when either CMAN+DLM are in use or GuLM is in use.

Due to the move towards Open AIS, which implements standards-based SAF AIS APIs, further development of the Magma API has ceased. Applications utilizing the Magma API should either be ported to the SAF AIS CLM+CLK APIs, or use the CMAN and DLM APIs. If you have written a plugin which implements the current set of Magma APIs for your infrastructure, you can submit it to the linux-cluster mailing list for inclusion in the RHEL4 / STABLE branches.

LVS / Piranha / Load Balancing Questions

What is Load Balancing and when do I need it?
Load Balancing is a mechanism that tries to distribute the workload evenly throughout a cluster. For example, if your cluster does ftp serving, and 400 people all try to download the latest file you're serving up, your server may choke under the pressure of 400 requests. Load balancing can help you distribute that workload evenly throughout your cluster so that 400 FTP requests can be evenly distributed among twenty nodes with 20 requests each. Still, the clients only need to go to one FTP site to get the data.

This is achieved with LVS and Network Address Translation (NAT) routing which translates your world-viewable FTP address into any number of real IP addresses in your cluster.

You only need it if you need true active/active services with a distributed workload.
What is LVS and when do I need it?
LVS stands for Linux Virtual Server. It uses a mechanism called Network Address Translation (NAT) to route requests from one IP address to another and that's what achieves true load-balancing. With LVS, you have a second "layer" of servers (called LVS routers) whose job is to equally distribute the requests. Only one router is active at a time; additional routers are needed to provide failover capabilities in case the first router fails.

With LVS, all requests come in to a central LVS server known as the Active Router. The router decides which server to give the request to, based on one of several selectable methods. A second router (called the backup router) monitors the network and takes over if the active router fails.
What is Piranha and when do I need it?
Piranha is a graphical LVS configuration tool. You only need it if you're planning to do Load Balancing.
Where is the documentation for LVS and Piranha?
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/pt-lvs.html
Can I load-balance on two separate subnets?
Question: I want disaster recovery load balancing. In other words, I want two load balancers in two locations, 50 miles apart, on different subnets. For example, I want to have www.mycompany.com with a public-facing virtual ip address, with two servers: a primary server (192.168.0.5) and a failover server (172.31.0.5). I know this isn't best practice but it's still what I want. Is it possible?

Not with Cluster Suite alone. However, there's a concept called Global Server Load Balancing (GSLB). There are a few different GSLB solutions. For example, Foundry Networks sells a kit that will do this. For more information, see:

http://www.foundrynet.com/

Disclaimer: The author has no first-hand knowledge of gslb, so this should NOT be considered an endorsement.

Cluster Project FAQ - Frequently Asked Questions

Introduction

Questions

General Questions

GFS (Global File System) and GFS2 Questions

Cluster Manager (CMAN) Questions

OpenAis Questions

Fencing Questions

Clustered Volume Manager (CLVM) Questions

DLM, GULM and Lock Manager Questions

Cluster Configuration System (CCS) Questions

Global Network Block Device (GNBD) Questions

Resource Group Manager (rgmanager) Questions

GUI (system-config-cluster) Questions

API (Application Programming Interface) Questions

LVS / Piranha / Load Balancing Questions

Answers

Answers to General Questions

Answers to GFS (Global File System) Questions

GFS

OCFS2

Use -r 2048 on gfs_mkfs and mkfs.gfs2 for large file systems.

Preallocate files.

Break file systems up when huge numbers of files are involved.

Disable GFS quota support if you don't need it.

Make sure your hardware isn't slowing you down

Open files with (or without) O_DIRECT

Try increasing or decreasing the number of GFS locks cached.

Increase "statfs_slots"

Adjust how often the GFS daemons run.

Answers to Cluster Manager (CMAN) Questions

OpenAIS Questions

Fencing Questions

Clustered Volume Manager (CLVM) Questions

DLM, GULM and Lock Manager Questions

Cluster Configuration System (CCS) Questions

Global Network Block Device (GNBD) Questions

Resource Group Manager (rgmanager) Questions

Preferred node or preferred member:

Restricted domain:

Unrestricted domain:

Ordered domain:

Unordered domain:

GUI (system-config-cluster) Questions

API Questions

LVS / Piranha / Load Balancing Questions

Links