Recent changes:
02 Jan 2008 - Revised question:How do I change the time after which a non-responsive node is considered dead?
04 Dec 2007 - Added question:My RHEL5 or similar cluster won't work with my Cisco switch.
09 Nov 2007 - Added question:On Fedora 8, CMAN won't start, complaining about "aisexec not started". How do I fix it?
18 Oct 2007 - Revised question:What's the "right" way to get cman to use a different NIC, say, eth2 rather than eth0?
07 Aug 2007 - Revised question:What is a tie-breaker, and do I need one in two-node clusters?
26 Jul 2007 - Revised question:I want to use GFS for Samba (smb) file serving. Is that okay?
05 Jul 2007 - Added question: I get 'generic error' while trying to start a Xen guest as a service, how do I fix it?
20 Jun 2007 - Revised answer:If my cluster is mission-critical, can I override quorum rules and have a "last-man-standing" cluster that's still functioning?
20 Jun 2007 - Added question: Do I really need a shared disk to use QDisk?
20 Jun 2007 - Revised question: How do I set up a quorum disk/partition?
05 Jun 2007 - Revised question: What ports do I have to enable for the iptables firewall?
14 May 2007 - Revised question: On RHEL5, why do I get "cman not started: Can't bind to local cman socket /usr/sbin/cman_tool"?
11 May 2007 - Revised question: What's the "right way" to propagate the cluster.conf file to a running cluster?
01 May 2007 - Added question: Can I speed up the time it takes to fail over a service?
01 May 2007 - Added question: On RHEL5, why do I get "cman not started: Can't bind to local cman socket /usr/sbin/cman_tool"?
01 May 2007 - Clarified question: In RHEL3, what is the explanation for maximum restarts and maximum false restarts?
If you have corrections, please send them to Bob Peterson: rpeterso@redhat.com
If you have questions, please send them to the mailing list: linux-cluster@redhat.com
To subscribe to the linux-cluster mailing list, please visit the following page:
https://www.redhat.com/mailman/listinfo/linux-cluster
The Cluster Project is a set of components designed to enable clustering, which means a group of computers all sharing resources, such as shared storage devices and services. Clustering ensures data integrity when people are working on devices from multiple machines (or virtual machines) at the same time.
Red Hat Cluster Suite is a marketing term under which some of this software is promoted. Red Hat has bundled components from the cluster project together and made them available for its various platforms.
Somewhere around 1996, Red Hat developed its first Cluster Suite, which primarily managed cluster-cooperative services. That's the equivalent of rgmanager now.
From 1997 to 2003, Sistina Software was spun off from a project at the University of Minnesota, and they developed a clustering file system which became the Global File System, GFS, which it sold to customers.
In 2004, Red Hat, Inc. bought Sistina, merged GFS into its Cluster Suite, and open-sourced the whole thing.
Today, the cluster project belongs to the people and is available for free to the public through Red Hat's CVS repository. The open-source community continues to improve and develop the cluster project with new clustering technology and infrastructures, such as OpenAIS.
That depends on what version you are using. Like all active technology, it is constantly evolving. The Cluster Project involves development in many different areas including:
Assuming you have all the necessary pieces and/or RPMs in place, there are four ways to configure a cluster:
The cluster configuration system (ccs) tries to manage the cluster.conf file and keep all the nodes in sync. If you make changes to the cluster.conf file, you have to tell ccs and cman that you did it, so they can update the other nodes. If you don't, your changes are likely to be overwritten with an older version of the cluster.conf file from a different node. See the next question.
The cluster configuration guis take case of propagating changes to cluster.conf to your cluster. The system-config-cluster gui has a big button that says "Send to Cluster". If you're maintaining your cluster.conf file by hand and want to propagate it to the rest of the cluster, do this:
Note: For RHEL5 and similar, cman_tool -r is no longer necessary.
A list of options can be found at the following link. I won't guarantee it's complete or comprehensive, but it's pretty close:
http://sources.redhat.com/cluster/doc/cluster_schema.html
Take a look at the man page for cluster.conf (5). There's also a small example in the usage.txt file: cluster/doc/usage.txt
The GFS 6.0 cluster code runs on the 2.4.xx (for Red Hat Enterprise Linux 3). The GFS 6.1 code runs on the 2.6.xx series kernels for Red Hat Enterprise Linux 4, Fedora Core and other distributions.
The source code for the current development tree is kept in the Red Hat CVS repository: http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/?cvsroot=cluster
You can check the entire source code tree out from CVS with this command:
cvs -d :ext:sources.redhat.com:/cvs/cluster co cluster
The cluster project code in our CVS head is development code. If you want the stable version, you can check it out from CVS with this command:
cvs -d :ext:sources.redhat.com:/cvs/cluster co -r STABLE cluster
The cluster project was primarily designed to run on linux. Some of the cluster infrastructure, such as OpenAIS has been successfully ported to FreeBSD and possibly Darwin.
The project page is: here
It depends on which components you need to use. For a basic cluster, all you need is two or more computers and a network between them. If you want to use GFS, you'll need shared storage.
This is a moving target, so it's not possible to give up-to-date information. However, without naming names the largest single cluster in production with GFS was an oil and gas company with 152 nodes directly on a SAN (McData switches, Qlogic 1GB HBAs and LSI Storage). This customer used this cluster for almost 2 years but not in use anymore after they were acquired by a larger company and the architecture changed. That cluster used GULM locking.
As of this writing, we haven't tested GFS 6.1 with DLM locking past 31 nodes.
It depends on what you're planning to do. The point of using GFS and CLVM is that you have storage you want to share between machines concurrently. Without shared storage, you have a local filesystem and lvm2, neither of which need the cluster infrastructure. If you want to use the cluster infrastructure for High Availability services, you don't need shared storage.
Yes. They are here:
And, of course, this FAQ.
These ports should be enabled:
41966 | rgmanager/clurgmgrd | tcp |
41967 | rgmanager/clurgmgrd | tcp |
41968 | rgmanager/clurgmgrd | tcp |
41969 | rgmanager/clurgmgrd | tcp |
50006 | ccsd | tcp |
50007 | ccsd | udp |
50008 | ccsd | tcp |
50009 | ccsd | tcp |
21064 | dlm | tcp |
6809 (RHEL4 and under) 5405 (RHEL5 and above) | cman (RHEL4 and under) openais (RHEL5 and above) | udp |
14567 | gnbd | tcp |
Yes. From time to time, we build the STABLE branch against different kernels and release the tarballs. You'll find them here: ftp://sources.redhat.com/pub/cluster/releases/
The cluster software isn't specific to any Linux distribution or release. However, many of the users are running the software on Red Hat Enterprise Linux (RHEL). Most customers are currently running on RHEL4 (or the RHEL4 equivalent of CentOS, or at least the RHEL4 branch of the source tree in CVS). So they may want to know the differences between the way things work now in RHEL4 and how they'll work in RHEL5.
This list is by no means complete, but these are the differences I know about offhand:
Actually, the plan is to release this in RHEL4 U5.
However, since users can use a cluster without clvmd, gfs or rgmanager, they are still separate init scripts.
The logical volume manager can take a new "locking_type = 3" to figure out the appropriate locking for clustered and non-clustered volumes.
See question "What improvements will GFS2 have over GFS(1)?"
Some of the less noticed internal changes:
They were accepted into the 2.6.18 upstream kernel by kernel.org.
It's definitely not a good idea to mix the two within a single cluster. With the introduction of RHEL5, there are now two distinct and separate cluster infrastructures. The older (RHEL4 or STABLE branch in CVS) infrastructure passes cluster messages using a kernel module (cman or the one internal to gulm). The newer infrastructure (RHEL5 or HEAD branch in CVS) passes cluster messages using openais and userland daemons. If you try to mix and match the two, it will not work.
That said, you could probably still fetch the STABLE branch of the cluster code from CVS, compile it on a RHEL5 system, and have it interact properly in a RHEL4 cluster through the old infrastructure. Since the STABLE branch tracks the upstream kernel, you may also need to build a newer kernel from source code as well on the RHEL5 system.
It would be extremely difficult, if not impossible, to go the other way around (i.e. to get the new infrastructure and openais running on a RHEL4 system so it could interact with a RHEL5 cluster).
Yes you can. For example, you could have a single computer, running Xen virtualization, act as a complete cluster consisting of several xen guests. There are special fencing issues to consider. For example, if you use power fencing, one guest could cause the whole machine to be powered off and never come back (because it wouldn't be alive to tell the power switch to power back on). There is a special fencing agent designed to reboot xen guests as needed.
You can also create clusters made of several computers, each of which has several virtual xen guest nodes. This has other fencing complications. For example, a xen guest can't use a simple xen fencing agent to reboot a xen guest that's physically running on a different physical computer.
As I understand it, the problem is due to the fact that xen nodes tear down and rebuild the ethernet nic after cluster suite has started. We're working on a more permanent solution. In the meantime, here is a workaround:
GFS is the file system that runs on each of the nodes in the cluster. Like all file systems, it is basically a kernel module that runs on top of the vfs (virtual file system) layer of the kernel. It controls how and where the data is stored on a block device or logical volume. In order to make a cluster of computers ("nodes") cooperatively share the data on a SAN, you need GFS's ability to coordinate with a cluster locking protocol. One such cluster locking protocol is dlm, the distributed lock manager, which is also a kernel module. It's job is to ensure that nodes in the cluster who share the data on the SAN don't corrupt each other's data.
Many other file systems like ext3 are not cluster-aware, and therefore data kept on a volume that is shared between multiple computers, would quickly become corrupt otherwise.
You need some form of shared storage - Fibre Channel and iSCSI are typical. If you don't have Fibre Channel or iSCSI, look at GNBD instead. Also, you need two or more computers and a network connection between them.
No. GFS will only allow PCs with shared storage, such as a SAN with a Fibre Channel switch, to work together cooperatively on the same storage. Off-the-shelf PCs don't have shared storage.
GFS 6.1 (on RHEL 4) supports 16TB when any node in the cluster is running 32 bit RHEL. If all nodes in the cluster are 64-bit RHEL (x86-64, ia64) then the theoretical maximum is 8 EB (exabytes). We have field reports of 45 and 50 TB file systems. Testing these configurations is difficult due to our lack of access to very large array systems.
I've seen more than one 45TB GFS file system. If you know of a bigger one, I'd love to hear from you.
Currently, gfs and gfs2 do not use milliseconds for files. They use seconds. This is to maintain compatibility with the underlying vfs layer of the kernel. If the kernel changes to milliseconds, we will also change.
People don't normally care about milliseconds, so milliseconds only become important to computers when doing things like NFS file serving. For example, to see if another computer has changed the data on disk since the time of the last known request. For GFS2, we're planning to implement inode generation numbers to keep track of these things more accurately than a timestamp can.
If I do:
[root@node-01#] gfs_tool setflag inherit_directio my_directory
[root@node-01#] gfs_tool gettune my_directory
It displays:
new_files_directio = 0
Here's what's going on: inherit_directio and new_files_directio are two separate things. If you look at the man page, inherit_directio operates on a single directory whereas new_files_directio is a filesystem-wide "settune" value. If you do:
gfs_tool setflag inherit_directio my_directory
You're telling the fs that ONLY your directory and all new files
within that directory should have this attribute, which is why your
tests are acting as expected, as long as you're within that directory.
It basically sets an attribute on an in-memory inode for the directory.
If instead you were to do:
gfs_tool settune mount-point new_files_directio 1
The value new_files_directio value would change for the whole
mount point, not just that directory. Of course, you're seeing what
gfs_tool gettune my_directory is reporting for the global flag.
No, it's not true. What it prevents is data corruption as a result of the node waking up and erroneously issuing writes to the disk when it shouldn't.
The simple fact is that no one can guarantee against loss of data when a computer goes down. If a client goes down in the middle of a write, its cached data will be lost. If a server goes down in the middle of a write, cached data will be lost unless the filesystem is mounted with "sync" option. Unfortunately, the "sync" option has a huge performance penalty. GFS's journaling should minimize and/or guard against this loss.
With NFS failover, if a server goes down in the middle of an NFS request (which is far more likely), the failed NFS service should be failed over to another GFS server in the cluster. The NFS client should get a timeout on its write request, and that will cause it to retry the request, which should go to the server that has taken over the responsibilities of the failed NFS server. And GFS will ensure the original server having the problem will not corrupt the data.
You probably miss-typed the cluster name on mkfs. Use the 'dmesg' command to see what gfs is complaining about. If that's the problem, you can use gfs_tool or another mkfs to fix it.
Even if this is not your problem, if you have a problem mounting, always use dmesg to view complaints from the kernel.
It depends on whether you're using GULM or DLM locking. If you're using DLM, use this command from a node that has it mounted:
cman_tool services
If you're using GULM, or aren't on a node that has it mounted, here's another way to do it:
for i in `grep "<clusternode name" /etc/cluster/cluster.conf | cut -d '"' -f2` ; do ssh $i "mount | grep gfs" ; done
Unlike ext3, GFS will dynamically allocate inodes as it needs them. Therefore, it's not a problem.
It depends on file size and file system block size. Assuming the file system block size is a standard 4K, let's do the math: A GFS inode is 232 bytes (0xe8) in length. Therefore, the most data you can fit along with an inode is 4096 - 232 = 3864 bytes. By the way, in this case we say the file "height" is 0.
Slightly bigger and the file needs to use a single-level of indirection, also known as height 1. The inode's 3864 bytes will be used to hold a group of block pointers. These pointers are 64-bits each (or 8 bytes) so you can fit exactly 483 of them on the block after the disk inode. If you have all 483 pointers to 4K blocks, you have at most 1.88MB.
If your file gets over 1.88MB, it will need a second-level of indirection (height 2), the block of which will have a 24-byte (0x18 byte) header and 64-bytes of reserved space. That means you're inode will have at most 483 pointers to 4K-blocks which can each hold 501 block pointers. So 483*501 = 241983 blocks, or 991162368 bytes of data (945MB).
If your file is bigger than 945 MB, you'll need a third level of indirection (height 3), which means your file can grow to have 945MB of pointers, which is enough for 121233483 pointers. The file can grow to 496572346368 bytes, or 473568MB, also known as 462GB.
Still bigger, at height 4, we get a max file size of 248782745530368, also known as 231696GB or 226TB.
If your file is bigger than 226TB, (egads!) height 5, max file size is 124640155510714368 bytes, also known as 113359TB.
Also, extended attributes like ACLs, if used, take up more blocks.
Yes you can. Since GFS can manage the contents of a block device (SCSI, logical volume, etc), there is still the underlying logical volume manager, LVM2, that takes care of things like spanning physical volumes, striping, hardware RAID, mirroring and such. For clusters, there is a special version of LVM2 called CLVM that is needed, but not much changes other than the locking protocol specified in /etc/lvm/lvm.conf.
Note that GFS won't work properly in a cluster with software RAID (the MD driver). At the time of this writing, software RAID is not cluster-aware. Since software RAID can only be running on one node in the cluster, the other nodes will not be able to see the data properly, or will likely destroy each other's data. However, if GFS is used as a stand-alone file system on a single-node, software RAID should be okay.
Sometime after 2.6.15, the upstream kernel changed from using the semaphores (i_sem) within the VFS layer to using mutexes (i_mutex). If your Linux distribution is running an older kernel, you may not be able to compile GFS.
Your choices are: (1) upgrade your kernel to a newer one, or (2) downgrade your GFS or change the source code so that it uses semaphores like before. Older versions are available from CVS.
Because this is an open-source project, it's constantly evolving, as does the Linux kernel. Compile problems are to be expected (and usually easily overcome) unless you are compiling against the exact same kernel the developers happen to be using at the time.
Surprisingly, yes. Atix Corporate has a SourceForce project called "Open-Sharedroot" for this purpose.
Visit http://www.open-sharedroot.org/ for more information.
There's a quick how-to at:
http://www.open-sharedroot.org/documentation/the-opensharedroot-mini-howto.
Mark Hlawatschek from Atix gave a presentation about this at the 2006 Red Hat Summit. His slides can be seen here:
http://www.atix.de/downloads/vortrage-und-workshops/ATIX_Shared-Root-Cluster.pdf.
Yes, with the following caveats:
See the following for more information:
Red Hat GFS: Installing and Configuring Oracle9i RAC with GFS:
http://www.redhat.com/docs/manuals/csgfs/oracle-guide/
RAC Technologies Compatibility Matrix for Linux Clusters:
http://www.oracle.com/technology/products/database/clustering/certify/tech_generic_linux.html
RAC Technologies Compatibility Matrix for Linux x86 Clusters:
http://www.oracle.com/technology/products/database/clustering/certify/tech_linux_x86.html
RAC Technologies Compatibility Matrix for Linux x86-64 (AMD64/EM64T) Clusters:
http://www.oracle.com/technology/products/database/clustering/certify/tech_linux_x86_64.html
Oracle Certification Environment Program:
http://www.oracle.com/technology/software/oce/oce_fact_sheet.htm
Not currently, however, playing this song at high volume in your data center has been rumored to introduce entropy in to the GFS+RAC configuration. Please consider Mozart or Copin instead.
Yes, that's a joke, ha ha...Yes and No. Yes it's possible, and one application will not block the other. No, since only one node can cache the content of the inode in question at a particular time, so the performance may be poor. The application should use some kind of locking (for example, byte range locking, i.e. fcntl) to protect the data.
However, GFS does not excuse the application from locking to protect the data. Two processes trying to write data to the same file can still clobber each other's data unless proper locking is in place to prevent it.
Here's a good way to think about it: GFS will make two or more processes on two or more different nodes be treated the same as two or more processes on a single node. So if two processes can share data harmoniously on a single machine, then GFS will ensure they share data harmoniously on two nodes. But if two processes would collide on a single machine, then GFS can't protect you against their lack of locking.
If you have shared storage that you need to mount read/write, then you still need it. Perhaps it's best to explain why with an example.
Suppose you had a fibre-channel linked SAN storage device attached to two computers, and suppose they were running in a cluster, but using EXT3 instead of GFS to access the data. Immediately after they mount, both systems would be able to see the data on the SAN. Everything would be fine as long as the file system was mounted as read-only. But without GFS, as soon as one node writes data, the other node's file system doesn't know what's happened.
Suppose node A creates a file and assigns inode number 4351 to it, and write 16K of data to it in blocks 3120 and 2240. As far as node B is concerned, there is no inode 4351, and blocks 3120 and 2240 are free. So now it is free to try to create inode 4351, and write data to block 2240, but still believing block 3120 is free. The file system maps of which data areas are used and unused would soon overlap, as would the inode numbers. It wouldn't take long before the while file system was hopelessly corrupt, along with the files inside it.
With GFS, when node A assigns inode 4351, node B automatically knows about the change, and the data is kept harmoniously on disk. When one data area is allocated, all nodes in the cluster are aware of the file allocations, and they don't bump into one another. If node B needs to create another inode, it wouldn't choose 4351, and file system would not be corrupted.
However, even with GFS, if nodes A and B both decide to operate on a file X, even though they both agree on where the data is located, they can still overwrite the data within the file unless the program doing the writing uses some kind of locking scheme to prevent it.
If you have set-group-ID on, and then turn off group-execute you mark a file for mandatory locking. A file with mandatory locking will have the group exec bit as well as set-group-ID on and it would look like this in 'ls' (result of a chmod 2770):
-rwxrws--- 1 tangzx2 idev 347785 Jan 17 10:22 temp.txt
Not really. The gfs_mkfs command decides exactly where everything should go and you have no choice in the matter. The volume is carved into logical "sections." The first and last sections are for multiple resource groups, based roughly on the rg size specified on the gfs_mkfs commandline. The journals are always placed between the first and last section. Specifying a different number of journals will force gfs_mkfs to carve the section size smaller, thus changing where your journals will end up.
Only insofar as Linux is. Linux isn't 100% posix compliant, but GFS is as much compliant as any other file system can be under Linux.
No. GFS and GFS2 do not currently have the ability to shrink. Therefore, you can not reduce the size of your volume.
Mostly due to design constraints. An ls -r * can simply traverse the directory structures, which is very fast. An ls -lr * has to traverse the directory, but also has to stat each file to get more details for the ls. That means it has to acquire and release a cluster lock on each file, which can be slow. We've tried to address these problems with the new GFS2 file system.
It is possible to create GFS on an MD device as long as you are only using it for multipath. Software RAID is not cluster-aware and therefore not supported with GFS. The preferred solution is to use device mapper (DM) multipathing rather than md in these configurations.
Put it in /etc/fstab.
During startup, the "service gfs start" script (/etc/rc.d/init.d/gfs) gets called by init. The script checks /etc/fstab to see if there are any gfs file systems to be mounted. If so, it loads the gfs device driver and appropriate locking module, assuming the rest of the cluster infrastructure has been started.
GFS2 will address some of the shortcomings of GFS1:
I don't think anyone has speculated about this, and it's still to early for performance comparisons.
With GFS, the first node to access a file becomes its lock master. Therefore, access to that file will be faster than other nodes.
In the RHEL4 and STABLE branches of the code in CVS, SELinux is not currently supported.
In the development version (HEAD) and in upcoming releases, this support is built in.
That depends highly on the type of hardware that it's running on. File system check (fsck) operations take a long time regardless of the file system, and we'd rather do a thorough job than a fast one.
Running it in verbose mode (-v) will also slow it down considerably.
We recently had report of a 45TB GFS file system on a dual Opteron 275 (4Gb RAM) with 4Gb Fibre Channel to six SATA RAIDs. The 4GB of RAM was not enough to do the fsck. FSCK required about 15GB of RAM to do the job, so a large swap drive was added. It took 48 hours for gfs_fsck to run to completion without verbose mode.
Yes it does.
Yes, but you need to be careful.
If you only want one MySQL server running, (Active-Passive) there's no problem. You can use rgmanager to manage a smooth failover to redundant MySQL servers if your MySQL server goes down. However, you should be aware that in some releases, the mysql init script has an easily-fixed problem where it doesn't return the proper return code. That can result in rgmanager problems with starting the service.
If you want multiple MySQL services running on the cluster (Active-Active), that's where things get tricky. You can still use rgmanager to manage your MySQL services for High Availability. However, you need to configure MySQL so that:
If you don't follow these rules, the multiple mysqld servers will not play nice in the cluster and your database will likely be corrupted.
For information on configuring MySQL, visit the mysql web site: http://www.mysql.com
MySQL also sells a clustered version of MySQL called "MySQL Cluster", but that does its own method of clustering, and is completely separate from Cluster Suite and GFS. I'm not sure how it would interact with our cluster software. For more information, see: http://www.mysql.com/products/database/cluster/
It depends on where you keep your databases.
If you keep your databases on shared storage, such as a SAN or iSCSI, you should use a cluster-aware file system like GFS to keep the file system sane with the multiple nodes trying to access the data at the same time. You can easily use rgmanager to manage the servers, since all the nodes will be seeing the same data. Without a cluster file system like GFS, there's likely to be corruption on your shared storage.
If your databases are on storage that is local to the individual nodes (i.e. local hard drives) then there are no data corruption issues, since the nodes won't have access to the storage on other nodes where the data is kept. However, if you plan to use the rgmanager to provide High Availability (Active-Passive) for each of your database servers, you will probably want to make copies of the database on each of the nodes so that it can also serve the database from any node that fails. You may have to do it often, too, or your backup database may quickly get out of sync with the original it is trying to provide backup service for. It may be tricky to copy these databases between the nodes, so you may need to follow special instructions on the MySQL web site: http://www.mysql.com
Yes it is, for high-availability only (like MySQL, PostgreSQL is not yet cluster-aware). We even have a RG Manager resource agent for PostgreSQL 8 (only) which we plan to release in RHEL4 update 5. There is a bugzilla to track this work:
It depends on what you want to do with it.
You can serve samba from a single node without a problem.
If you want to use samba to serve the same shared file system from multiple nodes (clustered samba aka samba in active/active mode), you'll have to wait: there are still issues being worked out regarding clustered samba.
If you want to use samba with failover to other nodes (active/passive) it will work but if failover occurs, active connections to samba are severed, so the clients will have to reconnect. Locking states are also lost. Other than that, it works just fine.
When a node fails, cman detects the missing heartbeat and begins the process of fencing the node. The cman and lock manager (e.g. lock_dlm) prevent any new locks from being acquired until the failed node is successfully fenced. That has to be done to ensure the integrity of the file system, in case the failed node wants to write to the file system after the failure is detected by the other nodes (and therefore out of communication with the rest of the cluster).
The fence is considered successful after the fence script completes with a good return code. After the fence completes, the lock manager coordinates the reclaiming of the locks held by the node that had failed. Then the lock manager allows new locks and the GFS file system continues on its way.
If the fence is not successful or does not complete for some reason, new locks will continue to be prevented and therefore the GFS file system will freeze for the nodes that have it mounted and try to get locks. Processes that have already acquired locks will continue to run unimpeded until they try to get another lock.
There may be several reasons why a fence operation is not successful. For example, if there's a communication problem with a network power switch.
There may be several reasons why a fence operation does not complete. For example, if you were foolish enough to use manual fencing and forgot to run the script that informs the cluster that you manually fenced the node.
That pretty much means your file system is corrupt. There are a number of ways that this can happen that can't be blamed on GFS:
I'm guessing that maybe you gave them the same locking table on gfs_mkfs, and they're supposed to be different. When you did mkfs, did you use the same -t cluster:fsname for more than one? You can find this out by doing:
gfs_tool sb <device> table
for each device and see if the same value appears. You can change it after the mkfs has already been done with this command:
gfs_tool sb <device> table cluster_name:new_name
We believe GFS is better than OCFS2 because GFS has several key features that are missing from OCFS2:
GFS | OCFS2 |
Integrated cluster intrastructure. You can even write your own cluster apps if you want. | No cluster infrastructure. Limited lock coordination through a quorum disk. |
Quorum disk optional, easily scales to 32 nodes (soon to scale to 100 or more nodes) Without quorum disk, GFS already supports more than a hundred nodes. | Quorum disk limits you to 16 or less nodes |
Clustered volume manager lvm2-cluster | No clustered volume manager |
Limited support for extended attributes (ACLs currently supported, SELinux support will be available in RHEL4 U5, RHEL5 and going forward.) | No extended attribute support |
Memory mapped IO for interprocess communication | No memory mapped IO |
Quota support | No quota support |
Cluster-wide flocks and POSIX locks | No cluster-aware flock or POSIX locks |
POSIX Access Control Lists (ACLs) | No POSIX ACLs |
Robust fencing mechanism to ensure file system integrity | No fencing |
Integrated support for application failover (high availability) | No integrated application failover |
You shouldn't expect GFS to perform as fast as non-clustered file systems because it needs to do inter-node locking and file system coordination. That said, there are some things you can do to improve GFS performance.
This causes a bit more of traffic among the nodes but can sustain larger number of files.
This value is not persistent so it won't survive a reboot. If you want to make it persistent, you can add it to the gfs init script, /etc/init.d/gfs, after your file systems are mounted.
Daemon | Function | Frequency | Parameter |
gfs_glockd | reclaim unused glock structures | As needed | Unchangable |
gfs_inoded | reclaim unlinked inodes. | 15 secs | inoded_secs |
gfs_logd | journal maintenance | 1 sec. | logd_secs |
gfs_quotad | write cached quota changes to disk | 5 secs | quotad_secs |
gfs_scand | Look for cached glocks and inodes to toss from memory | 5 secs | scand_secs |
gfs_recoverd | Recover dead machine's journals | 60 secs | recoverd_secs |
gfs_tool settune /mnt/bob3 inoded_secs 30
These values are not persistent so they won't survive a reboot. If you want to make them persistent, you can add them to the gfs init script, /etc/init.d/gfs, after your file systems are mounted.
There's a tool called gfs2_convert whose job is to convert a file system from gfs1 to gfs2. At this time, gfs2_convert will only convert file systems created with the default 4K block size. I recommend following this procedure:
After it gives you some warnings and asks you the all-important "are you sure" question, it converts it to gfs2.
WARNING: At this time, gfs2 is still being worked on, so you should not use it for a production cluster.
The first access after a GFS mount will be slower because GFS needs to read in the resource group index and resource groups (internal GFS data structures) from disk. Once they're in memory, subsequent access to the file system will be faster. This should only happen right after the file system is mounted.
It also takes additional time to read in from disk: (1) the inodes for the root directory, (2) the journal index, (3) the root directory entries and other internal data structures.
You should be aware of this when performance testing GFS. For example, if you want to test the performance of the "df" command, the first "df" after a mount will be a lot slower than subsequent "df" commands.
After a node fails, there is a certain amount of time where cman is waiting for a heartbeat. When if doesn't get a heartbeat, it performs fencing, and has to wait for the fencing agent to return a good return code, verifying that the node has indeed been fenced. While the node is being fenced, GFS is prevented from taking out new locks (existing locks, however, remain valid, so some IO activity may still take place). After the fence is successful, DLM has to do lock recovery (to reclaim the locks held by the fenced node) and GFS has to replay the fenced node's journals. There's an additional configuration setting called post_fail_delay that can delay things further. So GFS is delayed for three things:
This varies widely based on the type of fencing you're using. Some network power switches are fast. Others agents like ilo are slower.
This varies based on how much activity was happening on the file system. For example, if your application had thousands of locks taken, it will take longer to recover those locks than if your node were idle before the failure.
Again, this varies, based on the activity of the file system before the fence. If there was lots of writing, there might be lots of journal entries to recover, which would take longer than an idle node.
There's not much you can do about the time taken, other than to reduce post_fail_delay to 0 or buy a faster power switch.
The GFS file system is like most other file systems with regard to applications, with one exception: It makes an application running on multiple nodes work as if they are multiple instances of the application running on a single node. GFS will maintain file system integrity when multiple nodes are accessing data on the same shared storage. However, the application is free to corrupt data within its own files unless it is cluster-aware.
For example, if you were to run multiple copies of the regular MySQL database on a single computer, you're going to get into trouble. That's because right now, MySQL doesn't do record-level locking on its database, and therefore a second instance would overwrite data from the first instance. Of course, there are safeguards within MySQL to prevent you from running two instances on a single computer. But if you ran MySQL from two clustered nodes on a GFS file system, it would be just like both instances are running on the same computer, except that there are no safeguards: Data corruption is likely. (Note, however, that there is a special version of MySQL that is more cluster friendly.)
The same holds true for other applications. If you can safely run multiple instances on the same computer, then you should be able to run multiple instances within your cluster safely on GFS.
If a GFS file system detects corruption due to an operation it has just performed, it will withdraw itself. The idea of withdrawning from GFS is just slightly nicer than a kernel panic. It means that the node feels it can no longer operate safely on that file system because it found out that one of its assumptions is wrong. Instead of panicking the kernel, it gives you an opportunity to reboot the node "nicely".
No. The withdrawn node should be rebooted.
Corruption in GFS is extremely rare and almost always indicates a hardware problem with your storage or SAN. The problem might be in the SAN itself, the motherboards, fibre channel cards (HBAs) or memory of the nodes, although that's still not guaranteed. Many things can cause data corruption, such as rogue machines that have access to the SAN that you're not aware of.
I recommend you:
[root@node-01#] dd if=/dev/my_vg/lvol0 of=/mnt/backup/sanbackup
[root@node-01#] diff /dev/my_vg/lvol0 /mnt/backup/sanbackup
(assuming of course that
/dev/my_vg/lvol0 is the logical volume you have your GFS partition on, and
/mnt/backup/ is some scratch area big enough to hold that much data.)
The idea here is simply to test that reading from the SAN gives you
the same data twice. If that works successfully on one node, try it
on the other nodes.
[root@node-01#] dd if=/dev/my_vg/lvol0 of=/tmp/sanbackup2 bs=1M count=4096
[root@node-01#] dd if=/dev/urandom of=/tmp/randomjunk bs=1M count=4096
[root@node-01#] dd if=/tmp/randomjunk of=/dev/my_vg/lvol0 bs=1M count=4096
[root@node-01#] dd if=/dev/my_vg/lvol0 of=/tmp/junkverify bs=1M count=4096
[root@node-01#] diff /tmp/randomjunk /tmp/junkverify
[root@node-01#] dd if=/tmp/sanbackup2 of=/dev/my_vg/lvol0 bs=1M count=4096
Perhaps someone else (the SAN manufacturer?) can recommend hardware tests you can run to verify the data integrity.
I realize these kinds of tests take a long time to do, but if it's a hardware problem, you really need to know. If you know it's not hardware and can recreate this kind of corruption with some kind of test using GFS, please let us know how and open a bugzilla.
It depends on which version of the code you are running. Basically, cluster manager is a component of the cluster project that handles communications between nodes in the cluster.
In the latest cluster code, cman is just a userland program that interfaces with the OpenAIS membership and messenging system.
In the previous versions, cman was a kernel module whose job was to keep a "heartbeat" message moving throughout the cluster, letting all the nodes know that the others are alive.
It also handles cluster membership messages, determining when a node enters or leaves the cluster.
Quorum is a voting algorithm used by the cluster manager.
A cluster can only function correctly if there is general agreement between the members about things. We say a cluster has 'quorum' if a majority of nodes are alive, communicating, and agree on the active cluster members. So in a thirteen-node cluster, quorum is only reached if seven or more nodes are communicating. If the seventh node dies, the cluster loses quorum and can no longer function.
It's necessary for a cluster to maintain quorum to prevent 'split-brain' problems. If we didn't enforce quorum, a communication error on that same thirteen-node cluster may cause a situation where six nodes are operating on the shared disk, and another six were also operating on it, independently. Because of the communication error, the two partial-clusters would overwrite areas of the disk and corrupt the file system. With quorum rules enforced, only one of the partial clusters can use the shared storage, thus protecting data integrity.
Quorum doesn't prevent split-brain situations, but it does decide who is dominant and allowed to function in the cluster. Should split-brain occur, quorum prevents more than one cluster group from doing anything.
We had to allow two-node clusters, so we made a special exception to the quorum rules. There is a special setting "two_node" in the /etc/cluster.conf file that looks like this:
<cman expected_votes="1" two_node="1"/>
This will allow one node to be considered enough to establish a quorum. Note that if you configure a quorum disk/partition, you don't want two_node="1".
Tie-breakers are additional heuristics that allow a cluster partition to decide whether or not it is quorate in the event of an even-split - prior to fencing. A typical tie-breaker construct is an IP tie-breaker, sometimes called a ping node. With such a tie-breaker, nodes not only monitor each other, but also an upstream router that is on the same path as cluster communications. If the two nodes lose contact with each other, the one that wins is the one that can still ping the upstream router. Of course, there are cases, such as a switch-loop, where it is possible for two nodes to see the upstream router - but not each other - causing what is called a split brain. This is why fencing is required in cases where tie-breakers are used.
Other types of tie-breakers include disk tie-breakers where a shared partition, often called a quorum disk, provides additional details. clumanager 1.2.x (Red Hat Cluster Suite 3) had a disk tie-breaker that allowed safe split brain operation if the network went down as long as both nodes were still communicating over the shared partition.
More complex tie-breaker schemes exist, such as QDisk (part of linux-cluster). QDisk allows arbitrary heuristics to be specified. These allow each node to determine its own fitness for participation in the cluster. It is often used as a simple IP tie-breaker, however. See the qdisk(5) manual page for more information.
CMAN has no internal tie-breakers for various reasons. However, tie-breakers can be implemented using the libcman API. This API allows quorum device registration and updating. For an example, look at the QDisk source code.
You might need a tie-breaker if you:
They do. When each node recognizes that the other has stopped responding, it will try to fence the other. It can be like a gunfight at the O.K. Coral, and the node that's quickest on the draw (first to fence the other) wins. Unfortunately, both nodes can end up going down simultaneously, losing the whole cluster.
It's possible to avoid this by using a network power switch that serializes the two fencing operations. That ensures that one node is rebooted and the second never fences the first. For other configurations, see below.
In a two node cluster (where you are using two_node="1" in the cluster configuration, and w/o QDisk), there are several considerations you need to be aware of:
If you can not meet the above requirements, you can use quorum disk or partition.
The two_node cluster.conf option allows one node to have quorum by itself. A network partition between the nodes won't result in a corrupt fs because each node will try to fence the other when it comes up prior to mounting gfs.
Strangely, if you have a persistent network problem and the fencing device is still accessible to both nodes, this can result in a "A reboots B, B reboots A" fencing loop.
This problem can be worked around by using a quorum disk or partition to break the tie, or using a specific network & fencing configuration.
It's possible to still write to a GFS volume, even without quorum, but ONLY if the three nodes that left the cluster didn't have the GFS volume mounted. It's not a problem because if a partitioned cluster is ever formed that gains quorum, it will fence the nodes in the inquorate partition before doing anything.
If, on the other hand, nodes failed while they had gfs mounted and quorum was lost, then gfs activity on the remaining nodes will be mostly blocked. If it's not then it may be a bug.
You can't mix RHEL4 U1 and U2 systems in a cluster because there were changes between U1 and U2 that changed the format of internal messages that are sent around the cluster.
Since U2, we now require these messages to be backward-compatible, so mixing U2 and U3 or U3 and U4 shouldn't be a problem.
Unfortunately, two-node clusters are a special case. A two-node cluster needs two nodes to establish quorum, but only one node to maintain quorum. This special status is set by a special "two_node" option in the cman section of cluster.conf. Unfortunately, this setting can only be reset by shutting down the cluster. Therefore, the only way to add a third node is to:
The system-config-cluster gui gets rid of the two_node option automatically when you add a third node. Also, note that this does not apply to two-node clusters with a quorum disk/partition. If you have a quorum disk/partition defined, you don't want to use the two_node option to begin with.
Adding subsequent nodes to a three-or-more node cluster is easy and the cluster does not need to be stopped to do it.
You're supposed to stop the node before removing it from the cluster.conf.
Here's the procedure:
Halting a single node in the cluster will seem like a communication failure to the other nodes. Errors will be logged and the fencing code will get called, etc. So there's a procedure for properly shutting down a cluster. Here's what you should do:
Use the "cman_tool leave remove" command before shutting down each node. That will force the remaining nodes to adjust quorum to accomodate the missing node and not treat it as an error.
Additional info: When I try to start cman, I see these messages in /var/log/messages:
Sep 11 11:44:26 server01 ccsd[24972]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5
Sep 11 11:44:26 server01 ccsd[24972]: Initial status:: Inquorate
Sep 11 11:44:57 server01 ccsd[24972]: Cluster is quorate. Allowing connections.
Sep 11 11:44:57 server01 ccsd[24972]: Cluster manager shutdown. Attemping to reconnect...
I see these messages in dmesg:
CMAN: forming a new cluster
CMAN: quorum regained, resuming activity
CMAN: sendmsg failed: -13
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -13
CMAN: we are leaving the cluster.
CMAN: Waiting to join or form a Linux-cluster
CMAN: sendmsg failed: -13
This is almost always caused by a mismatch between the kernel and user space CMAN code. Update the CMAN user tools to fix the problem.
No, it's not true. There is only one special case: two node clusters have special rules for determining quorum. See question 3 above.
A quorum disk or partition is a section of a disk that's set up for use with components of the cluster project. It has a couple of purposes. Again, I'll explain with an example.
Suppose you have nodes A and B, and node A fails to get several of cluster manager's "heartbeat" packets from node B. Node A doesn't know why it hasn't received the packets, but there are several possibilities: either node B has failed, the network switch or hub has failed, node A's network adapter has failed, or maybe just because node B was just too busy to send the packet. That can happen if your cluster is extremely large, your systems are extremely busy or your network is flakey.
Node A doesn't know which is the case, and it doesn't know whether the problem lies within itself or with node B. This is especially problematic in a two-node cluster because both nodes, out of touch with one another, can try to fence the other.
So before fencing a node, it would be nice to have another way to check if the other node is really alive, even though we can't seem to contact it. A quorum disk gives you the ability to do just that. Before fencing a node that's out of touch, the cluster software can check whether the node is still alive based on whether it has written data to the quorum partition.
In the case of two-node systems, the quorum disk also acts as a tie-breaker. If a node has access to the quorum disk and the network, that counts as two votes.
A node that has lost contact with the network or the quorum disk has lost a vote, and therefore may safely be fenced.
In older versions of the Cluster Project, a quorum disk was needed to break ties in a two-node cluster. Early versions of Red Hat Enterprise Linux 4 (RHEL4) did not have quorum disks, but it was added back as an optional feature in RHEL4U4.
In RHCS 4 update 4 and beyond, see the man page for qdisk for more information. As of September 2006, you need to edit your configuration file by hand to add quorum disk support. The system-config-cluster gui does not currently support adding or editing quorum disk properties.
Whether or not a quorum disk is needed is up to you. It is possible to configure a two-node cluster in such a manner that no tie-breaker (or quorum disk) is required. Here are some reasons you might want/need a quorum disk:
The best way to start is to do "man qdisk" and read the qdisk.5 man page. This has good information about the setup of quorum disks.
Note that if you configure a quorum disk/partition, you don't want two_node="1" or expected_votes="2" since the quorum disk solves the voting imbalance. You want two_node="0" and expected_votes="3" (or nodes + 1 if it's not a two-node cluster). However, since 0 is the default value for two_node, you don't need to specify it at all. If this is an existing two-node cluster and you're changing the two_node value from "1" to "0", you'll have to stop the entire cluster and restart it after the configuration is changed (normally, the cluster doesn't have to be stopped and restarted for configuration changes, but two_node is a special case.) Basically, you want something like this in your /etc/cluster/cluster.conf:
<cman two_node="0" expected_votes="3" .../>
<clusternodes>
<clusternode name="node1" votes="1" .../>
<clusternode name="node2" votes="1" .../>
</clusternodes>
<quorumd device="/dev/mapper/lun01" votes="1"/>
Note: You don't have to use a disk or partition to prevent two-node fence-cycles; you can also set your cluster up this way. You can set up a number of different heuristics for the qdisk daemon. For example, you can set up a redundant NIC with a crossover cable and use ping operations to the local router/switch to break the tie (this is typical, actually, and is called an IP tie breaker). A heuristic can be made to check anything, as long as it is a shared resource.
Currently, yes. There have been suggestions to make qdiskd operate in a 'diskless' mode in order to help prevent a fence-race (i.e. prevent a node from attempting to fence another node), but no work has been done in this area (yet).
Yes. if the quorum disk is registered correctly with cman you should see the votes it contributes and also it's "node name" in cman_tool nodes.
The official answer is 10MB. The real number is something like 100KB, but we'd like to reserve 10MB for possible future expansion and features.
Currently a quorum disk/partition may be used in clusters of up to 16 nodes.
First of all, no, they don't cause split-brain. As soon as heartbeat contact is lost, both nodes will realize something is wrong and lock GFS until it gets resolved and someone is fenced.
What actually happens depends on the configuration and the heuristics you build. The qdisk code allows you to build non-cluster heuristics to determine the fitness of each node beyond the heartbeat. With the heuristics in place, you can, for example, allow the node running a specific service to have priority over the other node. It's a way of saying "This node should win any tie" in case of a heartbeat failure. The winner fences the loser.
If both nodes still have a majority score according to their heuristics, then both nodes will try to fence each other, and the fastest node kills the other. Showdown at the Cluster Corral. The remaining node will have quorum along with the qdisk, and GFS will run normally under that node. When the "loser" reboots, unlike with a cman operation, it will not become quorate with just the quorum disk/partition, so it cannot cause split-brain that way either.
At this point (4-Apr-2007), if there are no heuristics defined whatsoever, the QDisk master node wins (and fences the non-master node). [This functionality will appear in Update 5 of Red Hat Cluster Suite for Red Hat Enterprise Linux 4, but is already available in CVS]
This may not be a good idea in most cases because of the dangers of split-brain, but there is a way you can do this: You can adjust the "votes" for the quorum disk to be equal to the number of nodes in the cluster, minus 1
For example, if you have a four-node cluster, you can set the quorum disk votes to 3, and expected_votes to 7. That way, even if three of the four nodes die, the remaining node may still function. That's because the quorum disk's 3 votes plus the remaining node's 1 vote makes a total of 4 votes out of 7, which is enough to establish quorum. Additionally, all of the nodes can be online - but not the qdiskd (which you might need to take down for maintenance or reconfiguration).
One or more of the nodes in your cluster is rejecting the membership of this node. Check the syslog (/var/log/messages) on all remaining nodes in the cluster for messages regarding why the membership was rejected.
This message will only appear when another node is rejecting the node in question and it WILL tell syslog (/var/log/messages) why unless you have kernel logging switched off for some reason. There are several reasons your node may be rejected:
Something else you might like to try is changing the port number that this cluster is using, or changing the cluster name to something totally different.
If you find that things work after doing this then you can be sure there is another cluster with that name or number on the network. If not, then you need to double/triple check that the config files really do all match on all nodes.
I've seen this message happen when I've accidentally done something like this:
Guess what? None of the nodes come up in a cluster. Can you guess why?
It's because node E still thinks it's part of the cluster and still has a claim on the cluster name. You still need to shut down the cluster software on E, or else reboot it before the correct nodes can form a cluster.
No, this isn't a problem and can be ignored. Some nodes may report [1 2 3 4 5] while others report a different order, like [4 3 5 2 1]. This merely has to do with the order in which cman join messages are received.
This message indicates that you tried to leave the cluster from a node that still has active cluster resources, such as mounted GFS file systems.
A node cannot leave the cluster if there are subsystems (e.g. DLM, GFS, rgmanager) active. You should unmount all GFS filesystems, stop the rgmanager service, stop the clvmd service, stop fenced and anything else using the cluster manager before using cman_tool leave. You can use cman_tool status and cman_tool services to see how many (and which) services are running.
Although this may be an over-simplification, you can think of the services as a big membership roster for different special interest groups or clubs. Each "service-name" pair corresponds to access to a unique resource, and each node corresponds to a voting member in the club.
So let's weave a inane piece of fiction around this concept: let's pretend that a journalist named Sam, wants to write an article for her newspaper, "The National Conspiracy Theorist." To write her article, she needs access to secret knowledge kept hidden for centuries by a secret society known only as "The Group." The only way she can become a member is to petition the existing members to join and the decision must be unanimously in her favor. But The Group is so secretive, they don't even know each other's names; every member is assigned a unique id number. Their only means of communication is through a chat room, and they won't even speak to you unless you're a member or unless you know to become a member.
So she logs into chat room and joins the channel #default. In the chat room, she can see there are seven members of The Group. They're not listed in order, but they're all there.
[root@roth-02 ~]# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[7 6 1 2 3 4 5]
She finds a blog (called "cluster.conf") and reads from it that her own ID number is 8. So she sends them a message: "Node 8 wants to join the default group".
Secretly, the other members take attendance to make sure all the members are present and accounted for. Then they take a vote. If all of them vote yes, she's allowed into the group and she becomes the next member. Her ID number is added to the list of members.
[root@roth-02 ~]# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[7 6 1 2 3 4 5 8]
Now that she's a member of the Group, she is told that the secrets of the order are not given to ordinary newbies; they're kept in a locked space. They are stored in an office building owned by the order, that they oddly call "clvmd." Since she's a newbie, she has to petition the other members to get a key to the clvmd office building. After a similar vote, they agree to give her a key, and they keep track of everyone who has a key.
[root@roth-02 ~]# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[7 6 1 2 3 4 5 8]
DLM Lock Space: "clvmd" 7 3 run -
[7 6 1 2 3 4 5 8]
Eager to write her article, she drives to the clvmd office building, unlocks the door, and goes inside. She's heard rumors that the secrets are kept in suite labeled "secret". She goes from room to room until she finds a door marked "secret." Then she discovers that the door is locked and her key doesn't fit. Again, she has to petition the others for a key. They tell her that there are actually two adjacent rooms inside the suite, the "DLM" room and the "GFS" room, each holding a different set of secrets.
Four of the members (3, 4, 6 and 7) never really cared what was in those rooms, so they never bothered to learn the grueling rituals, and consequently, they were never issued keys to the two secret rooms. So after months of training, Sam once again petitions the other members to join the "secret rooms" group. She writes "Node 8 wants to join the 'secret' DLM group" and sends it to the members who have a key: #1, #2 and #5. She sends them a similar message for the other room as well: "Node 8 wants to join the 'secret' GFS group". Having performed all the necessary rituals, they agree, and she's issued a duplicate key for both secret rooms.
[root@roth-02 ~]# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[7 6 1 2 3 4 5 8]
DLM Lock Space: "clvmd" 7 3 run -
[7 6 1 2 3 4 5 8]
DLM Lock Space: "secret" 12 8 run -
[1 2 5 8]
GFS Mount Group: "secret" 13 9 run -
[1 2 5 8]
Then something shocking rocks the secret society: member 2 went into cardiac arrest and died on the operating table. Clearly, something must be done to recover the keys held by member 2. In order to secure the contents of both rooms, no one is allowed to touch the information in the secret rooms until they've verified member 2 was really dead and recovered his keys. The members decide to leave that task to the most senior member, member 7.
That night, when no one is watching, Member 7 breaks into the morgue, verifies #2 is really dead, and steals back the key from his pocket. Then #7 drives to the office building, returns all the secrets he had borrowed from the secret room. (They call it "recovery".) He also informs the other members that #2 is truly dead and #2 is taken off the group membership lists. Relieved that their secrets are safe, the others are now allowed access to the secret rooms.
[root@roth-02 ~]# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[7 6 1 3 4 5 8]
DLM Lock Space: "clvmd" 7 3 run -
[7 6 1 3 4 5 8]
DLM Lock Space: "secret" 12 8 run -
[1 5 8]
GFS Mount Group: "secret" 13 9 run -
[1 5 8]
You get the picture...Each of these "services" keeps a list of members who are allowed access, and that's how the cluster software on each node knows which others to contact for locking purposes. Each GFS file system has two groups that are joined when the file system is mounted; one for GFS and one for DLM.
The "state" of each service corresponds to its status in the group: "run" means it's a normal member. There are also states corresponding to joining the group, leaving the group, recovering its locks, etc.
A node may leave the cluster for many reasons. Among them:
Just add hello_timer="value" to the cman section in your cluster.conf file. For example:
<cman hello_timer="5">
The default value is 5 seconds.
Just add deadnode_timeout="value" to the cman section in your cluster.conf file. For example:
<cman deadnode_timeout="21">
The default value is 21 seconds.
"Split brain" is a condition whereby two or more computers or groups of computers lose contact with one another but still act as if the cluster were intact. This is like having two governments trying to rule the same country. If multiple computers are allowed to write to the same file system without knowledge of what the other nodes are doing, it will quickly lead to data corruption and other serious problems.
Split-brain is prevented by enforcing quorum rules (which say that no group of nodes may operate unless they are in contact with a majority of all nodes) and fencing (which makes sure nodes outside of the quorum are prevented from interfering with the cluster).
There are several reasons for doing this. You may want to do this in cases where you want the cman heartbeat messages to be on a dedicated network so that a heavily used network doesn't cause heartbeat messages to be missed (and nodes in your cluster to be fenced). Second, you may have security reasons for wanting to keep these messages off of an Internet-facing network.
First, you want to configure your alternate NIC to have its own IP address, and the settings that go with that (subnet, etc).
Next, add an entry into /etc/hosts (on all nodes) for the ip address associated with the NIC you want to use. In this case, eth2. One way to do this is to append a suffix to the original host name. For example, if your node is "node-01" you could give it the name "node-01-p" (-p for private network). For example, your /etc/hosts file might look like this:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.0.0.1 node-01
192.168.0.1 node-01-p
If you're using RHEL4.4 or above, or 5.1 or above, that's all you need to do. There is code in cman to look at all the active network interfaces on the node and find the one that corresponds to the entry in cluster.conf. Note that this only works on ipv4 interfaces.
By default, the older cluster infrastructure (RHEL4, STABLE and so on) uses broadcast. By default, the newer cluster infrastructure with openais (RHEL5, HEAD and so on) uses multicast. You can configure a RHEL4 cluster to use multicast rather than broadcast. However, you can't switch openais to use broadcast.
Yes, it is. If you configure the cluster to use multicast rather than broadcast (there is an option for this in system-config-cluster) then the nodes can be on different subnets.
Be careful that any switches and/or routers between the nodes are of good specification and are set to pass multicast traffic though.
Put something like this in your cluster.conf file:
<clusternode name="nd1">
<multicast addr="224.0.0.1" interface="eth0"/>
</clusternode>
There is currently a known problem with RHEL5 whereby system-config-cluster is trying to improperly access /usr/sbin/cman_tool (cman_tool currently resides in /sbin). We'll correct the problem, but in the meanwhile, you can circumvent the problem by creating a symlink from /sbin/cman_tool to /usr/sbin/. For example:
[root@node-01 ~]# ln -s /sbin/cman_tool /usr/sbin/cman_tool
If this is not your problem, read on:
Ordinarily, this message would mean that cman could not create the local socket in /var/run for communication with the cluster clients.
The cman tries to create /var/run/cman_client and /var/run/cman_admin. Things like cman_tool, groupd and ccsd talk to cman over this link. If it can't be created then you'll get this error.
Check /var/run is writable and able to hold Unix domain sockets.
On Fedora 8 and other distributions where the core supports multiple architectures (ex: x86, x86_64), you must have a matched set of packages installed. A cman package for x86_64 will not work with an x86 (i386/i686) openais package, and vice-versa. To see if you have a mixed set, run:
WRONG:
[root@ayanami ~]# file `which cman_tool`; file `which aisexec`
/usr/sbin/cman_tool: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped
/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped
RIGHT:
[root@ayanami ~]# file `which cman_tool`; file `which aisexec`
/usr/sbin/cman_tool: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped
/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped
You need to use the same architecture as your kernel for running the userland parts of the cluster packages; on x86_64, this generally means you should only have the x86_64 versions of the cluster packages installed.
rpm -e cman.i386 openais.i386 rgmanager.i386 ...
yum install -y cman.x86_64 openais.x86_64 rgmanager.x86_64 ...
Note: If you were having trouble getting things up, there's a chance that an old aisexec process might be running on
one of the nodes; make sure you kill it before trying to start again!
Some Cisco switches do not support IP multicast in their default configuration. Since openais uses multicast for cluster communications, you may have to enable it in the switch in order to use the cluster software.
Before making any changes to your Cisco switches it is adviseable to contact your Cisco TAC to ensure the changes will have no negative consequences in your network.
Please visit this page for more information: http://www.openais.org/doku.php?id=faq:cisco_switches
Please see: http://www.openais.org/
The cluster manager (cman) locking scheme uses kernel modules to communicate cluster status and changes between nodes. OpenAIS uses userspace programs to accomplish the same thing. Moving this function to userspace made more sense since it makes it easier to monitor and debug, non-fatal if it crashes, and it meshes better with the communications layers of the operating system.
Fencing is the component of cluster project that cuts off access to a resource (hard disk, etc.) from a node in your cluster if it loses contact with the rest of the nodes in the cluster.
The most effective way to do this is commonly known as STONITH, which is an acronym that stands for "Shoot The Other Node In The Head." In other words, it forces the system to power off or reboot. That might seem harsh to the uninitiated, but really it's a good thing. If a node that is not cooperating with the rest of the cluster can seriously damage the data unless it's forced off. So by fencing an errant node, we're actually protecting the data.
Fencing is often accomplished with a network power switch, which is a power switch that can be controlled through the network. This is known as power fencing.
Fencing can also be accomplished by cutting off access to the resource, such as using SCSI reservations. This is known as fabric fencing.
This is constantly changing. Manufacturers come out with new models and new microcode all the time, forcing us to change our fence agents. Your best bet is to look at the source code in CVS and see if your device is mentioned:
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/?cvsroot=cluster
We are looking into ways to improve this.
No. Fencing is absolutely required in all production environments. That's right. We do not support people using only watchdog timers anymore.
Manual fencing is absolutely not supported in any production environment, ever, under any circumstances.
Both do the job. Both methods guarantee the victim can't write to the file system, thereby ensuring file system integrity.
However, we recommend to customers to use power-cycle fencing anyway for a number of reasons. There are arguments where fabric level fencing is useful. The common "Fabric Fencing" arguments go something like this:
"What if the node has a reproducible failure that keeps happening over and over if we reset it each time?" and "What if I have non-clustered, but mission-critical tasks running on the node, and it is evicted from the cluster but is not actually dead (say, the cluster software crashed)? Power-cycling the machine would kill the Mission Critical tasks running on it..."
However, once a node is fabric fenced, you need to reboot it before it can rejoin the cluster.
Killing fenced, or having it otherwise exit, while the node is using gfs isn't good. If the node fails without fenced running it won't be fenced. Fenced can simply be restarted if it exits somehow, which is what you should do if you find it's been killed. I don't think we can really prevent it from being intentionally killed, though.
The first step is to try fencing it from a command line that looks something like this:
/sbin/fence_ilo -a myilo -l login -p passwd -o off -v
Second, check the version of RIBCL you are using. You may want to consider upgrading your firmware. Also, you may want to scan bugzilla to see if there are any issues regarding your level of firmware.
A node can have multiple fence methods and each fence method can have multiple fence devices.
Multiple fence methods are set up for redundancy/insurance. For example, you may be using a baseboard management fencing method for a node in your cluster such as IPMI, or iLO, or RSA, or DRAC. All of these depend on a network connection. If this connection would fail, fencing could not occur, so as a backup fence method you could declare a second method of fencing that used a power switch or somesuch to fence the node. If the first method failed to fence the node, the second fence method would be employed.
Multiple fence devices per method are used, for example, if a node has dual power supplies and power fencing is the fence method of choice. If only one power supply were fenced, the node would not reboot - as the other power supply would keep it up and running. In this case you would want two fence devices in one method: one for power supply A and one for power supply B.
All fence devices within a fence method must succeed in order for the method to succeed.
If someone refers to fence "levels" they are the same thing as methods. The term "method" used to refer to "power" versus "fabric" fencing. But the technology has outgrown that but the config file has not. So the term "fencing level" might be more accurate, but we still refer to them as "fencing methods" because "method" is how you specify it in the config file.
There can be multiple causes for nodes that repeatedly get fenced, but the bottom line is that one of the nodes in your cluster isn't seeing enough "heartbeat" network messages from the node that's getting fenced.
Most of the time, these come down to flaky or faulty hardware, such as bad cables and bad ports on the network hub or switch.
Test your communications paths thoroughly without the cluster software running to make sure your hardware is okay.
If your network is busy, your cluster may decide it's not getting enough heartbeat packets, but that may be due to other activities that happen when a node joins a cluster. You may have to increase the post_join_delay setting in your cluster.conf. It's basically a grace period to give the node more time to join the cluster. For example:
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="600"/>
No. No. A thousand times no. Oh sure. You can use it. But don't complain when a node needs to be fenced and the cluster locks up, and services don't fail over.
Because we can't be responsible when this happens:
When a node can't talk to the rest of the cluster through its normal heartbeat packets, it will be fenced by another node.
If a GFS file system detects corruption due to an operation it has just performed, it will withdraw itself. The idea of withdrawning from GFS is just slightly nicer than a kernel panic. It means that the node feels it can no longer operate safely on that file system because it found out that one of its assumptions is wrong. Instead of panicking the kernel, it gives you an opportunity to reboot the node "nicely".
You have to be careful when configuring fencing for redundant power supplies. If you configure it wrong, each power supply will be fenced separately and the other power supply will allow the system to continue running. The system won't be fenced at all. What you really want is to configure it so that both power supplies are shut off and the system is taken completely down. What you want is a set of two fencing devices inside a single fencing method.
If you're using dual power supplies, both of which are plugged into the same power switch, using ports 1 and 2, you can do something like this:
<clusternode name="node-01" votes="1">
<fence>
<method name="1">
<device name="pwr01" option="off" switch="1" port="1"/>
<device name="pwr01" option="off" switch="1" port="2"/>
<device name="pwr01" option="on" switch="1" port="1"/>
<device name="pwr01" option="on" switch="1" port="2"/>
</method>
</fence>
</clusternode>
...
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.0.101"
login="admin" name="pwr01" passwd="XXXXXXXXXXX"/>
</fencedevices>
The intrinsic problem with this, of course, is that if your UPS fails or needs to be swapped out, your system will lose power to both power supplies and you have down time. This is unacceptable in a High Availability (HA) cluster. To solve that problem, you'd really want redundant power switches and UPSes for the dual power supplies.
For example, let's say you have two APC network power switches (pwr01 and pwr02), each of which runs on its own separate UPS and has its own unique IP address. Let's assume that the first power supply of node 1 is plugged into port 1 of pwr01, and the second power supply is plugged into port 1 of pwr02. That way, port 1 on both switches is reserved for node 1, port 2 for node 2, etc. In your cluster.conf you can do something like this:
<clusternode name="node-01" votes="1">
<fence>
<method name="1">
<device name="pwr01" option="off" switch="1" port="1"/>
<device name="pwr02" option="off" switch="1" port="1"/>
<device name="pwr01" option="on" switch="1" port="1"/>
<device name="pwr02" option="on" switch="1" port="1"/>
</method>
</fence>
</clusternode>
...
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.0.101"
login="admin" name="pwr01" passwd="XXXXXXXXXXX"/>
<fencedevice agent="fence_apc" ipaddr="192.168.1.101"
login="admin" name="pwr02" passwd="XXXXXXXXXXX"/>
</fencedevices>
We have some. For WTI please visit this link:
http://people.redhat.com/lhh/wti_devices.html
[root@taft-04 ~]# pvcreate /dev/sdb1
[root@taft-04 ~]Physical volume "/dev/sdb1" successfully created
[root@taft-04 ~]# pvscan
[root@taft-04 ~]No matching physical volumes found
Filters can cause this to happen. pvscan respects the filters and scans everything but if pvcreate finds the device you request immediately, it only applies the filter to the name on the command line, i.e. it doesn't scan everything.
This can give a different result to the filter matching. Internally, lvm2 only knows about device numbers - major/minor. Names are just a means to finding the device number required. Device numbers can have multiple names in the file system and the rules for applying filters can give a different answer if only applied to a subset of names in the filesystem. But scanning everything every time is slow, so it takes short cuts - at the price of occasional inconsistency.
Try running "pvscan -vvvv | grep sdb" to make sure it's not filtered out.
When I try to start clvmd, I get this message:
[root@node001 ~]# clvmd
clvmd could not connect to cluster manager
Consult syslog for more information
My syslog says: "clvmd: Unable to create lockspace for CLVM: No such file or directory"
Make sure your that dlm kernel module is loaded by using lsmod. If not, do "modprobe dlm" to insert the module. Also, make sure the failing node can physically see the shared storage in /proc/partitions. I've seen some weird things like this happen when a cluster comes up but some of the nodes can't physically access the storage.
No you can't. Without some kind of cluster infrastructure, there's nothing to stop the computers attached to your shared storage from corrupting and overwriting each other's data. In fact, each of the nodes will let you use the 'ae' option and each will be convinced it has exclusive access.
There's a little-known "clustering" flag for volume groups that should be set on when a cluster uses a shared volume. If that bit is not set, you can see strange lvm problems on your cluster. For example, if you extend a volume with lvresize and gfs_grow, the other nodes in the cluster will not be informed of the resize, and will likely crash when they try to access the volume.
To check if the clustering flag is on for a volume group, use the "vgs" command and see if the "Attr" column shows a "c". If the attr column shows something like "wz--n-" the clustering flag is off for the volume group. If the "Attr" column shows something like "wz--nc" the clustering flag is on.
To set the clustering flag on, use this command:
vgchange -cy <volume group name>
A lock manager is a traffic cop who controls access to resources in the cluster, such as access to a GFS file system. You need it because without a lock manager, there would be no control over access to your shared storage, and the nodes in the cluster would corrupt each other's data.
The GFS file system was written to interface with different lock managers. Today, there are three lock managers:
It depends on how many nodes in the cluster and what you're going to be using your cluster for.
For almost all cases, the DLM protocol is preferred. It's more modern and more efficient.
The first thing to consider is the number of computers in your cluster. DLM has known problems when you have more than 32 nodes in your cluster. We're working to resolve those issues, but until then use GULM if you have more than 32 nodes.
The GULM locking manager, on the other hand, requires more machines. GULM requires three or more independent computers outside the cluster that act as lock servers. That means that the minimum GULM configuration is five computers: A two-node GULM cluster with three independent GULM lock servers. So if you've got fewer than five computers, you'll have to use DLM.
The second thing to consider is the software that will be accessing the storage. Right now, Oracle and Oracle RAC are only Oracle-certified to work with the GULM locking manager.
Oracle RAC should work just fine with DLM locking; it just won't be a configuration that has passed Oracle certification. That means you can still run Oracle RAC in a two-node cluster without the additional lock servers, but you'll have to use DLM. and Oracle won't support your configuration. (Red Hat still will.) If you have a problem with Oracle, you should be able to temporarily introduce three lock servers and switch to GULM long enough to get their tech support. But please make sure the problem is still there before contacting them.
We're in the process of phasing out the GULM locking manager for future development, such as Fedora Core 6 and Red Hat Enterprise Linux 5.
You specify the locking protocol when you make your file system with gfs_mkfs or mkfs.gfs2. For example:
gfs_mkfs -t smoke_cluster:my_gfs -p lock_dlm -j 3 /dev/bobs_vg/lvol0
mkfs.gfs2 -t bob_cluster2:bobs_gfs2 -p lock_gulm -j 5 /dev/bobs_vg/lvol1
It's easy to change the locking protocol for a GFS file system:
gfs_tool sb <device> proto <locking protocol>
For example:
gfs_tool sb /dev/bobs_vg/lvol0 proto lock_dlm
See the man page for gfs_tool for more information and the full range of options.
Absolutely. Check out the source tree from CVS or download the source files from sources.redhat.com. There's documentation in dlm/doc/ and also several example programs (several of which might do exactly what you are looking for) in dlm/tests/usertest/
Testing a lock without blocking is available in the normal locking API (flag LKF_NOQUEUE). The only way of receiving notification of a lock being released is to queue another lock that is incompatible with it - so that lock will be granted when the previous one is released. That's also how you would do it on VMS.
The GULM locking protocol will be supported for Red Hat Enterprise Linux 3 & 4, but we are dropping GULM after that and don't have any plans to support it in future software. In the future, users will be required to switch to DLM locking protocol, which is easy to do.
Yes. For future releases of RHEL, we will go through the Oracle certification process again, this time using the newer DLM locking protocol.
The node's locks should be freed up.
Yes. On RHEL4 and equivalent, do this command:
gfs_tool lockdump /mnt/bob
Unfortunately, the output won't make much sense, but some of the numbers correspond to inode numbers. Atix did a fairly good analysis of what these numbers mean, and you can find it here:
http://www.open-sharedroot.org/documentation/gfs-lockdump-analysis
Right now, there isn't a GFS2 equivalent, but we plan to add it. There's a bugzilla record to track the progress, and it includes a patch to add the functionality:
Yes you can, but only on a per-lockspace basis, so you have to choose a lock space to dump. What you need to do is to echo the lockspace name into /proc/cluster/dlm_locks, then dump that file to get the results. You can get the lockspace names with the "cman_tool services" command. For example:
# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1 2]
DLM Lock Space: "clvmd" 2 3 run -
[1 2]
# echo "clvmd" > /proc/cluster/dlm_locks
# cat /proc/cluster/dlm_locks
This shows locks held by clvmd. If you want to look at another lockspace just echo the other name into the /proc file and repeat.
Again, the output won't make much sense, but some of the numbers may correspond to inode numbers.
Yes you can.
This is an excellent description of a dlm and the general ideas/logic reflect very well our own dlm:
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
You can get this when you build the Cluster Project by hand (i.e. compiling it, not
adding it with RPMs or up2date, etc.) and somehow did something wrong.
The solution I've used to fix it is:
cd cluster; make uninstall; make distclean; ./configure; make install
(Assuming the cluster suite source resides in directory "cluster").
As long as the clusters have different names, you don't need to place them on separate subnets.
GNBD is a kernel module that lets you export any block device from one node to another. You don't need it for normal cluster operations, but you can do some cool things with it.
You modprobe gnbd, and then run gnbd_import -i <server_name> The gnbd export name must be cluster unique. So you cannot export a GNBD named "foo" from serverA and serverB. You need to have both gnbd devices imported.
Yes. When you have Fibre Channel attached storage, you are get a new sd* device for every path to the storage device. Multipath takes all these devices and makes one multipathed device that routes IO over all of them. It works exactly the same with GNBD. Only it takes a couple more steps to get a the paths to appear (gnbd_export, gnbd_import).
The other important thing is that you have to specify the -U option when you export the gnbd device. Without this, the device cannot be multipathed. The -U option gives the device a UUID. If you are using SCSI devices, it should work fine.
For multipathing to work, you really do need two paths to the same actual physical device. Otherwise half you data will end up on one device, and half your data will end up on another device.
No. GNBD devices should work correctly by using defaults.
See the gnbd_export man page for more information.
You need to enable port 14567 (tcp).
In almost all cases, you should use -U (capital U) not -u. The -U option specifies a uuid callout command. You can specify -U<command> but if you don't specify a command, it uses a script that makes sure it deals with partitions correctly. Specifying -U with no command should work correctly for almost every type of shared storage device. If you specify -u and get it wrong with multipathing, you can cause data corruption.
The only real advantage of using GNBD is that it has built in fencing. With iSCSI, you still need some somthing to fence all the machines (unless your scsi target supports SCSI-3 persistent reservations). Theoretically, GNBD could run faster, since it doesn't need to do the work to imitate a SCSI device, but but there's a lot of work that needs to be done for GNBD to reach its full speed potential. Since there isn't much active development of GNBD and iSCSI has an active community of developers, if iSCSI isn't already faster than it, it eventually will be. . Using iSCSI also allows a much more seemless transition to a hardware shared storage solution later on.
If you don't have any fencing hardware, and your iSCSI target doesn't support SCSI-3 persistent reservations, then you should probably go with GNBD. Otherwise it's up to you.
The rgmanager program manages cluster resources defined by the user. It allows you to define services for high-availability on your cluster. Basically, you can define cluster services, for example an NFS server, that is available to computers on the network (in or out of the cluster). Rgmanager monitors the services, and if a node fails, it will relocate the service to another node in the cluster. So if your NFS server fails, the service can be automatically moved to another node in the cluster and the NFS clients on the network probably won't even know it failed. They should continue running seamlessly without knowing or caring about the failure.
The rgmanager program is complex. There is a service monitor that checks for services that are defined in the cluster to make sure they're running. The services may be configured to run on a subset of nodes in the cluster, which I call a service group. A cluster may have many service groups, so even if your cluster has lots of nodes, you can restrict the service to run on only the nodes you want. A cluster may have multiple service groups, so you can have a group of nodes. For example, you can define an NFS service to run on a group of nodes, and an Apache httpd service to run on a different group of nodes. If a service fails, a script is called to automatically restart the service. If a node fails, the service may be relocated to a different node in the service group.
/usr/share/cluster/*
Yes you can define your own services and scripts. The rgmanager is flexible enough to allow you to define your own services with their own scripts. We encourage you to share them so that others may use them as well.
Some people have found "active-active" vs. "active-passive" described differently in different places, so you may be wondering if you're using the terms correctly. If both nodes of a two-node cluster are running their own service, and that service has the ability to failover to the other node, does that make this an active-active cluster, or a doubly active-passive cluster?
Cluster Suite is an active-active cold failover cluster, though many services might not be. RHCS certainly can't make a service "active-active". For example, RHCS can not transform Oracle 10g CFC into a multiple instance Oracle 10g RAC database, or make ext3 into a file system that you can mount on multiple nodes safely. Nothing can do these things.
It's open to interpretation and linguistic changes over time, of course...
Historically, active-passive in the context of a failover cluster meant that only one node can serve *any* of the services at a time, because the underlying device topology or the way the cluster uses it requires it.
Examples:
(1) Device topology requirements: DRBD 0.7 or similar technologies (block-journal NBD, such as used by Steeleye Lifekeeper): only one node can have the shared device open read-write at any one time due to the way the design works (replication over network, in these cases).
(2) Cluster use restriction: SCSI reservations: only one node may talk to a given SCSI device because of the way SCSI reservations work. Requires multiple-initiator buses (IIRC), which get messy very quickly. Note that this might be considered a form of "fencing", but in a negative sense: The one node who has the reservation may access the data on that device.
Now, you can, for example, use the same GFS mount point to construct a multiple-NFS server on RHCS, because GFS does not have the limitation that ext3 does WRT one-node-at-a-time. You might call this service an 'active-active NFS service'... (In this case, there are multiple services which share resources, though - RHCS doesn't let you start the same service multiple times; I can elaborate on the 'why' of this if you would like).
Here's the thing with "active-active" services: most internally active-active services have internal clustering to begin with. Back to a previous example: Oracle 10g RAC probably will not benefit from something like RHCS managing instances at all, where a 10g infrastructure database in CFC configuration will benefit a great deal.
Manual intervention always overrides configured rules. If you want a service to start on a specific node, use:
clusvcadm -e < service > -n < node >
Not specifying is the same as "Start on the node I'm running clusvcadm on..."
With GFS you mount the file system on all the nodes, and have /etc/exports in sync. With ext3, you can't mount the file system on all the nodes. If the file system is mounted and the exports are all in sync cross-cluster. You can move IPs around and the NFS clients should just "do the right thing" Now...You can't mount ext3 on multiple nodes. and you can't have /etc/exports set up to export a file system that's not mounted. So, you have to make the whole bit a cluster service - ext3 mount point and all so that the mount is only mounted on one node at a time then use the cluster to bring up the exports. You could, alternatively just make the cluster start/stop nfsd after mounting the ext3 file system -- but then you can only have one NFS daemon safely running in the cluster at a time (because you can't run two instances of NFSd)
In other words, what's the difference between:
<resources>
<clusterfs device="/dev/bob_vg/lvol0" force_unmount="0" fstype="gfs"
mountpoint="/mnt/bob" name="bobfs" options="acl"/>
<nfsexport name="NFSexports"/>
<nfsclient name="trin-16" options="rw" target="trin-16.lab.msp.redhat.com"/>
</resources>
<service autostart="1" domain="nfsdomain" name="nfssvc">
<ip address="10.15.84.250" monitor_link="1"/>
<clusterfs ref="bobfs">
<nfsexport name="bobfs">
<nfsclient ref="trin-16"/>
</nfsexport>
</clusterfs>
</service>
and...
<service autostart="1" domain="bobdmn" name="nfssvc">
<clusterfs device="/dev/bob_vg/lvol0" force_unmount="0" fsid="51084"
fstype="gfs" mountpoint="/mnt/bob" name="bobfs" options="acl"/>
<nfsexport name="NFSexports"/>
<nfsclient name="trin-16" options="rw" target="trin-16.lab.msp.redhat.com"/>
</service>
The difference is primarily architectural. Resources in the <resource> block might be able to be used multiple times. Resources in a <service> block may only be used in that one place. So, you can also detach a resource from one service and reattach it to another service if it's in the <resources> block. If it was privately declared, you must recreate it the global section is primarily for <nfsclient>, <clusterfs> and <nfsexport> resources.
Yes you can. Starting with U3, you will be able to have rgmanager log to different places and a different level by changing the cluster/rm tag. For example:
<rm log_facility="local4" log_level="7">
Then you can add "local4.* /var/log/foo" to your /etc/syslog.conf file to send daemon output to file foo.
Note: The default log level for rgmanager is 5 (LOG_NOTICE).
Channel bonding.
No, not really. The channel bonding driver is designed for this purpose. When used with a good, internally redundant switch that supports trunking, you end up with higher bandwidth and increased availability.
The rgmanager script 'depth' means how intensive of a check. There used to be 0, 10 and 20.
0 was "Is the IP still there?"
10 was "Can I ping it, and is the ethernet link up?"
20 used to be (but was removed) - "Attempt to ping the router."
There is a parent/child inheritance relationship with nfs exports. Your problem might be that you don't have your nfs client as a child of the nfs export. For example:
Wrong:
<service autostart="1" domain="nfs" name="nfs">
<fs device="/dev/nfsvol/lvol01" force_fsck="0" force_unmount="1" fsid="8508" fstype="ext3" mountpoint="/export" name="/export" options="" self_fence="1"/>
<nfsexport name="/export"/>
<nfsclient name="/export" options="rw" target="81.19.179.*"/>
<ip address="192.168.1.77" monitor_link="1"/>
</service>
Right:
<service name="nfstest" nfslock="1">
<fs ref="NFS Mount">
<nfsexport name="exports">
<nfsclient ref="world-rw"/>
</nfsexport>
</fs>
<ip address="192.168.1.77/22"/>
</service>
The GUI (system-config-cluster) will tell you where the services are running.
From the command line, the clustat command will tell you as well.
The interval is in the script for each service, in /usr/share/cluster/
It's easier to just change the script.sh file to use whatever value you want (<5 is not supported, though). Checking is per-resource-type, not per-service, because it takes more system time to check one resource type vs. another resource type.
That is, a check on a "script" might happen only every 30 seconds, while a check on an "ip" might happen every 10 seconds.
The status checks are not supposed to consume system resources. Historically, people have done one of two things which generate support calls:
(a) Does not set a status check interval at all (why is my service not being checked?), or
(b) sets the status check interval to something way too low, like 10 seconds for an Oracle service (why is the cluster acting strange/running slowly?).
If the status check interval is lower than the actual amount of time it takes to check the status of a service, you end up with endless status-checking, which is a pure waste of resources.
A false start is a start where the first status check fails.
A restart occurs after a status check fails.
If either of those values are exceeded, the service is relocated rather than restarted locally.
Note: These values pertained only to clumanager and were phased out after RHEL3. These values don't exist for the current cluster suite (and it would be difficult to add them).
If you relocate one service by hand, the other one will not automatically follow. However, if the node running the two services fails, both services should be relocated to a failover node automatically.
WARNING: You should never reference the same ext3 file system from two services. Two services may reference the same GFS file system, but not the same ext3 file system.
There are a couple of possibilities. First, you could be a victim of the "resource scripts not returning 0 when they should" bug described in the next question. Otherwise, you might have a "resource collision" which is a little bit more complicated.
To determine if you have a resource collision, run this command:
# rg_test test /etc/cluster/cluster.conf
You have a resource collision if the output looks something like this:
Unique/primary not unique type clusterfs, name=WWWData
Error storing clusterfs resource
This can happen, for example, if you cut and paste a service section in your cluster.conf file and forget to change the name. For example, check out this invalid snippet:
<service autostart="1" domain="apache25" name="apache25">
<clusterfs device="/dev/emcpowerd1" force_unmount="0"
fsid="41106" fstype="gfs" mountpoint="/opt/www" name="WWWData" options="">
<script ref="vsftpd"/>
</clusterfs>
</clusterfsclusterfs device="/dev/emcpowera1" force_unmount="0"
fsid="30342" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft" options="">
<script ref="apache start-stop"/>
</clusterfs>
</service>
<service autostart="1" domain="apache26" name="apache26">
<clusterfs device="/dev/emcpowerd1" force_unmount="0"
fsid="41107" fstype="gfs" mountpoint="/opt/www" name="WWWData" options="">
<script ref="vsftpd"/>
</clusterfs>
<clusterfs device="/dev/emcpowerb1" force_unmount="0"
fsid="30343" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft" options="">
<script ref="apache start-stop"/>
</clusterfs>
</service>
In the example above, the apache26 service has two resource collisions with the apache25 service:
You should put this one in your <resources> block and pass it by reference.
You need to rename one to something else to resolve the naming collision.
The mount point is also the same, and that must be unique.
When rgmanager detects collisions between attributes of a resource type which are required to be unique across the resource type, it stops parsing that branch of the tree. So references to scripts in the apache26 service are largely ignored in the example above.
If the collisions are fixed, rgmanager should start the service.
According to the LSB specification, all init scripts are always supposed to return 0 if everything is running correctly. They should only return a non-zero return code if the service is not running, even if the service was stopped "cleanly." Unfortunately, many of the stock Red Hat init scripts wrongly don't adhere to this rule at various releases, and that causes these kinds of rgmanager symptoms. A lot of people just edit the init scripts by hand, but there are various patches available. For example, here's a patch to fix httpd in RHEL4:
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111998
For more information, see this bugzilla:
Here are some rules for service script writing:
Always return "0" if the status is non-fatal. Rgmanager follows the LSB + OCF RA API draft specifications. That means that 0 is "OK" from running a "status" operation and non-zero is "not OK". Some scripts do check-restarts of non-critical components from within the "status" check.
If you have a "recover" action in a resource agent (resource agent > plain script), rgmanager will try a "recover" operation first (and will fall back to full service restart if recovery fails). A "recover" action is by definition *not allowed* to interfere with other parts of the service. So, if a component X fails, and recovery succeeds, the rest of the service continues running uninterrupted.
None of our agents except the nfsclient have recover actions because they're all considered critical (file systems, IPs, etc.).
You could add a "critical" parameter to resource agents and always return 0 if "$OCF_RESKEY_critical" is not set in the script(s), or not allow recover actions if "critical" is set... etc.
A failover domain is an ordered subset of members to which a service may be bound. The following is a list of semantics governing the options as to how the different configuration options affect the behavior of a failover domain:
This is a notion which is no longer present in rgmanager. In older versions, the preferred node was the member designated to run a given service if the member is online. In most cases, it was used with the "Relocate on Preferred Node Boot" service option (as it was generally thought to be useless without!). In newer rgmanagers, we can emulate this behavior by specifying an unordered, unrestricted failover domain of exactly one member. There is no equivalent to the "Relocate on Preferred Node Boot" option in Cluster Manager 1.0.x.
Services bound to the domain may only run on cluster members which are also members of the failover domain. If no members of the failover domain are available, the service is placed in the stopped state.
Services bound to this domain may run on all cluster members members, but will run on a member of the domain whenever one is available. This means that if a service is running outside of the domain and a member of the domain comes online, the service will migrate to that member.
The order specified in the configuration dictates the order of preference of members within the domain. The highest-ranking member of the domain will run the service whenever it is online. This means that if member A has a higher-rank than member B, the service will migrate to A if it was running on B if A transitions from offline to online.
Members of the domain have no order of preference; any member may run the service. Services will always migrate to members of their failover domain whenever possible, however, in an unordered domain.
Ordering and restriction are flags and may be combined in any way (ie, ordered+restricted, unordered+unrestricted, etc.). These combinations affect both where services start after initial quorum formation and which cluster members will take over services in the event that the service has failed.
You can have multiple nodes per ordered level in the failover domains with RHEL4 and RHEL5, but not with RHEL3.
Examples:
Given a cluster comprised of this set of members: {A, B, C, D, E, F, G}
service 'S' will always run on member 'A' whenever member 'A' is online and there is a quorum. If all members of {A, B, C} are offline, the service will not run. If the service is running on 'C' and 'A' transitions online, the service will migrate to 'A'.
A service 'S' will only run if there is a quorum and at least one member of {A, B, C} is online. If another member of the domain transitions online, the service does not relocate.
A service 'S' will run whenever there is a quorum. If a member of the failover domain is online, the service will run on the highest-ordered member. That is, if 'A' is online, the service will run on 'A'.
This is also called a "Set of Preferred Members". When one or more members of the failover domain are online, the service will run on a nonspecific online member of the failover domain. If another member of the failover domain transitions online, the service does not relocate.
Try adding nfslock="1" to the <service> tag:
<service autostart="1" domain="nfsdomain" name="nfssvc" nfslock="1">
Also, don't forget to enable force-unmount of the file system:
<fs device="/dev/testing/test01" force_unmount="1" fstype="ext3" mountpoint="/test" name="test01" options="">
We've tried to optimize the failover time as much as we can, so there isn't much room for improvement. However, if you're not doing NFS as part of your service (e.g. with the "nfsexport/nfsclient setup"), you can delete the "sleep 10" in the /usr/share/cluster/ip.sh script. That will speed things up a bit.
This is usually caused by incorrect use of the 'path' attribute for the VM resource. The path attribute is like the environment path in a shell: it is a colon-separated list of directories, and is *NOT* a path to an individual file. Example of an exec search path:
PATH=/sbin:/bin:/usr/sbin:/usr/binExample 'path' as a vm attribute in cluster.conf:
<vm name="foo" path="/etc/xen" ... /> <vm name="foo2" path="/etc/xen:/usr/etc/xen" ... />Example incorrect 'path' as a vm attribute in cluster.conf (assuming /etc/xen/foo is a Xen domain config file):
<vm name="foo" path="/etc/xen/foo" ... />
It's just an overactive XML validity checker. You should be fine ignoring this error.
Yes, but it's new and only available for RHEL5 and Fedora Core 6. It's called Conga. More information can be found here:
http://sourceware.org/cluster/conga/Yes, but there isn't much documentation to support them.
When integrating into the latest Cluster Project (HEAD branch in CVS), use the cman api, dlm api and openais api. When integrating into the Cluster Project (RHEL4 or STABLE branches in CVS), use the Magma api. For GFS and GFS2 disk tools that require the file system NOT be mounted, use the libgfs and libgfs2 apis respectively.
Magma was the cluster API we used for RHEL4, with minimal documentation in magma/doc/magma.txt. It uses a plugin infrastructure to translate very simple APIs to cluster-specific APIs. For example, it allows rgmanager in the RHEL4 branch to operate almost identically when either CMAN+DLM are in use or GuLM is in use.
Due to the move towards Open AIS, which implements standards-based SAF AIS APIs, further development of the Magma API has ceased. Applications utilizing the Magma API should either be ported to the SAF AIS CLM+CLK APIs, or use the CMAN and DLM APIs. If you have written a plugin which implements the current set of Magma APIs for your infrastructure, you can submit it to the linux-cluster mailing list for inclusion in the RHEL4 / STABLE branches.
Load Balancing is a mechanism that tries to distribute the workload evenly throughout a cluster. For example, if your cluster does ftp serving, and 400 people all try to download the latest file you're serving up, your server may choke under the pressure of 400 requests. Load balancing can help you distribute that workload evenly throughout your cluster so that 400 FTP requests can be evenly distributed among twenty nodes with 20 requests each. Still, the clients only need to go to one FTP site to get the data.
This is achieved with LVS and Network Address Translation (NAT) routing which translates your world-viewable FTP address into any number of real IP addresses in your cluster.
You only need it if you need true active/active services with a distributed workload.
LVS stands for Linux Virtual Server. It uses a mechanism called Network Address Translation (NAT) to route requests from one IP address to another and that's what achieves true load-balancing. With LVS, you have a second "layer" of servers (called LVS routers) whose job is to equally distribute the requests. Only one router is active at a time; additional routers are needed to provide failover capabilities in case the first router fails.
With LVS, all requests come in to a central LVS server known as the Active Router. The router decides which server to give the request to, based on one of several selectable methods. A second router (called the backup router) monitors the network and takes over if the active router fails.
Piranha is a graphical LVS configuration tool. You only need it if you're planning to do Load Balancing.
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/pt-lvs.html
Question: I want disaster recovery load balancing. In other words, I want two load balancers in two locations, 50 miles apart, on different subnets. For example, I want to have www.mycompany.com with a public-facing virtual ip address, with two servers: a primary server (192.168.0.5) and a failover server (172.31.0.5). I know this isn't best practice but it's still what I want. Is it possible?
Not with Cluster Suite alone. However, there's a concept called Global Server Load Balancing (GSLB). There are a few different GSLB solutions. For example, Foundry Networks sells a kit that will do this. For more information, see:
Disclaimer: The author has no first-hand knowledge of gslb, so this should NOT be considered an endorsement.