--- /dev/null
+LVM device fault handling
+=========================
+
+Introduction
+------------
+This document is to serve as the definitive source for information
+regarding the policies and procedures surrounding device failures
+in LVM. It codifies LVM's responses to device failures as well as
+the responsibilities of administrators.
+
+Device failures can be permanent or transient. A permanent failure
+is one where a device becomes inaccessible and will never be
+revived. A transient failure is a failure that can be recovered
+from (e.g. a power failure, intermittent network outage, block
+relocation, etc). The policies for handling both types of failures
+is described herein.
+
+Available Operations During a Device Failure
+--------------------------------------------
+When there is a device failure, LVM behaves somewhat differently because
+only a subset of the available devices will be found for the particular
+volume group. The number of operations available to the administrator
+is diminished. It is not possible to create new logical volumes while
+PVs cannot be accessed, for example. Operations that create, convert, or
+resize logical volumes are disallowed, such as:
+- lvcreate
+- lvresize
+- lvreduce
+- lvextend
+- lvconvert (unless '--repair' is used)
+Operations that activate, deactivate, remove, report, or repair logical
+volumes are allowed, such as:
+- lvremove
+- vgremove (will remove all LVs, but not the VG until consistent)
+- pvs
+- vgs
+- lvs
+- lvchange -a [yn]
+- vgchange -a [yn]
+Operations specific to the handling of failed devices are allowed and
+are as follows:
+
+- 'vgreduce --removemissing <VG>': This action is designed to remove
+ the reference of a failed device from the LVM metadata stored on the
+ remaining devices. If there are (portions of) logical volumes on the
+ failed devices, the ability of the operation to proceed will depend
+ on the type of logical volumes found. If an image (i.e leg or side)
+ of a mirror is located on the device, that image/leg of the mirror
+ is eliminated along with the failed device. The result of such a
+ mirror reduction could be a no-longer-redundant linear device. If
+ a linear, stripe, or snapshot device is located on the failed device
+ the command will not proceed without a '--force' option. The result
+ of using the '--force' option is the entire removal and complete
+ loss of the non-redundant logical volume. Once this operation is
+ complete, the volume group will again have a complete and consistent
+ view of the devices it contains. Thus, all operations will be
+ permitted - including creation, conversion, and resizing operations.
+
+- 'lvconvert --repair <VG/LV>': This action is designed specifically
+ to operate on mirrored logical volumes. It is used on logical volumes
+ individually and does not remove the faulty device from the volume
+ group. If, for example, a failed device happened to contain the
+ images of four distinct mirrors, it would be necessary to run
+ 'lvconvert --repair' on each of them. The ultimate result is to leave
+ the faulty device in the volume group, but have no logical volumes
+ referencing it. In addition to removing mirror images that reside
+ on failed devices, 'lvconvert --repair' can also replace the failed
+ device if there are spare devices available in the volume group. The
+ user is prompted whether to simply remove the failed portions of the
+ mirror or to also allocate a replacement, if run from the command-line.
+ Optionally, the '--use-policies' flag can be specified which will
+ cause the operation not to prompt the user, but instead respect
+ the policies outlined in the LVM configuration file - usually,
+ /etc/lvm/lvm.conf. Once this operation is complete, mirrored logical
+ volumes will be consistent and I/O will be allowed to continue.
+ However, the volume group will still be inconsistent - due to the
+ refernced-but-missing device/PV - and operations will still be
+ restricted to the aformentioned actions until either the device is
+ restored or 'vgreduce --removemissing' is run.
+
+Device Revival (transient failures):
+------------------------------------
+During a device failure, the above section describes what limitations
+a user can expect. However, if the device returns after a period of
+time, what to expect will depend on what has happened during the time
+period when the device was failed. If no automated actions (described
+below) or user actions were necessary or performed, then no change in
+operations or logical volume layout will occur. However, if an
+automated action or one of the aforementioned repair commands was
+manually run, the returning device will be perceived as having stale
+LVM metadata. In this case, the user can expect to see a warning
+concerning inconsistent metadata. The metadata on the returning
+device will be automatically replaced with the latest copy of the
+LVM metadata - restoring consistency. Note, while most LVM commands
+will automatically update the metadata on a restored devices, the
+following possible exceptions exist:
+- pvs (when it does not read/update VG metadata)
+
+Automated Target Response to Failures:
+--------------------------------------
+The only LVM target type (i.e. "personality") that has an automated
+response to failures is a mirrored logical volume. The other target
+types (linear, stripe, snapshot, etc) will simply propagate the failure.
+[A snapshot becomes invalid if its underlying device fails, but the
+origin will remain valid - presuming the origin device has not failed.]
+There are three types of errors that a mirror can suffer - read, write,
+and resynchronization errors. Each is described in depth below.
+
+Mirror read failures:
+If a mirror is 'in-sync' (i.e. all images have been initialized and
+are identical), a read failure will only produce a warning. Data is
+simply pulled from one of the other images and the fault is recorded.
+Sometimes - like in the case of bad block relocation - read errors can
+be recovered from by the storage hardware. Therefore, it is up to the
+user to decide whether to reconfigure the mirror and remove the device
+that caused the error. Managing the composition of a mirror is done with
+'lvconvert' and removing a device from a volume group can be done with
+'vgreduce'.
+
+If a mirror is not 'in-sync', a read failure will produce an I/O error.
+This error will propagate all the way up to the applications above the
+logical volume (e.g. the file system). No automatic intervention will
+take place in this case either. It is up to the user to decide what
+can be done/salvaged in this senario. If the user is confident that the
+images of the mirror are the same (or they are willing to simply attempt
+to retreive whatever data they can), 'lvconvert' can be used to eliminate
+the failed image and proceed.
+
+Mirror resynchronization errors:
+A resynchronization error is one that occurs when trying to initialize
+all mirror images to be the same. It can happen due to a failure to
+read the primary image (the image considered to have the 'good' data), or
+due to a failure to write the secondary images. This type of failure
+only produces a warning, and it is up to the user to take action in this
+case. If the error is transient, the user can simply reactivate the
+mirrored logical volume to make another attempt at resynchronization.
+If attempts to finish resynchronization fail, 'lvconvert' can be used to
+remove the faulty device from the mirror.
+
+TODO...
+Some sort of response to this type of error could be automated.
+Since this document is the definitive source for how to handle device
+failures, the process should be defined here. If the process is defined
+but not implemented, it should be noted as such. One idea might be to
+make a single attempt to suspend/resume the mirror in an attempt to
+redo the sync operation that failed. On the other hand, if there is
+a permanent failure, it may simply be best to wait for the user or the
+automated response that is sure to follow from a write failure.
+...TODO
+
+Mirror write failures:
+When a write error occurs on a mirror constituent device, an attempt
+to handle the failure is automatically made. This is done by calling
+'lvconvert --repair --use-policies'. The policies implied by this
+command are set in the LVM configuration file. They are:
+- mirror_log_fault_policy: This defines what action should be taken
+ if the device containing the log fails. The available options are
+ "remove" and "allocate". Either of these options will cause the
+ faulty log device to be removed from the mirror. The "allocate"
+ policy will attempt the further action of trying to replace the
+ failed disk log by using space that might be available in the
+ volume group. If the allocation fails (or the "remove" policy
+ is specified), the mirror log will be maintained in memory. Should
+ the machine be rebooted or the logical volume deactivated, a
+ complete resynchronization of the mirror will be necessary upon
+ the follow activation - such is the nature of a mirror with a 'core'
+ log. The default policy for handling log failures is "allocate".
+ The service disruption incurred by replacing the failed log is
+ negligible, while the benefits of having persistent log is
+ pronounced.
+- mirror_image_fault_policy: This defines what action should be taken
+ if a device containing an image fails. Again, the available options
+ are "remove" and "allocate". Both of these options will cause the
+ faulty image device to be removed - adjusting the logical volume
+ accordingly. For example, if one image of a 2-way mirror fails, the
+ mirror will be converted to a linear device. If one image of a
+ 3-way mirror fails, the mirror will be converted to a 2-way mirror.
+ The "allocate" policy takes the further action of trying to replace
+ the failed image using space that is available in the volume group.
+ Replacing a failed mirror image will incure the cost of
+ resynchronizing - degrading the performance of the mirror. The
+ default policy for handling an image failure is "remove". This
+ allows the mirror to still function, but gives the administrator the
+ choice of when to incure the extra performance costs of replacing
+ the failed image.
+
+TODO...
+The appropriate time to take permanent corrective action on a mirror
+should be driven by policy. There should be a directive that takes
+a time or percentage argument. Something like the following:
+- mirror_fault_policy_WHEN = "10sec"/"10%"
+A time value would signal the amount of time to wait for transient
+failures to resolve themselves. The percentage value would signal the
+amount a mirror could become out-of-sync before the faulty device is
+removed.
+
+A mirror cannot be used unless /some/ corrective action is taken,
+however. One option is to replace the failed mirror image with an
+error target, forgo the use of 'handle_errors', and simply let the
+out-of-sync regions accumulate and be tracked by the log. Mirrors
+that have more than 2 images would have to "stack" to perform the
+tracking, as each failed image would have to be associated with a
+log. If the failure is transient, the device would replace the
+error target that was holding its spot and the log that was tracking
+the deltas would be used to quickly restore the portions that changed.
+
+One unresolved issue with the above scheme is how to know which
+regions of the mirror are out-of-sync when a problem occurs. When
+a write failure occurs in the kernel, the log will contain those
+regions that are not in-sync. If the log is a disk log, that log
+could continue to be used to track differences. However, if the
+log was a core log - or if the log device failed at the same time
+as an image device - there would be no way to determine which
+regions are out-of-sync to begin with as we start to track the
+deltas for the failed image. I don't have a solution for this
+problem other than to only be able to handle errors in this way
+if conditions are right. These issues will have to be ironed out
+before proceeding. This could be another case, where it is better
+to handle failures in the kernel by allowing the kernel to store
+updates in various metadata areas.
+...TODO