[lvm2.git] / doc / lvm_fault_handling.txt

LVM device fault handling
=========================

Introduction
------------
This document is to serve as the definitive source for information
regarding the policies and procedures surrounding device failures
in LVM.  It codifies LVM's responses to device failures as well as
the responsibilities of administrators.

Device failures can be permanent or transient.  A permanent failure
is one where a device becomes inaccessible and will never be
revived.  A transient failure is a failure that can be recovered
from (e.g. a power failure, intermittent network outage, block
relocation, etc).  The policies for handling both types of failures
is described herein.

Users need to be aware that there are two implementations of RAID1 in LVM.
The first is defined by the "mirror" segment type.  The second is defined by
the "raid1" segment type.  The characteristics of each of these are defined
in lvm.conf under 'mirror_segtype_default' - the configuration setting used to
identify the default RAID1 implementation used for LVM operations.

Available Operations During a Device Failure
--------------------------------------------
When there is a device failure, LVM behaves somewhat differently because
only a subset of the available devices will be found for the particular
volume group.  The number of operations available to the administrator
is diminished.  It is not possible to create new logical volumes while
PVs cannot be accessed, for example.  Operations that create, convert, or
resize logical volumes are disallowed, such as:
- lvcreate
- lvresize
- lvreduce
- lvextend
- lvconvert (unless '--repair' is used)
Operations that activate, deactivate, remove, report, or repair logical
volumes are allowed, such as:
- lvremove
- vgremove (will remove all LVs, but not the VG until consistent)
- pvs
- vgs
- lvs
- lvchange -a [yn]
- vgchange -a [yn]
Operations specific to the handling of failed devices are allowed and
are as follows:

- 'vgreduce --removemissing <VG>':  This action is designed to remove
  the reference of a failed device from the LVM metadata stored on the
  remaining devices.  If there are (portions of) logical volumes on the
  failed devices, the ability of the operation to proceed will depend
  on the type of logical volumes found.  If an image (i.e leg or side)
  of a mirror is located on the device, that image/leg of the mirror
  is eliminated along with the failed device.  The result of such a
  mirror reduction could be a no-longer-redundant linear device.  If
  a linear, stripe, or snapshot device is located on the failed device
  the command will not proceed without a '--force' option.  The result
  of using the '--force' option is the entire removal and complete
  loss of the non-redundant logical volume.  If an image or metadata area
  of a RAID logical volume is on the failed device, the sub-LV affected is
  replace with an error target device - appearing as <unknown> in 'lvs'
  output.  RAID logical volumes cannot be completely repaired by vgreduce -
  'lvconvert --repair' (listed below) must be used.  Once this operation is
  complete on volume groups not containing RAID logical volumes, the volume
  group will again have a complete and consistent view of the devices it
  contains.  Thus, all operations will be permitted - including creation,
  conversion, and resizing operations.  It is currently the preferred method
  to call 'lvconvert --repair' on the individual logical volumes to repair
  them followed by 'vgreduce --removemissing' to extract the physical volume's
  representation in the volume group.

- 'lvconvert --repair <VG/LV>':  This action is designed specifically
  to operate on individual logical volumes.  If, for example, a failed
  device happened to contain the images of four distinct mirrors, it would
  be necessary to run 'lvconvert --repair' on each of them.  The ultimate
  result is to leave the faulty device in the volume group, but have no logical
  volumes referencing it.  (This allows for 'vgreduce --removemissing' to
  removed the physical volumes cleanly.)  In addition to removing mirror or
  RAID images that reside on failed devices, 'lvconvert --repair' can also
  replace the failed device if there are spare devices available in the
  volume group.  The user is prompted whether to simply remove the failed
  portions of the mirror or to also allocate a replacement, if run from the
  command-line.  Optionally, the '--use-policies' flag can be specified which
  will cause the operation not to prompt the user, but instead respect
  the policies outlined in the LVM configuration file - usually,
  /etc/lvm/lvm.conf.  Once this operation is complete, the logical volumes
  will be consistent.  However, the volume group will still be inconsistent -
  due to the referenced-but-missing device/PV - and operations will still be
  restricted to the aforementioned actions until either the device is
  restored or 'vgreduce --removemissing' is run.

Device Revival (transient failures):
------------------------------------
During a device failure, the above section describes what limitations
a user can expect.  However, if the device returns after a period of
time, what to expect will depend on what has happened during the time
period when the device was failed.  If no automated actions (described
below) or user actions were necessary or performed, then no change in
operations or logical volume layout will occur.  However, if an
automated action or one of the aforementioned repair commands was
manually run, the returning device will be perceived as having stale
LVM metadata.  In this case, the user can expect to see a warning
concerning inconsistent metadata.  The metadata on the returning
device will be automatically replaced with the latest copy of the
LVM metadata - restoring consistency.  Note, while most LVM commands
will automatically update the metadata on a restored devices, the
following possible exceptions exist:
- pvs (when it does not read/update VG metadata)

Automated Target Response to Failures:
--------------------------------------
The only LVM target types (i.e. "personalities") that have an automated
response to failures are the mirror and RAID logical volumes.  The other target
types (linear, stripe, snapshot, etc) will simply propagate the failure.
[A snapshot becomes invalid if its underlying device fails, but the
origin will remain valid - presuming the origin device has not failed.]

Starting with the "mirror" segment type, there are three types of errors that
a mirror can suffer - read, write, and resynchronization errors.  Each is
described in depth below.

Mirror read failures:
If a mirror is 'in-sync' (i.e. all images have been initialized and
are identical), a read failure will only produce a warning.  Data is
simply pulled from one of the other images and the fault is recorded.
Sometimes - like in the case of bad block relocation - read errors can
be recovered from by the storage hardware.  Therefore, it is up to the
user to decide whether to reconfigure the mirror and remove the device
that caused the error.  Managing the composition of a mirror is done with
'lvconvert' and removing a device from a volume group can be done with
'vgreduce'.

If a mirror is not 'in-sync', a read failure will produce an I/O error.
This error will propagate all the way up to the applications above the
logical volume (e.g. the file system).  No automatic intervention will
take place in this case either.  It is up to the user to decide what
can be done/salvaged in this scenario.  If the user is confident that the
images of the mirror are the same (or they are willing to simply attempt
to retrieve whatever data they can), 'lvconvert' can be used to eliminate
the failed image and proceed.

Mirror resynchronization errors:
A resynchronization error is one that occurs when trying to initialize
all mirror images to be the same.  It can happen due to a failure to
read the primary image (the image considered to have the 'good' data), or
due to a failure to write the secondary images.  This type of failure
only produces a warning, and it is up to the user to take action in this
case.  If the error is transient, the user can simply reactivate the
mirrored logical volume to make another attempt at resynchronization.
If attempts to finish resynchronization fail, 'lvconvert' can be used to
remove the faulty device from the mirror.

TODO...
Some sort of response to this type of error could be automated.
Since this document is the definitive source for how to handle device
failures, the process should be defined here.  If the process is defined
but not implemented, it should be noted as such.  One idea might be to
make a single attempt to suspend/resume the mirror in an attempt to
redo the sync operation that failed.  On the other hand, if there is
a permanent failure, it may simply be best to wait for the user or the
automated response that is sure to follow from a write failure.
...TODO

Mirror write failures:
When a write error occurs on a mirror constituent device, an attempt
to handle the failure is automatically made.  This is done by calling
'lvconvert --repair --use-policies'.  The policies implied by this
command are set in the LVM configuration file.  They are:
- mirror_log_fault_policy:  This defines what action should be taken
  if the device containing the log fails.  The available options are
  "remove" and "allocate".  Either of these options will cause the
  faulty log device to be removed from the mirror.  The "allocate"
  policy will attempt the further action of trying to replace the
  failed disk log by using space that might be available in the
  volume group.  If the allocation fails (or the "remove" policy
  is specified), the mirror log will be maintained in memory.  Should
  the machine be rebooted or the logical volume deactivated, a
  complete resynchronization of the mirror will be necessary upon
  the follow activation - such is the nature of a mirror with a 'core'
  log.  The default policy for handling log failures is "allocate".
  The service disruption incurred by replacing the failed log is
  negligible, while the benefits of having persistent log is
  pronounced.
- mirror_image_fault_policy:  This defines what action should be taken
  if a device containing an image fails.  Again, the available options
  are "remove" and "allocate".  Both of these options will cause the
  faulty image device to be removed - adjusting the logical volume
  accordingly.  For example, if one image of a 2-way mirror fails, the
  mirror will be converted to a linear device.  If one image of a
  3-way mirror fails, the mirror will be converted to a 2-way mirror.
  The "allocate" policy takes the further action of trying to replace
  the failed image using space that is available in the volume group.
  Replacing a failed mirror image will incur the cost of
  resynchronizing - degrading the performance of the mirror.  The
  default policy for handling an image failure is "remove".  This
  allows the mirror to still function, but gives the administrator the
  choice of when to incur the extra performance costs of replacing
  the failed image.

RAID logical volume device failures are handled differently from the "mirror"
segment type.  Discussion of this can be found in lvm2-raid.txt.
Commit	Line	Data
b5097c84 JEB	1	LVM device fault handling
	2	=========================
	3
	4	Introduction
	5	------------
	6	This document is to serve as the definitive source for information
	7	regarding the policies and procedures surrounding device failures
	8	in LVM. It codifies LVM's responses to device failures as well as
	9	the responsibilities of administrators.
	10
	11	Device failures can be permanent or transient. A permanent failure
	12	is one where a device becomes inaccessible and will never be
	13	revived. A transient failure is a failure that can be recovered
	14	from (e.g. a power failure, intermittent network outage, block
	15	relocation, etc). The policies for handling both types of failures
	16	is described herein.
	17
d0981401 JEB	18	Users need to be aware that there are two implementations of RAID1 in LVM.
	19	The first is defined by the "mirror" segment type. The second is defined by
	20	the "raid1" segment type. The characteristics of each of these are defined
	21	in lvm.conf under 'mirror_segtype_default' - the configuration setting used to
	22	identify the default RAID1 implementation used for LVM operations.
	23
b5097c84 JEB	24	Available Operations During a Device Failure
	25	--------------------------------------------
	26	When there is a device failure, LVM behaves somewhat differently because
	27	only a subset of the available devices will be found for the particular
	28	volume group. The number of operations available to the administrator
	29	is diminished. It is not possible to create new logical volumes while
	30	PVs cannot be accessed, for example. Operations that create, convert, or
	31	resize logical volumes are disallowed, such as:
	32	- lvcreate
	33	- lvresize
	34	- lvreduce
	35	- lvextend
	36	- lvconvert (unless '--repair' is used)
	37	Operations that activate, deactivate, remove, report, or repair logical
	38	volumes are allowed, such as:
	39	- lvremove
	40	- vgremove (will remove all LVs, but not the VG until consistent)
	41	- pvs
	42	- vgs
	43	- lvs
	44	- lvchange -a [yn]
	45	- vgchange -a [yn]
	46	Operations specific to the handling of failed devices are allowed and
	47	are as follows:
	48
	49	- 'vgreduce --removemissing <VG>': This action is designed to remove
	50	the reference of a failed device from the LVM metadata stored on the
	51	remaining devices. If there are (portions of) logical volumes on the
	52	failed devices, the ability of the operation to proceed will depend
	53	on the type of logical volumes found. If an image (i.e leg or side)
	54	of a mirror is located on the device, that image/leg of the mirror
	55	is eliminated along with the failed device. The result of such a
	56	mirror reduction could be a no-longer-redundant linear device. If
	57	a linear, stripe, or snapshot device is located on the failed device
	58	the command will not proceed without a '--force' option. The result
	59	of using the '--force' option is the entire removal and complete
d0981401 JEB	60	loss of the non-redundant logical volume. If an image or metadata area
	61	of a RAID logical volume is on the failed device, the sub-LV affected is
	62	replace with an error target device - appearing as <unknown> in 'lvs'
	63	output. RAID logical volumes cannot be completely repaired by vgreduce -
	64	'lvconvert --repair' (listed below) must be used. Once this operation is
	65	complete on volume groups not containing RAID logical volumes, the volume
	66	group will again have a complete and consistent view of the devices it
	67	contains. Thus, all operations will be permitted - including creation,
	68	conversion, and resizing operations. It is currently the preferred method
	69	to call 'lvconvert --repair' on the individual logical volumes to repair
	70	them followed by 'vgreduce --removemissing' to extract the physical volume's
	71	representation in the volume group.
b5097c84 JEB	72
b5097c84 JEB	73	- 'lvconvert --repair <VG/LV>': This action is designed specifically
d0981401 JEB	74	to operate on individual logical volumes. If, for example, a failed
	75	device happened to contain the images of four distinct mirrors, it would
	76	be necessary to run 'lvconvert --repair' on each of them. The ultimate
	77	result is to leave the faulty device in the volume group, but have no logical
	78	volumes referencing it. (This allows for 'vgreduce --removemissing' to
	79	removed the physical volumes cleanly.) In addition to removing mirror or
	80	RAID images that reside on failed devices, 'lvconvert --repair' can also
	81	replace the failed device if there are spare devices available in the
	82	volume group. The user is prompted whether to simply remove the failed
	83	portions of the mirror or to also allocate a replacement, if run from the
	84	command-line. Optionally, the '--use-policies' flag can be specified which
	85	will cause the operation not to prompt the user, but instead respect
b5097c84	86	the policies outlined in the LVM configuration file - usually,
d0981401 JEB	87	/etc/lvm/lvm.conf. Once this operation is complete, the logical volumes
d0981401 JEB	88	will be consistent. However, the volume group will still be inconsistent -
26a6c69a	89	due to the referenced-but-missing device/PV - and operations will still be
aec5e573	90	restricted to the aforementioned actions until either the device is
b5097c84 JEB	91	restored or 'vgreduce --removemissing' is run.
	92
	93	Device Revival (transient failures):
	94	------------------------------------
	95	During a device failure, the above section describes what limitations
	96	a user can expect. However, if the device returns after a period of
	97	time, what to expect will depend on what has happened during the time
	98	period when the device was failed. If no automated actions (described
	99	below) or user actions were necessary or performed, then no change in
	100	operations or logical volume layout will occur. However, if an
	101	automated action or one of the aforementioned repair commands was
	102	manually run, the returning device will be perceived as having stale
	103	LVM metadata. In this case, the user can expect to see a warning
	104	concerning inconsistent metadata. The metadata on the returning
	105	device will be automatically replaced with the latest copy of the
	106	LVM metadata - restoring consistency. Note, while most LVM commands
	107	will automatically update the metadata on a restored devices, the
	108	following possible exceptions exist:
	109	- pvs (when it does not read/update VG metadata)
	110
	111	Automated Target Response to Failures:
	112	--------------------------------------
d0981401 JEB	113	The only LVM target types (i.e. "personalities") that have an automated
d0981401 JEB	114	response to failures are the mirror and RAID logical volumes. The other target
b5097c84 JEB	115	types (linear, stripe, snapshot, etc) will simply propagate the failure.
	116	[A snapshot becomes invalid if its underlying device fails, but the
	117	origin will remain valid - presuming the origin device has not failed.]
d0981401 JEB	118
	119	Starting with the "mirror" segment type, there are three types of errors that
	120	a mirror can suffer - read, write, and resynchronization errors. Each is
	121	described in depth below.
b5097c84 JEB	122
	123	Mirror read failures:
	124	If a mirror is 'in-sync' (i.e. all images have been initialized and
	125	are identical), a read failure will only produce a warning. Data is
	126	simply pulled from one of the other images and the fault is recorded.
	127	Sometimes - like in the case of bad block relocation - read errors can
	128	be recovered from by the storage hardware. Therefore, it is up to the
	129	user to decide whether to reconfigure the mirror and remove the device
	130	that caused the error. Managing the composition of a mirror is done with
	131	'lvconvert' and removing a device from a volume group can be done with
	132	'vgreduce'.
	133
	134	If a mirror is not 'in-sync', a read failure will produce an I/O error.
	135	This error will propagate all the way up to the applications above the
	136	logical volume (e.g. the file system). No automatic intervention will
	137	take place in this case either. It is up to the user to decide what
aec5e573	138	can be done/salvaged in this scenario. If the user is confident that the
b5097c84	139	images of the mirror are the same (or they are willing to simply attempt
aec5e573	140	to retrieve whatever data they can), 'lvconvert' can be used to eliminate
b5097c84 JEB	141	the failed image and proceed.
	142
	143	Mirror resynchronization errors:
	144	A resynchronization error is one that occurs when trying to initialize
	145	all mirror images to be the same. It can happen due to a failure to
	146	read the primary image (the image considered to have the 'good' data), or
	147	due to a failure to write the secondary images. This type of failure
	148	only produces a warning, and it is up to the user to take action in this
	149	case. If the error is transient, the user can simply reactivate the
	150	mirrored logical volume to make another attempt at resynchronization.
	151	If attempts to finish resynchronization fail, 'lvconvert' can be used to
	152	remove the faulty device from the mirror.
	153
	154	TODO...
	155	Some sort of response to this type of error could be automated.
	156	Since this document is the definitive source for how to handle device
	157	failures, the process should be defined here. If the process is defined
	158	but not implemented, it should be noted as such. One idea might be to
	159	make a single attempt to suspend/resume the mirror in an attempt to
	160	redo the sync operation that failed. On the other hand, if there is
	161	a permanent failure, it may simply be best to wait for the user or the
	162	automated response that is sure to follow from a write failure.
	163	...TODO
	164
	165	Mirror write failures:
	166	When a write error occurs on a mirror constituent device, an attempt
	167	to handle the failure is automatically made. This is done by calling
	168	'lvconvert --repair --use-policies'. The policies implied by this
	169	command are set in the LVM configuration file. They are:
	170	- mirror_log_fault_policy: This defines what action should be taken
	171	if the device containing the log fails. The available options are
	172	"remove" and "allocate". Either of these options will cause the
	173	faulty log device to be removed from the mirror. The "allocate"
	174	policy will attempt the further action of trying to replace the
	175	failed disk log by using space that might be available in the
	176	volume group. If the allocation fails (or the "remove" policy
	177	is specified), the mirror log will be maintained in memory. Should
	178	the machine be rebooted or the logical volume deactivated, a
	179	complete resynchronization of the mirror will be necessary upon
	180	the follow activation - such is the nature of a mirror with a 'core'
	181	log. The default policy for handling log failures is "allocate".
	182	The service disruption incurred by replacing the failed log is
	183	negligible, while the benefits of having persistent log is
	184	pronounced.
	185	- mirror_image_fault_policy: This defines what action should be taken
	186	if a device containing an image fails. Again, the available options
	187	are "remove" and "allocate". Both of these options will cause the
	188	faulty image device to be removed - adjusting the logical volume
	189	accordingly. For example, if one image of a 2-way mirror fails, the
	190	mirror will be converted to a linear device. If one image of a
	191	3-way mirror fails, the mirror will be converted to a 2-way mirror.
	192	The "allocate" policy takes the further action of trying to replace
	193	the failed image using space that is available in the volume group.
aec5e573	194	Replacing a failed mirror image will incur the cost of
b5097c84 JEB	195	resynchronizing - degrading the performance of the mirror. The
	196	default policy for handling an image failure is "remove". This
	197	allows the mirror to still function, but gives the administrator the
aec5e573	198	choice of when to incur the extra performance costs of replacing
b5097c84 JEB	199	the failed image.
b5097c84 JEB	200
d0981401 JEB	201	RAID logical volume device failures are handled differently from the "mirror"
d0981401 JEB	202	segment type. Discussion of this can be found in lvm2-raid.txt.