]>
Commit | Line | Data |
---|---|---|
b5097c84 JEB |
1 | LVM device fault handling |
2 | ========================= | |
3 | ||
4 | Introduction | |
5 | ------------ | |
6 | This document is to serve as the definitive source for information | |
7 | regarding the policies and procedures surrounding device failures | |
8 | in LVM. It codifies LVM's responses to device failures as well as | |
9 | the responsibilities of administrators. | |
10 | ||
11 | Device failures can be permanent or transient. A permanent failure | |
12 | is one where a device becomes inaccessible and will never be | |
13 | revived. A transient failure is a failure that can be recovered | |
14 | from (e.g. a power failure, intermittent network outage, block | |
15 | relocation, etc). The policies for handling both types of failures | |
16 | is described herein. | |
17 | ||
d0981401 JEB |
18 | Users need to be aware that there are two implementations of RAID1 in LVM. |
19 | The first is defined by the "mirror" segment type. The second is defined by | |
20 | the "raid1" segment type. The characteristics of each of these are defined | |
21 | in lvm.conf under 'mirror_segtype_default' - the configuration setting used to | |
22 | identify the default RAID1 implementation used for LVM operations. | |
23 | ||
b5097c84 JEB |
24 | Available Operations During a Device Failure |
25 | -------------------------------------------- | |
26 | When there is a device failure, LVM behaves somewhat differently because | |
27 | only a subset of the available devices will be found for the particular | |
28 | volume group. The number of operations available to the administrator | |
29 | is diminished. It is not possible to create new logical volumes while | |
30 | PVs cannot be accessed, for example. Operations that create, convert, or | |
31 | resize logical volumes are disallowed, such as: | |
32 | - lvcreate | |
33 | - lvresize | |
34 | - lvreduce | |
35 | - lvextend | |
36 | - lvconvert (unless '--repair' is used) | |
37 | Operations that activate, deactivate, remove, report, or repair logical | |
38 | volumes are allowed, such as: | |
39 | - lvremove | |
40 | - vgremove (will remove all LVs, but not the VG until consistent) | |
41 | - pvs | |
42 | - vgs | |
43 | - lvs | |
44 | - lvchange -a [yn] | |
45 | - vgchange -a [yn] | |
46 | Operations specific to the handling of failed devices are allowed and | |
47 | are as follows: | |
48 | ||
49 | - 'vgreduce --removemissing <VG>': This action is designed to remove | |
50 | the reference of a failed device from the LVM metadata stored on the | |
51 | remaining devices. If there are (portions of) logical volumes on the | |
52 | failed devices, the ability of the operation to proceed will depend | |
53 | on the type of logical volumes found. If an image (i.e leg or side) | |
54 | of a mirror is located on the device, that image/leg of the mirror | |
55 | is eliminated along with the failed device. The result of such a | |
56 | mirror reduction could be a no-longer-redundant linear device. If | |
57 | a linear, stripe, or snapshot device is located on the failed device | |
58 | the command will not proceed without a '--force' option. The result | |
59 | of using the '--force' option is the entire removal and complete | |
d0981401 JEB |
60 | loss of the non-redundant logical volume. If an image or metadata area |
61 | of a RAID logical volume is on the failed device, the sub-LV affected is | |
62 | replace with an error target device - appearing as <unknown> in 'lvs' | |
63 | output. RAID logical volumes cannot be completely repaired by vgreduce - | |
64 | 'lvconvert --repair' (listed below) must be used. Once this operation is | |
65 | complete on volume groups not containing RAID logical volumes, the volume | |
66 | group will again have a complete and consistent view of the devices it | |
67 | contains. Thus, all operations will be permitted - including creation, | |
68 | conversion, and resizing operations. It is currently the preferred method | |
69 | to call 'lvconvert --repair' on the individual logical volumes to repair | |
70 | them followed by 'vgreduce --removemissing' to extract the physical volume's | |
71 | representation in the volume group. | |
b5097c84 JEB |
72 | |
73 | - 'lvconvert --repair <VG/LV>': This action is designed specifically | |
d0981401 JEB |
74 | to operate on individual logical volumes. If, for example, a failed |
75 | device happened to contain the images of four distinct mirrors, it would | |
76 | be necessary to run 'lvconvert --repair' on each of them. The ultimate | |
77 | result is to leave the faulty device in the volume group, but have no logical | |
78 | volumes referencing it. (This allows for 'vgreduce --removemissing' to | |
79 | removed the physical volumes cleanly.) In addition to removing mirror or | |
80 | RAID images that reside on failed devices, 'lvconvert --repair' can also | |
81 | replace the failed device if there are spare devices available in the | |
82 | volume group. The user is prompted whether to simply remove the failed | |
83 | portions of the mirror or to also allocate a replacement, if run from the | |
84 | command-line. Optionally, the '--use-policies' flag can be specified which | |
85 | will cause the operation not to prompt the user, but instead respect | |
b5097c84 | 86 | the policies outlined in the LVM configuration file - usually, |
d0981401 JEB |
87 | /etc/lvm/lvm.conf. Once this operation is complete, the logical volumes |
88 | will be consistent. However, the volume group will still be inconsistent - | |
26a6c69a | 89 | due to the referenced-but-missing device/PV - and operations will still be |
aec5e573 | 90 | restricted to the aforementioned actions until either the device is |
b5097c84 JEB |
91 | restored or 'vgreduce --removemissing' is run. |
92 | ||
93 | Device Revival (transient failures): | |
94 | ------------------------------------ | |
95 | During a device failure, the above section describes what limitations | |
96 | a user can expect. However, if the device returns after a period of | |
97 | time, what to expect will depend on what has happened during the time | |
98 | period when the device was failed. If no automated actions (described | |
99 | below) or user actions were necessary or performed, then no change in | |
100 | operations or logical volume layout will occur. However, if an | |
101 | automated action or one of the aforementioned repair commands was | |
102 | manually run, the returning device will be perceived as having stale | |
103 | LVM metadata. In this case, the user can expect to see a warning | |
104 | concerning inconsistent metadata. The metadata on the returning | |
105 | device will be automatically replaced with the latest copy of the | |
106 | LVM metadata - restoring consistency. Note, while most LVM commands | |
107 | will automatically update the metadata on a restored devices, the | |
108 | following possible exceptions exist: | |
109 | - pvs (when it does not read/update VG metadata) | |
110 | ||
111 | Automated Target Response to Failures: | |
112 | -------------------------------------- | |
d0981401 JEB |
113 | The only LVM target types (i.e. "personalities") that have an automated |
114 | response to failures are the mirror and RAID logical volumes. The other target | |
b5097c84 JEB |
115 | types (linear, stripe, snapshot, etc) will simply propagate the failure. |
116 | [A snapshot becomes invalid if its underlying device fails, but the | |
117 | origin will remain valid - presuming the origin device has not failed.] | |
d0981401 JEB |
118 | |
119 | Starting with the "mirror" segment type, there are three types of errors that | |
120 | a mirror can suffer - read, write, and resynchronization errors. Each is | |
121 | described in depth below. | |
b5097c84 JEB |
122 | |
123 | Mirror read failures: | |
124 | If a mirror is 'in-sync' (i.e. all images have been initialized and | |
125 | are identical), a read failure will only produce a warning. Data is | |
126 | simply pulled from one of the other images and the fault is recorded. | |
127 | Sometimes - like in the case of bad block relocation - read errors can | |
128 | be recovered from by the storage hardware. Therefore, it is up to the | |
129 | user to decide whether to reconfigure the mirror and remove the device | |
130 | that caused the error. Managing the composition of a mirror is done with | |
131 | 'lvconvert' and removing a device from a volume group can be done with | |
132 | 'vgreduce'. | |
133 | ||
134 | If a mirror is not 'in-sync', a read failure will produce an I/O error. | |
135 | This error will propagate all the way up to the applications above the | |
136 | logical volume (e.g. the file system). No automatic intervention will | |
137 | take place in this case either. It is up to the user to decide what | |
aec5e573 | 138 | can be done/salvaged in this scenario. If the user is confident that the |
b5097c84 | 139 | images of the mirror are the same (or they are willing to simply attempt |
aec5e573 | 140 | to retrieve whatever data they can), 'lvconvert' can be used to eliminate |
b5097c84 JEB |
141 | the failed image and proceed. |
142 | ||
143 | Mirror resynchronization errors: | |
144 | A resynchronization error is one that occurs when trying to initialize | |
145 | all mirror images to be the same. It can happen due to a failure to | |
146 | read the primary image (the image considered to have the 'good' data), or | |
147 | due to a failure to write the secondary images. This type of failure | |
148 | only produces a warning, and it is up to the user to take action in this | |
149 | case. If the error is transient, the user can simply reactivate the | |
150 | mirrored logical volume to make another attempt at resynchronization. | |
151 | If attempts to finish resynchronization fail, 'lvconvert' can be used to | |
152 | remove the faulty device from the mirror. | |
153 | ||
154 | TODO... | |
155 | Some sort of response to this type of error could be automated. | |
156 | Since this document is the definitive source for how to handle device | |
157 | failures, the process should be defined here. If the process is defined | |
158 | but not implemented, it should be noted as such. One idea might be to | |
159 | make a single attempt to suspend/resume the mirror in an attempt to | |
160 | redo the sync operation that failed. On the other hand, if there is | |
161 | a permanent failure, it may simply be best to wait for the user or the | |
162 | automated response that is sure to follow from a write failure. | |
163 | ...TODO | |
164 | ||
165 | Mirror write failures: | |
166 | When a write error occurs on a mirror constituent device, an attempt | |
167 | to handle the failure is automatically made. This is done by calling | |
168 | 'lvconvert --repair --use-policies'. The policies implied by this | |
169 | command are set in the LVM configuration file. They are: | |
170 | - mirror_log_fault_policy: This defines what action should be taken | |
171 | if the device containing the log fails. The available options are | |
172 | "remove" and "allocate". Either of these options will cause the | |
173 | faulty log device to be removed from the mirror. The "allocate" | |
174 | policy will attempt the further action of trying to replace the | |
175 | failed disk log by using space that might be available in the | |
176 | volume group. If the allocation fails (or the "remove" policy | |
177 | is specified), the mirror log will be maintained in memory. Should | |
178 | the machine be rebooted or the logical volume deactivated, a | |
179 | complete resynchronization of the mirror will be necessary upon | |
180 | the follow activation - such is the nature of a mirror with a 'core' | |
181 | log. The default policy for handling log failures is "allocate". | |
182 | The service disruption incurred by replacing the failed log is | |
183 | negligible, while the benefits of having persistent log is | |
184 | pronounced. | |
185 | - mirror_image_fault_policy: This defines what action should be taken | |
186 | if a device containing an image fails. Again, the available options | |
187 | are "remove" and "allocate". Both of these options will cause the | |
188 | faulty image device to be removed - adjusting the logical volume | |
189 | accordingly. For example, if one image of a 2-way mirror fails, the | |
190 | mirror will be converted to a linear device. If one image of a | |
191 | 3-way mirror fails, the mirror will be converted to a 2-way mirror. | |
192 | The "allocate" policy takes the further action of trying to replace | |
193 | the failed image using space that is available in the volume group. | |
aec5e573 | 194 | Replacing a failed mirror image will incur the cost of |
b5097c84 JEB |
195 | resynchronizing - degrading the performance of the mirror. The |
196 | default policy for handling an image failure is "remove". This | |
197 | allows the mirror to still function, but gives the administrator the | |
aec5e573 | 198 | choice of when to incur the extra performance costs of replacing |
b5097c84 JEB |
199 | the failed image. |
200 | ||
d0981401 JEB |
201 | RAID logical volume device failures are handled differently from the "mirror" |
202 | segment type. Discussion of this can be found in lvm2-raid.txt. |