]> sourceware.org Git - lvm2.git/blame - doc/lvm2-raid.txt
man: document allocation process in lvm.8
[lvm2.git] / doc / lvm2-raid.txt
CommitLineData
4ebbd137
JEB
1=======================
2= LVM RAID Design Doc =
3=======================
4
5#############################
6# Chapter 1: User-Interface #
7#############################
8
9***************** CREATING A RAID DEVICE ******************
10
1101: lvcreate --type <RAID type> \
1202: [--regionsize <size>] \
1303: [-i/--stripes <#>] [-I,--stripesize <size>] \
1404: [-m/--mirrors <#>] \
1505: [--[min|max]recoveryrate <kB/sec/disk>] \
1606: [--stripecache <size>] \
1707: [--writemostly <devices>] \
1808: [--maxwritebehind <size>] \
1909: [[no]sync] \
2010: <Other normal args, like: -L 5G -n lv vg> \
2111: [devices]
22
23Line 01:
24I don't intend for there to be shorthand options for specifying the
25segment type. The available RAID types are:
26 "raid0" - Stripe [NOT IMPLEMENTED]
27 "raid1" - should replace DM Mirroring
28 "raid10" - striped mirrors, [NOT IMPLEMENTED]
29 "raid4" - RAID4
30 "raid5" - Same as "raid5_ls" (Same default as MD)
31 "raid5_la" - RAID5 Rotating parity 0 with data continuation
32 "raid5_ra" - RAID5 Rotating parity N with data continuation
33 "raid5_ls" - RAID5 Rotating parity 0 with data restart
34 "raid5_rs" - RAID5 Rotating parity N with data restart
35 "raid6" - Same as "raid6_zr"
36 "raid6_zr" - RAID6 Rotating parity 0 with data restart
37 "raid6_nr" - RAID6 Rotating parity N with data restart
38 "raid6_nc" - RAID6 Rotating parity N with data continuation
39The exception to 'no shorthand options' will be where the RAID implementations
40can displace traditional tagets. This is the case with 'mirror' and 'raid1'.
75a59aab
JEB
41In this case, "mirror_segtype_default" - found under the "global" section in
42lvm.conf - can be set to "mirror" or "raid1". The segment type inferred when
43the '-m' option is used will be taken from this setting. The default segment
44types can be overridden on the command line by using the '--type' argument.
4ebbd137
JEB
45
46Line 02:
47Region size is relevant for all RAID types. It defines the granularity for
48which the bitmap will track the active areas of disk. The default is currently
494MiB. I see no reason to change this unless it is a problem for MD performance.
50MD does impose a restriction of 2^21 regions for a given device, however. This
51means two things: 1) we should never need a metadata area larger than
528kiB+sizeof(superblock)+bitmap_offset (IOW, pretty small) and 2) the region
53size will have to be upwardly revised if the device is larger than 8TiB
54(assuming defaults).
55
56Line 03/04:
57The '-m/--mirrors' option is only relevant to RAID1 and will be used just like
58it is today for DM mirroring. For all other RAID types, -i/--stripes and
59-I/--stripesize are relevant. The former will specify the number of data
60devices that will be used for striping. For example, if the user specifies
61'--type raid0 -i 3', then 3 devices are needed. If the user specifies
62'--type raid6 -i 3', then 5 devices are needed. The -I/--stripesize may be
63confusing to MD users, as they use the term "chunksize". I think they will
64adapt without issue and I don't wish to create a conflict with the term
65"chunksize" that we use for snapshots.
66
67Line 05/06/07:
68I'm still not clear on how to specify these options. Some are easier than
69others. '--writemostly' is particularly hard because it involves specifying
70which devices shall be 'write-mostly' and thus, also have 'max-write-behind'
71applied to them. It has been suggested that a '--readmostly'/'--readfavored'
72or similar option could be introduced as a way to specify a primary disk vs.
73specifying all the non-primary disks via '--writemostly'. I like this idea,
74but haven't come up with a good name yet. Thus, these will remain
75unimplemented until future specification.
76
77Line 09/10/11:
78These are familiar.
79
80Further creation related ideas:
81Today, you can specify '--type mirror' without an '-m/--mirrors' argument
82necessary. The number of devices defaults to two (and the log defaults to
83'disk'). A similar thing should happen with the RAID types. All of them
84should default to having two data devices unless otherwise specified. This
85would mean a total number of 2 devices for RAID 0/1, 3 devices for RAID 4/5,
86and 4 devices for RAID 6/10.
87
88
89***************** CONVERTING A RAID DEVICE ******************
90
9101: lvconvert [--type <RAID type>] \
9202: [-R/--regionsize <size>] \
9303: [-i/--stripes <#>] [-I,--stripesize <size>] \
9404: [-m/--mirrors <#>] \
75a59aab
JEB
9505: [--merge]
9606: [--splitmirrors <#> [--trackchanges]] \
9707: [--replace <sub_lv|device>] \
9808: [--[min|max]recoveryrate <kB/sec/disk>] \
9909: [--stripecache <size>] \
10010: [--writemostly <devices>] \
10111: [--maxwritebehind <size>] \
10212: vg/lv
10313: [devices]
4ebbd137
JEB
104
105lvconvert should work exactly as it does now when dealing with mirrors -
106even if(when) we switch to MD RAID1. Of course, there are no plans to
107allow the presense of the metadata area to be configurable (e.g. --corelog).
108It will be simple enough to detect if the LV being up/down-converted is
109new or old-style mirroring.
110
111If we choose to use MD RAID0 as well, it will be possible to change the
112number of stripes and the stripesize. It is therefore conceivable to see
113something like, 'lvconvert -i +1 vg/lv'.
114
115Line 01:
116It is possible to change the RAID type of an LV - even if that LV is already
117a RAID device of a different type. For example, you could change from
118RAID4 to RAID5 or RAID5 to RAID6.
119
75a59aab 120Line 02/03/04:
4ebbd137
JEB
121These are familiar options - all of which would now be available as options
122for change. (However, it'd be nice if we didn't have regionsize in there.
123It's simple on the kernel side, but is just an extra - often unecessary -
124parameter to many functions in the LVM codebase.)
125
75a59aab
JEB
126Line 05:
127This option is used to merge an LV back into a RAID1 array - provided it was
128split for temporary read-only use by '--splitmirrors 1 --trackchanges'.
129
4ebbd137 130Line 06:
75a59aab
JEB
131The '--splitmirrors <#>' argument should be familiar from the "mirror" segment
132type. It allows RAID1 images to be split from the array to form a new LV.
133Either the original LV or the split LV - or both - could become a linear LV as
134a result. If the '--trackchanges' argument is specified in addition to
135'--splitmirrors', an LV will be split from the array. It will be read-only.
136This operation does not change the original array - except that it uses an empty
137slot to hold the position of the split LV which it expects to return in the
138future (see the '--merge' argument). It tracks any changes that occur to the
139array while the slot is kept in reserve. If the LV is merged back into the
140array, only the changes are resync'ed to the returning image. Repeating the
141'lvconvert' operation without the '--trackchanges' option will complete the
142split of the LV permanently.
143
144Line 07:
4ebbd137
JEB
145This option allows the user to specify a sub_lv (e.g. a mirror image) or
146a particular device for replacement. The device (or all the devices in
147the sub_lv) will be removed and replaced with different devices from the
148VG.
149
75a59aab 150Line 08/09/10/11:
4ebbd137
JEB
151It should be possible to alter these parameters of a RAID device. As with
152lvcreate, however, I'm not entirely certain how to best define some of these.
153We don't need all the capabilities at once though, so it isn't a pressing
154issue.
155
75a59aab 156Line 12:
4ebbd137
JEB
157The LV to operate on.
158
75a59aab 159Line 13:
4ebbd137
JEB
160Devices that are to be used to satisfy the conversion request. If the
161operation removes devices or splits a mirror, then the devices specified
162form the list of candidates for removal. If the operation adds or replaces
163devices, then the devices specified form the list of candidates for allocation.
164
165
166
167###############################################
168# Chapter 2: LVM RAID internal representation #
169###############################################
170
171The internal representation is somewhat like mirroring, but with alterations
172for the different metadata components. LVM mirroring has a single log LV,
173but RAID will have one for each data device. Because of this, I've added a
174new 'areas' list to the 'struct lv_segment' - 'meta_areas'. There is exactly
175a one-to-one relationship between 'areas' and 'meta_areas'. The 'areas' array
176still holds the data sub-lv's (similar to mirroring), while the 'meta_areas'
177array holds the metadata sub-lv's (akin to the mirroring log device).
178
179The sub_lvs will be named '%s_rimage_%d' instead of '%s_mimage_%d' as it is
180for mirroring, and '%s_rmeta_%d' instead of '%s_mlog'. Thus, you can imagine
181an LV named 'foo' with the following layout:
182foo
183[foo's lv_segment]
184|
185|-> foo_rimage_0 (areas[0])
186| [foo_rimage_0's lv_segment]
187|-> foo_rimage_1 (areas[1])
188| [foo_rimage_1's lv_segment]
189|
190|-> foo_rmeta_0 (meta_areas[0])
191| [foo_rmeta_0's lv_segment]
192|-> foo_rmeta_1 (meta_areas[1])
193| [foo_rmeta_1's lv_segment]
194
195LVM Meta-data format
75a59aab 196====================
4ebbd137
JEB
197The RAID format will need to be able to store parameters that are unique to
198RAID and unique to specific RAID sub-devices. It will be modeled after that
199of mirroring.
200
201Here is an example of the mirroring layout:
202lv {
203 id = "agL1vP-1B8Z-5vnB-41cS-lhBJ-Gcvz-dh3L3H"
204 status = ["READ", "WRITE", "VISIBLE"]
205 flags = []
206 segment_count = 1
207
208 segment1 {
209 start_extent = 0
210 extent_count = 125 # 500 Megabytes
211
212 type = "mirror"
213 mirror_count = 2
214 mirror_log = "lv_mlog"
215 region_size = 1024
216
217 mirrors = [
218 "lv_mimage_0", 0,
219 "lv_mimage_1", 0
220 ]
221 }
222}
223
224The real trick is dealing with the metadata devices. Mirroring has an entry,
225'mirror_log', in the top-level segment. This won't work for RAID because there
226is a one-to-one mapping between the data devices and the metadata devices. The
227mirror devices are layed-out in sub-device/le pairs. The 'le' parameter is
228redundant since it will always be zero. So for RAID, I have simple put the
229metadata and data devices in pairs without the 'le' parameter.
230
231RAID metadata:
232lv {
233 id = "EnpqAM-5PEg-i9wB-5amn-P116-1T8k-nS3GfD"
234 status = ["READ", "WRITE", "VISIBLE"]
235 flags = []
236 segment_count = 1
237
238 segment1 {
239 start_extent = 0
240 extent_count = 125 # 500 Megabytes
241
242 type = "raid1"
243 device_count = 2
244 region_size = 1024
245
246 raids = [
247 "lv_rmeta_0", "lv_rimage_0",
248 "lv_rmeta_1", "lv_rimage_1",
249 ]
250 }
251}
252
253The metadata also must be capable of representing the various tunables. We
254already have a good example for one from mirroring, region_size.
255'max_write_behind', 'stripe_cache', and '[min|max]_recovery_rate' could also
256be handled in this way. However, 'write_mostly' cannot be handled in this
257way, because it is a characteristic associated with the sub_lvs, not the
258array as a whole. In these cases, the status field of the sub-lv's themselves
259will hold these flags - the meaning being only useful in the larger context.
260
75a59aab
JEB
261
262##############################################
263# Chapter 3: LVM RAID implementation details #
264##############################################
265
4ebbd137 266New Segment Type(s)
75a59aab 267===================
4ebbd137
JEB
268I've created a new file 'lib/raid/raid.c' that will handle the various different
269RAID types. While there will be a unique segment type for each RAID variant,
270they will all share a common backend - segtype_handler functions and
271segtype->flags = SEG_RAID.
272
273I'm also adding a new field to 'struct segment_type', parity_devs. For every
274segment_type except RAID4/5/6, this will be 0. This field facilitates in
275allocation and size calculations. For example, the lvcreate for RAID5 would
276look something like:
277~> lvcreate --type raid5 -L 30G -i 3 -n my_raid5 my_vg
278or
279~> lvcreate --type raid5 -n my_raid5 my_vg /dev/sd[bcdef]1
280
281In the former case, the stripe count (3) and device size are computed, and
282then 'segtype->parity_devs' extra devices are allocated of the same size. In
283the latter case, the number of PVs is determined and 'segtype->parity_devs' is
284subtracted off to determine the number of stripes.
285
286This should also work in the case of RAID10 and doing things in this manor
287should not affect the way size is calculated via the area_multiple.
288
289Allocation
75a59aab 290==========
4ebbd137
JEB
291When a RAID device is created, metadata LVs must be created along with the
292data LVs that will ultimately compose the top-level RAID array. For the
293foreseeable future, the metadata LVs must reside on the same device as (or
294at least one of the devices that compose) the data LV. We use this property
295to simplify the allocation process. Rather than allocating for the data LVs
296and then asking for a small chunk of space on the same device (or the other
297way around), we simply ask for the aggregate size of the data LV plus the
298metadata LV. Once we have the space allocated, we divide it between the
299metadata and data LVs. This also greatly simplifies the process of finding
300parallel space for all the data LVs that will compose the RAID array. When
301a RAID device is resized, we will not need to take the metadata LV into
302account, because it will already be present.
303
304Apart from the metadata areas, the other unique characteristic of RAID
305devices is the parity device count. The number of parity devices does nothing
306to the calculation of size-per-device. The 'area_multiple' means nothing
307here. The parity devices will simply be the same size as all the other devices
308and will also require a metadata LV (i.e. it is treated no differently than
309the other devices).
310
311Therefore, to allocate space for RAID devices, we need to know two things:
3121) how many parity devices are required and 2) does an allocated area need to
313be split out for the metadata LVs after finding the space to fill the request.
314We simply add these two fields to the 'alloc_handle' data structure as,
75a59aab
JEB
315'parity_count' and 'alloc_and_split_meta'. These two fields get set in
316'_alloc_init'. The 'segtype->parity_devs' holds the number of parity
4ebbd137
JEB
317drives and can be directly copied to 'ah->parity_count' and
318'alloc_and_split_meta' is set when a RAID segtype is detected and
319'metadata_area_count' has been specified. With these two variables set, we
320can calculate how many allocated areas we need. Also, in the routines that
321find the actual space, they stop not when they have found ah->area_count but
322when they have found (ah->area_count + ah->parity_count).
323
75a59aab
JEB
324Conversion
325==========
326RAID -> RAID, adding images
327---------------------------
328When adding images to a RAID array, metadata and data components must be added
329as a pair. It is best to perform as many operations as possible before writing
330new LVM metadata. This allows us to error-out without having to unwind any
331changes. It also makes things easier if the machine should crash during a
332conversion operation. Thus, the actions performed when adding a new image are:
333 1) Allocate the required number of metadata/data pairs using the method
334 describe above in 'Allocation' (i.e. find the metadata/data space
335 as one unit and split the space between them after found - this keeps
336 them together on the same device).
337 2) Form the metadata/data LVs from the allocated space (leave them
338 visible) - setting required RAID_[IMAGE | META] flags as appropriate.
339 3) Write the LVM metadata
340 4) Activate and clear the metadata LVs. The clearing of the metadata
341 requires the LVM metadata be written (step 3) and is a requirement
342 before adding the new metadata LVs to the array. If the metadata
343 is not cleared, it carry residual superblock state from a previous
344 array the device may have been part of.
345 5) Deactivate new sub-LVs and set them "hidden".
346 6) expand the 'first_seg(raid_lv)->areas' and '->meta_areas' array
347 for inclusion of the new sub-LVs
348 7) Add new sub-LVs and update 'first_seg(raid_lv)->area_count'
349 8) Commit new LVM metadata
350Failure during any of these steps will not affect the original RAID array. In
351the worst scenario, the user may have to remove the new sub-LVs that did not
352yet make it into the array.
353
354RAID -> RAID, removing images
355-----------------------------
356To remove images from a RAID, the metadata/data LV pairs must be removed
357together. This is pretty straight-forward, but one place where RAID really
358differs from the "mirror" segment type is how the resulting "holes" are filled.
359When a device is removed from a "mirror" segment type, it is identified, moved
360to the end of the 'mirrored_seg->areas' array, and then removed. This action
361causes the other images to shift down and fill the position of the device which
362was removed. While "raid1" could be handled in this way, the other RAID types
363could not be - it would corrupt the ordering of the data on the array. Thus,
364when a device is removed from a RAID array, the corresponding metadata/data
365sub-LVs are removed from the 'raid_seg->meta_areas' and 'raid_seg->areas' arrays.
366The slot in these 'lv_segment_area' arrays are set to 'AREA_UNASSIGNED'. RAID
367is perfectly happy to construct a DM table mapping with '- -' if it comes across
368area assigned in such a way. The pair of dashes is a valid way to tell the RAID
369kernel target that the slot should be considered empty. So, we can remove
370devices from a RAID array without affecting the correct operation of the RAID.
371(It also becomes easy to replace the empty slots properly if a spare device is
372available.) In the case of RAID1 device removal, the empty slot can be safely
373eliminated. This is done by shifting the higher indexed devices down to fill
374the slot. Even the names of the images will be renamed to properly reflect
375their index in the array. Unlike the "mirror" segment type, you will never have
376an image named "*_rimage_1" occupying the index position 0.
377
378As with adding images, removing images holds off on commiting LVM metadata
379until all possible changes have been made. This reduces the likelyhood of bad
380intermediate stages being left due to a failure of operation or machine crash.
381
382RAID1 '--splitmirrors', '--trackchanges', and '--merge' operations
d0981401 383------------------------------------------------------------------
75a59aab
JEB
384This suite of operations is only available to the "raid1" segment type.
385
386Splitting an image from a RAID1 array is almost identical to the removal of
387an image described above. However, the metadata LV associated with the split
388image is removed and the data LV is kept and promoted to a top-level device.
389(i.e. It is made visible and stripped of its RAID_IMAGE status flags.)
390
391When the '--trackchanges' option is given along with the '--splitmirrors'
392argument, the metadata LV is left as part of the original array. The data LV
393is set as 'VISIBLE' and read-only (~LVM_WRITE). When the array DM table is
394being created, it notices the read-only, VISIBLE nature of the sub-LV and puts
395in the '- -' sentinel. Only a single image can be split from the mirror and
396the name of the sub-LV cannot be changed. Unlike '--splitmirrors' on its own,
397the '--name' argument must not be specified. Therefore, the name of the newly
398split LV will remain the same '<lv>_rimage_<N>', where 'N' is the index of the
399slot in the array for which it is associated.
400
401When an LV which was split from a RAID1 array with the '--trackchanges' option
402is merged back into the array, its read/write status is restored and it is
403set as "hidden" again. Recycling the array (suspend/resume) restores the sub-LV
404to its position in the array and begins the process of sync'ing the changes that
405were made since the time it was split from the array.
406
d0981401
JEB
407RAID device replacement with '--replace'
408----------------------------------------
409This option is available to all RAID segment types.
410
411The '--replace' option can be used to remove a particular device from a RAID
412logical volume and replace it with a different one in one action (CLI command).
413The device device to be removed is specified as the argument to the '--replace'
414option. This option can be specified more than once in a single command,
415allowing multiple devices to be replaced at the same time - provided the RAID
416logical volume has the necessary redundancy to allow the action. The devices
417to be used as replacements can also be specified in the command; similar to the
418way allocatable devices are specified during an up-convert.
419
420Example> lvconvert --replace /dev/sdd1 --replace /dev/sde1 vg/lv /dev/sd[bc]1
421
422RAID '--repair'
423---------------
424This 'lvconvert' option is available to all RAID segment types and is described
425under "RAID Fault Handling".
426
427
428RAID Fault Handling
429===================
430RAID is not like traditional LVM mirroring (i.e. the "mirror" segment type).
431LVM mirroring required failed devices to be removed or the logical volume would
432simply hang. RAID arrays can keep on running with failed devices. In fact, for
433RAID types other than RAID1 removing a device would mean substituting an error
434target or converting to a lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to
435RAID0). Therefore, rather than removing a failed device unconditionally, the
436user has a couple of options to choose from.
437
438The automated response to a device failure is handled according to the user's
439preference defined in lvm.conf:activation.raid_fault_policy. The options are:
440 # "warn" - Use the system log to warn the user that a device in the RAID
441 # logical volume has failed. It is left to the user to run
442 # 'lvconvert --repair' manually to remove or replace the failed
443 # device. As long as the number of failed devices does not
444 # exceed the redundancy of the logical volume (1 device for
445 # raid4/5, 2 for raid6, etc) the logical volume will remain
446 # usable.
447 #
448 # "remove" - NOT CURRENTLY IMPLEMENTED OR DOCUMENTED IN example.conf.in.
449 # Remove the failed device and reduce the RAID logical volume
450 # accordingly. If a single device dies in a 3-way mirror,
451 # remove it and reduce the mirror to 2-way. If a single device
452 # dies in a RAID 4/5 logical volume, reshape it to a striped
453 # volume, etc - RAID 6 -> RAID 4/5 -> RAID 0. If devices
454 # cannot be removed for lack of redundancy, fail.
455 # THIS OPTION CANNOT YET BE IMPLEMENTED BECAUSE RESHAPE IS NOT
456 # YET SUPPORTED IN linux/drivers/md/dm-raid.c. The superblock
457 # does not yet hold enough information to support reshaping.
458 #
459 # "allocate" - Attempt to use any extra physical volumes in the volume
460 # group as spares and replace faulty devices.
461
462If manual intervention is taken, either in response to the automated solution's
463"warn" mode or simply because dmeventd hadn't run, then the user can call
464'lvconvert --repair vg/lv' and follow the prompts. They will be prompted
465whether or not to replace the device and cause a full recovery of the failed
466device.
467
468If replacement is chosen via the manual method or "allocate" is the policy taken
469by the automated response, then 'lvconvert --replace' is the mechanism used to
470attempt the replacement of the failed device.
471
472'vgreduce --removemissing' is ineffectual at repairing RAID logical volumes. It
473will remove the failed device, but the RAID logical volume will simply continue
474to operate with an <unknown> sub-LV. The user should clear the failed device
475with 'lvconvert --repair'.
This page took 0.07275 seconds and 5 git commands to generate.