doc/lvm2-raid.txt

   1 =======================
   2 = LVM RAID Design Doc =
   3 =======================
   4
   5 #############################
   6 # Chapter 1: User-Interface #
   7 #############################
   8
   9 ***************** CREATING A RAID DEVICE ******************
  10
  11 01: lvcreate --type <RAID type> \
  12 02:          [--regionsize <size>] \
  13 03:          [-i/--stripes <#>] [-I,--stripesize <size>] \
  14 04:          [-m/--mirrors <#>] \
  15 05:          [--[min|max]recoveryrate <kB/sec/disk>] \
  16 06:          [--stripecache <size>] \
  17 07:          [--writemostly <devices>] \
  18 08:          [--maxwritebehind <size>] \
  19 09:          [[no]sync] \
  20 10:          <Other normal args, like: -L 5G -n lv vg> \
  21 11:          [devices]
  22
  23 Line 01:
  24 I don't intend for there to be shorthand options for specifying the
  25 segment type.  The available RAID types are:
  26         "raid0"  - Stripe [NOT IMPLEMENTED]
  27         "raid1"  - should replace DM Mirroring
  28         "raid10" - striped mirrors, [NOT IMPLEMENTED]
  29         "raid4"  - RAID4
  30         "raid5"  - Same as "raid5_ls" (Same default as MD)
  31         "raid5_la" - RAID5 Rotating parity 0 with data continuation
  32         "raid5_ra" - RAID5 Rotating parity N with data continuation
  33         "raid5_ls" - RAID5 Rotating parity 0 with data restart
  34         "raid5_rs" - RAID5 Rotating parity N with data restart
  35         "raid6"    - Same as "raid6_zr"
  36         "raid6_zr" - RAID6 Rotating parity 0 with data restart
  37         "raid6_nr" - RAID6 Rotating parity N with data restart
  38         "raid6_nc" - RAID6 Rotating parity N with data continuation
  39 The exception to 'no shorthand options' will be where the RAID implementations
  40 can displace traditional tagets.  This is the case with 'mirror' and 'raid1'.
  41 In this case, "mirror_segtype_default" - found under the "global" section in
  42 lvm.conf - can be set to "mirror" or "raid1".  The segment type inferred when
  43 the '-m' option is used will be taken from this setting.  The default segment
  44 types can be overridden on the command line by using the '--type' argument.
  45
  46 Line 02:
  47 Region size is relevant for all RAID types.  It defines the granularity for
  48 which the bitmap will track the active areas of disk.  The default is currently
  49 4MiB.  I see no reason to change this unless it is a problem for MD performance.
  50 MD does impose a restriction of 2^21 regions for a given device, however.  This
  51 means two things: 1) we should never need a metadata area larger than
  52 8kiB+sizeof(superblock)+bitmap_offset (IOW, pretty small) and 2) the region
  53 size will have to be upwardly revised if the device is larger than 8TiB
  54 (assuming defaults).
  55
  56 Line 03/04:
  57 The '-m/--mirrors' option is only relevant to RAID1 and will be used just like
  58 it is today for DM mirroring.  For all other RAID types, -i/--stripes and
  59 -I/--stripesize are relevant.  The former will specify the number of data
  60 devices that will be used for striping.  For example, if the user specifies
  61 '--type raid0 -i 3', then 3 devices are needed.  If the user specifies
  62 '--type raid6 -i 3', then 5 devices are needed.  The -I/--stripesize may be
  63 confusing to MD users, as they use the term "chunksize".  I think they will
  64 adapt without issue and I don't wish to create a conflict with the term
  65 "chunksize" that we use for snapshots.
  66
  67 Line 05/06/07:
  68 I'm still not clear on how to specify these options.  Some are easier than
  69 others.  '--writemostly' is particularly hard because it involves specifying
  70 which devices shall be 'write-mostly' and thus, also have 'max-write-behind'
  71 applied to them.  It has been suggested that a '--readmostly'/'--readfavored'
  72 or similar option could be introduced as a way to specify a primary disk vs.
  73 specifying all the non-primary disks via '--writemostly'.  I like this idea,
  74 but haven't come up with a good name yet.  Thus, these will remain
  75 unimplemented until future specification.
  76
  77 Line 09/10/11:
  78 These are familiar.
  79
  80 Further creation related ideas:
  81 Today, you can specify '--type mirror' without an '-m/--mirrors' argument
  82 necessary.  The number of devices defaults to two (and the log defaults to
  83 'disk').  A similar thing should happen with the RAID types.  All of them
  84 should default to having two data devices unless otherwise specified.  This
  85 would mean a total number of 2 devices for RAID 0/1, 3 devices for RAID 4/5,
  86 and 4 devices for RAID 6/10.
  87
  88
  89 ***************** CONVERTING A RAID DEVICE ******************
  90
  91 01: lvconvert [--type <RAID type>] \
  92 02:           [-R/--regionsize <size>] \
  93 03:           [-i/--stripes <#>] [-I,--stripesize <size>] \
  94 04:           [-m/--mirrors <#>] \
  95 05:           [--merge]
  96 06:           [--splitmirrors <#> [--trackchanges]] \
  97 07:           [--replace <sub_lv|device>] \
  98 08:           [--[min|max]recoveryrate <kB/sec/disk>] \
  99 09:           [--stripecache <size>] \
 100 10:           [--writemostly <devices>] \
 101 11:           [--maxwritebehind <size>] \
 102 12:           vg/lv
 103 13:           [devices]
 104
 105 lvconvert should work exactly as it does now when dealing with mirrors -
 106 even if(when) we switch to MD RAID1.  Of course, there are no plans to
 107 allow the presense of the metadata area to be configurable (e.g. --corelog).
 108 It will be simple enough to detect if the LV being up/down-converted is
 109 new or old-style mirroring.
 110
 111 If we choose to use MD RAID0 as well, it will be possible to change the
 112 number of stripes and the stripesize.  It is therefore conceivable to see
 113 something like, 'lvconvert -i +1 vg/lv'.
 114
 115 Line 01:
 116 It is possible to change the RAID type of an LV - even if that LV is already
 117 a RAID device of a different type.  For example, you could change from
 118 RAID4 to RAID5 or RAID5 to RAID6.
 119
 120 Line 02/03/04:
 121 These are familiar options - all of which would now be available as options
 122 for change.  (However, it'd be nice if we didn't have regionsize in there.
 123 It's simple on the kernel side, but is just an extra - often unecessary -
 124 parameter to many functions in the LVM codebase.)
 125
 126 Line 05:
 127 This option is used to merge an LV back into a RAID1 array - provided it was
 128 split for temporary read-only use by '--splitmirrors 1 --trackchanges'.
 129
 130 Line 06:
 131 The '--splitmirrors <#>' argument should be familiar from the "mirror" segment
 132 type.  It allows RAID1 images to be split from the array to form a new LV.
 133 Either the original LV or the split LV - or both - could become a linear LV as
 134 a result.  If the '--trackchanges' argument is specified in addition to
 135 '--splitmirrors', an LV will be split from the array.  It will be read-only.
 136 This operation does not change the original array - except that it uses an empty
 137 slot to hold the position of the split LV which it expects to return in the
 138 future (see the '--merge' argument).  It tracks any changes that occur to the
 139 array while the slot is kept in reserve.  If the LV is merged back into the
 140 array, only the changes are resync'ed to the returning image.  Repeating the
 141 'lvconvert' operation without the '--trackchanges' option will complete the
 142 split of the LV permanently.
 143
 144 Line 07:
 145 This option allows the user to specify a sub_lv (e.g. a mirror image) or
 146 a particular device for replacement.  The device (or all the devices in
 147 the sub_lv) will be removed and replaced with different devices from the
 148 VG.
 149
 150 Line 08/09/10/11:
 151 It should be possible to alter these parameters of a RAID device.  As with
 152 lvcreate, however, I'm not entirely certain how to best define some of these.
 153 We don't need all the capabilities at once though, so it isn't a pressing
 154 issue.
 155
 156 Line 12:
 157 The LV to operate on.
 158
 159 Line 13:
 160 Devices that are to be used to satisfy the conversion request.  If the
 161 operation removes devices or splits a mirror, then the devices specified
 162 form the list of candidates for removal.  If the operation adds or replaces
 163 devices, then the devices specified form the list of candidates for allocation.
 164
 165
 166
 167 ###############################################
 168 # Chapter 2: LVM RAID internal representation #
 169 ###############################################
 170
 171 The internal representation is somewhat like mirroring, but with alterations
 172 for the different metadata components.  LVM mirroring has a single log LV,
 173 but RAID will have one for each data device.  Because of this, I've added a
 174 new 'areas' list to the 'struct lv_segment' - 'meta_areas'.  There is exactly
 175 a one-to-one relationship between 'areas' and 'meta_areas'.  The 'areas' array
 176 still holds the data sub-lv's (similar to mirroring), while the 'meta_areas'
 177 array holds the metadata sub-lv's (akin to the mirroring log device).
 178
 179 The sub_lvs will be named '%s_rimage_%d' instead of '%s_mimage_%d' as it is
 180 for mirroring, and '%s_rmeta_%d' instead of '%s_mlog'.  Thus, you can imagine
 181 an LV named 'foo' with the following layout:
 182 foo
 183 [foo's lv_segment]
 184 |
 185 |-> foo_rimage_0 (areas[0])
 186 |   [foo_rimage_0's lv_segment]
 187 |-> foo_rimage_1 (areas[1])
 188 |   [foo_rimage_1's lv_segment]
 189 |
 190 |-> foo_rmeta_0 (meta_areas[0])
 191 |   [foo_rmeta_0's lv_segment]
 192 |-> foo_rmeta_1 (meta_areas[1])
 193 |   [foo_rmeta_1's lv_segment]
 194
 195 LVM Meta-data format
 196 ====================
 197 The RAID format will need to be able to store parameters that are unique to
 198 RAID and unique to specific RAID sub-devices.  It will be modeled after that
 199 of mirroring.
 200
 201 Here is an example of the mirroring layout:
 202 lv {
 203         id = "agL1vP-1B8Z-5vnB-41cS-lhBJ-Gcvz-dh3L3H"
 204         status = ["READ", "WRITE", "VISIBLE"]
 205         flags = []
 206         segment_count = 1
 207
 208         segment1 {
 209                 start_extent = 0
 210                 extent_count = 125      # 500 Megabytes
 211
 212                 type = "mirror"
 213                 mirror_count = 2
 214                 mirror_log = "lv_mlog"
 215                 region_size = 1024
 216
 217                 mirrors = [
 218                         "lv_mimage_0", 0,
 219                         "lv_mimage_1", 0
 220                 ]
 221         }
 222 }
 223
 224 The real trick is dealing with the metadata devices.  Mirroring has an entry,
 225 'mirror_log', in the top-level segment.  This won't work for RAID because there
 226 is a one-to-one mapping between the data devices and the metadata devices.  The
 227 mirror devices are layed-out in sub-device/le pairs.  The 'le' parameter is
 228 redundant since it will always be zero.  So for RAID, I have simple put the
 229 metadata and data devices in pairs without the 'le' parameter.
 230
 231 RAID metadata:
 232 lv {
 233         id = "EnpqAM-5PEg-i9wB-5amn-P116-1T8k-nS3GfD"
 234         status = ["READ", "WRITE", "VISIBLE"]
 235         flags = []
 236         segment_count = 1
 237
 238         segment1 {
 239                 start_extent = 0
 240                 extent_count = 125      # 500 Megabytes
 241
 242                 type = "raid1"
 243                 device_count = 2
 244                 region_size = 1024
 245
 246                 raids = [
 247                         "lv_rmeta_0", "lv_rimage_0",
 248                         "lv_rmeta_1", "lv_rimage_1",
 249                 ]
 250         }
 251 }
 252
 253 The metadata also must be capable of representing the various tunables.  We
 254 already have a good example for one from mirroring, region_size.
 255 'max_write_behind', 'stripe_cache', and '[min|max]_recovery_rate' could also
 256 be handled in this way.  However, 'write_mostly' cannot be handled in this
 257 way, because it is a characteristic associated with the sub_lvs, not the
 258 array as a whole.  In these cases, the status field of the sub-lv's themselves
 259 will hold these flags - the meaning being only useful in the larger context.
 260
 261
 262 ##############################################
 263 # Chapter 3: LVM RAID implementation details #
 264 ##############################################
 265
 266 New Segment Type(s)
 267 ===================
 268 I've created a new file 'lib/raid/raid.c' that will handle the various different
 269 RAID types.  While there will be a unique segment type for each RAID variant,
 270 they will all share a common backend - segtype_handler functions and
 271 segtype->flags = SEG_RAID.
 272
 273 I'm also adding a new field to 'struct segment_type', parity_devs.  For every
 274 segment_type except RAID4/5/6, this will be 0.  This field facilitates in
 275 allocation and size calculations.  For example, the lvcreate for RAID5 would
 276 look something like:
 277 ~> lvcreate --type raid5 -L 30G -i 3 -n my_raid5 my_vg
 278 or
 279 ~> lvcreate --type raid5 -n my_raid5 my_vg /dev/sd[bcdef]1
 280
 281 In the former case, the stripe count (3) and device size are computed, and
 282 then 'segtype->parity_devs' extra devices are allocated of the same size.  In
 283 the latter case, the number of PVs is determined and 'segtype->parity_devs' is
 284 subtracted off to determine the number of stripes.
 285
 286 This should also work in the case of RAID10 and doing things in this manor
 287 should not affect the way size is calculated via the area_multiple.
 288
 289 Allocation
 290 ==========
 291 When a RAID device is created, metadata LVs must be created along with the
 292 data LVs that will ultimately compose the top-level RAID array.  For the
 293 foreseeable future, the metadata LVs must reside on the same device as (or
 294 at least one of the devices that compose) the data LV.  We use this property
 295 to simplify the allocation process.  Rather than allocating for the data LVs
 296 and then asking for a small chunk of space on the same device (or the other
 297 way around), we simply ask for the aggregate size of the data LV plus the
 298 metadata LV.  Once we have the space allocated, we divide it between the
 299 metadata and data LVs.  This also greatly simplifies the process of finding
 300 parallel space for all the data LVs that will compose the RAID array.  When
 301 a RAID device is resized, we will not need to take the metadata LV into
 302 account, because it will already be present.
 303
 304 Apart from the metadata areas, the other unique characteristic of RAID
 305 devices is the parity device count.  The number of parity devices does nothing
 306 to the calculation of size-per-device.  The 'area_multiple' means nothing
 307 here.  The parity devices will simply be the same size as all the other devices
 308 and will also require a metadata LV (i.e. it is treated no differently than
 309 the other devices).
 310
 311 Therefore, to allocate space for RAID devices, we need to know two things:
 312 1) how many parity devices are required and 2) does an allocated area need to
 313 be split out for the metadata LVs after finding the space to fill the request.
 314 We simply add these two fields to the 'alloc_handle' data structure as,
 315 'parity_count' and 'alloc_and_split_meta'.  These two fields get set in
 316 '_alloc_init'.   The 'segtype->parity_devs' holds the number of parity
 317 drives and can be directly copied to 'ah->parity_count' and
 318 'alloc_and_split_meta' is set when a RAID segtype is detected and
 319 'metadata_area_count' has been specified.  With these two variables set, we
 320 can calculate how many allocated areas we need.  Also, in the routines that
 321 find the actual space, they stop not when they have found ah->area_count but
 322 when they have found (ah->area_count + ah->parity_count).
 323
 324 Conversion
 325 ==========
 326 RAID -> RAID, adding images
 327 ---------------------------
 328 When adding images to a RAID array, metadata and data components must be added
 329 as a pair.  It is best to perform as many operations as possible before writing
 330 new LVM metadata.  This allows us to error-out without having to unwind any
 331 changes.  It also makes things easier if the machine should crash during a
 332 conversion operation.  Thus, the actions performed when adding a new image are:
 333         1) Allocate the required number of metadata/data pairs using the method
 334            describe above in 'Allocation' (i.e. find the metadata/data space
 335            as one unit and split the space between them after found - this keeps
 336            them together on the same device).
 337         2) Form the metadata/data LVs from the allocated space (leave them
 338            visible) - setting required RAID_[IMAGE | META] flags as appropriate.
 339         3) Write the LVM metadata
 340         4) Activate and clear the metadata LVs.  The clearing of the metadata
 341            requires the LVM metadata be written (step 3) and is a requirement
 342            before adding the new metadata LVs to the array.  If the metadata
 343            is not cleared, it carry residual superblock state from a previous
 344            array the device may have been part of.
 345         5) Deactivate new sub-LVs and set them "hidden".
 346         6) expand the 'first_seg(raid_lv)->areas' and '->meta_areas' array
 347            for inclusion of the new sub-LVs
 348         7) Add new sub-LVs and update 'first_seg(raid_lv)->area_count'
 349         8) Commit new LVM metadata
 350 Failure during any of these steps will not affect the original RAID array.  In
 351 the worst scenario, the user may have to remove the new sub-LVs that did not
 352 yet make it into the array.
 353
 354 RAID -> RAID, removing images
 355 -----------------------------
 356 To remove images from a RAID, the metadata/data LV pairs must be removed
 357 together.  This is pretty straight-forward, but one place where RAID really
 358 differs from the "mirror" segment type is how the resulting "holes" are filled.
 359 When a device is removed from a "mirror" segment type, it is identified, moved
 360 to the end of the 'mirrored_seg->areas' array, and then removed.  This action
 361 causes the other images to shift down and fill the position of the device which
 362 was removed.  While "raid1" could be handled in this way, the other RAID types
 363 could not be - it would corrupt the ordering of the data on the array.  Thus,
 364 when a device is removed from a RAID array, the corresponding metadata/data
 365 sub-LVs are removed from the 'raid_seg->meta_areas' and 'raid_seg->areas' arrays.
 366 The slot in these 'lv_segment_area' arrays are set to 'AREA_UNASSIGNED'.  RAID
 367 is perfectly happy to construct a DM table mapping with '- -' if it comes across
 368 area assigned in such a way.  The pair of dashes is a valid way to tell the RAID
 369 kernel target that the slot should be considered empty.  So, we can remove
 370 devices from a RAID array without affecting the correct operation of the RAID.
 371 (It also becomes easy to replace the empty slots properly if a spare device is
 372 available.)  In the case of RAID1 device removal, the empty slot can be safely
 373 eliminated.  This is done by shifting the higher indexed devices down to fill
 374 the slot.  Even the names of the images will be renamed to properly reflect
 375 their index in the array.  Unlike the "mirror" segment type, you will never have
 376 an image named "*_rimage_1" occupying the index position 0.
 377
 378 As with adding images, removing images holds off on commiting LVM metadata
 379 until all possible changes have been made.  This reduces the likelyhood of bad
 380 intermediate stages being left due to a failure of operation or machine crash.
 381
 382 RAID1 '--splitmirrors', '--trackchanges', and '--merge' operations
 383 ------------------------------------------------------------------
 384 This suite of operations is only available to the "raid1" segment type.
 385
 386 Splitting an image from a RAID1 array is almost identical to the removal of
 387 an image described above.  However, the metadata LV associated with the split
 388 image is removed and the data LV is kept and promoted to a top-level device.
 389 (i.e.  It is made visible and stripped of its RAID_IMAGE status flags.)
 390
 391 When the '--trackchanges' option is given along with the '--splitmirrors'
 392 argument, the metadata LV is left as part of the original array.  The data LV
 393 is set as 'VISIBLE' and read-only (~LVM_WRITE).  When the array DM table is
 394 being created, it notices the read-only, VISIBLE nature of the sub-LV and puts
 395 in the '- -' sentinel.  Only a single image can be split from the mirror and
 396 the name of the sub-LV cannot be changed.  Unlike '--splitmirrors' on its own,
 397 the '--name' argument must not be specified.  Therefore, the name of the newly
 398 split LV will remain the same '<lv>_rimage_<N>', where 'N' is the index of the
 399 slot in the array for which it is associated.
 400
 401 When an LV which was split from a RAID1 array with the '--trackchanges' option
 402 is merged back into the array, its read/write status is restored and it is
 403 set as "hidden" again.  Recycling the array (suspend/resume) restores the sub-LV
 404 to its position in the array and begins the process of sync'ing the changes that
 405 were made since the time it was split from the array.
 406
 407 RAID device replacement with '--replace'
 408 ----------------------------------------
 409 This option is available to all RAID segment types.
 410
 411 The '--replace' option can be used to remove a particular device from a RAID
 412 logical volume and replace it with a different one in one action (CLI command).
 413 The device device to be removed is specified as the argument to the '--replace'
 414 option.  This option can be specified more than once in a single command,
 415 allowing multiple devices to be replaced at the same time - provided the RAID
 416 logical volume has the necessary redundancy to allow the action.  The devices
 417 to be used as replacements can also be specified in the command; similar to the
 418 way allocatable devices are specified during an up-convert.
 419
 420 Example> lvconvert --replace /dev/sdd1 --replace /dev/sde1 vg/lv /dev/sd[bc]1
 421
 422 RAID '--repair'
 423 ---------------
 424 This 'lvconvert' option is available to all RAID segment types and is described
 425 under "RAID Fault Handling".
 426
 427
 428 RAID Fault Handling
 429 ===================
 430 RAID is not like traditional LVM mirroring (i.e. the "mirror" segment type).
 431 LVM mirroring required failed devices to be removed or the logical volume would
 432 simply hang.  RAID arrays can keep on running with failed devices.  In fact, for
 433 RAID types other than RAID1 removing a device would mean substituting an error
 434 target or converting to a lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to
 435 RAID0).  Therefore, rather than removing a failed device unconditionally, the
 436 user has a couple of options to choose from.
 437
 438 The automated response to a device failure is handled according to the user's
 439 preference defined in lvm.conf:activation.raid_fault_policy.  The options are:
 440     # "warn"    - Use the system log to warn the user that a device in the RAID
 441     #             logical volume has failed.  It is left to the user to run
 442     #             'lvconvert --repair' manually to remove or replace the failed
 443     #             device.  As long as the number of failed devices does not
 444     #             exceed the redundancy of the logical volume (1 device for
 445     #             raid4/5, 2 for raid6, etc) the logical volume will remain
 446     #             usable.
 447     #
 448     # "remove"  - NOT CURRENTLY IMPLEMENTED OR DOCUMENTED IN example.conf.in.
 449     #             Remove the failed device and reduce the RAID logical volume
 450     #             accordingly.  If a single device dies in a 3-way mirror,
 451     #             remove it and reduce the mirror to 2-way.  If a single device
 452     #             dies in a RAID 4/5 logical volume, reshape it to a striped
 453     #             volume, etc - RAID 6 -> RAID 4/5 -> RAID 0.  If devices
 454     #             cannot be removed for lack of redundancy, fail.
 455     #             THIS OPTION CANNOT YET BE IMPLEMENTED BECAUSE RESHAPE IS NOT
 456     #             YET SUPPORTED IN linux/drivers/md/dm-raid.c.  The superblock
 457     #             does not yet hold enough information to support reshaping.
 458     #
 459     # "allocate" - Attempt to use any extra physical volumes in the volume
 460     #             group as spares and replace faulty devices.
 461
 462 If manual intervention is taken, either in response to the automated solution's
 463 "warn" mode or simply because dmeventd hadn't run, then the user can call
 464 'lvconvert --repair vg/lv' and follow the prompts.  They will be prompted
 465 whether or not to replace the device and cause a full recovery of the failed
 466 device.
 467
 468 If replacement is chosen via the manual method or "allocate" is the policy taken
 469 by the automated response, then 'lvconvert --replace' is the mechanism used to
 470 attempt the replacement of the failed device.
 471
 472 'vgreduce --removemissing' is ineffectual at repairing RAID logical volumes.  It
 473 will remove the failed device, but the RAID logical volume will simply continue
 474 to operate with an <unknown> sub-LV.  The user should clear the failed device
 475 with 'lvconvert --repair'.