Peter Rajnoha [Fri, 3 May 2013 11:20:07 +0000 (13:20 +0200)]
udev: fire pvscan --cache properly on CHANGE event for MD devices
Commit 756bcabbfe297688ba240a880bc2b55265ad33f0 restricted the
situations at which the LVM autoactivation fires - only on ADD
event for devices other than DM. However, this caused a problem
for MD devices...
MD devices are activated in a very similar way as DM devices:
the MD dev is created on first appeareance of MD array member
(ADD event) and stays *inactive* until the array is complete.
Just then the MD dev turns to active state and this is reported
to userspace by CHANGE event.
Unfortunately, we can't differentiate between the CHANGE event
coming from udev trigger/WATCH rule and CHANGE event coming from
the transition to active state - MD would need to add similar logic
we already use to detect this in DM world. For now, we just have
to enable pvscan --cache on *all* CHANGE events for MD so the
autoactivation of the LVM volumes on top of MD works.
A downside of this is that a spurious CHANGE event for MD dev
can cause the LVM volumes on top of it to be automatically activated.
However, one should not open/change the device underneath until
the device above in the stack is removed! So this situation should
only happen if one opens the MD dev for read-write by mistake
(and hence firing the CHANGE event because of the WATCH udev rule),
or if calling udev trigger manually for the MD dev.
(No WHATS_NEW here as this fixes the commit mentioned
above and which has not been released yet.)
Support for exclusive activation of snapshots revealed some problems.
When snapshot is created, COW LV is activated first (for clearing) and
then it's transformed into snapshot's COW LV, but it has left the lock
for such LV active in cluster and this lock could not have been removed
from dlm, unless snapshot has been removed within same dlm session.
If the user tried to remove snapshot after rebooting node, the lock was
missing, and COW LV could not have been detached.
Patch modifes the approach in this way:
Always deactivate COW LV for clustered vg after clearing (so it's
activated again via imlicit snapshot activation rule when snapshot is activated).
When snapshot is removed, activate COW LV as independend LV, so the lock
will exist for such LV, but only when the snapshot is active.
Also add test case for testing snapshot removal after cluster reboot.
Patch fixes hidden problem with lvm metadata caching.
When the pretest was made, only the commited data have been cached back
since the call lv_info_by_lvid() triggers mda read operation.
However call of lv_suspend_if_active() also reads precommited metadata.
The problem is visible in this sequence of calls:
which may end with leaving outdated mda in lvm cache, since vg_write()
drops cached metadata and vg_commit() only transforms precommited
to commited metadata, but in the case of pretesting we have
no precommited mda available so the cache will continue to use
old metadata. This happens, when suspend LV is inactive.
Merging multiple config files together needs to know newest (highest)
timestamp of merged files. Persistent cache file is being used
only in case, the config file is older then .cache file.
Add verbose message when we will not obtain devices from udev
(i.e. testing is using different udev dir, and the log was
giving misleading info about using udev)
Add proper error message if zalloc from pull would have failed.
Assign fid as the last step before returning VG.
Make the format reader for 'lvm1' and 'pool' equal to 'lvm2' format reader.
It has caused memory corruption to lvmetad as it later calls
destroy_instance() to allocated fid. This patch should fix problems
with crashing test lvmetad-lvm1.sh.
Sharing char* with field has a problem in error path,
when we allocate event, but fail to allocate timeout string.
Instead of creating complicated error paths to resolve
it individually stop using unions, and let the resource
to be released in a simple _free_message().
Peter Rajnoha [Fri, 19 Apr 2013 10:17:53 +0000 (12:17 +0200)]
udev: also autoactivate on coldplug
Commit 756bcabbfe297688ba240a880bc2b55265ad33f0 fixed autoactivation
to not trigger on each uevent for a PV that appeared in the system
most notably the events that are triggered artificially (udevadm
trigger or as the result of the WATCH udev rule being applied that
consequently generates CHANGE uevents). This fixed a situation in
which VGs/LVs were activated when they should not.
BUT we still need to care about the coldplug used at boot to
retrigger the ADD events - the "udevadm trigger --action=add"!
For non-DM-based PVs, this is already covered as for these we
run the autoactivation on ADD event only.
However, for DM-based PVs, we still need to run the
autoactivation even for the artificial ADD event, reusing
the udev DB content from previous proper CHANGE event that
came with the DM device activation.
Simply, this patch fixes a situation in which we run extra
"udevadm trigger --action=add" (or echo add > /sys/block/<dev>/uevent)
for DM-based PVs (cryptsetup devices, multipath devices, any
other DM devices...).
Without this patch, while using lvmetad + autoactivation,
any VG/LV that has a DM-based PV and for which we do not
call the activation directly, the VG/LV is not activated.
For example a VG with an LV with root FS on it which is directly
activated in initrd and then missing activation of the rest
of the LVs in the VG because of unhandled uevent retrigger on
boot after switching to root FS (the "coldplug").
(No WHATS_NEW here as this fixes the commit mentioned
above and which was not released yet.)
Tony Asleson [Tue, 16 Apr 2013 08:42:03 +0000 (10:42 +0200)]
config_def_check: fix memory leak
There is no need to strdup a key when inserting into
the hash table as the table allocates memory and copies
the string. This was causing memory to be lost.
Jonathan Brassow [Mon, 15 Apr 2013 18:59:46 +0000 (13:59 -0500)]
RAID: Add writemostly/writebehind support for RAID1
'lvchange' is used to alter a RAID 1 logical volume's write-mostly and
write-behind characteristics. The '--writemostly' parameter takes a
PV as an argument with an optional trailing character to specify whether
to set ('y'), unset ('n'), or toggle ('t') the value. If no trailing
character is given, it will set the flag.
Synopsis:
lvchange [--writemostly <PV>:{t|y|n}] [--writebehind <count>] vg/lv
Example:
lvchange --writemostly /dev/sdb1:y --writebehind 512 vg/raid1_lv
The last character in the 'lv_attr' field is used to show whether a device
has the WriteMostly flag set. It is signified with a 'w'. If the device
has failed, the 'p'artial flag has priority.
Example ("nosync" raid1 with mismatch_cnt and writemostly):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg Rwi---r-m 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-w 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-- 1 linear 4.00m
Example (raid1 with mismatch_cnt, writemostly - but failed drive):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg rwi---r-p 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-p 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-p 1 linear 4.00m
A new reportable field has been added for writebehind as well. If
write-behind has not been set or the LV is not RAID1, the field will
be blank.
Example (writebehind is set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r-- 512
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
Example (writebehind is not set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r--
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
The original code also handled len==1, which the new code doesn't.
Press <TAB> in the lvm shell to get a list of the possible
flag completions for a single hyphen.
Jonathan Brassow [Fri, 12 Apr 2013 16:30:04 +0000 (11:30 -0500)]
CLEAN-UP: Better string checking to avoid substring matches
Commit 9fd7ac7d035f0b2f8dcc3cb19935eb181816bd76 introduced a way a
method of avoiding reading from mirrors with a device failure. If
a device was found to be dead, the mapping table was checked for
'handle_errors' or 'block_on_error'. These strings were checked for
in the table string via 'strstr', which could also match on strings
like, 'no_handle_errors' or 'no_block_on_error'. No such strings
exist, but we don't want to have problems in the future if they do.
So, we check for ' <string>{'\0'|' '}'.
Move common code for changing activation state from
vgchange and lvchange to one function.
Fix the order of checks - so we always implicitelly
activate snapshots and thin volumes in exclusive mode,
and we do not allow local deactivation for them.
Jonathan Brassow [Thu, 11 Apr 2013 20:57:14 +0000 (15:57 -0500)]
RAID: Revert previous commit that allowed identical table loads.
Revert commit 31c24dd9f2ad7b5f7913a18c9f11a00d7b3474a1. This commit
was used to force a RAID device-mapper table to be loaded into the
kernel despite the fact that it was identical to the one already
loaded. The effect allowed a RAID array with a transiently failed
device to refresh and reintegrate the failed device. This operation
is better done in the kernel on a 'resume'. Since,
'lvchange --refresh' already performs a suspend/resume cycle, the
above commit is not needed once the kernel change is made. Reverting
the commit removes an unnecessary (at least for now) change to the
device-mapper interface.
Jonathan Brassow [Thu, 11 Apr 2013 20:33:59 +0000 (15:33 -0500)]
RAID: Add scrubbing support for RAID LVs
New options to 'lvchange' allow users to scrub their RAID LVs.
Synopsis:
lvchange --syncaction {check|repair} vg/raid_lv
RAID scrubbing is the process of reading all the data and parity blocks in
an array and checking to see whether they are coherent. 'lvchange' can
now initaite the two scrubbing operations: "check" and "repair". "check"
will go over the array and recored the number of discrepancies but not
repair them. "repair" will correct the discrepancies as it finds them.
'lvchange --syncaction repair vg/raid_lv' is not to be confused with
'lvconvert --repair vg/raid_lv'. The former initiates a background
synchronization operation on the array, while the latter is designed to
repair/replace failed devices in a mirror or RAID logical volume.
Additional reporting has been added for 'lvs' to support the new
operations. Two new printable fields (which are not printed by
default) have been added: "syncaction" and "mismatches". These
can be accessed using the '-o' option to 'lvs', like:
lvs -o +syncaction,mismatches vg/lv
"syncaction" will print the current synchronization operation that the
RAID volume is performing. It can be one of the following:
- idle: All sync operations complete (doing nothing)
- resync: Initializing an array or recovering after a machine failure
- recover: Replacing a device in the array
- check: Looking for array inconsistencies
- repair: Looking for and repairing inconsistencies
The "mismatches" field with print the number of descrepancies found during
a check or repair operation.
The 'Cpy%Sync' field already available to 'lvs' will print the progress
of any of the above syncactions, including check and repair.
Finally, the lv_attr field has changed to accomadate the scrubbing operations
as well. The role of the 'p'artial character in the lv_attr report field
as expanded. "Partial" is really an indicator for the health of a
logical volume and it makes sense to extend this include other health
indicators as well, specifically:
'm'ismatches: Indicates that there are discrepancies in a RAID
LV. This character is shown after a scrubbing
operation has detected that portions of the RAID
are not coherent.
'r'efresh : Indicates that a device in a RAID array has suffered
a failure and the kernel regards it as failed -
even though LVM can read the device label and
considers the device to be ok. The LV should be
'r'efreshed to notify the kernel that the device is
now available, or the device should be 'r'eplaced
if it is suspected of failing.
Jonathan Brassow [Wed, 10 Apr 2013 21:47:04 +0000 (16:47 -0500)]
mirror: Fix overly-concerning warning on mirror up-convert failure.
Attempting to up-convert an inactive mirror when there is insufficient
space leads to the following message:
Unable to allocate extents for mirror(s).
ABORTING: Failed to remove temporary mirror layer inactive_mimagetmp_3.
Manual cleanup with vgcfgrestore and dmsetup may be required.
This is caused by a failure to execute the 'deactivate_lv' function in
the error condition. The deactivate returns an error because the LV is
already inactive. This patch checks if the LV is activate and calls
deactivate_lv only if it is. This allows the error cleanup code to work
properly in this condition.
It wasn't that big of a deal anyway, since there was no previous vg_commit
that needed to be reverted. IOW, no harm was done if the allocation failed.
The message was scary and useless.
RAID: Capture new RAID kernel sync_action status fields
I've updated the dm_status_raid structure and dm_get_status_raid()
function to make it handle the new kernel status fields that will
be coming in dm-raid v1.5.0. It is backwards compatible with the
old status line - initializing the new fields to '0'. The new
structure is also more amenable to future changes. It includes a
'reserved' field that is currently initialized to zero but could
be used to hold flags describing new features. It also now uses
pointers for the character strings instead of attempting to allocate
their space along with the structure (causing the size of the
structure to be variable). This allows future fields to be appended.
The new fields that are available are:
- sync_action : shows what the sync thread in the kernel is doing
(idle, frozen, resync, recover, check, repair, or
reshape)
- mismatch_count: shows the number of discrepancies which were
found or repaired by a "check" or "repair"
process, respectively.
Peter Rajnoha [Mon, 25 Mar 2013 13:30:40 +0000 (14:30 +0100)]
pv_write: clean up non-orphan format1 PV write
...to not pollute the common and format-independent code in the
abstraction layer above.
The format1 pv_write has common code for writing metadata and
PV header by calling the "write_disks" fn and when rewriting
the header itself only (e.g. just for the purpose of changing
the PV UUID) during the pvchange operation, we had to tweak
this functionality for the format1 case and we had to assign
the PV the orphan state temporarily.
This patch removes the need for this format1 tweak and it calls
the write_disks with appropriate flag indicating whether this is
a PV write call or a VG write call, allowing for metatada update
for the latter one.
Also, a side effect of the former tweak was that it effectively
invalidated the cache (even for the non-format1 PVs) as we
assigned it the orphan state temporarily just for the format1
PV write to pass.
Also, that tweak made it difficult to directly detect whether
a PV was part of a VG or not because the state was incorrect.
Also, it's not necessary to backup and restore some PV fields
when doing a PV write:
Peter Rajnoha [Tue, 19 Mar 2013 13:17:53 +0000 (14:17 +0100)]
cleanup: remove unused 'pv_by_path' fn
The pv_by_path might be also dangerous to use as it does not
count with any other metadata areas but the ones found on the PV
itself. If metadata was not found on the PV referenced by the path,
it returned no PV though it might have been referenced by metadata
elsewhere (on other PVs...).
Peter Rajnoha [Tue, 19 Mar 2013 13:13:44 +0000 (14:13 +0100)]
vgextend: do not allow PV with 0 MDAs to be added while already in a VG
If extending a VG and including a PV with 0 MDAs that was already
a part of a VG, the vgextend allowed that PV to be added and we
ended up *with one PV in two VGs*!
The vgextend code used the 'pv_by_path' fn that returned a PV for
a given path. However, when the PV did not have any metadata areas,
the fn just returned a PV without any reference to existing VG.
Consequently, any checks for the existing VG failed.
[0] raw/~ # vgcreate vg2 /dev/sdc
Volume group "vg2" successfully created
Before this patch (incorrect):
[0] raw/~ # vgextend vg2 /dev/sda
Volume group "vg2" successfully extended
With this patch (correct):
[0] raw/~ # vgextend vg2 /dev/sda
Physical volume '/dev/sda' is already in volume group 'vg1'
Unable to add physical volume '/dev/sda' to volume group 'vg2'.
Peter Rajnoha [Tue, 19 Mar 2013 13:05:19 +0000 (14:05 +0100)]
metadata: add 'allow_orphan' arg to find_pv_by_name fn
Before, the find_pv_by_name call always failed if the PV found was orphan.
However, we might use this function even for a PV that is not part of any VG.
This patch adds 'allow_orphan' arg to find_pv_by_name fn that allows that.
The only callers of the underscored variants were their wrappers
without the underscore. No other part of the code referenced the
underscored variants.
Zdenek Kabelac [Mon, 11 Mar 2013 08:49:10 +0000 (09:49 +0100)]
thin: rework lvconvert
Usage of layer was not the best plan here - for proper devices stack
we have to keep correct reference in volume_group structure and
make the new thin pool LV appear as a new volume.
Peter Rajnoha [Tue, 12 Mar 2013 11:37:24 +0000 (12:37 +0100)]
dmsetup: fix 'splitname -o' to not fail if used without '-c'
This was a regression introduced with e33fd978a8a56228677c14612fce99b3b3588125
(libdm v1.02.68/lvm2 v2.02.89) with the introduction of new output
fields blkdevname and blkdevs_used for ls and deps dmsetup commands.
A new common '_process_options' fn was added with that commit, but the
fn was called prematurely which then broke processing of
'dmsetup splitname -o' which should implicitly use '-c' option
and this was failing after the commit:
alatyr/~ $ dmsetup splitname -o lv_name /dev/mapper/vg_data-test
Option not recognised: lv_name
Couldn't process command line.
The '-c' had to be used for correct operation:
alatyr/~ $ dmsetup splitname -c -o lv_name /dev/mapper/vg_data-test
LV
test
Now fixed to work as it did before:
alatyr/~ $ dmsetup splitname -o lv_name /dev/mapper/vg_data-test
LV
test