Peter Rajnoha [Wed, 28 Aug 2013 14:06:51 +0000 (16:06 +0200)]
systemd: lvm2-activation-generator: remove default dir if args not specified and require all args to be given
Remove default "/tmp" as destination directory if no args
specified for lvm2-activation-generator. Require all the
args to be specified directly for proper functionality.
Petr Rockai [Fri, 23 Aug 2013 08:09:23 +0000 (10:09 +0200)]
test: Add a new "check_full" target, which also tests with real /dev.
The original "check" target stays confined to a local device directory, while
check_full does 6 flavours, 3 with a local device directory and 3 with the
global /dev directory (the latter are prefixed with "s" for
"system"). I.e.: normal, cluster, lvmetad, snormal, scluster, slvmetad.
Jonathan Brassow [Tue, 27 Aug 2013 21:46:40 +0000 (16:46 -0500)]
TEST: Add tests for lvchange actions of RAID under thin
Patch includes RAID1,4,5,6,10 tests for:
- setting writemostly/writebehind
* syncaction changes (i.e. scrubbing operations)
- refresh (i.e. reviving devices after transient failures)
- setting recovery rate (sync I/O throttling)
while the RAID LVs are under a thin-pool (both data and metadata)
* not fully tested because I haven't found a way to force bad
blocks to be noticed in the testsuite yet. Works just fine
when dealing with "real" devices.
Jonathan Brassow [Mon, 26 Aug 2013 21:38:54 +0000 (16:38 -0500)]
test: pvmove tests for all the different segment types.
Test moving linear, mirror, snapshot, RAID1,5,10, thinpool, thin
and thin on RAID. Perform the moves along with a dummy LV and
also without the dummy LV by specifying a logical volume name as
an argument to pvmove.
Jonathan Brassow [Mon, 26 Aug 2013 21:36:30 +0000 (16:36 -0500)]
pvmove: Allow moving snapshot/origin. Disallow converting and merging LVs
The patch allows the user to also pvmove snapshots and origin logical
volumes. This means pvmove should be able to move all segment types.
I have, however, disallowed moving converting or merging logical volumes.
Jonathan Brassow [Mon, 26 Aug 2013 19:12:31 +0000 (14:12 -0500)]
pvmove: Fix inability to specify LV name when moving RAID, mirror, or thin LV
Top-level LVs (like RAID, mirror or thin) are ignored when determining which
portions of an LV to pvmove. If the user specified the name of an LV to
move and it was one of the above types, it would be skipped. The code would
never move on to check whether its sub-LVs needed moving because their names
did not match what the user specified.
The solution is to check whether a sub-LVs is part of the LV whose name was
specified by the user - not just if there was a name match.
Peter Rajnoha [Mon, 26 Aug 2013 13:27:00 +0000 (15:27 +0200)]
udev: inform lvmetad about lost PV label
In stacked environment where we have a PV layered on top of a
snapshot LV and then removing the LV, lvmetad still keeps information
about the PV:
[0] raw/~ $ pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
[0] raw/~ $ vgcreate vg /dev/sda
Volume group "vg" successfully created
[0] raw/~ $ lvcreate -L32m vg
Logical volume "lvol0" created
[0] raw/~ $ lvcreate -L32m -s vg/lvol0
Logical volume "lvol1" created
[0] raw/~ $ pvcreate /dev/vg/lvol1
Physical volume "/dev/vg/lvol1" successfully created
[0] raw/~ $ lvremove -ff vg/lvol1
Logical volume "lvol1" successfully removed
[0] raw/~ $ pvs
No device found for PV BdNlu2-7bHV-XcIp-mFFC-PPuR-ef6K-yffdzO.
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 92.00m
[0] raw/~ $ pvscan --cache --major 253 --minor 3
Device 253:3 not found. Cleared from lvmetad cache.
This is because of the reactivation that is done just before
snapshot removal as part of the process (vg/lvol1 from the example above).
This causes a CHANGE event to be generated, but any scan done
on the LV does not see the original data anymore (in this case
the stacked PV label on top) and consequently the ID_FS_TYPE="LVM2_member"
(provided by blkid scan) is not stored in udev db anymore for the LV.
Consequently, the pvscan --cache is not run anymore as the dev is not
identified as LVM PV by the "LVM2_member" id - lvmetad loses this info
and still keeps records about the PV.
We can run into a very similar problem with erasing the PV label directly:
[0] raw/~ $ lvcreate -L32m vg
Logical volume "lvol0" created
[0] raw/~ $ pvcreate /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[0] raw/~ $ dd if=/dev/zero of=/dev/vg/lvol0 bs=1M
dd: error writing '/dev/vg/lvol0': No space left on device
33+0 records in
32+0 records out 33554432 bytes (34 MB) copied, 0.380921 s, 88.1 MB/s
[0] raw/~ $ pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 92.00m
/dev/vg/lvol0 lvm2 a-- 32.00m 32.00m
[0] raw/~ $ pvscan --cache --major 253 --minor 2
No PV label found on /dev/vg/lvol0.
This patch adds detection of this change from ID_FS_LABEL="LVM2_member"
to ID_FS_LABEL="<whatever_else>" and hence informing the lvmetad
about PV being gone.
David Teigland [Fri, 23 Aug 2013 19:38:48 +0000 (14:38 -0500)]
test: add process-each-vg and process-each-lv
These test the toollib functions that select
vgs/lvs to process based on command line args:
empty, vg name(s), lv names(s), vg tag(s),
lv tags(s), and combinations of all.
Jonathan Brassow [Fri, 23 Aug 2013 13:57:16 +0000 (08:57 -0500)]
pvmove: Add support for RAID, mirror, and thin
This patch allows pvmove to operate on RAID, mirror and thin LVs.
The key component is the ability to avoid moving a RAID or mirror
sub-LV onto a PV that already has another RAID sub-LV on it.
(e.g. Avoid placing both images of a RAID1 LV on the same PV.)
Top-level LVs are processed to determine which PVs to avoid for
the sake of redundancy, while bottom-level LVs are processed
to determine which segments/extents to move.
This approach does have some drawbacks. By eliminating whole PVs
from the allocation list, we might miss the opportunity to perform
pvmove in some senarios. For example, if we have 3 devices and
a linear uses half of the first, a RAID1 uses half of the first and
half of the second, and a linear uses half of the third (FIGURE 1);
we should be able to pvmove the first device (FIGURE 2).
FIGURE 1:
[ linear ] [ -RAID- ] [ linear ]
[ -RAID- ] [ ] [ ]
FIGURE 2:
[ moved ] [ -RAID- ] [ linear ]
[ moved ] [ linear ] [ -RAID- ]
However, the approach we are using would eliminate the second
device from consideration and would leave us with too little space
for allocation. In these situations, the user does have the ability
to specify LVs and move them one at a time.
Jonathan Brassow [Fri, 23 Aug 2013 13:49:16 +0000 (08:49 -0500)]
Thin: Make 'lv_is_on_pv(s)' work with thin types
The pool metadata LV must be accounted for when determining what PVs
are in a thin-pool. The pool LV must also be accounted for when
checking thin volumes.
This is a prerequisite for pvmove working with thin types.
Jonathan Brassow [Fri, 23 Aug 2013 13:40:13 +0000 (08:40 -0500)]
Misc: Make get_pv_list_for_lv() available to more than just RAID
The function 'get_pv_list_for_lv' will assemble all the PVs that are
used by the specified LV. It uses 'for_each_sub_lv' to traverse all
of the sub-lvs which may compose it.
Though this information is quite useful during boot, it may
be confusing for users if it happens anytime later and it
actually happens if systemd reloads. This is usually on package
update to update the systemd state and load any new units that are
newly installed in the system. The systemd reload is global and
so any existing generators are rerun at that moment too.
Peter Rajnoha [Wed, 21 Aug 2013 12:07:01 +0000 (14:07 +0200)]
filter-mpath: remove superfluous error message about mpath major not equal to dm major
This is a regression caused by commit 3bd90488545a4ad5374b4e0f1daba6cf16ae6ae8.
The error message added with that commit "mpath major %d is not dm major %d" is
superfluous.
When scanning for mpath components, we're looking for a parent device.
But this parent device is not necessarily an mpath device (so the dm device)
if it exists - it can be any other device layered on top (e.g. an MD RAID device).
Jonathan Brassow [Tue, 20 Aug 2013 18:21:09 +0000 (13:21 -0500)]
cmirrord: Prevent secondary checkpoints from corrupting bitmaps
The bug addressed by this patch manifested itself during testing
by showing a mirror that never became 'in-sync' after creation.
The bug is isolated to distributions that do not have support
for openAIS checkpointing (i.e. > RHEL6, > F16).
When a node joins a group that is managing a mirror log, the other
machines in the group send it a checkpoint representing the current
state of the bitmap. More than one machine can send a checkpoint,
but only the initial one should be imported. Once the bitmap state
has been imported from the initial checkpoint, operations (such
as resync, mark, and clear operations) can begin. When subsequent
checkpoints are allowed to be imported, it has the effect of erasing
all the log operations between the initial checkpoint and the ones
that follow.
When cmirrord was updated to handle the absence of openAIS
checkpointing (commit 62e38da133d9801cdf36b0f2aaec615ce14b9000),
the new import_checkpoint() function failed to honor the 'no_read'
parameter. This parameter was designed to avoid reading all but
the initial checkpoint. Honoring this parameter has solved the
issue of corrupting bitmap data with secondary checkpoints.
Recent kernels allow messages to respond with a string.
Add dm_task_get_message_response() to libdevmapper to perform some
basic sanity checks and return this.
Have 'dmsetup message' display any response.
Peter Rajnoha [Fri, 16 Aug 2013 13:45:00 +0000 (15:45 +0200)]
udev: fix lvmetad rules to not ignore loop device configuration
If loop device is first configured on systems where /dev/loop-control
is used to dynamically create the loop device itself, there's an
ADD+CHANGE even generated. But next time the existing /dev/loop[0-9]*
is reused, there's only a CHANGE event since the device representing
it is already present in kernel (so no ADD event in this case).
Peter Rajnoha [Thu, 15 Aug 2013 10:23:49 +0000 (12:23 +0200)]
fix: some issues reported by coverity
- null_fd resource leak on error path in _reopen_fd_null fn
- dead code in verify_message in clvmd code
- dead code in _init_filter_components in toolcontext code
- null dereference in dm_prepare_selinux_context on error path if
setfscreatecon fails while resetting SELinux context
Peter Rajnoha [Wed, 14 Aug 2013 12:04:58 +0000 (14:04 +0200)]
autoactivation: refresh existing VG before autoactivation
When autoactivating a VG, there could be an existing VG with exactly
the same PV UUIDs. The PVs could be reappeared after previous
loss/disconnect (for example disconnecting and reconnecting iscsi).
Since there's no "autodeactivation" yet, the mappings for the LVs
from the VG were left in the system even if the device was disconnected.
These mappings also hold the major:minor of the underlying device.
So if the device reappears, it is assigned a different major:minor
pair (...and kernel name). We need to cope with this during
autoactivation so any existing mappings are corrected for any changes.
The VG refresh does that (the vgchange --refresh functionality) -
call this before VG autoactivation.
(If the VG does not exist yet, the VG refresh is NOP)
Split out the partitioned device filter that needs to open the device
and move the multipath filter in front of it.
When a device is multipathed, sending I/O to the underlying paths may
cause problems, the most obvious being I/O errors visible to lvm if a
path is down.
Revert the incorrect <backtrace> messages added when a device doesn't
pass a filter.
Peter Rajnoha [Tue, 13 Aug 2013 15:26:36 +0000 (17:26 +0200)]
blkdeactivate: add support for bind mounts
Recent version of util-linux/umount (v2.23+) provides
umount --all-targets that can unmount all the mount targets of
the same device (the bind mounts). Use this if available when
calling the umount blkdeactivate.
Otherwise, for older versions of util-linux, use findmnt
(that is also a part of the util-linux) to iterate over all
mount targets of the same device - this is the manual way.
Peter Rajnoha [Tue, 13 Aug 2013 15:17:25 +0000 (17:17 +0200)]
blkdeactivate: change the way blkdeactivate reports status
The blkdeactivate now suppresses error messages from external
tools that are called. Instead, only a summary message "done"
or "skipped" is issued by blkdeactivate as any error in calling
the external tool (e.g. unmounting or deactivating a device) causes
the device to be skipped and the blkdeactivate continues with the
next device in the tree.
Add new -e/--errors switch to display any error messages from
external tools.
Also, suppress any output given by the external tools and add
new -v/--verbose switch to display it including the verbose
output of the tools called (this will enable error reporting
as well).
Also add blkdeactivate -vv for even more debug (the script's debug).
Also note:
md raid replaces dm mirroring as the default implementation.
Can call out to thin_repair to fix thin metadata.
Improved clvmd error detection/debugging information.
Jonathan Brassow [Mon, 12 Aug 2013 18:56:47 +0000 (13:56 -0500)]
Mirror: Fix inability to remove VG's cluster flag if it contains a mirror
According to bug 995193, if a volume group
1) contains a mirror
2) is clustered
3) 'locking_type' = 0 is used
then it is not possible to remove the 'c'luster flag from the VG. This
is due to the way _lv_is_active behaves.
We shouldn't allow the cluster flag to be flipped unless the mirrors in
the cluster are not active. This is because different kernel modules
are used depending on whether a mirror is cluster or not. When we
attempt to see if the mirror is active, we first check locally. If it
is not, then we attempt to check for remotely active instances if the VG
is clustered. Since the no_lock locking type is LCK_CLUSTERED, but does
not implement 'query_resource', remote_lock_held will always return an
error in this case. An error from remove_lock_held is treated as though
the lock _is_ held (i.e. the LV is active remotely). This blocks the
cluster flag from changing.
The solution is to implement 'query_resource' for the no_lock type. It
will report a message and return 1. This will allow _lv_is_active to
function properly. The LV would be considered not active remotely and
the VG can change its flag.
Jonathan Brassow [Mon, 12 Aug 2013 17:40:52 +0000 (12:40 -0500)]
RAID: Fix bug making lvchange unable to change recovery rate for RAID
Commit ID 8615234c0fa331852a11e1bf595bf1d4b858f4bc failed to include
the actual code changes that were made to fix the bug. Instead, all
tests went in to validate the bug fix. This patch adds the missing
code changes.
RAID: Fix bug making lvchange unable to change recovery rate for RAID
1) Since the min|maxrecoveryrate args are size_kb_ARGs and they
are recorded (and sent to the kernel) in terms of kB/sec/disk,
we must back out the factor multiple done by size_kb_arg. This
is already performed by 'lvcreate' for these arguments.
2) Allow all RAID types, not just RAID1, to change these values.
3) Add min|maxrecoveryrate_ARG to the list of 'update_partial_unsafe'
commands so that lvchange will not complain about needing at
least one of a certain set of arguments and failing.
4) Add tests that check that these values can be set via lvchange
and lvcreate and that 'lvs' reports back the proper results.
Breakpoint 1, config_def_check (cmd=0x819b050, handle=0x81a04f8) at config/config.c:775
(gdb) p vp
$1 = 0x8189ee0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb)
_config_def_check_tree (handle=0x81a04f8, vp=0x8189ee0 <_cfg_path>
"config", pvp=0x8189ee6 <_cfg_path+6> "", rp=0xbfffe1e8 "config",
prp=0xbfffe1ee "", buf_size=58, root=0x81a2568, ht=0x81a65
48) at config/config.c:680
(gdb) p vp
$4 = 0x8189ee0 <_cfg_path> "config"
(gdb) p pvp
$5 = 0x8189ee6 <_cfg_path+6> ""
If compiled with -O2 (incorrect):
Breakpoint 1, config_def_check (cmd=cmd@entry=0x8183050, handle=0x81884f8) at config/config.c:775
(gdb) p vp
$1 = 0x8172fc0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb) p vp + strlen(vp)
$3 = 0x8172fc6 <_cfg_path+6> ""
(gdb)
_config_def_check_tree (handle=handle@entry=0x81884f8, pvp=0x8172fc7
<_cfg_path+7> "host_list", rp=rp@entry=0xbffff190 "config",
prp=prp@entry=0xbffff196 "", buf_size=buf_size@entry=58, ht=0x 818e548, root=0x818a568, vp=0x8172fc0 <_cfg_path> "config") at
config/config.c:674
(gdb) p pvp
$4 = 0x8172fc7 <_cfg_path+7> "host_list"
The difference is in passing the "pvp" arg for _config_def_check_tree.
While in the correct case, the value of _cfg_path+6 is passed
(the result of vp + strlen(vp) - see the snippet of the code above),
in the incorrect case, this value is increased by 1 to _cfg_path+7,
hence totally malforming the string that is being processed.
This ends up with incorrect validation check and incorrect warning
messages are issued like:
"Configuration setting "config/checks" has invalid type. Found integer, expected section."
To workaround this issue, remove the "static" qualifier from the
"static char _cfg_path[CFG_PATH_MAX_LEN]". This causes the optimalizer
to be less aggressive (also shuffling the arg list for
_config_def_check_tree call helps).
Mirror: Fix issue preventing PV creation on mirror LVs
Commit b248ba0a396d7fc9a459eea02cfdc70b33ce3441 attempted to
prevent mirror devices which had a failed device in their
mirrored log from being usable/readable by LVM. This was to
protect against circular dependancies where one LVM command
could be blocked trying to read one of these affected mirrors
while the LVM command to fix/unblock that mirror was stuck
behind the currently running command.
The above commit went wrong when it used 'device_is_usable()' to
recurse on the mirrored log device to check if it was suspended
or blocked. The 'device_is_usable' function also contains a check
for reserved names - like *_mlog, etc. This last check always
triggered when checking a mirror's log simply because of the name,
not because it was suspended or blocked - a false positive.
The solution is to create a new function like 'device_is_usable',
but without the check for reserved names. Using this new function
(device_is_suspended_or_blocked), we can check the status of a
mirror's log device properly.
Mirror/RAID1: When up|down-converting default to segtype of current LV
If there is no RAID support in the kernel but the default mirror
segtype is "raid1", converting legacy mirrors can be problematic.
For example, changing the log type or converting a mirror to a linear
LV does not require the RAID modules to be present. However, because
lp->segtype is set to be RAID1 by the configuration file, the command
fails.
We should only be setting lp->segtype when converting mirrors if it is
going to change (e.g. to linear or between mirror types).
TEST: Be explicit about which mirror segment type to use.
In those places where mirrors were being created while assuming
a default segment type of "mirror", we include the '--type mirror'
argument to explicitly set the segment type. This will preserve
the mirror testing that is performed even though the default
mirroring segment type is now "raid1".
RAID: Make "raid10" the default striped + mirror segment type
When both the '-i' and '-m' arguments are specified on the command
line, use the "raid10" segment type. This way, the native RAID10
personality is used through dm-raid rather than layering a mirror
on striped LVs. If the old behavior is desired, the '--type'
argument to use would be "mirror" rather than "raid10".
Peter Rajnoha [Tue, 6 Aug 2013 11:37:42 +0000 (13:37 +0200)]
lvmetad: fix mda offset/size overflow if >= 4g (32bit)
When reading an info about MDAs from lvmetad, we need to use 64 bit
int to read the value of the offset/size, otherwise the value is
overflows and then it's used throughout!
This is dangerous if we're trying to write such metadata area then,
mostly visible if we're using 2 mdas where the 2nd one is at the end
of the underlying device and hence the value of the mda offset is
high enough to cause problems:
(the offset trimmed to value of 0 instead of 4096m, so we write
at the very start of the disk (or elsewhere if the offset has
some other value!)
David Teigland [Tue, 30 Jul 2013 19:12:33 +0000 (14:12 -0500)]
clvmd: verify messages before processing
Check that fields in clvm_header are valid when
local or remote messages are received. If not,
log an error, dump the message data and ignore
the message.
Jonathan Brassow [Wed, 31 Jul 2013 20:23:13 +0000 (15:23 -0500)]
dmeventd: Fix memory leak
When creating a timeout thread for snapshots, the thread is not
tracked and thus never joined. This means that the exit status
of the timeout thread is held indefinitely. Saves a bit of
memory to set PTHREAD_CREATE_DETACHED when creating this thread.
I've also added pthread_attr_init|destroy to setup the creation
pthread_attr_t.
Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Initial basic support for repair.
It currently takes pool metadata spare volume, which
is used for recovery. New spare is created if the volume
is successfuly repaired.
After the operation the previous _tmeta volume is moved
into _tmeta%d volume and if everything is ok, this volume
could be removed.
New _tmeta needs to be pvmoved to proper place and also
converted to i.e. mirror if it should be mirrored.
Later version will try to automate some steps here.
The PREFERRED allocation mechanism requires the number of areas in the
previous LV segment to match the number in the new segment being
allocated. If they do not match, the code may crash.
E.g. https://bugzilla.redhat.com/989347
Introduce A_AREA_COUNT_MATCHES and when not set avoid referring
to the previous segment with the contiguous and cling policies.
Tony Asleson [Thu, 25 Jul 2013 19:54:57 +0000 (15:54 -0400)]
python-lvm: Correct parsing arguments for integers
There were a few places where the code was incorrectly
using parse arguments for the supplied variable type &
size. Changing the variables to be declared exactly
like python is expecting so if we build on an arch
where the size of type is different than typically
expected we will continue to match. In addition the
parse character needed to be corrected in a few spots
too.
In the example above a closing '|' character is missing at the end
of the regex. The segfault itself was caused by trying to destroy
the same filter twice in _init_filters fn within the error path
(the "bad" goto target):
bad:
if (f3)
f3->destroy(f3);
if (f4)
f4->destroy(f4);
Where f3 is the composite filter (sysfs + regex + type + md + mpath filter)
and f4 is the persistent filter which encompasses this composite filter
within persistent filter's 'real' field in 'struct pfilter'.
So in the end, we need to destroy the persistent filter only as
this will also destroy any 'real' filter attached to it.
Jonathan Brassow [Wed, 24 Jul 2013 19:18:07 +0000 (14:18 -0500)]
Revert a previous change
commit d00d45a8b609d50302c94a0fff20849f0cc13a48 introduced changes
that are causing cluster mirror tests to fail. Ultimately, I think
the change was right, but a proper clean-up will have to wait.
The portion of the commit we are reverting correlates to the
following commit comment:
2) lib/metadata/mirror.c:_delete_lv() - should have been calling
_activate_lv_like_model() with 'mirror_lv'. This is because
'mirror_lv' is the LV that the overall operation is being
performed on. We need to use this LV as the basis for
determining whether to activate locally, or across the
cluster, etc.
It appears that when legs or logs are removed from a mirror, they
are being activated before they are deleted in order to make them
top-level LVs that can be acted upon. When doing this, it appears
they are not activated based on the characteristics of the mirror
from which they came. IOW, if the mirror was exclusively active,
the sub-LVs are activated globally. This is a no-no. This then
made it impossible to activate_lv_like_model if the model was
"mirror_lv" instead of "lv" in _delete_lv(). Thus, at some point
this change should probably be put back and those location where
the sub-LVs are being improperly activated "shared" instead of
EX should be corrected.