Zdenek Kabelac [Thu, 7 Dec 2017 19:28:03 +0000 (20:28 +0100)]
libdm: use delay_resume_if_extended
Update the logic towards more explicit logic.
Preload tree normally does not want to resume, only
in certain cases of extension or new loaded nodes can be
resumed. So introduce new internal variable delay_resume_if_extended
controlable by target.
Patch itself is not changing current existing behaviour,
and rather documents existing problem in more readable way.
lvm2 needs to introduce explicit mechanism how to support more
fain-grained (and safe) logic to i.e. resize thin-pool which
can be sitting on cached raid volume.
Zdenek Kabelac [Thu, 7 Dec 2017 16:52:01 +0000 (17:52 +0100)]
libdm: avoid checking status on activation
Variable props.send_messages has 3 states and was not used properly
here. Activation in this moment does not need to verify thin-pool status
as that has been already checked on preload.
So only if there are some real messages (value 2) call function
for sending them.
device: Tag I/O for each mda on a device separately in log messages.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
Zdenek Kabelac [Mon, 4 Dec 2017 14:45:49 +0000 (15:45 +0100)]
cleanup: drop unneeded check
Code already has dereferenced UUID before this point,
and its already given we require name & uuid when ading new node
(although uuid could be empty string).
Zdenek Kabelac [Mon, 4 Dec 2017 14:05:44 +0000 (15:05 +0100)]
libdm: watch for failing _info_by_dev
Separate handling of error code from _info_by_dev.
This error can only happeng when we are running out of memory.
In such case there is urgent need to stop any futher proceeding
of command and run to error ASAP.
Avoiding "$(get first_extent_sector "$d")" in the loop
allows the test to succeed in the cluster. Further cluster
analysis needed to get to the core reason.
The lvm2 test suite aims at small test resource footprints
(few PVs, small PV sizes) to run on tmpfs backed loop device.
OTOH, lvconvert-reshape-raid.sh aims to test the maxima of
supported total stripes of 64. This patch adds a prerequisite
conditional to skip tests using more than 14 stripes.
It requires the target version 1.13.1 to avoid deadlocks.
If the recovery of the repleced leg(s) of a RaidLV created without
initial resynchronization (i.e. "lvcreate --nosync ...") got
interrupted, it can't be extended because of the < 100% sync rate.
raid: ignore --stripesize on raid4/5 conversion to 1 stripe
In case caller passes in changed stripe size when reshaping raid4/5
to 1 stripe aiming to convert to raid1 and optionally to linear,
ignore it to prevent data corruption.
Zdenek Kabelac [Thu, 30 Nov 2017 12:26:44 +0000 (13:26 +0100)]
activation: avoid rechecking pvmove node
Use new 3rd. state of trace_pvmove_deps == 2.
In this state we know, we have already seen the node and can skip futher
testing. Remainging value 1 signals we want to track, and value 0
is for ignoring tracking, but node is still checking in this case.
Zdenek Kabelac [Tue, 28 Nov 2017 22:11:20 +0000 (23:11 +0100)]
activation: extend resume validation
Check also all snapshosts when resume is requested,
the origin volume is already resume, but possibly
some subLV or snapshot LV could be suspended if
we are still in critical_section.
Zdenek Kabelac [Fri, 1 Dec 2017 10:50:14 +0000 (11:50 +0100)]
activation: split priority from memory locking
When entering any critical section, lvm2 used to lock process memory
and raised task priority to avoid problem with page swapping and minimize
time of having non-resumed devices in table.
With this patch, memory locking which which is expensive is only used when
entering 'suspending' section as only in this section there is risk
lvm could be suspending a device which later can be needed for paging.
Raised priority is still kept for all section entrances as this is
low-cost operation and may accelerate table resumes - although the real
impact can be still considered later.
Zdenek Kabelac [Wed, 29 Nov 2017 21:19:46 +0000 (22:19 +0100)]
pvmove: add missing segment merging
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
Zdenek Kabelac [Mon, 27 Nov 2017 09:21:21 +0000 (10:21 +0100)]
cmdline: avoid overrun on very large numbers.
When large size number (>2^31) is given on command line it could be
misdetected and in certain cases lead to wrongly casted number.
So make sure all cases always do set _MAX number in case the value would
not fit within the supported range instead of getting some random value
within the range.
In most cases this was not a problem to detect, but i.e. stripesize
parameter might have been fooled by certain large numbers.
Zdenek Kabelac [Mon, 27 Nov 2017 09:26:35 +0000 (10:26 +0100)]
toollib: improve stripes args reading
Rewrite validation of stripes and stripe_size args into more readable
sequential code.
Extend reading of stripes & stripes_size args so it better knows
defaults for types like striped raid.
TODO: this should really be a value obtained for segtype structure and
all the weird conditions and modification of stripes and stripe_size
around lvm2 code should be dropped.
Zdenek Kabelac [Sat, 25 Nov 2017 23:28:33 +0000 (00:28 +0100)]
pvmove: enhance delayed_resume logic
ATM we want to support delayed resume purely in pvmove case.
So have libdm logic internal to recognize difference beween
pvmove and other targets that do use delayed resume.
This fixes problem introduced with commit aa68b898ff9c51dcb
for mirror-on-mirror or snapshot-on-mirror problem.
TODO: likely added new API call and let libdm user select
delayed nodes explicitely.
Zdenek Kabelac [Fri, 24 Nov 2017 12:55:42 +0000 (13:55 +0100)]
activation: automaticaly discover pvmove holders
When pvmove is finished and does 'suspend/resume' on PVMOVE LV,
on resume path committed metadata are already showing 'standalone'
pvmove LV prepared just for removal.
However code should be able to 'resume' preloaded LV there were
participating in pvmove operation.
Previously this was all done in the 'tools' part of lvm2 code.
So the lvconvert upon pvmove finish had to explicitely call 'resume' on every such LV.
Now 'smarted' activation code is able to deduce and combine all information from
the active dm table and committed metadata so single call resolves
it all in one go.
Internally holders are detected by reading sysfs directory to capture
all needed UUID which are then looked in lvm2 metadata and all such
LVs are automatically collected into dmtree.
Zdenek Kabelac [Fri, 24 Nov 2017 12:58:23 +0000 (13:58 +0100)]
mirror: use lv_update_and_reload_origin
Replace complex code with standard lv_update_and_reload_origin().
Extra suspend should not be necessary.
(If they would be - dependency tree would have bug for fixing).
Zdenek Kabelac [Fri, 24 Nov 2017 12:57:22 +0000 (13:57 +0100)]
libdm: preload propagates delayed resume
Propagate delayed resume at least for preload case in a simple way.
Currently PVMOVE depends on internal logic where 'mirror' with
corelog is 'possible' PVMOVE. In such case resume of 'created'
node is 'delayed'.
This is mostly an ugly internal hack - but for the moment being when we
add propagation for preload - it does work reasonable.
TODO: provide standard API and avoid this internal 'guessing'.
Zdenek Kabelac [Fri, 24 Nov 2017 12:53:02 +0000 (13:53 +0100)]
resume: secure critical section
Only thin-pool with origin_only suspend is allowed to be not suspending anything.
In such case pairing resume will 'decrement' critical section counter.
When sanlock or dlm lock managers return an error number
that we don't recognize, replace it with a generic -ELMERR
which is defined in the set of special lvmlockd error
numbers. Otherwise, an unknown lock manager error number
could be misinterpreted for something else if it happened
to overlap another set of error numbers (which they have
not thus far.)
David Teigland [Wed, 15 Nov 2017 21:34:42 +0000 (15:34 -0600)]
lvmlockd: retry on other sanlock errors
These less common errors returned from sanlock should
also cause sanlock to retry the lock acquire:
- i/o timeout occurs during sanlock_acquire().
other i/o on the same disk as the leases can cause
sanlock i/o timeouts.
- low level disk paxos contention between hosts naturally
causes one host to not acquire the lease. There are a
couple special error numbers associated with these cases
that should just be recognized as a normal failure to
acquire the lease.
Zdenek Kabelac [Wed, 15 Nov 2017 11:08:33 +0000 (12:08 +0100)]
activation: suspend pvmove using lv.
Whenever pvmove tree is going to be generated for suspend
and such LV has a user - use this 'using LV' to generate
correct dm tree holding all components.
Zdenek Kabelac [Fri, 10 Nov 2017 20:15:50 +0000 (21:15 +0100)]
activation: check subLV before skipping resume
LV is asked for resume, and its already resume and tool
is inside 'critical_section()' check if there is any suspended sub LV.
In that case 'resume' operation will not be skipped.
Only lv_committed() now uses vg->vg_committed and it appears redundant
if its contents match the enclosing VG so don't waste cycles creating it
when that's known to be true when no write lock is held so the struct
won't get modified.