Zdenek Kabelac [Thu, 30 Nov 2017 12:26:44 +0000 (13:26 +0100)]
activation: avoid rechecking pvmove node
Use new 3rd. state of trace_pvmove_deps == 2.
In this state we know, we have already seen the node and can skip futher
testing. Remainging value 1 signals we want to track, and value 0
is for ignoring tracking, but node is still checking in this case.
Zdenek Kabelac [Tue, 28 Nov 2017 22:11:20 +0000 (23:11 +0100)]
activation: extend resume validation
Check also all snapshosts when resume is requested,
the origin volume is already resume, but possibly
some subLV or snapshot LV could be suspended if
we are still in critical_section.
Zdenek Kabelac [Fri, 1 Dec 2017 10:50:14 +0000 (11:50 +0100)]
activation: split priority from memory locking
When entering any critical section, lvm2 used to lock process memory
and raised task priority to avoid problem with page swapping and minimize
time of having non-resumed devices in table.
With this patch, memory locking which which is expensive is only used when
entering 'suspending' section as only in this section there is risk
lvm could be suspending a device which later can be needed for paging.
Raised priority is still kept for all section entrances as this is
low-cost operation and may accelerate table resumes - although the real
impact can be still considered later.
Zdenek Kabelac [Wed, 29 Nov 2017 21:19:46 +0000 (22:19 +0100)]
pvmove: add missing segment merging
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
Zdenek Kabelac [Mon, 27 Nov 2017 09:21:21 +0000 (10:21 +0100)]
cmdline: avoid overrun on very large numbers.
When large size number (>2^31) is given on command line it could be
misdetected and in certain cases lead to wrongly casted number.
So make sure all cases always do set _MAX number in case the value would
not fit within the supported range instead of getting some random value
within the range.
In most cases this was not a problem to detect, but i.e. stripesize
parameter might have been fooled by certain large numbers.
Zdenek Kabelac [Mon, 27 Nov 2017 09:26:35 +0000 (10:26 +0100)]
toollib: improve stripes args reading
Rewrite validation of stripes and stripe_size args into more readable
sequential code.
Extend reading of stripes & stripes_size args so it better knows
defaults for types like striped raid.
TODO: this should really be a value obtained for segtype structure and
all the weird conditions and modification of stripes and stripe_size
around lvm2 code should be dropped.
Zdenek Kabelac [Sat, 25 Nov 2017 23:28:33 +0000 (00:28 +0100)]
pvmove: enhance delayed_resume logic
ATM we want to support delayed resume purely in pvmove case.
So have libdm logic internal to recognize difference beween
pvmove and other targets that do use delayed resume.
This fixes problem introduced with commit aa68b898ff9c51dcb
for mirror-on-mirror or snapshot-on-mirror problem.
TODO: likely added new API call and let libdm user select
delayed nodes explicitely.
Zdenek Kabelac [Fri, 24 Nov 2017 12:55:42 +0000 (13:55 +0100)]
activation: automaticaly discover pvmove holders
When pvmove is finished and does 'suspend/resume' on PVMOVE LV,
on resume path committed metadata are already showing 'standalone'
pvmove LV prepared just for removal.
However code should be able to 'resume' preloaded LV there were
participating in pvmove operation.
Previously this was all done in the 'tools' part of lvm2 code.
So the lvconvert upon pvmove finish had to explicitely call 'resume' on every such LV.
Now 'smarted' activation code is able to deduce and combine all information from
the active dm table and committed metadata so single call resolves
it all in one go.
Internally holders are detected by reading sysfs directory to capture
all needed UUID which are then looked in lvm2 metadata and all such
LVs are automatically collected into dmtree.
Zdenek Kabelac [Fri, 24 Nov 2017 12:58:23 +0000 (13:58 +0100)]
mirror: use lv_update_and_reload_origin
Replace complex code with standard lv_update_and_reload_origin().
Extra suspend should not be necessary.
(If they would be - dependency tree would have bug for fixing).
Zdenek Kabelac [Fri, 24 Nov 2017 12:57:22 +0000 (13:57 +0100)]
libdm: preload propagates delayed resume
Propagate delayed resume at least for preload case in a simple way.
Currently PVMOVE depends on internal logic where 'mirror' with
corelog is 'possible' PVMOVE. In such case resume of 'created'
node is 'delayed'.
This is mostly an ugly internal hack - but for the moment being when we
add propagation for preload - it does work reasonable.
TODO: provide standard API and avoid this internal 'guessing'.
Zdenek Kabelac [Fri, 24 Nov 2017 12:53:02 +0000 (13:53 +0100)]
resume: secure critical section
Only thin-pool with origin_only suspend is allowed to be not suspending anything.
In such case pairing resume will 'decrement' critical section counter.
When sanlock or dlm lock managers return an error number
that we don't recognize, replace it with a generic -ELMERR
which is defined in the set of special lvmlockd error
numbers. Otherwise, an unknown lock manager error number
could be misinterpreted for something else if it happened
to overlap another set of error numbers (which they have
not thus far.)
David Teigland [Wed, 15 Nov 2017 21:34:42 +0000 (15:34 -0600)]
lvmlockd: retry on other sanlock errors
These less common errors returned from sanlock should
also cause sanlock to retry the lock acquire:
- i/o timeout occurs during sanlock_acquire().
other i/o on the same disk as the leases can cause
sanlock i/o timeouts.
- low level disk paxos contention between hosts naturally
causes one host to not acquire the lease. There are a
couple special error numbers associated with these cases
that should just be recognized as a normal failure to
acquire the lease.
Zdenek Kabelac [Wed, 15 Nov 2017 11:08:33 +0000 (12:08 +0100)]
activation: suspend pvmove using lv.
Whenever pvmove tree is going to be generated for suspend
and such LV has a user - use this 'using LV' to generate
correct dm tree holding all components.
Zdenek Kabelac [Fri, 10 Nov 2017 20:15:50 +0000 (21:15 +0100)]
activation: check subLV before skipping resume
LV is asked for resume, and its already resume and tool
is inside 'critical_section()' check if there is any suspended sub LV.
In that case 'resume' operation will not be skipped.
Only lv_committed() now uses vg->vg_committed and it appears redundant
if its contents match the enclosing VG so don't waste cycles creating it
when that's known to be true when no write lock is held so the struct
won't get modified.
- Use 'lvmcache' consistently instead of 'metadata cache'
- Always use 5 characters for source line number
- Remember to convert uuids into printable form
- Use <no name> rather than (null) when VG has no name.
The persistent filter should not be imported by any command that doesn't
use it so take addtional note of REQUIRES_FULL_LABEL_SCAN (for vgrename)
and introduce IGNORE_PERSISTENT_FILTER for vgscan and pvscan.
Zdenek Kabelac [Fri, 10 Nov 2017 12:28:33 +0000 (13:28 +0100)]
reload: do not take backup with suspended devices
If the suspend/resume sequence would leave some device in suspend
for possible later resume, backup cannot be takes (fs holding backups
could be still frozen in critical section())
Zdenek Kabelac [Thu, 9 Nov 2017 12:00:28 +0000 (13:00 +0100)]
raid: protect raid4 activation
Move check for presence of raid4 into the right place
so there is no way how to hit activation of any LV
with raid4 on kernel which does not support it.
raid: reject message for 2-legged raid4/5 -> striped
Commit 763db8aab02d7df551a3e8500d261ef6c9651bdb rejects 2-legged
conversions to striped/raid0 but different messages are displayed
for raid0 or striped. This commit provides the same rejection messages.
raid: reject conversion request to striped/raid0 on 2-legged raid4/5
raid4/5 LVs may only be converted to striped or raid0/raid0_meta
in case they have at least 3 legs. 2-legged raid4/5 are a result
of either converting a raid1 to raid4/5 (takeover) or converting
a raid4/5 with more than 2 legs to raid1 with 2 legs (reshape).
The raid4/5 personalities map those as raid1,
thus reject conversion to striped/raid0.
Eric Ren [Mon, 30 Oct 2017 12:53:20 +0000 (20:53 +0800)]
clvmd: supress ENOENT error on testing connection
In HA cluster, we have "clvm" resource agent to manage clvmd daemon.
The agent invokes clvmd like: "clvmd -T90 -d0", which always prints
a scaring error message:
"""
local socket: connect failed: No such file or directory
"""
When specifed with "-d" option, clvmd tries to check if an instance
of the clvmd daemon is already running through a testing connection.
The connect() will fail with this ENOENT error in such case, so supress
the error message in such case.
TODO: add missing error reaction code - since ofter log_error, program
is not supposed to continue running (log_error() is for reporting
stopping problems).
testsuite: Fix problem when checking RAID4/5/6 for mismatches.
The lvchange-raid[456].sh test checks that mismatches can be detected
properly. It does this by writing garbage to the back half of one of
the legs directly. When performing a "check" or "repair" of mismatches,
MD does a good job going directly to disk and bypassing any buffers that
may prevent it from seeing mismatches. However, in the case of RAID4/5/6
we have the stripe cache to contend with and this is not bypassed. Thus,
mismatches which have /just/ happened to an area that now populates the
stripe cache may be overlooked. This isn't a serious issue, however,
because the stripe cache is short-lived and reasonably small. So, while
there may be a small window of time between the disk changing underneath
the RAID array and when you run a "check"/"repair" - causing a mismatch
to be missed - that would be no worse than if a user had simply run a
"check" a few seconds before the disk changed. IOW, it simply isn't worth
making a fuss over dropping the stripe cache before beginning a "check" or
"repair" (which we actually did attempt to do a while back).
So, to get the test running smoothly, we simply deactivate and reactivate
the LV to force the stripe cache to be dropped and then proceed. We could
just as easily wait a few seconds for the stripe cache to empty also.
testsuite: Add and document a 'should' for "idle" -> "recover" RAID test
When a "recover" is just starting for a RAID LV, it is possible to get
"idle" for the sync action if the status is issued quickly enough. This
is fine, the MD thread just hasn't gotten things going yet. However,
the /need/ for a "recover" should be marked in md->recovery and it would
be simple enough to fix the kernel so this doesn't happen. May eventually
want a separate bug for this, but for now it fits with RHBZ 1507719.
Bastian Blank [Wed, 1 Nov 2017 14:25:54 +0000 (15:25 +0100)]
systemd: remove Install sections from socket-activated services
We always preferred and recommended socket activation for our services
so remove the Install section in related .service units which are unused
in this case and keep only the Install section in associated .socket
units.
Zdenek Kabelac [Tue, 31 Oct 2017 23:51:39 +0000 (00:51 +0100)]
lv_manip: hide layered LV temporarily
Since vg_validate() now rejects LVs without segments and
insert_layer_for_segments_on_pv() gets just created
'layer_lv' without segment, it needs to be hidden
from vg->lvs during processing of _align_segment_boundary_to_pe_range()
as this function calls lv_validate() and now requires
vg to be consistent. LV is then put back into vg->lvs.
Jonathan Brassow [Tue, 31 Oct 2017 02:58:38 +0000 (21:58 -0500)]
test: clean-up failing test case and document 'should' cases
There are two known bugs in the lvconvert-raid-status-validation.sh
test. The first one I consider to be more of an annoyance (1507719).
The second one I consider to be more serious (1507729).
RHBZ 1507719 simply documents the fact that the three RAID status
fields may not always be coherent due to the way they are set and
unset when the MD thread is shutting down and starting up. For
example, the sync ratio may be 100% but the sync action may not
yet have switched to "idle" and the health characters may not yet
all be 'A's (i.e. the devices set to InSync).
RHBZ 1507729 is more serious. The sync ratio can be 100% for a
short period of time after upconverting linear -> RAID1. It is
reset to 0 once the MD sync thread gets to work on it. It does
this because, technically, the array /is/ in-sync if the new
devices are excluded - i.e. the data is 100% available and
consistent. I'm not sure what to do about this problem, but we'd
much rather not have this state that looks exactly like the
end of the process when the sync ratio is 100% because the
"recover" process finished, but the sync action and health
characters haven't been updated yet. Put simply, the problem
is that we can't tell if a sync is starting or finished based
on the status output.
Since 4fa5add6b1bd4d7f7313f2950021a09e4130ad08 ("pvcreate: Wipe cached
bootloaderarea when wiping label.") label_remove is responsible
for the lvmcache_del. (toollib and liblvm need fixing to share
the code.)