device: Move dev_read memory allocation into device layer.
Rename dev_read() to dev_read_buf() - the function that reads data
into a supplied buffer.
Introduce a new dev_read() that allocates the buffer it returns and
switch the important users over to this. No caller may change the
returned data. (For now, callers are responsible for freeing it after
use, but later the device layer will take full ownership.)
dev_read_buf() should only be used for tiny buffers or unimportant code
(such as the old disk formats).
format_text: Separate out code paths for buffer wraparound
The creation of wrapped around metadata - where the start of metadata is
written up to the end of the buffer and the remainder follows back at
the start of the buffer - is now restricted to cases where writing the
metadata in one piece wouldn't fit. This shouldn't happen in 'normal'
usage so let's begin treating the code for this as a special case that
can be ignored when optimising 'normal' cases.
format_text: Change metadata alignment from 512 to 4096.
If there is sufficient space in the metadata area, align the next
metadata to a disk offset that is a multiple of 4096 bytes and
don't write it circularly. If it doesn't all fit at the end
of the metadata area, go back to the start and write it all there
contiguously.
If there is insufficient space to use the new stricter rules, revert to
the original behaviour, aligning on 512-byte boundaries wrapping around
the circular buffer as required.
metadata: Consistently skip metadata areas that failed.
Even after writing some metadata encountered problems, some commands
continue (rightly or wrongly) and attempt to make further changes.
Once an mda is marked MDA_FAILED, don't try to use it again.
This also applies when reverting, where one loop already skips
failed mdas but the other doesn't.
This fixes some device open_count warnings on relevant failure paths.
Marian Csontos [Tue, 12 Dec 2017 11:14:15 +0000 (12:14 +0100)]
test: lvmdbusd is used for process name
lvmdbusd was started, but the process was not recognized by pgrep.
- configure does not make the script executable - set the flag
explicitly when running make check,
- process name changed to lvmdbusd. The previous python3 value
originated from the use of /usr/bin/env.
Use new ALIGN_ABSOLUTE macro when calculating the start location
of new metadata and adjust the end of buffer detection so that
there is no longer an imposed gap between old and new metadata.
format_text: Use absolute alignment to calculate metadata usage
Currently both start and offset should always be divisible by alignment,
so this should have no effect, but a later patch will increase alignment
so these variables can no longer be optimised out.
Marian Csontos [Mon, 11 Dec 2017 11:36:54 +0000 (12:36 +0100)]
lvmdbusd: Fix path to python3
lvmdbusd executable script must use python3 interpreter detected by
configure script, as site-packages directory used for library is only
used by that interpreter.
Zdenek Kabelac [Thu, 7 Dec 2017 19:28:03 +0000 (20:28 +0100)]
libdm: use delay_resume_if_extended
Update the logic towards more explicit logic.
Preload tree normally does not want to resume, only
in certain cases of extension or new loaded nodes can be
resumed. So introduce new internal variable delay_resume_if_extended
controlable by target.
Patch itself is not changing current existing behaviour,
and rather documents existing problem in more readable way.
lvm2 needs to introduce explicit mechanism how to support more
fain-grained (and safe) logic to i.e. resize thin-pool which
can be sitting on cached raid volume.
Zdenek Kabelac [Thu, 7 Dec 2017 16:52:01 +0000 (17:52 +0100)]
libdm: avoid checking status on activation
Variable props.send_messages has 3 states and was not used properly
here. Activation in this moment does not need to verify thin-pool status
as that has been already checked on preload.
So only if there are some real messages (value 2) call function
for sending them.
device: Tag I/O for each mda on a device separately in log messages.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
Zdenek Kabelac [Mon, 4 Dec 2017 14:45:49 +0000 (15:45 +0100)]
cleanup: drop unneeded check
Code already has dereferenced UUID before this point,
and its already given we require name & uuid when ading new node
(although uuid could be empty string).
Zdenek Kabelac [Mon, 4 Dec 2017 14:05:44 +0000 (15:05 +0100)]
libdm: watch for failing _info_by_dev
Separate handling of error code from _info_by_dev.
This error can only happeng when we are running out of memory.
In such case there is urgent need to stop any futher proceeding
of command and run to error ASAP.
Avoiding "$(get first_extent_sector "$d")" in the loop
allows the test to succeed in the cluster. Further cluster
analysis needed to get to the core reason.
The lvm2 test suite aims at small test resource footprints
(few PVs, small PV sizes) to run on tmpfs backed loop device.
OTOH, lvconvert-reshape-raid.sh aims to test the maxima of
supported total stripes of 64. This patch adds a prerequisite
conditional to skip tests using more than 14 stripes.
It requires the target version 1.13.1 to avoid deadlocks.
If the recovery of the repleced leg(s) of a RaidLV created without
initial resynchronization (i.e. "lvcreate --nosync ...") got
interrupted, it can't be extended because of the < 100% sync rate.
raid: ignore --stripesize on raid4/5 conversion to 1 stripe
In case caller passes in changed stripe size when reshaping raid4/5
to 1 stripe aiming to convert to raid1 and optionally to linear,
ignore it to prevent data corruption.
Zdenek Kabelac [Thu, 30 Nov 2017 12:26:44 +0000 (13:26 +0100)]
activation: avoid rechecking pvmove node
Use new 3rd. state of trace_pvmove_deps == 2.
In this state we know, we have already seen the node and can skip futher
testing. Remainging value 1 signals we want to track, and value 0
is for ignoring tracking, but node is still checking in this case.
Zdenek Kabelac [Tue, 28 Nov 2017 22:11:20 +0000 (23:11 +0100)]
activation: extend resume validation
Check also all snapshosts when resume is requested,
the origin volume is already resume, but possibly
some subLV or snapshot LV could be suspended if
we are still in critical_section.
Zdenek Kabelac [Fri, 1 Dec 2017 10:50:14 +0000 (11:50 +0100)]
activation: split priority from memory locking
When entering any critical section, lvm2 used to lock process memory
and raised task priority to avoid problem with page swapping and minimize
time of having non-resumed devices in table.
With this patch, memory locking which which is expensive is only used when
entering 'suspending' section as only in this section there is risk
lvm could be suspending a device which later can be needed for paging.
Raised priority is still kept for all section entrances as this is
low-cost operation and may accelerate table resumes - although the real
impact can be still considered later.
Zdenek Kabelac [Wed, 29 Nov 2017 21:19:46 +0000 (22:19 +0100)]
pvmove: add missing segment merging
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
Zdenek Kabelac [Mon, 27 Nov 2017 09:21:21 +0000 (10:21 +0100)]
cmdline: avoid overrun on very large numbers.
When large size number (>2^31) is given on command line it could be
misdetected and in certain cases lead to wrongly casted number.
So make sure all cases always do set _MAX number in case the value would
not fit within the supported range instead of getting some random value
within the range.
In most cases this was not a problem to detect, but i.e. stripesize
parameter might have been fooled by certain large numbers.
Zdenek Kabelac [Mon, 27 Nov 2017 09:26:35 +0000 (10:26 +0100)]
toollib: improve stripes args reading
Rewrite validation of stripes and stripe_size args into more readable
sequential code.
Extend reading of stripes & stripes_size args so it better knows
defaults for types like striped raid.
TODO: this should really be a value obtained for segtype structure and
all the weird conditions and modification of stripes and stripe_size
around lvm2 code should be dropped.
Zdenek Kabelac [Sat, 25 Nov 2017 23:28:33 +0000 (00:28 +0100)]
pvmove: enhance delayed_resume logic
ATM we want to support delayed resume purely in pvmove case.
So have libdm logic internal to recognize difference beween
pvmove and other targets that do use delayed resume.
This fixes problem introduced with commit aa68b898ff9c51dcb
for mirror-on-mirror or snapshot-on-mirror problem.
TODO: likely added new API call and let libdm user select
delayed nodes explicitely.