lvchange: allow a transiently failed RaidLV to be refreshed
Add to commits 87117c2b2546 and 0b8bf73a63d8 to avoid refreshing two
times altogether, thus avoiding issues related to clustered, remotely
activated RaidLV. Avoid need to repeat "lvchange --refresh RaidLV"
two times as a workaround to refresh a RaidLV. Fix handles removal
of temporary *-missing-* devices created for any missing segments
in RAID SubLVs during activation.
Because the kernel dm-raid target isn't able to handle transiently
failing devices properly we need
"[dm-devel][PATCH] dm raid: fix transient device failure processing"
as well.
test: add lvchange-raid-transient-failures.sh
and enhance lvconvert-raid.sh
Zdenek Kabelac [Thu, 22 Dec 2016 22:28:04 +0000 (23:28 +0100)]
thin: refresh status when error processing fails
When thin-pool processes event and 'lvextend --use-policies' fails
rather capture up-to-date new info as the fullness percentage may
have jumped noticable. This way we could use 'more' correct numbers
when checking for thresholds.
Zdenek Kabelac [Thu, 22 Dec 2016 18:51:35 +0000 (19:51 +0100)]
report: show proper info for merging origin
When there is 'merging' of an origin in progress, but metadata stil
do provide both origin and snapshot, we should show data from merged
snapshot. This is important mainly for thin case, where there was
a window, where i.e. 'lvs -o+device_id' would report information
about 'already gone' origin thin LV.
This race window is usually hard to trigger but can be ocasionally hit.
Usually shortly after activation, but before polling process manages
to update metadata after merge.
Zdenek Kabelac [Tue, 20 Dec 2016 14:59:11 +0000 (15:59 +0100)]
validation: rework segment validation
Move individual segment validation to a separate function
executed for 'complete_vg'.
Move some 'extra' validation bits from 'raid' validation to global
segtype validation (so extending existing normal validation)
TODO: still some test are left to be moved.
Reduce some duplication in validation process - there are still
some left thought so still room for improving overal speed.
Tony Asleson [Wed, 14 Dec 2016 21:30:01 +0000 (15:30 -0600)]
lvmdbusd: Use timeout_add instead
The function timeout_add_seconds has quite a bit of variability. Using
timeout_add which specifies the timeout in ms instead of seconds. Testing
shows that this is much more consistent which should improve clients that
are using shorter timeouts for the API and the connection.
Zdenek Kabelac [Mon, 19 Dec 2016 13:06:55 +0000 (14:06 +0100)]
lvconvert: fix shown lv name for snapshot split
We can't keep 'display_lvname' for too long - it's using
ringbuffer and keeps limited number of names. So it's
safe only per few simple tests, but can't be used anymore
after some function calls..
(Fixes 00e641ef37a977129acc503f3fa1b67f556ac5eb)
Bryn M. Reeves [Thu, 15 Dec 2016 19:03:42 +0000 (19:03 +0000)]
libdm: clear region table in dm_stats_list()
Call _stats_regions_destroy() from dm_stats_list() if dms->regions
is non-NULL. This avoids leaking any pool allocations and ensures
the handle is in a known state: if an error occurs during the list,
dms->regions will be NULL and the handle will appear empty.
Zdenek Kabelac [Sat, 17 Dec 2016 21:41:27 +0000 (22:41 +0100)]
lvconvert: support cache to external origin conversion
Add this functionality to lvconvert:
'lvconvert --thin cachedLV --thinpool vg/poll'
Converts cachedLV to external origin (which will be read-only).
New thin volume is created in thinpool LV and it's using external
origin as source for unprovisioned chunks.
This conversion happens online (while volume is in use).
Thin LV remains fully writable.
Cached external origin no longer could be written so cache will be used
ONLY for read operations. For this limitation we require cache mode
to be writethrough (as writeback cannot write to read-only volumes).
When thinLV is later removed cached external origin is again
fully usable, just note, LV remain in 'read-only' mode.
When read-write is needed, 'lvchange -prw' has to be used.
Single external origin could be user by multiple thinLV in
multiple differen thin pool.
Zdenek Kabelac [Sun, 18 Dec 2016 14:05:31 +0000 (15:05 +0100)]
cache: improve activation with -real
When cache volume may be converted from normal to -real layer LV
we need to improve logic for call cache_check.
With this patch, we register call for cache_check only when metadata LV
is not yet present in active table slot (should match initial table
load).
This avoids unwanted checking when cache would become layer device
online.
Zdenek Kabelac [Sat, 17 Dec 2016 21:40:59 +0000 (22:40 +0100)]
libdm: drop callback on revert path
The system is likely in some very inconsisten state.
Do not try to make it even more problematic with trying
to invoke tools like thin_check via callback.
Zdenek Kabelac [Sat, 17 Dec 2016 21:40:14 +0000 (22:40 +0100)]
lv: fix lock holder for external origin
External origin could be reloaded via more locks.
It's actually even more complex then thin-pool,
as it may be active on more nodes for linear LVs
(and maybe even more types).
External origin is always read-only thus unmodifiable
device so there should not be a problem accesing it
through multiple nodes.
Also for thin-pool check first presence of active thin-pool.
FIXME:
It's not easy to detect on which nodes this device is active
Thus manipulation with such device may require checking every
node and it active state and refresh.
But since such setup is quite complex to prepare and use,
hopefully there are not user trying to 'explore' this usage yet.
Zdenek Kabelac [Sat, 17 Dec 2016 20:52:27 +0000 (21:52 +0100)]
cache: prepare status checking for layer
To be ready to show status of cache volume, call the status
with layer. Layer is automatically detected in this case when
cache volume is used in 'layered' form (needs -real suffix).
Zdenek Kabelac [Sun, 18 Dec 2016 14:05:57 +0000 (15:05 +0100)]
cache: improve wait for cache clear
Avoid printing misleading message about single dirty block.
Instead properly detect condition where the 'cleaner' policy
needs to be installed without 'overloading' dirty variable.
Also print warning if we would be clearing read-only volume.
(it really shouldn't happen).
Zdenek Kabelac [Sat, 17 Dec 2016 20:55:02 +0000 (21:55 +0100)]
thin: reload external origin with last thin
External origin could be activated as stand-alone device.
When the last thin LV is removed, external origin is no longer
the external origin and it's layer property was dropped.
Ensure dm table is correct by reloading external origin
(when it's active).
Zdenek Kabelac [Sat, 17 Dec 2016 20:54:51 +0000 (21:54 +0100)]
lvs: show status for layer
When LV is external origin, show info for LV but
status for -layer. So we expose more info to a user
as otherwise active external origin is only linear
mapping of -real layer.
We also need to handle --count=0 specially when calculating the
interval number: since intervals count from #1, this must account
for the implicit "minus one" when converting from zero to the
UINT64_MAX value used (which is too large to store in _int_args).
Bryn M. Reeves [Sun, 18 Dec 2016 12:42:47 +0000 (12:42 +0000)]
dmstats: separate TIMERFD and useleep() exit conditions
The time management code mixes tests of the _timer_fd value with
code that should be timer agnostic: this causes problems for users
of the usleep() timer, since it cannot properly detect the start
of a new interval:
Separate these out, by defining a _timer_running() call that each
timer implements, and only define _timer_fd if we are compiling
with TIMERFD enabled.
Bryn M. Reeves [Sun, 18 Dec 2016 12:39:26 +0000 (12:39 +0000)]
dmstats: use better interval estimate for usleep() timer
Although the usleep() interval timer is not used if the Linux
TIMERFD interface is available it should still provide reasonably
good timing.
Instead of trying to estimate the error from the duration of the
last sleep, peg it to the start time of the program, and use the
value of ((start_time - now) % interval) to correct the current
interval duration.
This always pulls us back into sync at the end of each interval,
rather than relying on trying to incrementally adjust the time
duration at each interval start.
This greatly reduces drift when the usleep() clock is used.
This is safe if the destination bitset is smaller than the source,
or if the two bitsets are of the same size.
With a destination that is larger (e.g. when resizing a bitmap to
add more capacity), the memcpy will overrun the source bitset and
set garbage bits in the destination.
There are nine uses of the macro currently (8 in libdm/regex, and
1 in daemons/cmirrord): in each case the two bitsets are always of
equal size so the behaviour is unchanged.
Fix the macro to use bs2's size to simplify resizing bitsets and
avoid the need for another copy macro.
We do two operations in one table reload:
- raid leg/image extraction
- rename remaining raid legs
This should be made in separate steps. Otherwise we do an
uncorrectable table change on error path (leaving tables
for admin and dmsetup).
As a hotfix - restore the previous logic and use a single
new function _lv_update_and_reload_list which activates exclusively
extracted LVs on the list before resuming suspended raid LV.
This restore 'rename' functionality upon resume.
Also still preserve the 'origin_only' logic - although we know
it's not working properly for cluster and LV stacking.
Zdenek Kabelac [Tue, 13 Dec 2016 11:28:12 +0000 (12:28 +0100)]
cleanup: remove wrapping function
backup is not 'tested' for success and also it should
actually happen just when command is finished.
We do not target to make backups with each inter-step
metadata change.
Zdenek Kabelac [Tue, 13 Dec 2016 11:31:28 +0000 (12:31 +0100)]
raid: improve table reload sequence
This is another place for 'common' use pattern or
reload and activation of deleted devices.
(Moving the exclusive activation to _deactivate_and_remove_lvs()).
TODO: looks like halve of raid function is reloading
just 'origin' - and the other full LV.
Bryn M. Reeves [Sun, 11 Dec 2016 22:41:45 +0000 (22:41 +0000)]
libdm: add min_num_bits to dm_bitset_parse_list()
It's useful to be able to specify a minimum number of bits for a
new bitmap parsed from a list, for e.g. to allow for expansing a
group without needing to copy/reallocate the bitmap.
Add a backwards compatible symbol for programs linked against old
versions of the library.
Bryn M. Reeves [Sun, 11 Dec 2016 12:44:45 +0000 (12:44 +0000)]
libdm: add dm_bit_get_last()/dm_bit_get_prev()
It is sometimes convenient to iterate over the set bits in a dm
bitset in reverse order (from the highest set bit toward zero), or
to quickly find the last set bit.
Add dm_bit_get_last() and dm_bit_get_prev(), mirroring the existing
dm_bit_get_first() and dm_bit_get_next().
dm_bit_get_prev() uses __builtin_clz when available to efficiently
test the bitset in reverse.
Bryn M. Reeves [Tue, 13 Dec 2016 12:03:00 +0000 (12:03 +0000)]
libdm: break up _stats_get_extents_for_file()
Split out the loop that iterates over each batch of FIEMAP
extent data from the function that sets up and calls the ioctl
to reduce nesting and simplify local variable use:
The _stats_map_extents() function is responsible for detecting
eof and extent boundaries and adding whole, allocated extents
to the file extent table for region creation.
Bryn M. Reeves [Tue, 13 Dec 2016 14:31:39 +0000 (14:31 +0000)]
libdm: check for non-existent region_id values in groups
Check that all region_id values specified in a group bitmap are
actually present: although this should not normally happen when
using the dmstats tool, it is possible as a result of manual
changes (or bugs) for a group descriptor to contain one or more
group_id values that do not exist.
Check for this situation when reading group descriptors, warn
the user the user, and clear these bits in the bitmap when
formatting it for output.
The crash will occur in some arbitrary dm_stats_get_* property
method - this happens while processing the 1st region_id in the
bitset, because the region is marked as grouped, but there is
no group bitmap present at dms->groups[2]->regions.
Fix this by detecting a mismatch between the expected region_id
and dm_bit_get_first() for the parsed bitset during
_parse_aux_data_group().
Bryn M. Reeves [Mon, 12 Dec 2016 20:40:27 +0000 (20:40 +0000)]
libdm: fix _stats_get_extents_for_file()
Handle files that contain multiple logical extents in a single
physical extent properly:
- In FIEMAP terms a logical extent is a contiguous range of
sectors in the file's address space.
- One or more physically adjacent logical extents comprise a
physical extent: these are the disk areas that will be mapped
to regions.
- An extent boundary occurs when the start sector of extent
n+1 is not equal to (n.start + n.length).
This requires that we accumulate the length values of extents
returned by FIEMAP until a discontinuity is found (since each
struct fiemap_extent returned by FIEMAP only represents a single
logical extent, which may be contiguous with other logical
extents on-disk).
This avoids creating large numbers of regions for physically
adjacent (logical) extents and fixes the earlier behaviour which
would only map the first logical extent of the physical extent,
leaving gaps in the region table for these files.
Zdenek Kabelac [Sun, 11 Dec 2016 09:18:44 +0000 (10:18 +0100)]
tests: test should clean devices it has created
To be able to detect lvm2 command is not leaking some
'unexpected' device - remove all devices before
test exits by its own command so test teardown
now can check what was 'left' unexpectedly.
Zdenek Kabelac [Sat, 10 Dec 2016 19:01:05 +0000 (20:01 +0100)]
raid: fix raid1 to mirror conversion
Fix order of operation when converting raid1 into old mirror.
Before any later metadata modification are initiated prepare
mirror_log device with all clearing.
Then directly convert raid1 into mirror with mirror_log.
This convertion now properly see as precommitted metadata
new 'mirror' and committed old 'raid' and is able to
preload all LVs.
Bryn M. Reeves [Sat, 10 Dec 2016 13:07:03 +0000 (13:07 +0000)]
libdm: use a private pool for filemap extent table
When mapping regions to a file descriptor, a temporary table of
extent descriptors is built using the dm_pool object building
interface.
Previously this use borrowed the dms->mem region and counter
table pool (since nothing can interleave with the allocation
while the caller is still in dm_stats_create_regions_from_fd()).
This turns out to be problematic for error recovery. When a
region creation operation fails partway through file mapping,
we need to roll back the set of already created regions and
this requires a listed handle: the dm_stats_list() will then
allocate from the same pool as the extents; we either have
to throw away valid list data, or leak the extent table, to
return the handle in a valid state.
Avoid this problem by creating a new, temporary mem pool in
_stats_create_file_regions() to hold the extent data, and
discarding it on exit from the function.
Bryn M. Reeves [Fri, 9 Dec 2016 23:59:51 +0000 (23:59 +0000)]
libdm: fix performance of failed filemap cleanup
While cleaning up the table of already created regions during a
failed dm_stats_create_regions_from_fd(), list the handle once,
and call _stats_delete_region() directly. This avoids sending a
@stats_list message for each region deleted, reducing runtime
from 6s to 0.7s when cleaning up ~250 out of ~10000 regions:
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
<< pauses here >>
Command failed
real 0m6.267s
user 0m3.770s
sys 0m2.487s
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
Command failed
real 0m0.716s
user 0m0.034s
sys 0m0.581s
Testing the error path requires region creation to start to
fail part way through the operation (in order to have regions
to clean up): the simplest way is to ensure the system is
close to the kernel limit of 1/4 RAM or 1/2 vmalloc space
consumed by dmstats data.
Bryn M. Reeves [Fri, 9 Dec 2016 22:58:25 +0000 (22:58 +0000)]
libdm: split off internal _stats_delete_region()
Split dm_stats_delete_region() so that internal callers can manage
the handle state themselves.
dm_stats_delete_region() now just handles checking the state of the
handle, reporting validation errors, and calling dm_stats_list() if
necessary, before calling _stats_delete_region().
The new _stats_delete_region() function performs the actual group
member removal and region deletion, and requires a fully listed
handle to operate.
Callers that repeatedly delete regions can use a single listed
handle for many operations on the same device, avoiding one
message ioctl per region deleted: since @stats_list with many
regions is expensive, this yields large runtime improvements.
Bryn M. Reeves [Fri, 9 Dec 2016 15:50:41 +0000 (15:50 +0000)]
libdm: use correct region_id when cleaning up a failed filemap
If we fail to create a region during dm_stats_create_regions_from_fd(),
we must remove all regions that were created to do this to date. This
needs to loop over the table of region_id values that were populated
by _stats_create_file_regions() before the error.
The code for this failure case in the out_remove branch incorrectly
uses the table index as the region_id:
for (--i; i != DM_STATS_REGION_NOT_PRESENT; i--) {
if (!dm_stats_delete_region(dms, i))
log_error("Could not delete region " FMTu64 ".", i);
}
This causes the cleanup code to delete a completely unrelated set
of regions (since the index here will always be nr_regions..0).
Fix it to pass the actual region_id stored in regions[i] instead.