Zdenek Kabelac [Fri, 20 Jan 2017 22:07:05 +0000 (23:07 +0100)]
dmeventd_thin: new logic for calling commands
For more advanced support we need to ensure better logic for calling
external much more advanced script for maintanance of thin-pool.
So this new code ensures:
When thin-pool data or metadata is bigger then 50%,
then with each 5% increment, action is called.
This is independent from autoextend_threshold.
This action always happens when thin-pool is over threshold,
(so no action when it's exactly i.e. 60%).
The only exception is 100% full thin-pool - which invokes 'last'
action.
Since thin-pool occupancy may change also downward, code needs
to also handle possibly reduction of occupancy of thin-pool.
So when usage drop from 90% to 50%, thin-pool will start to call
again action when it will pass 55% threshold.
This give external commands lot of option i.e. to call 'fstrim'
before actual resize is needed.
Zdenek Kabelac [Fri, 20 Jan 2017 22:06:45 +0000 (23:06 +0100)]
dmeventd_thin: drop umounting on error path
Default internal logic will stop trying to do any 'rescue' action
when executed command fails.
This will be now fully in hands of external script if such
behaviour is needed.
Zdenek Kabelac [Fri, 20 Jan 2017 21:53:55 +0000 (22:53 +0100)]
dmeventd_thin: rework failure handling
Instead of stopping monitoring after couple failing retries,
keep monitoring forever, just make larger delays between command
retries (ATM upto ~42 minutes).
So syslog is not spammed too often, yet commands have a chance to
be retried and succeed eventually...
Zdenek Kabelac [Fri, 20 Jan 2017 20:50:23 +0000 (21:50 +0100)]
dmeventd_thin: init command
When dmeventd configured command does not start with 'lvm ' prefix,
it's going to be an 'external' command.
In this case we split command by spaces to argv strings.
Zdenek Kabelac [Mon, 9 Jan 2017 15:30:49 +0000 (16:30 +0100)]
libdm: add human R|readable units
When showing sizes with 'H|human' units we do use standard rounding.
This however is confusing users from time to time,
when the printed number uses some biger units i.e. GiB and there is just
tiny fraction of space missing.
So here is some real-life example with new 'r' unit.
Meaning is - lvol1 has 'slightly' less then 2.00g - from sign '<' user
can be aware the LV doesn't have full 2.00GiB in size so he
will be less surpriced allocation of 2G volume will not succeed.
Zdenek Kabelac [Fri, 6 Jan 2017 11:41:38 +0000 (12:41 +0100)]
mirror: relax internal error for a while
With recent commit d6a74025df1afb3d76bec435bc6a40d649217b42 using
INTERNAL_ERROR while cheking layer LV - it's been noticed mirror
logic currently doesn't do a correct thing during upconversion and
does a full-try instead of checking only allocator capabilities.
This leads to invalid usage of layer.
To keep existing code running before providing a fix, relax
INTERNAL_ERROR just an error and keep the 'code' running.
Once mirror code is fixed, these all check should be switched
to internal errors.
Zdenek Kabelac [Thu, 5 Jan 2017 14:52:00 +0000 (15:52 +0100)]
report: report merged state for inactive LV
This was missing piece in 77997c7673bfca56f51ae4eb55a50bc76e40fe79.
When merging origin is inactive (while driver is loaded) we
could already report merge in progress values as there is
no way to activate 'old state' now.
Zdenek Kabelac [Thu, 5 Jan 2017 14:49:07 +0000 (15:49 +0100)]
debug: show proper error message for layer mismatch
Show proper internal error for failing command when there are some
inconsitencies in sizes of LV and its layer instead of rather
meaningless error code 5.
(Could be hit i.e. if user tried to 'resize' cached LV and then
uncache such LV.)
Zdenek Kabelac [Thu, 5 Jan 2017 14:32:25 +0000 (15:32 +0100)]
cache: resize is still unsupported
During rework of resize code this validation check
has been lost (in my resize branch). Upstream
is still not supporting resize of any cache type LV
so needs to be prevented.
Zdenek Kabelac [Tue, 3 Jan 2017 13:47:46 +0000 (14:47 +0100)]
cache: add missing udev wait
When we need to clear dirty cache content of cached LV, there
is table reload which usually is shortly followed by next metadata
change. However udev can't (as of now) process udev event
while device is 'suspended'.
So whenever sequence of 'suspend/resume/suspend' is needed,
we need to wait first for finishing of 'resume' processing before
starting next 'suspend'. Otherwise there is 'race' danger of triggering
unwantend umount by systemd as such event will trigger
SYSTEMD_READY=0 state for a moment for such changed device.
Such race is pretty ugly to trace so we may need to review more
sequencies for missing 'sync'.
(Other option is to enhnace 'udev' rules processing to avoid
such dramatic actions to be happening for suspended devices).
The only reasonable behaviour here is to error on
any number out of accepted range (i.e. now numbers
wrapping around with some hidden logic).
As this is plain bug there is no support for
backward compatibility since noone should
set numbers >UINT32_MAX and expect 0 or error
depending on how big number was used....
Zdenek Kabelac [Tue, 3 Jan 2017 12:02:52 +0000 (13:02 +0100)]
lvmcmdline: support uint32
Add simple function to wrap usage for only uint32 numbers.
Unlike 'int_arg' which accepts full range of 64bit number
this function will error on numbers out of this range:
lvchange: allow a transiently failed RaidLV to be refreshed
Add to commits 87117c2b2546 and 0b8bf73a63d8 to avoid refreshing two
times altogether, thus avoiding issues related to clustered, remotely
activated RaidLV. Avoid need to repeat "lvchange --refresh RaidLV"
two times as a workaround to refresh a RaidLV. Fix handles removal
of temporary *-missing-* devices created for any missing segments
in RAID SubLVs during activation.
Because the kernel dm-raid target isn't able to handle transiently
failing devices properly we need
"[dm-devel][PATCH] dm raid: fix transient device failure processing"
as well.
test: add lvchange-raid-transient-failures.sh
and enhance lvconvert-raid.sh
Zdenek Kabelac [Thu, 22 Dec 2016 22:28:04 +0000 (23:28 +0100)]
thin: refresh status when error processing fails
When thin-pool processes event and 'lvextend --use-policies' fails
rather capture up-to-date new info as the fullness percentage may
have jumped noticable. This way we could use 'more' correct numbers
when checking for thresholds.
Zdenek Kabelac [Thu, 22 Dec 2016 18:51:35 +0000 (19:51 +0100)]
report: show proper info for merging origin
When there is 'merging' of an origin in progress, but metadata stil
do provide both origin and snapshot, we should show data from merged
snapshot. This is important mainly for thin case, where there was
a window, where i.e. 'lvs -o+device_id' would report information
about 'already gone' origin thin LV.
This race window is usually hard to trigger but can be ocasionally hit.
Usually shortly after activation, but before polling process manages
to update metadata after merge.
Zdenek Kabelac [Tue, 20 Dec 2016 14:59:11 +0000 (15:59 +0100)]
validation: rework segment validation
Move individual segment validation to a separate function
executed for 'complete_vg'.
Move some 'extra' validation bits from 'raid' validation to global
segtype validation (so extending existing normal validation)
TODO: still some test are left to be moved.
Reduce some duplication in validation process - there are still
some left thought so still room for improving overal speed.
Tony Asleson [Wed, 14 Dec 2016 21:30:01 +0000 (15:30 -0600)]
lvmdbusd: Use timeout_add instead
The function timeout_add_seconds has quite a bit of variability. Using
timeout_add which specifies the timeout in ms instead of seconds. Testing
shows that this is much more consistent which should improve clients that
are using shorter timeouts for the API and the connection.
Zdenek Kabelac [Mon, 19 Dec 2016 13:06:55 +0000 (14:06 +0100)]
lvconvert: fix shown lv name for snapshot split
We can't keep 'display_lvname' for too long - it's using
ringbuffer and keeps limited number of names. So it's
safe only per few simple tests, but can't be used anymore
after some function calls..
(Fixes 00e641ef37a977129acc503f3fa1b67f556ac5eb)
Bryn M. Reeves [Thu, 15 Dec 2016 19:03:42 +0000 (19:03 +0000)]
libdm: clear region table in dm_stats_list()
Call _stats_regions_destroy() from dm_stats_list() if dms->regions
is non-NULL. This avoids leaking any pool allocations and ensures
the handle is in a known state: if an error occurs during the list,
dms->regions will be NULL and the handle will appear empty.
Zdenek Kabelac [Sat, 17 Dec 2016 21:41:27 +0000 (22:41 +0100)]
lvconvert: support cache to external origin conversion
Add this functionality to lvconvert:
'lvconvert --thin cachedLV --thinpool vg/poll'
Converts cachedLV to external origin (which will be read-only).
New thin volume is created in thinpool LV and it's using external
origin as source for unprovisioned chunks.
This conversion happens online (while volume is in use).
Thin LV remains fully writable.
Cached external origin no longer could be written so cache will be used
ONLY for read operations. For this limitation we require cache mode
to be writethrough (as writeback cannot write to read-only volumes).
When thinLV is later removed cached external origin is again
fully usable, just note, LV remain in 'read-only' mode.
When read-write is needed, 'lvchange -prw' has to be used.
Single external origin could be user by multiple thinLV in
multiple differen thin pool.
Zdenek Kabelac [Sun, 18 Dec 2016 14:05:31 +0000 (15:05 +0100)]
cache: improve activation with -real
When cache volume may be converted from normal to -real layer LV
we need to improve logic for call cache_check.
With this patch, we register call for cache_check only when metadata LV
is not yet present in active table slot (should match initial table
load).
This avoids unwanted checking when cache would become layer device
online.
Zdenek Kabelac [Sat, 17 Dec 2016 21:40:59 +0000 (22:40 +0100)]
libdm: drop callback on revert path
The system is likely in some very inconsisten state.
Do not try to make it even more problematic with trying
to invoke tools like thin_check via callback.
Zdenek Kabelac [Sat, 17 Dec 2016 21:40:14 +0000 (22:40 +0100)]
lv: fix lock holder for external origin
External origin could be reloaded via more locks.
It's actually even more complex then thin-pool,
as it may be active on more nodes for linear LVs
(and maybe even more types).
External origin is always read-only thus unmodifiable
device so there should not be a problem accesing it
through multiple nodes.
Also for thin-pool check first presence of active thin-pool.
FIXME:
It's not easy to detect on which nodes this device is active
Thus manipulation with such device may require checking every
node and it active state and refresh.
But since such setup is quite complex to prepare and use,
hopefully there are not user trying to 'explore' this usage yet.
Zdenek Kabelac [Sat, 17 Dec 2016 20:52:27 +0000 (21:52 +0100)]
cache: prepare status checking for layer
To be ready to show status of cache volume, call the status
with layer. Layer is automatically detected in this case when
cache volume is used in 'layered' form (needs -real suffix).
Zdenek Kabelac [Sun, 18 Dec 2016 14:05:57 +0000 (15:05 +0100)]
cache: improve wait for cache clear
Avoid printing misleading message about single dirty block.
Instead properly detect condition where the 'cleaner' policy
needs to be installed without 'overloading' dirty variable.
Also print warning if we would be clearing read-only volume.
(it really shouldn't happen).
Zdenek Kabelac [Sat, 17 Dec 2016 20:55:02 +0000 (21:55 +0100)]
thin: reload external origin with last thin
External origin could be activated as stand-alone device.
When the last thin LV is removed, external origin is no longer
the external origin and it's layer property was dropped.
Ensure dm table is correct by reloading external origin
(when it's active).
Zdenek Kabelac [Sat, 17 Dec 2016 20:54:51 +0000 (21:54 +0100)]
lvs: show status for layer
When LV is external origin, show info for LV but
status for -layer. So we expose more info to a user
as otherwise active external origin is only linear
mapping of -real layer.
We also need to handle --count=0 specially when calculating the
interval number: since intervals count from #1, this must account
for the implicit "minus one" when converting from zero to the
UINT64_MAX value used (which is too large to store in _int_args).
Bryn M. Reeves [Sun, 18 Dec 2016 12:42:47 +0000 (12:42 +0000)]
dmstats: separate TIMERFD and useleep() exit conditions
The time management code mixes tests of the _timer_fd value with
code that should be timer agnostic: this causes problems for users
of the usleep() timer, since it cannot properly detect the start
of a new interval:
Separate these out, by defining a _timer_running() call that each
timer implements, and only define _timer_fd if we are compiling
with TIMERFD enabled.
Bryn M. Reeves [Sun, 18 Dec 2016 12:39:26 +0000 (12:39 +0000)]
dmstats: use better interval estimate for usleep() timer
Although the usleep() interval timer is not used if the Linux
TIMERFD interface is available it should still provide reasonably
good timing.
Instead of trying to estimate the error from the duration of the
last sleep, peg it to the start time of the program, and use the
value of ((start_time - now) % interval) to correct the current
interval duration.
This always pulls us back into sync at the end of each interval,
rather than relying on trying to incrementally adjust the time
duration at each interval start.
This greatly reduces drift when the usleep() clock is used.
This is safe if the destination bitset is smaller than the source,
or if the two bitsets are of the same size.
With a destination that is larger (e.g. when resizing a bitmap to
add more capacity), the memcpy will overrun the source bitset and
set garbage bits in the destination.
There are nine uses of the macro currently (8 in libdm/regex, and
1 in daemons/cmirrord): in each case the two bitsets are always of
equal size so the behaviour is unchanged.
Fix the macro to use bs2's size to simplify resizing bitsets and
avoid the need for another copy macro.
We do two operations in one table reload:
- raid leg/image extraction
- rename remaining raid legs
This should be made in separate steps. Otherwise we do an
uncorrectable table change on error path (leaving tables
for admin and dmsetup).
As a hotfix - restore the previous logic and use a single
new function _lv_update_and_reload_list which activates exclusively
extracted LVs on the list before resuming suspended raid LV.
This restore 'rename' functionality upon resume.
Also still preserve the 'origin_only' logic - although we know
it's not working properly for cluster and LV stacking.
Zdenek Kabelac [Tue, 13 Dec 2016 11:28:12 +0000 (12:28 +0100)]
cleanup: remove wrapping function
backup is not 'tested' for success and also it should
actually happen just when command is finished.
We do not target to make backups with each inter-step
metadata change.
Zdenek Kabelac [Tue, 13 Dec 2016 11:31:28 +0000 (12:31 +0100)]
raid: improve table reload sequence
This is another place for 'common' use pattern or
reload and activation of deleted devices.
(Moving the exclusive activation to _deactivate_and_remove_lvs()).
TODO: looks like halve of raid function is reloading
just 'origin' - and the other full LV.
Bryn M. Reeves [Sun, 11 Dec 2016 22:41:45 +0000 (22:41 +0000)]
libdm: add min_num_bits to dm_bitset_parse_list()
It's useful to be able to specify a minimum number of bits for a
new bitmap parsed from a list, for e.g. to allow for expansing a
group without needing to copy/reallocate the bitmap.
Add a backwards compatible symbol for programs linked against old
versions of the library.
Bryn M. Reeves [Sun, 11 Dec 2016 12:44:45 +0000 (12:44 +0000)]
libdm: add dm_bit_get_last()/dm_bit_get_prev()
It is sometimes convenient to iterate over the set bits in a dm
bitset in reverse order (from the highest set bit toward zero), or
to quickly find the last set bit.
Add dm_bit_get_last() and dm_bit_get_prev(), mirroring the existing
dm_bit_get_first() and dm_bit_get_next().
dm_bit_get_prev() uses __builtin_clz when available to efficiently
test the bitset in reverse.
Bryn M. Reeves [Tue, 13 Dec 2016 12:03:00 +0000 (12:03 +0000)]
libdm: break up _stats_get_extents_for_file()
Split out the loop that iterates over each batch of FIEMAP
extent data from the function that sets up and calls the ioctl
to reduce nesting and simplify local variable use:
The _stats_map_extents() function is responsible for detecting
eof and extent boundaries and adding whole, allocated extents
to the file extent table for region creation.
Bryn M. Reeves [Tue, 13 Dec 2016 14:31:39 +0000 (14:31 +0000)]
libdm: check for non-existent region_id values in groups
Check that all region_id values specified in a group bitmap are
actually present: although this should not normally happen when
using the dmstats tool, it is possible as a result of manual
changes (or bugs) for a group descriptor to contain one or more
group_id values that do not exist.
Check for this situation when reading group descriptors, warn
the user the user, and clear these bits in the bitmap when
formatting it for output.