David Teigland [Fri, 18 Sep 2020 19:42:23 +0000 (14:42 -0500)]
metadata: open rw fd before closing ro fd
lvm opens devices readonly to scan them, but
needs to open then readwrite to update the metadata.
Previously, the ro fd was closed before the rw fd
was opened, leaving a small gap where the dev was
not held open, and during which the dev could
possibly change which storage it referred to.
With the bcache_change_fd() interface, lvm opens a
rw fd on a device to be written, tells bcache to
change to the new rw fd, and closes the ro fd.
. open dev ro
. read dev with the ro fd (label_scan)
. lock vg (ex for writing)
. open dev rw
. close ro fd
. rescan dev to check if the metadata changed
between the scan and the lock
. if the metadata did change, reread in full
. write the metadata
David Teigland [Thu, 17 Sep 2020 14:40:18 +0000 (09:40 -0500)]
bcache: use indirection table for fd
Add a "device index" (di) for each device, and use this
in the bcache api to the rest of lvm. This replaces the
file descriptor (fd) in the api. The rest of lvm uses
new functions bcache_set_fd(), bcache_clear_fd(), and
bcache_change_fd() to control which fd bcache uses for
io to a particular device.
. lvm opens a dev and gets and fd.
fd = open(dev);
. lvm passes fd to the bcache layer and gets a di
to use in the bcache api for the dev.
di = bcache_set_fd(fd);
. lvm uses bcache functions, passing di for the dev.
bcache_write_bytes(di, ...), etc.
. bcache translates di to fd to do io.
. lvm closes the device and clears the di/fd bcache state.
close(fd);
bcache_clear_fd(di);
In the bcache layer, a di-to-fd translation table
(int *_fd_table) is added. When bcache needs to
perform io on a di, it uses _fd_table[di].
In the following commit, lvm will make use of the new
bcache_change_fd() function to change the fd that
bcache uses for the dev, without dropping cached blocks.
Use new SKIP_WITH_LOW_SPACE and set higher requirement for free space.
But still this test can't run on system's tmpfs directories -
as they typically provide less then 2G of space and when the test
runs there it also provisioning for all READ pages!)
BRD (ramdisk) device should work.
Extend a _wait_recalc() loop for slower hw.
When creating large raid which do not need to be fully synchronized use
them on delay devices - so even less data needs read/write.
Remove unneeded lvchange as lvcreate is already leaving LV inactive.
Replace printf with awk as generator.
Test can set individually a higher value for required free space on
storage.
Note: it is not fully reliable since when 'brd' (ramdisk) device is used
this free space value is rather meanigul, but it might help
in case where a real filesystem is doing back-end for test devices.
When the test exhausts all the available free space on storage device,
then during the fail we cannot write anything as well - yet
the teardown needs to finish it's work - otherwise we leave
basicaly overfilled filesystem for all remaining tests.
In cases where internal functions like zero_dev, delay_dev pass-in
invalid parameter so resulting table can't work, resume at least
previous table line before failing out - so the cleaning process
later on is not stuck waiting on a suspended device.
While the previous commit c9b40083fc34b5e2a1bfc7b251b38c0b8737483f
decresed version to 1.19 for using bigger datasets, it's not
been quite right - so from our bb machine it looks like
bigger metadata consumption started with 1.19 and kernel 4.18
(fc27)
Use bigger volume and slowdown writing to cache device.
This allows more simple to reach 'dirty' state.
Also document exactly 1 SIGINT has to fire aborting of flushing.
locking: restore blocking signal for VG_GLOBAL lck
During removal of a lot of locking code the signal blocking got lost
and signal processing got broken leading to unpredictable
behavior of i.e. activation code the can get interrupted in the
middle of DM table processing.
lvm2 code always expects signals are blocked while lock is held
unless it is explictelly placed into section of:
sigint_allow();....;sigint_restore();
For checking catched interrupt there is sigint_catched();
Metadata size was calculated correctly only for raids.
Fixes problem for crash during lvcreate when thin-pool was created
on a VG where remaining free space had the size to only fit a single
metadata LV and not also its _pmspare.
The return value of top_level_lv_name() may be NULL, so we should
check return value of top_level_lv_name before calling
strcmp(lv->name, top_level_lv_name(vg, lv_name)).
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
When using cscope to read code, it will generate below 3 files for speedup
cross-refer: cscope.files, cscope.in.out, cscope.po.out
The .gitignore only contains "/cscope.out". It a little bit messy when
executing 'git status', and other git commands.
This patch add all cscope generated files in .gitignore.
When using --use-policy for automatic extension of thin-pool,
the extension of thin-pool's metadata itself can actually take
some extra space.
Since I'm not aware of exact compensation formula, add just
1% extra to calculated amount and hope it fits.
Wanted target is to always have usable thin-pool that fits
bellow pool_metadata_min_threshold().
Since we query on regular code these:
lv_raid_has_integrity()
lv_has_integrity_recalculate_metadata()
without prior checking for lv_is_raid() - these 'return 0' should
not use <stacktrace> as they are expected.
raid: do not enforce flushing of raids when it is not required
This is probably somewhat experimantal patch - but when i.e. raid device
is just extend, there should not be a technical need for flush,
unless the target would stricly need it. It should allow faster
processing of lvm command not being blocked by possibly longer flush.
Since we do not support rimage & rmeta for snapshots - we can
avoid quering for -cow devices and add them as origin_only -
since their snapshots (-cow) could have never existed.
This redumes several ioctl operation during table preloading.
lvconvert: flip return value of _raid_split_image_conversion
Use '0' for error and '1' as success.
Also drop INTERNAL_ERROR from path - as this error
is ATM used for invalid devices.
(i.e. test lvconvert-raid1-split-trackchanges.sh)
Just like we have 'writeerror_dev' supporting creation of device
which 'readable' segment and segments where write will fail we
have now support for delay zero mappings.
This is useful if we want to 'fake' large writing areas where we do
not really care about the actual 'disk' content - since we test
operation logic and it doesn't matter we read and write zeroes.
With combination with 'delay' target we can create specific mappings
and avoid using large memory areas of ramdisk.
On test system with 'default' filter (aka accept all) test
after enabling device can suffer from automatic system
activation - so for created LVs setup skipping this automatic
activation. This should prevent getting LVs into table
with pvscan service.
Since we declare dev_name in lib/device/device.h
and pvs in commands.h
rename local dev_name to device_name
and pvs to pvs_list to prevent shadowing warning.