Peter Rajnoha [Wed, 11 Mar 2015 15:18:42 +0000 (16:18 +0100)]
metadata: _vg_read: check if PV_EXT_USED flag is set correctly for orphan PVs and do a repair if needed
If we know that the PV is orphan, meaning there's at least one MDA on
that PV which does not reference any VG and at the same time there's
PV_EXT_USED flag set, we're certainly in an inconsistent state and we
need to fix this.
For example, such situation can happen during vgremove/vgreduce if we
removed/reduced the VG, but we haven't written PV headers yet because
vgremove stopped abruptly for whatever reason just before writing new
PV headers with updated state, including PV extension flags (and so the
PV_EXT_USED flag).
However, in case the PV has no MDAs at all, we can't double-check
whether the PV_EXT_USED is correct or not - if that PV is marked
as used, it's either:
- really used (but other disks with MDAs are missing)
- or the error state as described above is hit
User needs to overwrite the PV header directly if it's really clear
the PV having no MDAs does not belong to any VG and at the same time
it's still marked as being in use (pvcreate -ff <dev_name> will fix this).
For example - /dev/sda here has 1 MDA, orphan and is incorrectly marked
with PV_EXT_USED flag:
Peter Rajnoha [Tue, 10 Mar 2015 10:25:14 +0000 (11:25 +0100)]
pv: check for the PV_EXT_USED flag and deny pvcreate/pvchange/pvremove/vgcreate on such PV (unless forced)
Make sure we won't use a PV that is already marked as used. Normally,
VG metadata would stop us from doing that, but we can run into a
situation where such metadata is missing because PVs with MDAs
are missing and the PVs left are the ones with 0 MDAs.
(/dev/sda in this example has 0 MDAs and it belongs to a VG,
but other PVs with MDA are missing)
$ pvcreate /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't initialize PV '/dev/sda' without -ff.
$ pvchange -u /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't change PV '/dev/sda' without -ff.
Physical volume /dev/sda not changed
0 physical volumes changed / 1 physical volume not changed
$ pvremove /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
(If you are certain you need pvremove, then confirm by using --force twice.)
$ vgcreate vg /dev/sda
Physical volume '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Unable to add physical volume '/dev/sda' to volume group 'vg'.
We'll use this struct in subsequent patches for PVs which should
be rewritten, not just created. So rename struct pv_to_create to
struct pv_to_write for clarity.
The bug description: First we allocate memory for
processing handle (at an address 1) then we
allocate some memory on the same pool for later use
in pvmove_poll function inside the process_each_pv
function (at an address 2). After we jump out of
process_each_pv we called destroy_processing_handle.
As a result of destroying the handle memory pool could
deallocate all memory at address 1 or higher. The
pvmove_poll function tried to copy a memory allocated
at address 2 that could be returned to the system.
If it was so it led to segfault.
We need to rethink proper fix but in the same time
cmd->mem pool is recreated per each lvm command so
this should not cause problems even when we run
multiple commands in lvm shell.
A valgrind snapshot of the corruption:
Invalid read of size 1
at 0x4C29F92: strlen (mc_replace_strmem.c:403)
by 0x5495F2E: dm_pool_strdup (pool.c:51)
by 0x1592A7: _create_id (pvmove.c:774)
by 0x159409: pvmove_poll (pvmove.c:796)
by 0x1599E3: pvmove (pvmove.c:931)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Address 0xf15df8a is 138 bytes inside a block of size 8,192 free'd
at 0x4C28430: free (vg_replace_malloc.c:446)
by 0x5494E73: dm_free_wrapper (dbg_malloc.c:357)
by 0x5495DE2: _free_chunk (pool-fast.c:318)
by 0x549561C: dm_pool_free (pool-fast.c:151)
by 0x164451: destroy_processing_handle (toollib.c:1837)
by 0x1598C1: pvmove (pvmove.c:903)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Zdenek Kabelac [Thu, 11 Feb 2016 11:00:28 +0000 (12:00 +0100)]
dm: alloc always 8byte aligned
Fixing regression caused by 197b5e6dc7dd8ec161ebe43c97fd2ac8384b3433.
So the 'TODO' part now finally know the answer - there is 'sparc64'
architecture which imposes limitation to read 64b words only through
64b aligned address.
Since we never could know how is the user going to use the returned
pointer and the userusually expects it's aligned on the highest CPU
required alignement, preserve it also for char*.
M.H. Tsai [Wed, 27 Jan 2016 11:11:03 +0000 (19:11 +0800)]
config: fix verbose type to int
'verbose' was marked as a boolean option while it
takes integer args - so it has limited usage to 0 or 1,
but we supported 0-4 at least.
Fix it by switching to corrent int type.
(Hopefully noone was trying to use this variable as true/yes/false/no
way - as the would be unsupported/undocumented).
Peter Rajnoha [Wed, 10 Feb 2016 13:53:10 +0000 (14:53 +0100)]
conf: use use_blkid_wiping=0 if not defined in lvm.conf and support not compiled in
Normally, we generate and provide lvm.conf file where use_blkid_wiping
is set based on whether support for this is compiled in or not. This was
generated properly based on configure.
However, if lvm.conf is not used at all (someone deletes it) or the value
in lvm.conf is commented out (user edited it), we still need to use
proper default value that is based on DEFAULT_USE_BLKID_WIPING taken
from configure script - we used hardcoded value of "1" in this case
by mistake.
Peter Rajnoha [Wed, 3 Feb 2016 13:40:52 +0000 (14:40 +0100)]
filter: do not check for suspended devs in filter-usable in lvmetad mode
We already do check for suspended devs within udev rules where
the pvscan is to update lvmetad. So the check for suspended devs
in "pre-lvmetad" chain is not useful here - remove it - it may
be a source of hardly to detect races anyway (if udev rule detects
the device is not suspended and then the pvscan instance sees the
dev as suspended, we may end up not reacting to the event properly).
Peter Rajnoha [Tue, 2 Feb 2016 12:54:19 +0000 (13:54 +0100)]
metadata: format: also delete bootloader areas from lvmcache when reading lvm1 and pool label
lvm1 and pool format do not support bootloader areas and we need to
remove any existing associated bootloader areas when we read lvm1 and
pool labels.
This has its importance if we're converting from one format to another
and we're reusing lvmcache in long-running commands (e.g. clvmd or lvm
shell) and we need to make lvmcache consistent and valid for current format.
Peter Rajnoha [Tue, 2 Feb 2016 12:28:11 +0000 (13:28 +0100)]
filters: partitioned: fix partition table filter with external_device_info_source="udev" and blkid<2.20
Non-dm devices have ID_PART_TABLE_TYPE variable exported in
udev db from blkid scan for *both* whole devices and partitions.
We used ID_PART_ENTRY_DISK in addition to decide whether this
is the whole device or partition and then we filtered out only
whole devices where the partition table really is.
However, ID_PART_ENTRY_DISK was added in blkid 2.20 so we need
to use a different set of variables to decide on whole devices
and partitions on systems where older blkid is still used.
Now, we use ID_PART_TABLE_TYPE to detect that there's something
related to partitioning with this device and we use DEVTYPE variable
instead to decide between whole device (DEVTYPE="disk") and partition
(DEVTYPE="partition").
For dm devices it's simpler, we have ID_PART_TABLE_TYPE variable\
set in udev db for whole devices. It's not set for partitions,
hence we don't need more variable in addition to make the decision
on whole device vs. partition (dm devices do not have regular
partitions, hence DEVTYPE can't be used anyway, it's always set
to "disk" for whole disks and partitions).
Peter Rajnoha [Fri, 15 Jan 2016 15:41:27 +0000 (16:41 +0100)]
device: also cache device size
Add "size" and "size_seqno" to struct device to cache device's size
and also to control its lifetime - the cached value is valid as long
as the global _dev_size_seqno is equal to the device's size_seqno,
otherwise we need to get the size again and cache the new value.
This patch also adds new dev_size_seqno_inc() fn for the appropriate
parts of the code to increment current global value of _dev_size_seqno
and hence to cause all currently cached values for device sizes to
be invalidated.
The device size is now cached because we're planning to reuse this
information for further checks and we want to avoid checking it more
than necessary to save resources.
Zdenek Kabelac [Thu, 21 Jan 2016 12:19:27 +0000 (13:19 +0100)]
toollib: restore command break support
Fix regression caused by c9f021de0b4d2c873ef5b97d17c602d0380359fd.
This commit actually transfered real-action (e.g. device removal)
into the next loop which has however missed to check for break.
So add check for break also there.
Peter Rajnoha [Thu, 21 Jan 2016 08:56:07 +0000 (09:56 +0100)]
configure: fix configure to set proper use_blkid_wiping if autodetected as disabled
If not using explicit --enable-blkid-wiping/--disable-blkid-wiping
configure option, the configure script tries to enable/disable blkid
wiping feature automatically based on blkid library version found.
The script incorrectly set default value for lvm.conf's
allocation/use_blkid_wiping" setting to "1" (enabled) if proper
blkid library version was not found or the version found was less
than the minimum required. It should be set to "0" in this case.
Zdenek Kabelac [Tue, 19 Jan 2016 15:07:39 +0000 (16:07 +0100)]
cleanup: reformat sentence about max sizes
The extent size must fits all blocks in 4294967295 sectors
(in 512b units) this is 1/2 KiB less then 2TiB.
So while previous statement 'suggested' 2TiB is still acceptable value,
make it clear it's not.
As now we support any multiples of 128KB as extent size -
values like 2047G will still 'flow-in' otherwise the largest power-of-2
supported value is 1TiB.
With 1TiB user needs 8388608 extents for 8EiB device.
(FYI such device is already unusable with todays glibc-2.22.90-27)
4GiB extent size is currently the smallest extent size which allows
a user to create 8EiB devices (with 2GiB it's less then 8EiB).
TODO: lvm2 may possibly print amount of 'lost/unused space' on a PV,
since using such ridiculously sized extent size may result in huge
space being left unaccessible.
Since commit 2fc126b00d83991c6a2f3ed9aa61457294a4c45e, the library
code requires udev to be initialised for device scanning and
clvmd can fail to find VGs if devices/external_device_info_source
is set to "udev".
Peter Rajnoha [Tue, 19 Jan 2016 11:26:01 +0000 (12:26 +0100)]
report: make devices, metadata_devices, seg_pe_ranges and seg_metadata_le_ranges fields consistent
There are two basic groups of fields for LV segment device reporting:
- related to LV segment's devices: devices and seg_pe_ranges
- related to LV segment's metadata devices: metadata_devices and seg_metadata_le_ranges
The devices and metadata_devices report devices in this format:
"device_name(extent_start)"
The seg_pe_ranges and seg_metadata_le_ranges report devices in
this format:
"device_name:extent_start-extent_end"
This patch reverts partly what commit 7f74a995029caa41ee3cf9aec0bd024a34bfd89a
(v 2.02.140) introduced in this area - it added [] for
hidden devices to mark them for all four fields mentioned above.
We won't be marking hidden devices in devices and metadata_devices
fields.
The seg_metadata_le_ranges field will have hidden devices marked -
it's new enough that we don't need to care about compatibility much
yet.
The seg_pe_ranges is old enough that we shouldn't be changing this
one - so we're reverting to not marking hidden devices here.
Instead, there's going to be a new field "seg_le_ranges" which
is going to replace the seg_pe_ranges and it will mark hidden devices -
this is going to be introduced in a patch later.
So in the end we'll end up with:
(LV segment's devices)
devices field with "device_name(extent_start)" format, not marking hidden devices
seg_pe_ranges field with "device_name:extent_start-extent_end" format, not marking hidden devices (deprecated, new seg_le_ranges should be used instead for standardized format)
seg_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
(LV segment's metadata devices)
metadata_devices field with "device_name:extent_start-extent_end" format, not marking hidden devices
seg_metadata_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
Also, both seg_le_ranges and seg_metadata_le_ranges will honour the
report/list_item_separator setting which can be used to configure
the delimiter used for list items.
So, to sum it up, we will recommend using the new seg_le_ranges and
seg_metadata_le_ranges fields because they display devices with
standard extent range format, they can mark hidden devices and they
honour the report/list_item_separator setting.
We'll be keeping devices,seg_pe_ranges and metadata_devices fields
for compatibility.
Peter Rajnoha [Tue, 19 Jan 2016 11:24:02 +0000 (12:24 +0100)]
report: change _format_pvsegs to return list instead of plain string, change associated report fields from STR to STR_LIST
The associated devices,metadata_devices,seg_pe_ranges and
seg_metadata_le_ranges are reported as genuine string lists now.
This allows for using the items separately in -S|--select
(so searching for subsets etc.) and also it allows for
configuring the separator using report/list_item_separator
which may be useful in scripts (however, we'll enable this
only for seg_le_metadata_ranges and not for devices,seg_pe_ranges
and seg_metadata_devices for compatibility reasons - see following
patch).
David Teigland [Mon, 18 Jan 2016 21:40:42 +0000 (15:40 -0600)]
toollib: add comment about missing device
Add a comment in _process_pvs_in_vg() to document the
place where there have been problems with processing
PVs twice.
For a while we had a hacky workaround here where we'd
skip processing a PV if its device wasn't found in
all_devices (and !is_missing_pv since we want to
process PVs with missing devices.). That workaround
was removed in commit 5cd4d46f because it was no
longer needed.
The workaround had originally been needed to prevent
a device from being processed twice when the PV had
no MDAs -- it would be processed once in its real VG
and then the workaround would prevent it from being
processed a second time in the orphan VG.
Wrongly appearing as an orphan likely happened because
lvmcache would consider the no-MDA PV an orphan unless
the real VG holding that PV was also in lvmcache.
This issue is also mentioned in pvchange where holding
the global lock allows VGs to remain in lvmcache so
PVs with 0 mdas are not considered orphans.
The workaround in _process_pvs_in_vg() was originally
intended for reporting commands, not for pvchange.
But, it was accidentally helping pvchange also because
the method described by the pvchange global lock
comment had been subverted by commit 80f4b4b8.
Commit 80f4b4b8 was found to be unnecessary, and was
reverted in commit e710bac0. This restored the
intended global lock lvmcache effect to pvchange, and
it no longer relied on the workaround in toollib.
When reporting on LVs, take the end of the range from the size of the
underlying (hidden) LV rather than the logical size of the current
segment (that PVs use).
/* equivalent to process_each_pv */
dm_list_iterate_items(vgids)
vg = vg_read_internal()
dm_list_iterate_items(&vg->pvs)
With the found 'pv', it would do vg_read() on pv_vg_name(pv),
and then do the actual pvmove processing.
This commit simplifies by using process_each_pv() and putting
the actual pvmove processing into the "single" function.
This eliminates both find_pv_by_name() and the vg_read().
The processing code that followed vg_read remains the same.
The return code for the pvmove command is not based on the
process_each_pv return code, but is based on the success/fail
conditions in the existing code.
David Teigland [Fri, 15 Jan 2016 21:31:13 +0000 (15:31 -0600)]
lvmlockd: fix lvb validation for conversion
Make the lvb validation rules for convert match
those for unlock (even though it would be very
unlikely or impossible for convert to deal with
zero lvb.)
David Teigland [Fri, 15 Jan 2016 21:18:25 +0000 (15:18 -0600)]
pvchange, pvresize: fix lockd_gl() usage
When an orphan PV is changed/resized, the
lvmlockd global lock is converted from sh
to ex. If the command is changing two
orphan PVs, the conversion to ex should
be done only once.
Peter Rajnoha [Mon, 18 Jan 2016 13:32:17 +0000 (14:32 +0100)]
report: add kernel_cache_settings field
Existing cache_settings field displays the settings which are
saved in metadata. Add new kernel_cache_settings fields to display
the settings which are currently used by kernel, including fields
for which default values are used.
This way users have complete view of the set of cache settings
supported (and which they can set) and their values which are used
at the moment by kernel.
Peter Rajnoha [Fri, 15 Jan 2016 13:39:43 +0000 (14:39 +0100)]
lvm2app: fix lvm2app to return either 0 or 1 for lvm_vg_is_{clustered,exported}
Fix lvm2app to return either 0 or 1 for lvm_vg_is_{clustered,exported},
including internal functions pvseg_is_allocated and vg_is_resizeable
which are not yet exposed in lvm2app but make them consistent with the
rest.
David Teigland [Thu, 14 Jan 2016 19:26:15 +0000 (13:26 -0600)]
Revert "lvmcache: skip drop when vg_write lock is not held"
This reverts e28e22b9e1e4f7243608aa24ddf43ec63afd1751
The problem that that commit was fixing (pytest failure)
no longer appears with the current code, so the commit is
not needed.
That commit is a problem for pvchange, because it prevents
lvmcache from retaining VG metadata even while the global
lock is held. pvchange holds the global lock to ensure
that VG metadata is kept in lvmcache throughout processing.
If the cache is not kept, a PV with zero MDAs will appear
first in its actual VG and then appear again in the orphan VG.
It wrongly appears a second time in the orphan VG only if
the actual VG is dropped from lvmcache.
Peter Rajnoha [Thu, 14 Jan 2016 15:54:12 +0000 (16:54 +0100)]
report: add kernel_discards report field to display thin pool discard used in kernel
Thin pool discard mode set in metadata can be different from the one
actually used if any device underneath does not support that mode. Add
kernel_discard report field to make it possible to see this difference.