sourceware.org Git - lvm2.git/log

config: add support for enhanced config node output

There's a possibility to interconnect the dm_config_node with an
ID, which in our case is used to reference the configuration
definition ID from config_settings.h. So simply interconnecting
struct dm_config_node with struct cfg_def_item.

This patch also adds support for enhanced config node output besides
existing "output line by line". This patch adds a possibility to
register a callback that gets called *before* the config node is
processed line by line (for example to include any headers on output)
and *after* the config node is processed line by line (to include any
footers on output). Also, it adds the config node reference itself
as the callback arg in addition to have a possibility to extract more
information from the config node itself if needed when processing the
output callback (e.g. the key name, the id, or whether this is a
section or a value etc...).

If the config node from lvm.conf/--config tree is recognized and valid,
it's always coupled with the config node definition ID from
config_settings.h:

struct dm_config_node {
   int id;
   const char *key;
   struct dm_config_node *parent, *sib, *child;
   struct dm_config_value *v;
}

For example if the dm_config_node *cn holds "devices/dev" configuration,
then the cn->id holds "devices_dev_CFG" ID from config_settings.h, -1 if
not found in config_settings.h and 0 if matching has not yet been done.

To support the enhanced config node output, a new structure has been
defined in libdevmapper to register it:

  struct dm_config_node_out_spec {
    dm_config_node_out_fn prefix_fn; /* called before processing config node lines */
    dm_config_node_out_fn line_fn; /* called for each config node line */
    dm_config_node_out_fn suffix_fn; /* called after processing config node lines */
  };

Where dm_config_node_out_fn is:

  typedef int (*dm_config_node_out_fn)(const struct dm_config_node *cn, const char *line, void *baton);

(so in comparison to existing callbacks for config node output, it has
an extra dm_config_node *cn arg in addition)

This patch also adds these functions to libdevmapper:
  - dm_config_write_node_out
  - dm_config_write_one_node_out

...which have exactly the same functionality as their counterparts
without the "out" suffix. The "*_out" functions adds the extra hooks
for enhanced config output (prefix_fn and suffix_fn mentioned above).

One can still use the old interface for config node output, this is
just an enhancement for those who'd like to modify the output more
extensively.

dumpconfig: add --type, --atversion and --validate arg

lvm dumpconfig [--type {current|default|missing|new}] [--atversion] [--validate]

This patch adds above-mentioned args to lvm dumpconfig and it maps them
to creation and writing out a configuration tree of a specific type
(see also previous commit):

  - current maps to CFG_TYPE_CURRENT
  - default maps to CFG_TYPE_DEFAULT
  - missing maps to CFG_TYPE_MISSING
  - new maps to CFG_TYPE_NEW

If --type is not defined, dumpconfig defaults to "--type current"
which is the original behaviour of dumpconfig before all these changes.

The --validate option just validates current configuration tree
(lvm.conf/--config) and it writes a simple status message:

  "LVM configuration valid" or "LVM configuration invalid"

config: use config checks and add support for creating trees from config definition (config_def_create_tree fn)

Configuration checking is initiated during config load/processing
(_process_config fn) which is part of the command context
creation/refresh.

This patch also defines 5 types of trees that could be created from
the configuration definition (config_settings.h), the cfg_def_tree_t:

  - CFG_DEF_TREE_CURRENT that denotes a tree of all the configuration
    nodes that are explicitly defined in lvm.conf/--config

  - CFG_DEF_TREE_MISSING that denotes a tree of all missing
    configuration nodes for which default valus are used since they're
    not explicitly used in lvm.conf/--config

  - CFG_DEF_TREE_DEFAULT that denotes a tree of all possible
    configuration nodes with default values assigned, no matter what
    the actual lvm.conf/--config is

  - CFG_DEF_TREE_NEW that denotes a tree of all new configuration nodes
    that appeared in given version

  - CFG_DEF_TREE_COMPLETE that denotes a tree of the whole configuration
    tree that is used in LVM2 (a combination of CFG_DEF_TREE_CURRENT +
    CFG_DEF_TREE_MISSING). This is not implemented yet, it will be added
    later...

The function that creates the definition tree of given type:

  struct dm_config_tree *config_def_create_tree(struct config_def_tree_spec *spec);

Where the "spec" specifies the tree type to be created:

  struct config_def_tree_spec {
    cfg_def_tree_t type; /* tree type */
    uint16_t version; /* tree at this LVM2 version */
    int ignoreadvanced; /* do not include advanced configs */
    int ignoreunsupported; /* do not include unsupported configs */
  };

This tree can be passed to already existing functions that write
the tree on output (like we already do with cmd->cft).

There is a new lvm.conf section called "config" with two new options:

  - config/checks which enables/disables checking (enabled by default)

  - config/abort_on_errors which enables/disables aborts on any type of
    mismatch found in the config (disabled by default)

config: add support for configuration check (config_def_check fn)

Add support for configuration checking - type checking and recognition
of registered configuration settings that LVM2 understands and also
check the structure of the configuration. Log error on any mismatch
found.

A hash over all allowed configuration paths is created which helps
with matching the exact configuration (lvm.conf/--config tree) with
the configuration item definition from config_settings.h in an
efficient and one-step way.

Two more helper flags are introduced for each configuration definition
item:

  - CFG_USED which marks the item as being used (lvm.conf/--config)
    This helps with identifying missing configuration options
    (and for which defaults were used) when traversing the tree later.

  - CFG_VALID which denotes that the item has already been checked and
    it was found valid. This improves performance, so if the check
    is called once again on the same tree which was not reloaded, we
    can just return the state from previous check (with a possibility
    to force the check if needed).

The new function that config.h exports and which is going to be used
to perform the configuration checking is:

  int config_def_check(struct cmd_context *cmd, int force, int skip, int suppress_messages)

...which is exported internally via config.h.

libdevmapper: add dm_config_value_is_bool function

Export this functionality from libdevmapper just for
convenience and general use when reading boolean values
which could be defined either in a numeric way with 0/1
or by using strings with "true"/"false", "yes"/"no",
"on"/"off", "y"/"n".

config: refer to config nodes using assigned IDs

For example, the old call and reference:

find_config_tree_str(cmd, "devices/dir", DEFAULT_DEV_DIR)

...now becomes:

find_config_tree_str(cmd, devices_dir_CFG)

So we're referring to the named configuration ID instead
of passing the configuration path and the default value
is taken from central config definition in config_settings.h
automatically.

config: add structs to represent config definition and register config_settings.h content

This patch adds basic structures that encapsulate the config_settings.h
content - it takes each item and puts it in structures:

  - cfg_def_type_t to define config item type

  - cfg_def_value_t to define config item (default) value

  - flags used to define the nature and use of the config item:
      - CFG_NAME_VARIABLE for items with variable names (e.g. tags)
      - CFG_ALLOW_EMPTY for items where empty value is allowed
      - CFG_ADVANCED for items which are considered as "advanced settings"
      - CFG_UNSUPPORTED for items which are not officially supported
        (config options mostly for internal use and testing/debugging)

  - cfg_def_item_t to encapsulate the whole definition of the config
    definition itself

Each config item is referenced by named ID, e.g. "devices_dir_CFG"
instead of directly typing the path "devices/dir" as it was before.

This patch also adds cfg_def_get_path helper function to get the
config setting path up to the root for given config ID
(it returns the path in form of "abc/def/.../xyz" where the "abc"
is the topmost element).

config: add config_settings.h

This file centrally defines all recognized LVM2 configuration
sections and settings. Each item here has its parent, set of
allowed types, default value, brief comment, version the setting
first appeared in and flags that further define the nature of
the configuration setting and its use.

config: add vsn macro

The 'vsn' macro encodes the LVM2 version major, minor
and patchlevel number in a packed form using 16 bits.

config: fix config node lookup inside empty sections to not return the section itself

When a section was empty in a configuration tree (no children - this is
allowed) and we were looking for a config node inside that section, the
_find_config_node function incorrectly returned the section itself if
the node inside that section was not found.

For example the configuration below:

The config:
    abc {
    }

And a function call to get the "def" node inside "abc" section:
     _find_config_node(..., "abc/def")

...returned the "abc" node instead of NULL ("def" not found).

This in turn caused segfaults in the code using lookups in such
a configuration tree as we (correctly) expected that the node
returned was always the one we were looking for or NULL if not
found. But if incorrect node was returned instead, we processed
that as if this was the node we were looking for and so we
processed its value as well. But sections don't have values => segfault.

cleanup: remove struct pv_header_extension reference from struct pv_header

Just to prevent accidental and improper use when reading the layout
from disk because of the already existing disk_areas_xl[0] lists
that are variable in size. We can read pv_header_extension only
after we know exactly where the lists end...

lvmetad: fix to properly process embedding area

WHATS_NEW: add lines for embedding area support

man: update pvcreate/pvs/vgconvert man page to cover embedding area support

report: add reporting fields for Embedding Area start and size

There are new reporting fields for Embedding Area: ea_start and ea_size.

An example of 1m Embedding Area and relevant reporting fields:
raw/~ # pvs -o pv_name,pe_start,ea_start,ea_size
PV 1st PE EA start EA size
/dev/sda 2.00m 1.00m 1.00m

tools: add embeddingareasize arg to pvcreate and vgconvert

To create an Embedding Area during PV creation (pvcreate or as part of
the vgconvert operation), we need to define the Embedding Area size.
The Embedding Area start will be calculated automatically by the tools.

This patch adds --embeddingareasize argument to pvcreate and vgconvert.

pv_header_extension: add support for writing PV header extension (flags & Embedding Area)

The PV header extension information (PV header extension version, flags
and list of Embedding Area locations) is stored just beyond the PV header base.

When calculating the Embedding Area start value (ea_start), the same logic is
used as when calculating the pe_start value for Data Area - the value must
follow exactly the same alignment restrictions for its start value
(the alignment detected automatically or provided via command line using
the --dataalignment and --dataalignmentoffset arguments).

The Embedding Area is placed at the very start of the PV, starting at
ea_start. The Data Area starting at pe_start is placed next. The pe_start is
still properly aligned. Due to the pe_start alignment, it's possible that the
resulting Embedding Area size (ea_size) ends up bigger in size than requested
(but never less than requested).

pv_header_extension: add support for reading PV header extension (flags & Embedding Area)

New tools with PV header extension support will read the extension
if it exists and it's not an error if it does not exist (so old PVs
will still work seamlessly with new tools).

Old tools without PV header extension support will just ignore any
extension.

As for the Embedding Area location information (its start and size),
there are actually two places where this is stored:
  - PV header extension
  - VG metadata

The VG metadata contains a copy of what's written in the PV header
extension about the Embedding Area location (NULL value is not copied):

    physical_volumes {
        pv0 {
          id = "AkSSRf-difg-fCCZ-NjAN-qP49-1zzg-S0Fd4T"
          device = "/dev/sda"     # Hint only

          status = ["ALLOCATABLE"]
          flags = []
          dev_size = 262144       # 128 Megabytes
          pe_start = 67584
          pe_count = 23   # 92 Megabytes
          ea_start = 2048
          ea_size = 65536 # 32 Megabytes
        }
    }

The new metadata fields are "ea_start" and "ea_size".
This is mostly useful when restoring the PV by using existing
metadata backups (e.g. pvcreate --restorefile ...).

New tools does not require these two fields to exist in VG metadata,
they're not compulsory. Therefore, reading old VG metadata which doesn't
contain any Embedding Area information will not end up with any kind
of error but only a debug message that the ea_start and ea_size values
were not found.

Old tools just ignore these extra fields in VG metadata.

pv_header_extension: add supporting infrastructure for PV header extension (flags & Embedding Area)

PV header extension comes just beyond the existing PV header base:

PV header base (existing):
- uuid
- device size
- null-terminated list of Data Areas
- null-terminater list of MetaData Areas

PV header extension:
- extension version
- flags
- null-terminated list of Embedding Areas

This patch also adds "eas" (Embedding Areas) list to lvmcache (lvmcache_info)
and it also adds support for common operations on the list (just like for
already existing "das" - Data Areas list):
- lvmcache_add_ea
- lvmcache_update_eas
- lvmcache_foreach_ea
- lvmcache_del_eas

Also, add ea_start and ea_size to struct physical_volume for processing
PV Embedding Area location throughout the code (currently only one
Embedding Area is supported, though the definition on disk allows for
more if needed in the future...).

Also, define FMT_EAS format flag to mark that the format actually
supports Embedding Areas (currently format-text only).

cleanup: use struct pvcreate_restorable_params throughout

cleanup: add struct pvcreate_restorable_params and move relevant items from pvcreate_params

Extract restorable PV creation parameters from struct pvcreate_params into
a separate struct pvcreate_restorable_params for clarity and also for better
maintainability when adding any new items later.

activation: fix pvmove partial tree creation

Do not try to add LV again into the partial tree, if it's been
already added. Otherwise we may end in endless loop.

thin: lvconvert support for external origin

Add basic support for converting LV into an external origin volume.

Syntax:

lvconvert --thinpool vg/pool --originname renamed_origin -T origin

It will convert volume 'origin' into a thin volume, which will
use 'renamed_origin' as an external read-only origin.
All read/write into origin will go via 'pool'.

renamed_origin volume is read-only volume, that could be activated
only in read-only mode, and cannot be modified.

thin: removal of external_origin

thin: report external origin

Use the field 'origin' for reporting external origin lv name.

For thin volumes with external origin, report the size of
external origin size via:

lvs -o+origin_size

thin: external origin cannot be changed

Do not allow conversion of external origin into writeable LV,
and prohibit changing the external origin size.

If the snapshot origin is also external origin, merge is prohibited.

thin: add support for external origin

Add internal support for thin volume's external origin.

lvremove: easier removal of dependent lvs

Add function to remove lvs which are depending on removed lv
prior the lv is removed.

User is asked for confirmation.

activation: simplify activation code

Reorder activation code to look similar for preload tree and
activation tree.

Its also give much better suppport for device stacking,
since now we also support activation of snapshot which might
be then used for other devices.

activation: add _add_layer_target_to_dtree

Add function for creation of simple linear mapping over layer device.

thin: replace _thin_layer with lv_layer()

Use consitently lv_layer function internally for thin pool layer name.

activation: extend _cached_info

Add layer string to support check of layered devices.

configure: Respect pkg-config --cflags for udev.

conf: add missing '=' for raid10_segtype_default

RAID:  Make 'lvchange --refresh' restore transiently failed RAID PVs

A new function (dm_tree_node_force_identical_table_reload) was added to
avoid the suppression of identical table reloads.  This allows RAID LVs
to reload the on-disk superblock information that contains which devices
have failed and the bitmaps.  If the failed device has returned, this has
the effect of restoring the device and initiating recovery.  Without this
patch, the user had to completely deactivate their RAID LV and re-activate
it in order to restore the failed device.  Now they simply need to
suspend and resume (which is done by 'lvchange --refresh').

The identical table suppression is only avoided if the LV is not PARTAIL
(i.e. all of it's devices can be seen and read by LVM) and the kernel
status of the array contains failed devices.  In other words, the function
will only be called in the case where we may have success in restoring
a failed device in the array.

vgimport:  Allow '--force' to import VGs with missing PVs.

When there are missing PVs in a volume group, most operations that alter
the LVM metadata are disallowed.  It turns out that 'vgimport' is one of
those disallowed operations.  This is bad because it creates a circular
dependency.  'vgimport' will complain that the VG is inconsistent and that
'vgreduce --removemissing' must be run.  However, 'vgreduce' cannot be run
because it has not been imported.  Therefore, 'vgimport' must be one of
the operations allowed to change the metadata when PVs are missing.  The
'--force' option is the way to make 'vgimport' happen in spite of the
missing PVs.

pvcreate: fix alignment to incorporate alignment offset if PV has 0 MDAs

If zero metadata copies are used, there's no further recalculation of
PV alignment that happens when adding metadata areas to the PV and
which actually calculates the alignment correctly as a matter of fact.
So fix this for "PV without MDA" case as well.

Before this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
  Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
  PV         1st PE
  /dev/sda    12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
  Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
  PV         1st PE
  /dev/sda     8.00m

After this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
  Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
  PV         1st PE
  /dev/sda    12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
  Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
  PV         1st PE
  /dev/sda    12.00m

Also, remove a superfluous condition "pv->pe_start < pv->pe_align" in:
  if (pe_start == PV_PE_START_CALC && pv->pe_start < pv->pe_align)
    pv->pe_start = pv->pe_align ...
This part of the condition is not reachable as with the PV_PE_START_CALC,
we always have pv->pe_start set to 0 from the PV struct initialisation
(...the pv->pe_start value is just being calculated).

RAID: Add new 'raid10_segtype_default' setting in lvm.conf

If '--mirrors/-m' and '--stripes/-i' are used together when creating
a logical volume, mirrors-over-stripes is currently chosen. The user
can override this by using the '--type raid10' option on creation.
However, we want a place where we can set the default behavior to
'raid10' explicitly - similar to the "mirror" and "raid1" tunable,
mirror_segtype_default.

A follow-on patch should use this new setting to change the default
from "mirror" to "raid10", as this is the preferred segment type.

clean-up: Remove a FIXME question that has been settled

It is ok for us to use the shorthand 'lv_is_virtual' to detect error
targets in a RAID LV when searching for candidates for device replacement.

RAID:  Allow remove/replace of sub-LVs composed of error segments.

When a device fails, we may wish to replace those segments with an
error segment.  (Like when a 'vgreduce --removemissing' removes a
failed device that happens to be a RAID image/meta.)  We are then left
with images that we will eventually want to remove or replace.

This patch allows us to pull out these virtual "error" sub-LVs.  This
allows a user to 'lvconvert -m -1 vg/lv' to extract the bad sub-LVs.
Sub-LVs with error segments are considered for extraction before other
possible devices so that good devices are not accidentally removed.

This patch also adds the ability to replace RAID images that contain error
segments.  The user will still be unable to run 'lvconvert --replace'
because there is no way to address the 'error' segment (i.e. no PV
that it is associated with).  However, 'lvconvert --repair' can be
used to replace the image's error segment with a new PV.  This is also
the most appropriate way to do it, since the LV will continue to be
reported as 'partial'.

RAID: Make 'vgreduce --removemissing' work with RAID LVs

Currently it is impossible to remove a failed PV which has a RAID LV
on it. This patch fixes the issue by replacing the failed PV with an
'error' segment within the affected sub-LVs. Once there is no longer
a RAID LV using the PV, it can be removed.

Most often, it is better to replace a failed RAID device with a spare.
(You can use 'lvconvert --repair <vg>/<LV>' to accomplish that.)
However, if there are no spares in the volume group and none will be
added, it is useful to be able to removed the failed device.

Following patches address the ability to perform 'lvconvert' operations
on RAID LVs that contain sub-LVs composed of 'error' segments.

clean-up:  Rename lvm.conf setting 'mirror_region_size' to 'raid_region_size'

We have been using 'mirror_region_size' in lvm.conf as the default region
size for RAID logical volumes as well as mirror logical volumes.  Since,
"raid" is more inclusive and representative than "mirror", I have changed
the name of this setting.  We must still check for the old setting and warn
the user if we are overriding it with the new setting if both happen to be
present.

fix: 'Couldn't read extent size' --> '... extent start'

report: fix pvs -o pv_free reporting for PVs with 0 PEs

[0] raw/~ # lsblk -o NAME,SIZE /dev/sda
NAME  SIZE
sda   128M

[0] raw/~ # pvcreate --dataalignment 128m /dev/sda
  Physical volume "/dev/sda" successfully created

[0] raw/~ # vgcreate vg /dev/sda
  Volume group "vg" successfully created

[0] raw/~ # lvcreate -l1 vg
  Volume group "vg" has insufficient free space (0 extents): 1 required.

Before this patch:
[0] raw/~ # pvs -o pv_name,pv_free
  PV         PFree
  /dev/sda   128.00m

After this patch:
[0] raw/~ # pvs -o pv_name,pv_free
  PV         PFree
  /dev/sda      0

cleanup: old style gcc

cleanup: preserve signesss and type size on return values

thin: update pool_is_active

Change it to take LV and move it to exported header - seems
to be a better fit for usability from tools/ directory.

thin: properly unmark volume after detach

When the volume is detached form thin pool,
unmask THIN_VOLUME flag and reset related pointers.

test: update using exclusive activation

For testing in cluster exclusive activation of origin is needed.

thin: fix forbidden discards checks

Instead of check for lv_is_active() for thin pool LV,
query the whole pool via new pool_is_active().

Fixes a problem when we cannot change discards settings
for active pool device where the actual layer for pool
device was inactive, but thin volumes using thin pool
have been active.

thin: add function pool_is_active

This internal function check for active pool device.
For cluster it checks every thin volume,
On the non-clustered VG we need to check just
for presence of -tpool device.

report: leave empty report field for 0

Since we do not support LVs with 0 size, use this value
as 'error' value for devices without origin, and leave this
field blank as in other cases.

lvconvert: update error path

Update the error path after problems with suspend_lv or vg_commit.
It's not exactly well defined what should happen, and this
code seems to appear in many different instancies<F2> in the
whole source code tree - we should probably pick the best version.

lvconvert: fix accepting second lv name

Do not allow to accept second LV name on lvconvert --thinpool
command line.

headers: add headers for musl libc

On glibc, those are erroneously (namespace pollution) pulled in via
other headers. this doesn't work with conformant libcs (musl libc in
this case), we simply need to include all needed headers.

Signed-Off-By: John Spencer <maillist-lvm@barfooze.de>

cleanup: avoid overflow warning

Even we do not expect to support chunks bigger then 2GB.

cleanup: remove extra braces

cleanup: malloc attributes

cleanup: add internal error check

Check if 'is_removable' is defined and report internal error,
if it's missing.

clean-up: Another functiont that can use 'lv_layer'

lib/activate/dev_manager.c:dev_manager_raid_status() can also use
the new 'lv_layer' function.

thin: use noflush for obtaining transaction_id

Do not flush thin pool data, when reading transation_id status.

cleanup: comment update

Just update code comment and use single line if().

cleanup: indent line

libdm: support newer thin pool status parameters

Support read_only and discards information.

test: regression test for lvmetad error

thin: replace is_active with send_messages

Since is_active is only used for thinp
replace struct member with more meaningful
send_messages flag

use lv_layer

activate: add lv_layer function

Add function to return layer name for LV.

libdm: fix segault for truncated string token.

This patch fixes problem reported here:

https://www.redhat.com/archives/dm-devel/2013-January/msg00311.html

Fixing it by separating function for duplicating string token.

---
When /etc/lvm/lvm.conf is truncated at the first '"' of a line, all LVM
utilities crash with a segfault.

The segfault only seems to occur if the last character is the first '"'
(double quote) of a line. If you truncate it at any other point, lvm
detects the error and report parse error

lvm.conf ends like this.

$hexdump -C lvm.conf
....
69 72 20 3d 20 22 2f 64  65 76 22 0a 0a 0a 20 20  |ir = "/dev"...  |
20 20 23 20 41 6e 20 61  72 72 61 79 20 6f 66 20  |  # An array of |
64 69 72 65 63 74 6f 72  69 65 73 20 74 68 61 74  |directories that|
20 63 6f 6e 74 61 69 6e  20 74 68 65 20 64 65 76  | contain the dev|
69 63 65 20 6e 6f 64 65  73 20 79 6f 75 20 77 69  |ice nodes you wi|
73 68 0a 20 20 20 20 23  20 74 6f 20 75 73 65 20  |sh.    # to use |
77 69 74 68 20 4c 56 4d  32 2e 0a 20 20 20 20 73  |with LVM2..    s|
63 61 6e 20 3d 20 5b 20  22 2f 78 22 2c 0a 20 20  |can = [ "/x",.  |
20 20 20 20 20 20 20 20  20 20 20 22              | "|
...

Reported-by: dongmao zhang <dmzhang suse com>

cleanup: postpone lv_is_thin_volume check

Code move to make it easier to follow and
call _add_dev_to_dtree() in the separate if() branch
for thin volumes.

WHATS_NEW: Better description of previous change

RAID:  Improve 'lvs' attribute reporting of RAID LVs and sub-LVs

There are currently a few issues with the reporting done on RAID LVs and
sub-LVs.  The most concerning is that 'lvs' does not always report the
correct failure status of individual RAID sub-LVs (devices).  This can
occur when a device fails and is restored after the failure has been
detected by the kernel.  In this case, 'lvs' would report all devices are
fine because it can read the labels on each device just fine.
Example:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
  LV            VG   Attr      Cpy%Sync Devices
  lv            vg   rwi-a-r--   100.00 lv_rimage_0(0),lv_rimage_1(0)
  [lv_rimage_0] vg   iwi-aor--          /dev/sda1(1)
  [lv_rimage_1] vg   iwi-aor--          /dev/sdb1(1)
  [lv_rmeta_0]  vg   ewi-aor--          /dev/sda1(0)
  [lv_rmeta_1]  vg   ewi-aor--          /dev/sdb1(0)

However, 'dmsetup status' on the device tells us a different story:
  [root@bp-01 lvm2]# dmsetup status vg-lv
  0 1024000 raid raid1 2 DA 1024000/1024000

In this case, we must also be sure to check the RAID LVs kernel status
in order to get the proper information.  Here is an example of the correct
output that is displayed after this patch is applied:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
  LV            VG   Attr      Cpy%Sync Devices
  lv            vg   rwi-a-r-p   100.00 lv_rimage_0(0),lv_rimage_1(0)
  [lv_rimage_0] vg   iwi-aor-p          /dev/sda1(1)
  [lv_rimage_1] vg   iwi-aor--          /dev/sdb1(1)
  [lv_rmeta_0]  vg   ewi-aor-p          /dev/sda1(0)
  [lv_rmeta_1]  vg   ewi-aor--          /dev/sdb1(0)

The other case where 'lvs' gives incomplete or improper output is when a
device is replaced or added to a RAID LV.  It should display that the RAID
LV is in the process of sync'ing and that the new device is the only one
that is not-in-sync - as indicated by a leading 'I' in the Attr column.
(Remember that 'i' indicates an (i)mage that is in-sync and 'I' indicates
an (I)mage that is not in sync.)  Here's an example of the old incorrect
behaviour:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
  LV            VG   Attr      Cpy%Sync Devices
  lv            vg   rwi-a-r--   100.00 lv_rimage_0(0),lv_rimage_1(0)
  [lv_rimage_0] vg   iwi-aor--          /dev/sda1(1)
  [lv_rimage_1] vg   iwi-aor--          /dev/sdb1(1)
  [lv_rmeta_0]  vg   ewi-aor--          /dev/sda1(0)
  [lv_rmeta_1]  vg   ewi-aor--          /dev/sdb1(0)
[root@bp-01 lvm2]# lvconvert -m +1 vg/lv; lvs -a -o name,vg_name,attr,copy_percent,devices vg
  LV            VG   Attr      Cpy%Sync Devices
  lv            vg   rwi-a-r--     0.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
  [lv_rimage_0] vg   Iwi-aor--          /dev/sda1(1)
  [lv_rimage_1] vg   Iwi-aor--          /dev/sdb1(1)
  [lv_rimage_2] vg   Iwi-aor--          /dev/sdc1(1)
  [lv_rmeta_0]  vg   ewi-aor--          /dev/sda1(0)
  [lv_rmeta_1]  vg   ewi-aor--          /dev/sdb1(0)
  [lv_rmeta_2]  vg   ewi-aor--          /dev/sdc1(0)                            ** Note that all the images currently are marked as 'I' even though it is
   only the last device that has been added that should be marked.

Here is an example of the correct output after this patch is applied:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
  LV            VG   Attr      Cpy%Sync Devices
  lv            vg   rwi-a-r--   100.00 lv_rimage_0(0),lv_rimage_1(0)
  [lv_rimage_0] vg   iwi-aor--          /dev/sda1(1)
  [lv_rimage_1] vg   iwi-aor--          /dev/sdb1(1)
  [lv_rmeta_0]  vg   ewi-aor--          /dev/sda1(0)
  [lv_rmeta_1]  vg   ewi-aor--          /dev/sdb1(0)
[root@bp-01 lvm2]# lvconvert -m +1 vg/lv; lvs -a -o name,vg_name,attr,copy_percent,devices vg
  LV            VG   Attr      Cpy%Sync Devices
  lv            vg   rwi-a-r--     0.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
  [lv_rimage_0] vg   iwi-aor--          /dev/sda1(1)
  [lv_rimage_1] vg   iwi-aor--          /dev/sdb1(1)
  [lv_rimage_2] vg   Iwi-aor--          /dev/sdc1(1)
  [lv_rmeta_0]  vg   ewi-aor--          /dev/sda1(0)
  [lv_rmeta_1]  vg   ewi-aor--          /dev/sdb1(0)
  [lv_rmeta_2]  vg   ewi-aor--          /dev/sdc1(0)
** Note only the last image is marked with an 'I'.  This is correct and we can
   tell that it isn't the whole array that is sync'ing, but just the new
   device.

It also works under snapshots...
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
  LV            VG   Attr      Cpy%Sync Devices
  lv            vg   owi-a-r-p    33.47 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
  [lv_rimage_0] vg   iwi-aor--          /dev/sda1(1)
  [lv_rimage_1] vg   Iwi-aor-p          /dev/sdb1(1)
  [lv_rimage_2] vg   Iwi-aor--          /dev/sdc1(1)
  [lv_rmeta_0]  vg   ewi-aor--          /dev/sda1(0)
  [lv_rmeta_1]  vg   ewi-aor-p          /dev/sdb1(0)
  [lv_rmeta_2]  vg   ewi-aor--          /dev/sdc1(0)
  snap          vg   swi-a-s--          /dev/sda1(51201)

RAID:  Cache previous results of lv_raid_dev_health for future use

We can avoid many dev_manager (ioctl) calls by caching the results of
previous calls to lv_raid_dev_health.  Just considering the case where
'lvs -a' is called to get the attributes of a RAID LV and its sub-lvs,
this function would be called many times.  (It would be called at least
7 times for a 3-way RAID1 - once for the health of each sub-LV and once
for the health of the top-level LV.)  This is a good idea because the
sub-LVs are processed in groups along with their parent RAID LV and in
each case, it is the parent LV whose status will be queried.  Therefore,
there only needs to be one trip through dev_manager for each time the
group is processed.

RAID:  Add RAID status accessibility functions

Similar to the way thin* accesses its kernel status, we add a method
for RAID to grab the various values in its status output without the
higher levels (LVM) having to understand how to parse the output.
Added functions include:
        - lib/activate/dev_manager.c:dev_manager_raid_status()
          Pulls the status line from the kernel

        - libdm/libdm-deptree.c:dm_get_status_raid()
          Parses status line and puts components into dm_status_raid struct

        - lib/activate/activate.c:lv_raid_dev_health()
          Accesses dm_status_raid to deliver raid dev_health string

The new structure and functions can provide a more unified way to access
status information.  ('lv_raid_percent' could switch to using these
functions, for example.)

Test (RAID): Test for RAID10 activations when devices are missing

Test the fix for bug 889358. RAID10 had been failing to activate when
there were devices that had failed in more than one mirror set.

blkdeactivate: prevent trying to unmount the same mountpoint more times

An addendum to previous commit 1052863a1b35f7488758c78b3a9ebef5c63392bc.

blkdeactivate: fix handling of nested mountpoints and mangled mount paths.

If there was a nested mountpoint inside an existing mount path,
blkdeactivate could fail to unmount such a mountpoint as it
needs to deactivate the deepest path first and continue upwards.

For example the simplest reproducer:

[root@rhel6-a ~]# lsblk
NAME                        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                           8:0    0    4G  0 disk
|-vg-lvol0 (dm-2)           253:2    0   32M  0 lvm  /mnt/a
`-vg-lvol1 (dm-3)           253:3    0   32M  0 lvm  /mnt/a/b

Before this patch:

[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
  UMOUNT: unmounting vg-lvol0 (dm-2) mounted on /mnt/a
umount: /mnt/a: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
  UMOUNT: unmounting vg-lvol1 (dm-3) mounted on /mnt/a/b
  LVM: deactivating Logical Volume vg/lvol1

(deactivation of vg/lvol0 is skipped as /mnt/a that is on lvol0
can't be unmounted - it still has /mnt/a/b as nested mountpoint!)

With this patch applied:

[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
  UMOUNT: unmounting vg-lvol1 (dm-3) mounted on /mnt/a/b
  UMOUNT: unmounting vg-lvol0 (dm-2) mounted on /mnt/a
  LVM: deactivating Logical Volume vg/lvol0
  LVM: deactivating Logical Volume vg/lvol1

===

Also, this patch contains a fix for processing mangled mount paths:

[root@rhel6-a ~]# lsblk
NAME                        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                           8:0    0    4G  0 disk
`-vg-lvol0 (dm-2)           253:2    0   32M  0 lvm  /mnt/x y z

[root@rhel6-a ~]# lsblk -r
vg-lvol0 253:2 0 32M 0 lvm /mnt/x\x20y\x20z

(the mount path is mangled with \xNN that is visible in raw
lsblk output only and which is used in blkdeactive as well)

Before this patch:

[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
  umount: /mnt/x\x20y\x20z: not found

After this patch applied:

[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
  UMOUNT: unmounting vg-lvol0 (dm-2) mounted on /mnt/x\x20y\x20z
  LVM: deactivating Logical Volume vg/lvol0

locales: use higher prio LC_ALL variable

For reseting locale environment into significantly less memory
consuming version 'C' - use LC_ALL instead of LANG since it has
higher priority in locale settings.

Otherwise we may observe whole locale-archive which might be
over 100MB on i.e. Fedora systems locked in memory with
some daemons.

Update WHATS_NEW.

lvmetad: Call _lvmetad_handle_reply in lvmetad_vg_lookup.

lvmetad: Fix a race in metadata update.

The idea is to avoid a period when an existing VG is not mapped to either the
old or the new name. (Note that the brief "blackout" was present even if the
name did not actually change.) We instead allow a brief overlap of a VG existing
under both names, i.e. a query for a VG might succeed but before a lock is
acquired the VG disappears.

dmeventd: close dmeventd FIFO FDs on exec (add FD_CLOEXEC).

whatsnew

filters: add scm devices

Fix this:
pvcreate /dev/scma
Device /dev/scma not found (or ignored by filtering).

Reported-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>

logging: classify log_debug messages

Place most log_debug() messages into a class.

logging: add debug classes

Add log/debug_classes to lvm.conf to allow debug messages to be
classified and filtered at runtime.

The dm_errno field is only used by log_error(), so I've redefined it
for log_debug() messages to hold the message class.

By default, all existing messages appear, but we can add categories that
generate high volumes of data, such as logging all traffic to/from
lvmetad.

lvmetad: fix format1 updates

fmt1 doesn't have a separate commit function: updates take effect
immediately vg_write is called, so we must update lvmetad at this
point if we're going to go on and ask lvmetad for the VG metadata
again before calling the commit function (though that's probably an
unsupported and pointless thing to do anyway as the client must
already have that data and it cannot have changed because it's locked
and with devs suspended we shouldn't be communicating with lvmetad;
so when that's fixed properly, this fix here can be reverted).

This problem showed up as an internal error when lvremoving an LVM1
snapshot.

> Internal error: LV snap1 (00000000000000000000000000000001) missing from preload metadata

https://bugzilla.redhat.com/891855

lvmetad: add basic client-side debug logging

First attempt at showing precisely what use any command is making of
lvmetad in the -vvvv trace information.

lvmetad: rename device vars and move _token_update

Move _token_update() to avoid the need for _lvmetad_send prototype.

Use 'dev' consistently for a struct device * variable.
Use 'devno' for a dev_t.

libdaemon: add logging to daemon_open

Log all conditions encountered in daemon_open().

Only store errno when known to be set.

lvmetad: improve client logging when connecting

Rename lvmetad_warning() to lvmetad_connect_or_warn().

Log all connection attempts on the client side, whether successful or not.

Reduce some nesting and remove a redundant assertion.

lvmetad: lvm depends on libdaemonclient.a

Rebuild lvm binary if libdaemonclient.a changes.

pvscan: synchronize with udev if pvscan --cache is used.

We need to call sync_local_dev_names directly as pvscan uses
VG_GLOBAL lock and this one *does not* cause the synchronization
(sync_dev_names) to be called on unlock (VG_GLOBAL is not a real VG):

define unlock_vg(cmd, vol)
  do { \
    if (is_real_vg(vol)) \
      sync_dev_names(cmd); \
    (void) lock_vol(cmd, vol, LCK_VG_UNLOCK); \
  } while (0)

Without this fix, we end up without udev synchronization for the
pvscan --cache (mainly for -aay that causes the VGs/LVs to be
autoactivated) and also udev synchronization cookies are then left
in the system since they're not managed properly (code before sets
up udev sync cookies, but we have to call dm_udev_wait at least once
after that to do the wait and cleanup).

activation: fix autoactivation to not trigger on each PV change

Before, the pvscan --cache -aay was called on each ADD and CHANGE
uevent (for a device that is not a device-mapper device) and each CHANGE
event (for a PV that is a device-mapper device).

This causes troubles with autoactivation in some cases as CHANGE event
may originate from using the OPTION+="watch" udev rule that is defined
in 60-persistent-storage.rules (part of the rules provided by udev
directly) and it's used for all block devices
(except fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md* devices). For example, the
following sequence incorrectly activates the rest of LVs in a VG if one
of the LVs in the VG is being removed:

[root@rhel6-a ~]# pvcreate /dev/sda
  Physical volume "/dev/sda" successfully created

[root@rhel6-a ~]# vgcreate vg /dev/sda
  Volume group "vg" successfully created

[root@rhel6-a ~]# lvcreate -l1 vg
  Logical volume "lvol0" created

[root@rhel6-a ~]# lvcreate -l1 vg
  Logical volume "lvol1" created

[root@rhel6-a ~]# vgchange -an vg
  0 logical volume(s) in volume group "vg" now active

[root@rhel6-a ~]# lvs
  LV      VG        Attr      LSize   Pool Origin Data%  Move Log
Cpy%Sync Convert
  lvol0   vg        -wi------   4.00m
  lvol1   vg        -wi------   4.00m

[root@rhel6-a ~]# lvremove -ff vg/lvol1
  Logical volume "lvol1" successfully removed

[root@rhel6-a ~]# lvs
  LV      VG        Attr      LSize   Pool Origin Data%  Move Log
Cpy%Sync Convert
  lvol0   vg        -wi-a----   4.00m

...so the vg was deactivated, then lvol1 removed, and we end up with
lvol1 removed (which is ok) BUT with lvol0 activated (which is wrong)!!!
This is because after lvol1 removal, we need to write metadata to the
underlying device /dev/sda and that causes the CHANGE event to be
generated (because of the WATCH udev rule set on this device) and this
causes the pvscan --cache -aay to be reevaluated.

We have to limit this and call pvscan --cache -aay to autoactivate
VGs/LVs only in these cases:

--> if the *PV is not a dm device*, scan only after proper device
addition (ADD event) and not with any other changes (CHANGE event)

--> if the *PV is a dm device*, scan only after proper mapping
activation (CHANGE event + the underlying PV in a state "just
activated")

RAID:  Limit replacement of devices when array is not in-sync.

If a RAID array is not in-sync, replacing devices should not be allowed
as a general rule.  This is because the contents used to populate the
incoming device may be undefined because the devices being read where
not in-sync.  The kernel enforces this rule unless overridden by not
allowing the creation of an array that is not in-sync and includes a
devices that needs to be rebuilt.

Since we cannot know the sync state of an LV if it is inactive, we must
also enforce the rule that an array must be active to replace devices.

That leaves us with the following conditions:
1) never allow replacement or repair of devices if the LV is in-active
2) never allow replacement if the LV is not in-sync
3) allow repair if the LV is not in-sync, but warn that contents may
   not be recoverable.

In the case where a user is performing the repair on the command line via
'lvconvert --repair', the warning is printed before the user is prompted
if they would like to replace the device(s).  If the repair is automated
(i.e. via dmeventd and policy is "allocate"), then the device is replaced
if possible and the warning is printed.

WHATS_NEW: changelog for fae1a611d2 and 5294a6f77a

lvm2app: No special behavior for 0 for max_snap_size in lvm_lv_snapshot()

It isn't possible to choose a sane default for snapshot size, so just
play it straight and use the passed size instead of adding special
behavior for 0.

Also revert change to Python lib, size parameter must be supplied.

Signed-off-by: Andy Grover <agrover@redhat.com>

Revert "lvmetad: simplify pvid memory allocation."

This reverts commit ed23da95b63308e11f8d680b189686a5d2d380d0.

Hash table device_to_pvid seems to contain references to
already deleted pvids and so revert to the older
behaviour using allocated memory.

lvmetad: Fix a possible race in remove_metadata.

All operations on shared hash tables need to be protected by mutexes. Moreover,
lookup and subsequent key removal need to happen atomically, to avoid races (and
possible double free-ing) between multiple threads trying to manipulate the same
VG.

lvmetad: Fix a possible deadlock.

If an update and a query were running in parallel, there was a slim but non-zero
chance of a deadlock due to (unnecessary) mutex nesting.