Petr Rockai [Wed, 31 Aug 2011 15:19:19 +0000 (15:19 +0000)]
Replace const usage of dm_config_find_node with more appropriate value-lookup
functionality. A number of bugs (copied and pasted all over the code) should
disappear:
- most string lookup based on dm_config_find_node would segfault when
encountering a non-zero integer (the intention there was to print an
error message instead)
- check for required sections in metadata would have been satisfied by
values as well (i.e. not sections)
- encountering a section in place of expected flag value would have
segfaulted (due to assumed but unchecked cn->v != NULL)
Petr Rockai [Wed, 31 Aug 2011 12:39:58 +0000 (12:39 +0000)]
Fix warnings and constness handling in lvmetad-core (adjusting the
dm_config_find_node to give non-const node pointer, since that better reflects
the contract of that function).
Petr Rockai [Wed, 31 Aug 2011 11:31:57 +0000 (11:31 +0000)]
A compromise integration of LVMetaD into the build: I have kept all the
daemon/common code in a single libdaemon.a, which is completely private. This
is currently linked into the lvmetad binary, and will be linked into LVM (the
client part, since static linking only picks up only symbols that are actually
used). I have also added --enable/disable-lvmetad to ./configure; although the
current default is off, I expect this to be flipped to on shortly. There's no
LVM-side support yet, but when there is, even when built, it'll still need to
be enabled by an lvm.conf option.
Petr Rockai [Tue, 30 Aug 2011 14:55:15 +0000 (14:55 +0000)]
Move the core of the lib/config/config.c functionality into libdevmapper,
leaving behind the LVM-specific parts of the code (convenience wrappers that
handle `struct device` and `struct cmd_context`, basically). A number of
functions have been renamed (in addition to getting a dm_ prefix) -- namely,
all of the config interface now has a dm_config_ prefix.
Peter Rajnoha [Mon, 29 Aug 2011 13:37:36 +0000 (13:37 +0000)]
Directly allocate buffer memory in a pvck scan instead of using a mempool.
There's a very high memory usage when calling _pv_analyse_mda_raw (e.g. while
executing pvck) that can end up with "out of memory".
_pv_analyse_mda_raw scans for metadata in the MDA, iteratively increasing the
size to scan with SECTOR_SIZE until we find a probable config section or we're
at the edge of the metadata area. However, when using a memory pool, we're also
iteratively chasing for bigger and bigger mempool chunk which can't be found
and so we're always allocating a new one, consuming more and more memory...
This patch just changes the mempool to direct memory allocation in this
problematic part of the code.
Move RAID convert tests to new file, t-lvconvert-raid.sh
There is duplicate code in t-lvconvert-raid.sh and t-lvcreate-raid.sh.
This should be moved into a common file which is then sourced by these two
files. I'll wait to move the duplicate code until I can talk to mornfall.
Add support for m-way to n-way up-convert in RAID1 (no linear to n-way yet)
This patch adds the ability to upconvert a raid1 array - say from 2-way to
3-way. It does not yet support upconverting linear to n-way.
The 'raid' device-mapper target allows for individual components (images) of
an array to be specified for rebuild. This mechanism is used when adding
new images to the array so that the new images can be resync'ed while the
rest of the images in the array can remain 'in-sync'. (There is no
mirror-on-mirror layering required.)
Add the ability to split an image from the mirror and track changes.
~> lvconvert --splitmirrors 1 --trackchanges vg/lv
The '--trackchanges' option allows a user the ability to use an image of
a RAID1 array for the purposes of temporary read-only access. The image
can be merged back into the array at a later time and only the blocks that
have changed in the array since the split will be resync'ed. This
operation can be thought of as a partial split. The image is never completely
extracted from the array, in that the array reserves the position the device
occupied and tracks the differences between the array and the split image via
a bitmap. The image itself is rendered read-only and the name (<LV>_rimage_*)
cannot be changed. The user can complete the split (permanently splitting the
image from the array) by re-issuing the 'lvconvert' command without the
'--trackchanges' argument and specifying the '--name' argument.
~> lvconvert --splitmirrors 1 --name my_split vg/lv
Merging the tracked image back into the array is done with the '--merge'
option (included in a follow-on patch).
~> lvconvert --merge vg/lv_rimage_<n>
The internal mechanics of this are relatively simple. The 'raid' device-
mapper target allows for the specification of an empty slot in an array
via '- -'. This is what will be used if a partial activation of an array
is ever required. (It would also be possible to use 'error' targets in
place of the '- -'.) If a RAID image is found to be both read-only and
visible, then it is considered separate from the array and '- -' is used
to hold it's position in the array. So, all that needs to be done to
temporarily split an image from the array /and/ cause the kernel target's
bitmap to track (aka "mark") changes made is to make the specified image
visible and read-only. To merge the device back into the array, the image
needs to be returned to the read/write state of the top-level LV and made
invisible.
Add --splitmirrors support for RAID1 (1 image only)
Users already have the ability to split an image from an LV of "mirror"
segtype. This patch extends that ability to LVs of "raid1" segtype.
This patch only allows a single image to be split off, however. (The
"mirror" segtype allows an arbitrary number of images to be split off.
e.g. 4-way => 3-way/linear, 2-way/2-way, linear,3-way)
When down-converting RAID1, don't activate sub-lvs between suspend/resume
of top-level LV.
We can't activate sub-lv's that are being removed from a RAID1 LV while it
is suspended. However, this is what was being used to have them show-up
so we could remove them. 'sync_local_dev_names' is a sufficient and
proper replacement and can be done after the top-level LV is resumed.
Compiler warning fixes, better error messaging, and cosmetic changes.
1) add new function 'raid_remove_top_layer' which will be useful
to other conversion functions later (also cleans up code)
2) Add error messages if raid_[extract|add]_images fails
3) Add function prototypes to prevent compiler warnings when
compiling with '--with-raid=shared'
Various code clean-ups (s/malloc/zalloc/, new msgs, etc)
Fix a couple more issues that kabi found.
- Add some error messages in failure cases
- s/malloc/zalloc/
- use vg->vgmem for lv names instead of vg->cmd->mem
Zdenek Kabelac [Thu, 11 Aug 2011 17:34:30 +0000 (17:34 +0000)]
Lock memory for shared VG
Use debug pool locking functionality. So the command could check,
whether the memory in the pool has not been modified.
For lv_postoder() instead of unlocking and locking for every changed
struct status member do it once when entering and leaving function.
(mprotect would trap each such memory access).
Currently lv_postoder() does not modify other part of vg structure
then status flags of each LV with flags that are reverted back to
its original state after function exit.
Zdenek Kabelac [Thu, 11 Aug 2011 17:29:04 +0000 (17:29 +0000)]
Add memory pool locking functions
Adding debuging functionality to lock and unlock memory pool.
2 ways to debug code:
crc - is default checksum/hash of the locked pool.
It gets slower when the pool is larger - so the check is only
made when VG is finaly released and it has been used more then
once.Thus the result is rather informative.
mprotect - quite fast all the time - but requires more memory and
currently it is using posix_memalign() - this could be
later modified to use dm_malloc() and align internally.
Tool segfaults when locked memory is modified and core
could be examined for faulty code section (backtrace).
Only fast memory pools could use mprotect for now -
so such debug builds cannot be combined with DEBUG_POOL.
Zdenek Kabelac [Thu, 11 Aug 2011 17:24:23 +0000 (17:24 +0000)]
Cache and share generated VG structs
Extend vginfo cache with cached VG structure. So if the same metadata
are use, skip mda decoding in the case, the same data are in use.
This helps for operations like activation of all LVs in one VG,
where same data were decoded giving the same output result.
Patch adds 1-to-1 connection between volume_group and lvmcache_vginfo.
Compiler complaining that meta_lv could be used uninitialized. (Not true
because it is protected by 'clear_metadata'.) I switched to using 'lv->vg',
as it makes no difference to vg_[write|commit].
Milan Broz [Wed, 10 Aug 2011 16:07:53 +0000 (16:07 +0000)]
If anything bad happens and unlocking fails
(here clvmd crashed in the middle of operation),
lock is not removed from cache - here is one example:
locking/cluster_locking.c:497 Locking VG V_vg_test UN (VG) (0x6)
locking/cluster_locking.c:113 Error writing data to clvmd: Broken pipe
locking/locking.c:399 <backtrace>
locking/locking.c:461 <backtrace>
Internal error: Volume Group vg_test was not unlocked
Code should always remove lock info from lvmcache and update counters
on unlock, even if unlock fails.
Peter Rajnoha [Tue, 9 Aug 2011 11:44:57 +0000 (11:44 +0000)]
Suppress low-level locking errors and warnings while using --sysinit.
Today, we use "suppress_messages" flag (set internally in init_locking fn based
on 'ignorelockingfailure() && getenv("LVM_SUPPRESS_LOCKING_FAILURE_MESSAGES")'.
This way, we can suppress high level messages like "File-based locking
initialisation failed" or "Internal cluster locking initialisation failed".
However, each locking has its own sequence of initialization steps and these
could log some errors as well. It's quite misleading for the user to see such
errors and warnings if the "--sysinit" is used (and so the ignorelockingfailure
&& LVM_SUPPRESS_LOCKING_FAILURE_MESSAGES environment variable). Errors and
warnings from these intermediary steps should be suppressed as well if requested.
This patch propagates the "suppress_messages" flag deeper into locking init
functions. I've also added these flags for other locking types for consistency,
though it's not actually used for no_locking and readonly_locking.
Zdenek Kabelac [Thu, 4 Aug 2011 15:18:10 +0000 (15:18 +0000)]
Remove unused inconsistent_seqno
Last usage was removed in Petr's commit related to VG mda repair fix
where relaxed check starts to ignore inconsistencies coming from
PVs that are marked MISSING - thus removing unused variable.
Zdenek Kabelac [Thu, 4 Aug 2011 12:40:24 +0000 (12:40 +0000)]
Minor memory leak fix
Defer the test of the function return value after the string memory is released.
Otherwise in this error path the string would present memory leak.
(Thought in this case we are already out of memory...)
Basic support includes:
- ability to create RAID 1/4/5/6 arrays
- ability to delete RAID arrays
- ability to display RAID arrays
Notable missing features (not included in this patch):
- ability to clean-up/repair failures
- ability to convert RAID segment types
- ability to monitor RAID segment types
Peter Rajnoha [Tue, 2 Aug 2011 10:49:57 +0000 (10:49 +0000)]
Change DEFAULT_UDEV_SYNC to 1 so udev_sync is used even without any config.
This should be set by default! Normally we have "activation/udev_sync = 1"
in lvm.conf (example.conf.in). But if we use lvm2 without any config file
(or without a definition within '--config' option) the DEFAULT_UDEV_SYNC
is used instead. Together with verify_udev_operations=0 (when we rely on
udev fully), this can cause races as the node could be missing when needed.
(See also https://bugzilla.redhat.com/show_bug.cgi?id=723144)
Peter Rajnoha [Thu, 28 Jul 2011 13:06:50 +0000 (13:06 +0000)]
Add support for systemd file descriptor handover in dmeventd.
Systemd preloads file descriptors for us and passes them in for
newly spawned daemon when using on-demand fifo (or socket)
based activation.
This patch adds checks for file descriptors preloaded by
systemd and uses them instead of opening the FIFOs again
to properly support on-demand FIFO-based activation.
(We'll change FIFOs to sockets soon - but still this
part of the code will stay almost the same.)
Peter Rajnoha [Thu, 28 Jul 2011 13:03:37 +0000 (13:03 +0000)]
Add support for new oom killer adjustment interface (oom_score_adj).
The filename to adjust the oom score was changed in 2.6.36.
We should use oom_score_adj instead of oom_adj (which is still
there under /proc, but it's scheduled for removal in August 2012).
New oom_score_adj uses a range from -1000 (OOM_SCORE_ADJ_MIN,
disable oom killing) to 1000 (OOM_SCORE_ADJ_MAX).