Dave Wysochanski [Thu, 30 Sep 2010 14:07:47 +0000 (14:07 +0000)]
Add pv_uuid_dup, vg_uuid_dup, and lv_uuid_dup, and call id_format_and_copy.
Add supporting functions for pv_uuid, vg_uuid, and lv_uuid.
Call new function id_format_and_copy. Use 'const' where appropriate.
Add "_dup" suffix to indicate memory is being allocated.
Call {pv|vg|lv}_uuid_dup from lvm2app uuid functions.
Dave Wysochanski [Thu, 30 Sep 2010 14:07:19 +0000 (14:07 +0000)]
Simplify logic to create 'attr' strings.
This patch addresses code review request to simplify creation of 'attr'
strings. The simplification is done in this separate patch to more
easily review and ensure the simplification is done without error.
Dave Wysochanski [Thu, 30 Sep 2010 13:52:55 +0000 (13:52 +0000)]
Add {pv|vg|lv}_attr_dup() functions and refactor 'disp' functions.
Move the creating of the 'attr' strings into a common function so
they can be called from the 'disp' functions as well as the new
'get' property functions.
Add "_dup" suffix to indicate memory is allocated.
Refactor pvstatus_disp to take pv argument and call pv_attr_dup().
Dave Wysochanski [Thu, 30 Sep 2010 13:05:45 +0000 (13:05 +0000)]
Refactor metadata.[ch] into lv.[ch] for lv functions.
This patch is similar to the other patches for pv and vg
functionality, and separates lv functionality into separate
files, concentrating on reporting fields and simple functions.
Dave Wysochanski [Thu, 30 Sep 2010 13:05:20 +0000 (13:05 +0000)]
Refactor metadata.[ch] into pv.[ch] for pv functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out pv functions and data, particularly
related to the reporting fields calculations.
More code could be moved here but for now I'm stopping at reporting
functions 'get' / 'set' functions.
Dave Wysochanski [Thu, 30 Sep 2010 13:04:55 +0000 (13:04 +0000)]
Refactor metadata.[ch] into vg.[ch] for vg functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out vg functions and data, particularly
related to the reporting fields calculations.
Read complete content of /proc/self/maps into one buffer without
realocation in the middle of reading and before doing any m/unlock
operation with these lines - as some of them gets change.
With previous implementation we've read some mappings twice ([stack])
Milan Broz [Wed, 22 Sep 2010 13:45:21 +0000 (13:45 +0000)]
Fix handling of partial VG for lvm1 format metadata
If some lvm1 device is missing, lvm fails on all operations
# vgcfgbackup -f bck -P vg_test
Partial mode. Incomplete volume groups will be activated read-only.
3 PV(s) found for VG vg_test: expected 4
PV segment VG free_count mismatch: 152599 != 228909
PV segment VG extent_count mismatch: 152600 != 228910
Internal error: PV segments corrupted in vg_test.
Volume group "vg_test" not found
Allow loading of lvm1 partial VG by allocating "new" missing PV,
which covers lost space. Also this fake mising PV inform code
that it is partial VG.
Peter Rajnoha [Mon, 20 Sep 2010 14:25:27 +0000 (14:25 +0000)]
Revert to old glibc behaviour for vsnprintf used in emit_to_buffer function.
Revert to old glibc behaviour for vsnprintf used in emit_to_buffer fn.
Otherwise, the check that follows would be wrong for new glibc versions.
This caused the rh bug #633033 to be undetected and pass throught the check,
corrupting the metadata!
Peter Rajnoha [Thu, 9 Sep 2010 13:13:12 +0000 (13:13 +0000)]
Add random suffix to archive file names to prevent races when being created.
In certain configurations, we're not under a VG rw lock while trying to write
a new archive file with VG metadata. A common example is using "vgs" while
having the content of backup and archive directories empty. The code scans the
content of these directories and tries to determine the final index that should
be used in archive name. Since we're not under a lock, we can get into a race
while choosing the index which could end up showing errors about not being able
to rename to final archive name. Let's add random number suffix to these archive
file names so we can avoid the race.
Peter Rajnoha [Thu, 9 Sep 2010 13:07:13 +0000 (13:07 +0000)]
Reinitialize archive and backup handling on toolcontext refresh.
For example, when using '--config "backup { ... }"' line, the values from
lvm.conf (or default values) should be overridden. This patch adds
reinitialisation of archive and backup handling on toolcontext refresh
which makes these settings to be applied.
This patch fixes an issue where cluster mirror write I/O
can be opprobriously slow if created with '--nosync'.
One of the ways cluster mirrors coordinate I/O and recovery
amoung the different machines is by the use of the log
function 'is_remote_recovering()' which lets nodes know if
a region they wish to perform a write on is currently being
recovered on another node. If the region is being recovered,
the I/O is delayed.
The 'is_remote_recovering' routine has been optimized to
avoid the deluge of requests that would be issued to the
userspace log server by maintaining a marker of how far
the recovery has gotten. It can then immediately return
'not recovering' if the region being inquired about is
less than this mark. Additionally, if the region of
concern is greater than the mark, the function will
limit the number of transmissions to userspace by assuming
the region /is/ being recovered when skipping the
transmission. This limits the amount of processing
and updates the mark in 1/4 sec time steps.
This patch fixes a problem where 'the mark' is not being
updated because of faulty logic in the userspace log
daemon. When '--nosync' is used to create a cluster
mirror, the userspace log daemon never has a chance
to update the mark in the normal way. The fix is to set
the mark to "complete" if the mirror was created with
the --nosync flag.
This patch fixes a problem where the mirror polling process
may never complete.
If you convert from a linear to a mirror and then convert that
mirror back to linear /while/ the previous (up)convert is
taking place, the mirror polling process will never complete.
This is because the function that polls the mirror for
completion doesn't check if it is still polling a mirror and
the copy_percent that it gets back from the linear device is
certainly never 100%.
The fix is simply to check if the daemon is still looking at
a mirror device - if not, return PROGRESS_CHECK_FAILED.
The user sees the following output from the first (up)convert
if someone else sneaks in and does a down-convert shortly
after their convert:
[root@bp-01 ~]# lvconvert -m1 vg/lv
vg/lv: Converted: 43.4%
ABORTING: Mirror percentage check failed.
This patch fixes a potential for I/O to hang and LVM commands
to block when a mirror under a snapshot suffers a failure.
The problem has to do with label scanning. When a mirror suffers
a failure, the kernel blocks I/O to prevent corruption. When
LVM attempts to repair the mirror, it scans the devices on the
system for LVM labels. While mirrors are skipped during this
scanning process, snapshot-origins are not. When the origin is
scanned, it kicks up I/O to the mirror (which is blocked)
underneath - causing the label scan (an thus the repair operation)
to hang.
This patch simply bypasses snapshot-origin devices when doing
labels scans (while ignore_suspended_devices() is set). This
fixes the issue.
Milan Broz [Mon, 23 Aug 2010 11:34:10 +0000 (11:34 +0000)]
Fix pvmove --abort to work even for empty pvmove LV
If pvmove crashed and metadata contains pvmove LV
but without miorrored segments, pvmove --abort
will not repair the situation (and finish wth success!).
Fix it by allowing metadata update if aborting
(thus removing pvmove LV) even if no moved LVs detected.
(Tested on real metadata provided by an lvm user:-)
Mike Snitzer [Sat, 21 Aug 2010 15:43:45 +0000 (15:43 +0000)]
Verify that pvcreate --dataalignment really does override the topology
detected alignment.
NOTE: lvm2 doesn't detect MD 1.2 metadata (now the default on RHEL6) so
for now I'm forcing 1.0 metadata. This was needed to be able to reuse
the existing loop devices but recreate the md device with different
raid0 striping.
Mike Snitzer [Fri, 20 Aug 2010 20:59:05 +0000 (20:59 +0000)]
Update heuristic used for default and detected data alignment.
Add "devices/default_data_alignment" to lvm.conf to control the internal
default that LVM2 uses: 0==64k, 1==1MB, 2==2MB, etc.
If --dataalignment (or lvm.conf's "devices/data_alignment") is specified
then it is always used to align the start of the data area. This means
the md_chunk_alignment and data_alignment_detection are disabled if set.
(Same now applies to pvcreate --dataalignmentoffset, the specified value
will be used instead of the result from data_alignment_offset_detection)
set_pe_align() still looks to use the determined default alignment
(based on lvm.conf's default_data_alignment) if the default is a
multiple of the MD or topology detected values.
Dave Wysochanski [Fri, 20 Aug 2010 12:44:47 +0000 (12:44 +0000)]
Add properties.[ch] to lib/report, defined based on columns.h.
Extend the existing reporting infrastructure definitions and structures
to include a 'get' and 'set' function for each field. We will provide
a 'get' and 'set' function for each of these fields, which will be utilized
by exported lvm2app functions.
Define a default _not_implemented 'get' and 'set' function that just sets
an errno and returns 0. Future patches will actually implement the
specific 'get' and 'set' functions for each property. For read-only
properties, only the 'get' function will be implemented.
Define vg_get_property() function to query a property. We will call
this from a lvm2app function.
Dave Wysochanski [Fri, 20 Aug 2010 12:44:30 +0000 (12:44 +0000)]
Add macro definitions to report infrastructure for character array length.
Rather than hard code the size of the field, use a #define, so we can re-use.
The #define will be needed in a future patch when we extend the reporting
infrastructure to have 'get' and 'set' functions for each field, allowing
lvm2app functions which query any report field. In order to provide a
generic lookup based on the field id, we will define a type containing this
field id, and thus, we will need to re-use the length of this string as
it's defined inside libdevmapper.h.
Dave Wysochanski [Fri, 20 Aug 2010 12:44:17 +0000 (12:44 +0000)]
Remove explicit double quotes from columns.h 'id' entries.
The 'id' entries in columns.h are the report field names. Since these are
unique, we'd like to use them in generation of 'get' / 'set' functions.
As a step towards using them for this purpose, remove the explicit double
quotes and use the macro '#' character to add the double quotes back when
placing them into the '_fields' array 'id' member.
Dave Wysochanski [Fri, 20 Aug 2010 12:44:03 +0000 (12:44 +0000)]
Add 'flags' field to columns.h and define FIELD_MODIFIABLE.
Add a 'flags' field to columns.h, and set it to 0 by default.
Define FIELD_MODIFIABLE flag to indicate whether a 'set' function exists
to change the field's value.
Milan Broz [Thu, 19 Aug 2010 23:03:34 +0000 (23:03 +0000)]
Change the pvcreate swap/md logic
pvcreate detects MD and swap signature.
The logic hidden there is not only documented but it is also
user unfriendly. Who invented this logic should run pvcreate
on its own critical MD device to see why;-)
This patch
- creates one function instead of duplication code
- asks if user want to overwrite signature
- allows aborting (!)
(Please note that writing LVM signatute without wiping old
is wrong, it confuses blkid, MD will not work anyway and
swap and LUKS is broken too.)
Peter Rajnoha [Wed, 18 Aug 2010 13:11:56 +0000 (13:11 +0000)]
Fix dm-mod autoloading logic to not assume control node is set correctly.
We can't rely on the fact that udev should prepare the node with right major
and minor number to trigger the module autoloading. We have to take into
account that the node could be missing or it could exist with improper
major and minor number assigned (e.g. from previous kernel versions in
an environment with static nodes and without udev). Make any corrections
if needed!
Fix for bug 596453: multiple mirror image failures cause lvm repair...
The lvm repair issues I believe are the superficial symptoms of this
bug - there are worse issues that are not as clearly seen. From my
inline comments:
* If the mirror was successfully recovered, we want to always
* force every machine to write to all devices - otherwise,
* corruption will occur. Here's how:
* Node1 suffers a failure and marks a region out-of-sync
* Node2 attempts a write, gets by is_remote_recovering,
* and queries the sync status of the region - finding
* it out-of-sync.
* Node2 thinks the write should be a nosync write, but it
* hasn't suffered the drive failure that Node1 has yet.
* It then issues a generic_make_request directly to
* the primary image only - which is exactly the device
* that has suffered the failure.
* Node2 suffers a lost write - which completely bypasses the
* mirror layer because it had gone through generic_m_r.
* The file system will likely explode at this point due to
* I/O errors. If it wasn't the primary that failed, it is
* easily possible in this case to issue writes to just one
* of the remaining images - also leaving the mirror inconsistent.
*
* We let in_sync() return 1 in a cluster regardless of what is
* in the bitmap once recovery has successfully completed on a
* mirror. This ensures the mirroring code will continue to
* attempt to write to all mirror images. The worst that can
* happen for reads is that additional read attempts may be
* taken.
Alasdair Kergon [Tue, 17 Aug 2010 19:25:05 +0000 (19:25 +0000)]
Use 'SINGLENODE' instead of 'dead' in clvmd singlenode messages.
Ignore snapshots when performing mirror recovery beneath an origin.
Pass LCK_ORIGIN_ONLY flag around cluster.
Add suspend_lv_origin and resume_lv_origin using LCK_ORIGIN_ONLY.
Fix for bug 612291: dm devices of split off mirror images are not removed
DM devices were not handled properly on nodes in a cluster that were not
where the splitmirrors command was issued. This was happening because
suspend_lv/resume_lv were being used in a place where activate_lv should
have been used.
When the suspend/resume are issued on (effectively) new LVs, their
'resource' (UUID) is not located in the lv_hash. Thus, both operations
turn into no-ops. You can see this from the output of clvmd from one
of the remote nodes:
<snip>
do_suspend_lv, lock not already held
<snip>
do_resume_lv, lock not already held
'activate_lv' enjoins the other nodes in the cluster to process the lock
and activate the new LV. clvmd output from remote node as follows:
do_lock_lv: resource 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S', cmd = 0x19 LCK_LV_ACTIVATE (READ|LV|NONBLOCK), flags = 0x84 (DMEVENTD_MONITOR ), memlock = 1
sync_lock: 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S' mode:1 flags=1
sync_lock: returning lkid 27b0001
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Petr Rockai <prockai@redhat.com>
Peter Rajnoha [Thu, 12 Aug 2010 13:41:18 +0000 (13:41 +0000)]
Fix udev rules to support udev database content generated by older rules.
This can happen with older rules (without support for synthesized events)
that are still part of initrd while using new udev rules in the system itself.
The consequence was that new udev rules incorrectly assumed that not having
DM_UDEV_PRIMARY_SOURCE_FLAG set always means the uevent is synthesized and
inappropriate (device is still not properly activated) and so it should be
ignored. However, initrd is not updated automatically while updating the
libdevmapper/udev rules in the system and so we end up with the rules not
detecting and setting crucial parts in the initrd environment and the rules
in the system that rely on the information that should have been stored in
udev db (which is incorrect in this configuration, of course).
The overall consequence is that the update of libdevmapper/lvm2 without
regenerating the initrd could end up with a boot failure! Ignoring the event
means removing any existing symlinks in /dev!
To fix this, increase udev rules version to make a difference. So from now on,
mark rules without proper support for synthesized events as
DM_UDEV_RULES_VSN="1" and 2 (or higher) if that support is included.
Peter Rajnoha [Thu, 12 Aug 2010 13:07:08 +0000 (13:07 +0000)]
Reinstate detection of inappropriate uevent with DISK_RO set and suppress it.
We still need to detect this one! We're not so strict with CHANGE events as
with the ADD events while applying filters in the rules so this one would
pass and it would process the rules prematurely (because it appears *before*
the actual CHANGE event used when resuming a DM device while setting read-only
state at the same time).
Fix clvmd init script return code when executed as non-root user.
clvmd daemon itself does the right thing when invoked as non-root, by
returning 4.
The patch removes the use daemon function from
/etc/rc.d/init.d/functions that´s unnecessary and has th bad habit to
mask the return codes from the real daemon.
Add a simple and generic check to see if clvmd is executed by root or not.
Our stop/reload/restart paths in the init script are complex and not all
the tools involved in the process are guaranteed to return 4 if executed
by non-root against a process that´s running as root (for example kill
-TERM will return -1 and parsing the output to catch the error is
suboptimal at best).
Mike Snitzer [Thu, 12 Aug 2010 04:56:05 +0000 (04:56 +0000)]
fix t-pvcreate-operation-md.sh to require kernel.org Linux >= 2.6.33 for
the final alignment_offset check. In the future, might look to check
for the RHEL6 kernel too.
Mike Snitzer [Thu, 12 Aug 2010 04:11:48 +0000 (04:11 +0000)]
Change default alignment of pe_start to 1MB.
The new standard in the storage industry is to default alignment of data
areas to 1MB. fdisk, parted, and mdadm have all been updated to this
default.
Update LVM to align the PV's data area start (pe_start) to 1MB. This
provides a more useful default than the previous default of 64K (which
generally ended up being a 192K pe_start once the first metadata area
was created).
Before this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 188.00k 192.00k
After this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 1020.00k 1.00m
The heuristic for setting the default alignment for LVM data areas is:
- If the default value (1MB) is a multiple of the detected alignment
then just use the default.
- Otherwise, use the detected value.
In practice this means we'll almost always use 1MB -- that is unless:
- the alignment was explicitly specified with --dataalignment
- or MD's full stripe width, or the {minimum,optimal}_io_size exceeds
1MB
- or the specified/detected value is not a power-of-2
Peter Rajnoha [Wed, 11 Aug 2010 12:14:23 +0000 (12:14 +0000)]
Recognise and give preference to md device partitions (blkext major).
We can already detect MD devices internally. But when using MD partitions,
these have "block extended major" (blkext) assigned (259). Blkext major
is also used in general, so we need to check whether the original device
is an MD device actually.
Fix for bug 619221 - log device splitting regression
An incorrect fix on July 13, 2010 for an annoyance has caused a regression.
The offending check-in was part of the 2.02.71 release of LVM. That
check-in caused any PVs specified on the command line to be ignored when
performing a mirror split.
This patch reverses the aforementioned check-in (solving the regressions)
and posits a new solution to the list reversal problem. The original
problem was that we would always take the lowest mimage LVs from a mirror
when performing a split, but what we really want is to take the highest
mimage LVs. This patch accomplishes that by working through the list in
reverse order - choosing the higher numbered mimages first. (This also
reduces the amount of processing necessary.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Takahiro Yasui <takahiro.yasui@hds.com>
A misunderstanding of the return value of 'dm_bit' has been causing a data
corruption bug in cmirror. 'dm_bit' is only ever used as a boolean operation
within LVM, but it can return a range of values. If the bit is set, a power of
2 is returned. If the bit is unset, 0 is returned.
'log_test_bit' (a function in the cluster mirror log daemon code) has switched
to using the dm bit operations in rhel6. There are two places in the daemon
code where 'log_test_bit' is not used merely as a boolean, but rather the
return value is used as the return value for the log functions 'is_clean' and
'in_sync' - having assumed that 'dm_bit' was returning 0 or 1 only.
One place the 'in_sync' function is utilized is in 'dm_rh_get_state' - a
function that informs the mirroring code how to treat I/O and which devices to
read/write from. 'dm_rh_get_state' was checking if the return value of
'in_sync' was 1 to determine if the region was DM_RH_CLEAN. Since 'dm_bit'
(and by extension 'log_test_bit' and 'in_sync') was returning powers of 2,
DM_RH_CLEAN was rarely being reported as it should have been. Thinking the
region was out-of-sync, the mirroring code would write only to the primary
device. When the primary device was failed, all of those writes were lost -
leaving the entire mirror corrupted.