Fix for bug 612291: dm devices of split off mirror images are not removed
DM devices were not handled properly on nodes in a cluster that were not
where the splitmirrors command was issued. This was happening because
suspend_lv/resume_lv were being used in a place where activate_lv should
have been used.
When the suspend/resume are issued on (effectively) new LVs, their
'resource' (UUID) is not located in the lv_hash. Thus, both operations
turn into no-ops. You can see this from the output of clvmd from one
of the remote nodes:
<snip>
do_suspend_lv, lock not already held
<snip>
do_resume_lv, lock not already held
'activate_lv' enjoins the other nodes in the cluster to process the lock
and activate the new LV. clvmd output from remote node as follows:
do_lock_lv: resource 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S', cmd = 0x19 LCK_LV_ACTIVATE (READ|LV|NONBLOCK), flags = 0x84 (DMEVENTD_MONITOR ), memlock = 1
sync_lock: 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S' mode:1 flags=1
sync_lock: returning lkid 27b0001
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Petr Rockai <prockai@redhat.com>
Peter Rajnoha [Thu, 12 Aug 2010 13:41:18 +0000 (13:41 +0000)]
Fix udev rules to support udev database content generated by older rules.
This can happen with older rules (without support for synthesized events)
that are still part of initrd while using new udev rules in the system itself.
The consequence was that new udev rules incorrectly assumed that not having
DM_UDEV_PRIMARY_SOURCE_FLAG set always means the uevent is synthesized and
inappropriate (device is still not properly activated) and so it should be
ignored. However, initrd is not updated automatically while updating the
libdevmapper/udev rules in the system and so we end up with the rules not
detecting and setting crucial parts in the initrd environment and the rules
in the system that rely on the information that should have been stored in
udev db (which is incorrect in this configuration, of course).
The overall consequence is that the update of libdevmapper/lvm2 without
regenerating the initrd could end up with a boot failure! Ignoring the event
means removing any existing symlinks in /dev!
To fix this, increase udev rules version to make a difference. So from now on,
mark rules without proper support for synthesized events as
DM_UDEV_RULES_VSN="1" and 2 (or higher) if that support is included.
Peter Rajnoha [Thu, 12 Aug 2010 13:07:08 +0000 (13:07 +0000)]
Reinstate detection of inappropriate uevent with DISK_RO set and suppress it.
We still need to detect this one! We're not so strict with CHANGE events as
with the ADD events while applying filters in the rules so this one would
pass and it would process the rules prematurely (because it appears *before*
the actual CHANGE event used when resuming a DM device while setting read-only
state at the same time).
Fix clvmd init script return code when executed as non-root user.
clvmd daemon itself does the right thing when invoked as non-root, by
returning 4.
The patch removes the use daemon function from
/etc/rc.d/init.d/functions that´s unnecessary and has th bad habit to
mask the return codes from the real daemon.
Add a simple and generic check to see if clvmd is executed by root or not.
Our stop/reload/restart paths in the init script are complex and not all
the tools involved in the process are guaranteed to return 4 if executed
by non-root against a process that´s running as root (for example kill
-TERM will return -1 and parsing the output to catch the error is
suboptimal at best).
Mike Snitzer [Thu, 12 Aug 2010 04:56:05 +0000 (04:56 +0000)]
fix t-pvcreate-operation-md.sh to require kernel.org Linux >= 2.6.33 for
the final alignment_offset check. In the future, might look to check
for the RHEL6 kernel too.
Mike Snitzer [Thu, 12 Aug 2010 04:11:48 +0000 (04:11 +0000)]
Change default alignment of pe_start to 1MB.
The new standard in the storage industry is to default alignment of data
areas to 1MB. fdisk, parted, and mdadm have all been updated to this
default.
Update LVM to align the PV's data area start (pe_start) to 1MB. This
provides a more useful default than the previous default of 64K (which
generally ended up being a 192K pe_start once the first metadata area
was created).
Before this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 188.00k 192.00k
After this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 1020.00k 1.00m
The heuristic for setting the default alignment for LVM data areas is:
- If the default value (1MB) is a multiple of the detected alignment
then just use the default.
- Otherwise, use the detected value.
In practice this means we'll almost always use 1MB -- that is unless:
- the alignment was explicitly specified with --dataalignment
- or MD's full stripe width, or the {minimum,optimal}_io_size exceeds
1MB
- or the specified/detected value is not a power-of-2
Peter Rajnoha [Wed, 11 Aug 2010 12:14:23 +0000 (12:14 +0000)]
Recognise and give preference to md device partitions (blkext major).
We can already detect MD devices internally. But when using MD partitions,
these have "block extended major" (blkext) assigned (259). Blkext major
is also used in general, so we need to check whether the original device
is an MD device actually.
Fix for bug 619221 - log device splitting regression
An incorrect fix on July 13, 2010 for an annoyance has caused a regression.
The offending check-in was part of the 2.02.71 release of LVM. That
check-in caused any PVs specified on the command line to be ignored when
performing a mirror split.
This patch reverses the aforementioned check-in (solving the regressions)
and posits a new solution to the list reversal problem. The original
problem was that we would always take the lowest mimage LVs from a mirror
when performing a split, but what we really want is to take the highest
mimage LVs. This patch accomplishes that by working through the list in
reverse order - choosing the higher numbered mimages first. (This also
reduces the amount of processing necessary.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Takahiro Yasui <takahiro.yasui@hds.com>
A misunderstanding of the return value of 'dm_bit' has been causing a data
corruption bug in cmirror. 'dm_bit' is only ever used as a boolean operation
within LVM, but it can return a range of values. If the bit is set, a power of
2 is returned. If the bit is unset, 0 is returned.
'log_test_bit' (a function in the cluster mirror log daemon code) has switched
to using the dm bit operations in rhel6. There are two places in the daemon
code where 'log_test_bit' is not used merely as a boolean, but rather the
return value is used as the return value for the log functions 'is_clean' and
'in_sync' - having assumed that 'dm_bit' was returning 0 or 1 only.
One place the 'in_sync' function is utilized is in 'dm_rh_get_state' - a
function that informs the mirroring code how to treat I/O and which devices to
read/write from. 'dm_rh_get_state' was checking if the return value of
'in_sync' was 1 to determine if the region was DM_RH_CLEAN. Since 'dm_bit'
(and by extension 'log_test_bit' and 'in_sync') was returning powers of 2,
DM_RH_CLEAN was rarely being reported as it should have been. Thinking the
region was out-of-sync, the mirroring code would write only to the primary
device. When the primary device was failed, all of those writes were lost -
leaving the entire mirror corrupted.
Peter Rajnoha [Tue, 3 Aug 2010 07:56:03 +0000 (07:56 +0000)]
Add check for kernel semaphore support and disable udev_sync if not available.
udev_sync feature requires semaphores (part of System V IPC) to be configured
in kernel (CONFIG_SYSVIPC). Check whether it is supported and if not, give
a warning message and disable udev synchronisation code automatically to
avoid any further error states and associated problems.
One should use the kernel with System V IPC support enabled or libdevmapper
with udev_sync feature disabled.
Taka's fix for handling failure of all mirrored log devices and
all but one mirror leg.
<patch header>
To handle a double failure of a mirrored log, Jon's two patches are
commited, however, lvconvert command can't still handle an error
when mirror leg and mirrored log got failure at the same time.
[Patch]: Handle both devices of a mirrored log failing (bug 607347)
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00009.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00027.html
[Patch]: Handle both devices of a mirrored log failing (bug 607347) -
additional fix
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00093.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00101.html
In the second patch, the target type of mirrored log is replaced with
error target when remove_log is set to 1, but this procedure should be
also used in other cases such as the number of mirror leg is 1. This
patch relocates the procedure to the main path.
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
- With Jon's second patch, we think that we don't have to call
remove_mirror_log() in _lv_update_mirrored_log() because will be
handled remove_mirror_images() in _lvconvert_mirrors_repaire().
</patch header>
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com> Reviewed-by: Petr Rockai <prockai@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
The cluster log daemon (cmirrord) is not multi-threaded and
can handle only one request at a time. When a log is stacked
on top of a mirror (which itself contains a 'core' log), it
creates a situation that cannot be solved without threading.
When the top level mirror issues a "resume", the log daemon
attempts to read from the log device to retrieve the log
state. However, the log is a mirror which, before issuing
the read, attempts to determine the 'sync' status of the
region of the mirror which is to be read. This sync status
request cannot be completed by the daemon because it is
blocked on a read I/O to the very mirror requesting the
sync status.
With mirror_log_fault_policy of 'remove' and mirror_image_fault_policy
of 'allocate', the log type of the mirror volume is converted from
'disk' or 'mirrored' to 'core' when all mirror legs but one in a mirror
volume broke.
Keep new_log_count as a number of valid log devices by using log_count
variable for a temporary usage in the first phase of error recovery
in _lvconvert_mirrors_repair().
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com> Reviewed-by: Petr Rockai <prockai@redhat.com>
Peter Rajnoha [Wed, 28 Jul 2010 10:30:28 +0000 (10:30 +0000)]
Revert unsuccessful table load preparation in combined "create, load and resume" scenario.
There was missing "revert" call in _create_and_load_v4 fn while the preparation
for table load ends up with failure in create/load/resume sequence. Otherwise
we could end up with a device being created, but not table-loaded nor resumed.
Even though the table is not loaded and the device is not resumed at this
stage, we still need to synchronize with udev when calling the revert
"remove" ioctl - there's still a remove uevent generated! The "revert"
code does exactly that.
Building without the '--enable-cmirrord' option means that
CMIRRORD_PIDFILE is not defined. This makes the build fail.
Therefore, we need to conditionalize the check for cmirrord
based on if CMIRRORD_PIDFILE is defined.
It's not enough to check for the kernel module in the case of cluster
mirrors, we must also check that the log daemon (cmirrord) is running.
The log module can be auto-loaded, but the daemon cannot be
"auto-started". Failing to check for the daemon produces cryptic
messages that customers have a hard time deciphering. (The system
messages do report that the log daemon is not running, but people
don't seem to find this message easily.)
Here are examples of what is printed when the module is available,
but the log daemon has not been started.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg
Shared cluster mirrors are not available.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg -v
Setting logging type to disk
Finding volume group "vg"
Archiving volume group "vg" metadata (seqno 3).
Creating logical volume lv
Executing: /sbin/modprobe dm-log-userspace
Cluster mirror log daemon is not running
Shared cluster mirrors are not available.
Creating volume group backup "/etc/lvm/backup/vg" (seqno 4).
Fix reversal of LV list before performing a split mirror.
When splitting off mirror images from a mirror, we always take
LVs from the end of a list. For example, if the mirror sub-devices
are lv_mimage_[012], we should select lv_mimage_2 if splitting off
one image. However, lv_mimage_0 was being selected instead.
The problem came from calling '_move_removable_mimages_to_end'
when it was unnecessary to do so. When the user /does/ specify
specific devices to be removed, this function properly moved the
appropriate LVs to the end of the list for extraction. However,
if the user /doesn't/ give any specific PVs, the function should
do nothing. '_move_removable_mimages_to_end' was keying off of
whether 'removable_pvs' was NULL or not and this value was
improperly being populated with the set of all available PVs.
This was causing '_move_removable_mimages_to_end' to completely
reverse the list, which in turn caused us to extract the
hithertofore front-of-the-list LVs.
Fix for bug 612311: Split of linear provides no error msg
An unhandled condition allowed the command to terminate
cleanly without a warning. Added a check for the
'--splitmirrors' argument to allow execution to the lower
level function that has the check to see if the user is
trying to split a linear device. You should now see a
message if you try to use --splitmirrors on a linear device.
The main problem with these bugs was that the newly split
off LV was not being suspended properly. This meant that
the memlock count was not being balanced, the DM devices
were not being renamed, and some DM devices which should
have been removed were not.
I've also renamed some of the variables and added comments
to make things clearer as to what is going on. (I can break
this patch in two if it means easier review.)
Dave Wysochanski [Tue, 13 Jul 2010 15:04:23 +0000 (15:04 +0000)]
Minor man page updates related to metadataignore and vgmetadatacopies.
pvchange: Add --metadataignore description
vgchange: Fix minor formatting
pvcreate: Update metadataignore description to refer to pvchange
lvm.conf: Refer to pvcreate and pvchange for metadata options.
Add dm_create_lockfile to libdm to handle pidfiles for all daemons.
Switch dmeventd to use dm_create_lockfile and drop duplicate code.
Allow clvmd pidfile to be configurable.
Switch cmirrord and clvmd to use dm_create_lockfile.
Peter Rajnoha [Mon, 12 Jul 2010 11:37:49 +0000 (11:37 +0000)]
Add more verbose messages while checking volume_list and hosttags settings.
This should bring less confusion when there are some settings left and
people just forgot about it and then they run into problems. These messages
should give them a hint of what's really going on.
Failed to test for the case where a log was requested to be removed
even though there was no log. A simple run through the in-tree test
suite would have caught this. :(
- if (lv_is_mirrored(detached_log_lv) &&
+ if (detached_log_lv && lv_is_mirrored(detached_log_lv) &&
Also, made some cosmetic changes suggested by kabi after my last check-in
(e.g. s/return 0/return_0/ and adding an error message).