This is the mail archive of the
mailing list for the glibc project.
Identifying when collations change
- From: Craig Ringer <craig at 2ndquadrant dot com>
- To: libc-help at sourceware dot org
- Date: Fri, 3 Jul 2015 15:16:36 +0800
- Subject: Identifying when collations change
- Authentication-results: sourceware.org; auth=none
The PostgreSQL database relies on the collation support of the
underlying platform, which in GNU/Linux is glibc. This works very well
for most purposes, but a problem arises when the collation rules are
updated by the platform due to bug fixes or changes in accepted
PostgreSQL builds persistent on-disk b-tree indexes by executing the
system C library collation functions - strcoll or strcoll_l. Correct
searching of these indexes requires that the C library collation
function behaviour be pure and immutable, i.e. that any two calls over
any time period will return the same result for any given input.
Collation updates break that assumption, and indexes must be rebuilt
(REINDEXed) to ensure correct queries.
If PostgreSQL had a way to detect when the collation definition an
index was built with differed from the current collation definition it
would be very helpful, as we could then alert users to the situation,
or even repair the index if we could tell *what* changed, not just
that something changed.
This isn't only an issue with collation updates on one machine. It
also applies when a database is binary-replicated to another host with
a different glibc version. Queries on the replica may produce
incorrect results if the collations differ, and currently we have no
way to detect this situation.
The alternative to detecting and reporting issues with platform
collation changes is dropping the use of operating system collation
support in favour of a portable library like ICU. That's undesirable
for a number of reasons: ICU uses UTF-16 internally while PostgreSQL
uses UTF-8, so there'd be ugly conversion overheads, and that's just
one of the issues. It'd also potentially cause PostgreSQL's collation
results to differ from that of the platform it runs on. I'd rather
avoid that, so I'm really interested in a way to find out when glibc
collations change, or even better a portable way to do it and possibly
even derive what changed.
Do you have any advice for how PostgreSQL (or any other project with
similar issues) might detect when glibc collation definitions change?
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services