This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Improved check-localedef script

On Thu, Aug 3, 2017 at 5:17 PM, Zack Weinberg <> wrote:
> Here is an improved version of the check-localedef script I posted the
> other week.

Here is another revision which uses the SUPPORTED file to learn the
legacy encodings for each locale, rather than looking at %Charset:
annotations in the source files.  You run it like this now (from the
top level of the source tree):

$ ./scripts/ -p localedata/locales -f
localedata/SUPPORTED localedata/locales/*

The final "localedata/locales/*" part is not _required_; it only
enables the script to tell you about any locales that are missing from
the SUPPORTED file.

(Also, still more bugs have been fixed; in particular the
"inappropriate character" errors have been restored.  Doh.)

It's possible that Python isn't going to work out as the
implementation language for this script.  I used it because its
standard library provides Unicode normalization and many codecs for
legacy encodings, but it doesn't know all of the encodings mentioned
in localedata/SUPPORTED (ARMSCII-8, GEORGIAN-PS, and EUC-TW are
missing) and I don't think it knows how to do transliteration, either.
And it's still a solid order of magnitude slower than it should be.


Attachment: check-localedef.errs
Description: Binary data

# Validate locale definitions.
# Copyright (C) 2017 Free Software Foundation, Inc.
# This file is part of the GNU C Library.
# The GNU C Library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
# The GNU C Library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# Lesser General Public License for more details.
# You should have received a copy of the GNU Lesser General Public
# License along with the GNU C Library; if not, see
# <>.

"""Validate locale definition files in ways that are too complicated
or too expensive to code into localedef.  This script is run over all
locale definitions as part of 'make check', when Python 3 is available.

Currently this performs two checks on each string within each file on
the command line: it must be in either Unicode NFC or NFD (we don't
care which), and it must be representable in the legacy character
set(s) declared in an annotation (e.g. % Charset: ISO-8859-5, KOI8-R).

It also performs several checks on the overall syntax of the file:

Outside of comments, the only characters allowed are the ASCII graphic
characters (U+0021 through U+007E inclusive), U+0020 SPACE, U+0009
HORIZONTAL TAB, and U+000A NEW LINE; in particular, the other
characters counted as "whitespace" in the POSIX locale are NOT
allowed.  Inside comments, this rule is relaxed to permit most Unicode
characters (see INAPPROPRIATE_UNICHARS); we might in the future start
allowing "raw" Unicode text in strings as well.

Byte escapes (/xxx, where / is the escape character) are only to be
used to escape newline, ", <, >, and the escape character itself. All
other characters that can't be written directly should be written as
<Unnnn> instead.

The escape_char and comment_char directives' arguments are
sanity-checked: both take a single character, which must be an ASCII
graphic character and may not be any of , ; < > ".  Finally, the
escape character and the comment character may not be the same.

"..." strings and <...> symbols must be properly closed before the end
of the line.  Hard tabs are not permitted inside strings (write
<U0009> if you really mean to put a tab inside a string) and if
escape-newline is used to continue a string onto the next line, the
first character on the next line may not be a space (write <U0020> if
you really mean to do that).


import argparse
import codecs
import contextlib
import functools
import itertools
import os.path
import re
import sys
import unicodedata

class ErrorLogger:
    """Object responsible for all error message output; keeps track of
       things like the file currently being processed, and whether any
       errors have so far been encountered."""
    def __init__(self, ofp, verbose):
        self.ofp     = ofp
        self.verbose = verbose
        self.status  = 0
        self.fname   = None
        self.fstatus = 0
        self.tblib   = None
        self.twlib   = None

    def begin_file(self, fname):
        self.fname   = fname
        self.fstatus = 0
        if self.verbose:

    def end_file(self):
        if self.fstatus:
            self.status = 1
        elif self.verbose:
            self.ofp.write(" OK\n")

    def error(self, lno, message, *args):
        if self.verbose:
            if self.fstatus == 0:
            self.ofp.write("  ")
        if args:
            message = message.format(*args)
        self.ofp.write("{}:{}: {}\n".format(self.fname, lno, message))

        self.fstatus = 1

    def oserror(self, filename, errmsg):
        # If all these things are true, the last thing printed was the
        # filename that provoked an OS error (e.g. we failed to open the
        # file we're logging for) so just print the error message.
        if self.verbose and self.fname == filename and self.fstatus == 0:
            if self.verbose:
                if self.fstatus == 0:
                self.ofp.write("  ")
            self.ofp.write("{}: {}\n".format(filename, errmsg))

        self.fstatus = 1

    def exception(self):
        exi = sys.exc_info()

        # The traceback module is lazily loaded since this method should
        # only need to be called if there's a bug in this program.
        if self.tblib is None:
            import traceback
            self.tblib = traceback

        if self.verbose:
            if self.fstatus == 0:
            prefix = "  "
            prefix = ""
            self.ofp.write("{}: error:\n".format(self.fname))

        for msg in self.tblib.format_exception(*exi):
            for m in msg.split("\n"):
                if m:

        self.fstatus = 1

    def dump_codepoints(self, label, s):

        # The textwrap module is lazily loaded since this method should
        # only need to be called if there's a problem with the locale data.
        if self.twlib is None:
            import textwrap
            self.twlib = textwrap

        codepoints = [ord(c) for c in s]
        if any(c > 0xFFFF for c in codepoints):
            form = "06X"
            form = "04X"
        dumped = " ".join(format(c, form) for c in codepoints)
        if self.verbose:
            label = "  " + label
        self.ofp.write(self.twlib.fill(dumped, width=78,
                                       subsequent_indent=" "*len(label)))

def logging_for_file(log, fname):
    except OSError as e:
        log.oserror(e.filename, e.strerror)
    except Exception:

# Regular expressions used by the parser.
def re_escape_for_cc(x):
    return (x if x not in '-\\^]' else '\\' + x)
def make_cc(chars, inverse=False):
    chars = ''.join(re_escape_for_cc(x) for x in sorted(chars))
    if inverse:
        return '[^' + chars + ']'
        return '[' + chars + ']'

graphic_chars = set(chr(c) for c in range(0x21, 0x7F))

# A strict definition of 'inappropriate character', currently used
# everywhere except comments: all characters _except_ the ASCII
# graphic characters, space, tab, and newline.
inappropriate_ascii = re.compile(
    make_cc(graphic_chars | set(' \t\n'), inverse=True))

# A relaxed definition of 'inappropriate character', currently used in
# comments only: arbitary Unicode characters are allowed, but not
# the legacy control characters (except TAB), nor the Unicode NIH
# line-breaking characters, nor bare surrogates, nor noncharacters.
# Private-use, not-yet-assigned, and format controls (Cf) are fine,
# except that BYTE ORDER MARK (U+FEFF) is not allowed.  OBJECT
# are officially "symbols", but we weed them out as well, because
# their presence in a locale file means something has gone wrong
# somewhere.
inappropriate_unicode = re.compile(make_cc(chr(c) for c in itertools.chain(
    range(0x0000, 0x0009),
    range(0x000A, 0x0020),
    range(0x007F, 0x00A0),
    range(0xD800, 0xE000),
    range(0xFDD0, 0xFDF0),
    (i * 0x10000 + 0xFFFE for i in range(0x11)),
    (i * 0x10000 + 0xFFFF for i in range(0x11)),
    (0x2028, 0x2029, 0xFEFF, 0xFFFC, 0xFFFD)

def compile_token_re(escape_char, comment_char):

    special_chars = { escape_char, comment_char, ',', ';', '<', '>', '"' }
    wordchars = make_cc(graphic_chars - special_chars)

    # Note: POSIX specifically says that comments are _not_ continued
    # onto the next line by the escape_char.
    abstract_token_re = r"""(?msx)
             (?P<COMMA>    ,                        )
      |      (?P<SEMI>     ;                        )
      |      (?P<NEWLINE>  \n                       )
      |      (?P<WHITE>    [ \t]+                   )
      |      (?P<WORD>     (?:{wordchars}|{ec}.)+  )
      | "    (?P<STRING>   (?:[^"\n{ec}]|{ec}.)*    ) (?:"|$)
      | <    (?P<SYMBOL>   (?:[^>\n{ec}]|{ec}.)*    ) (?:>|$)
      | {cc} (?P<COMMENT>  [^\n]*                   )
      |      (?P<BAD>      .                        )

    return re.compile(abstract_token_re.format(
        wordchars = wordchars,
        ec        = re.escape(escape_char),
        cc        = re.escape(comment_char)))

def compile_esc_re(escape_char):
    return re.compile(r"""(?six)
        {ec} (?P<ESC> [0-7]{{1,3}} | d[0-9]{{1,3}} | x[0-9a-f]{{1,2}} | . )
      |   <U (?P<UNI> [0-9a-f]{{1,8}} ) >

directive_re = re.compile(
    r"[ \t]*(comment|escape)_char[ \t]*([^\n\t ]*)(?:[ \t][^\n]*)?(\n|\Z)")

def scan_localedef(fp, log):
    """Scan through a locale definition file, FP.  Returns a list of
       all strings appearing in the file, as 2-tuples (lno, string).
       May also emit error messages.
       Assumes that log.begin_file() has been called for the file FP.
    strings = []
    escape_char = '\\'
    comment_char = '#'
    lno = 1
    data =

    def decode_and_diagnose_esc(m):
        g = m.lastgroup
        c =
        if g == "UNI":
                return chr(int(c, 16))
            except (UnicodeError, ValueError):
                log.error(lno, "invalid token '<U{}>' in string", c)
            if c == '\n':
                # Look one past the end of the match.  Is it whitespace?
                loc = m.end(g)
                if len(m.string) > loc and m.string[loc] in " \t":
                    log.error(lno, "leading whitespace in string "
                              "after escaped newline")
                return ''
            if c not in '<>"' and c != escape_char:
                log.error(lno, "inappropriate escape sequence '{}'",

            if len(c) == 1 and c not in "01234567":
                return c

            p = c[0]
            if p in ('d', 'D'):
                base = 10
                digits = c[1:]
            elif p in ('x', 'X'):
                base = 16
                digits = c[1:]
                base = 8
                digits = c
                return chr(int(digits, base))
            except ValueError:
                log.error("invalid escape sequence '{!r}' in string",
                          escape_char + c)

    def diagnose_esc(m):
        g = m.lastgroup
        if g == "ESC":
            c =
            if c == '\n':
                # Look one past the end of the match.  Is it whitespace?
                loc = m.end(g)
                if len(m.string) > loc and m.string[loc] in " \t":
                    log.error(lno, "leading whitespace in string "
                              "after escaped newline")
            elif c not in '<>"' and c != escape_char:
                log.error(lno, "inappropriate escape sequence '{}'",

    # We only recognize the 'escape_char' and 'comment_char' directives
    # if they appear (in either order) on the very first one or two lines
    # in the file.
    for _ in range(2):
        m = directive_re.match(data)
        if not m: break

        if == '\n': lno += 1
        data = data[m.end():]
        which =
        arg =
        if len(arg) == 0:
            log.error(lno, "missing argument to {}_char directive", which)
        elif len(arg) != 1:
            log.error(lno, "argument to {}_char must be a single character",
        elif not ("!" <= arg <= "~" and arg not in ',;<>"'):
            log.error(lno, "{}_char may not be set to {!r}", which, arg)
        elif which == "comment":
            comment_char =
            escape_char =

    if comment_char == escape_char:
        log.error("comment_char and escape_char both set to {}", comment_char)
        escape_char = '\\'
        comment_char = '#'

    token_re = compile_token_re(escape_char, comment_char)
    esc_re = compile_esc_re(escape_char)

    for m in token_re.finditer(data):
        kind = m.lastgroup

        if kind == "NEWLINE":
            lno += 1

        elif kind == "BAD":
            log.error(lno, "inappropriate character {!r}",

        elif kind == "COMMENT":
            for c in inappropriate_unicode.finditer(
                log.error(lno, "inappropriate character {!r}",

        elif kind == "WORD":
            value =

            for c in inappropriate_ascii.finditer(
                log.error(lno, "inappropriate character {!r}",

            for xm in esc_re.finditer(value):

            if value == "comment_char" or value == "escape_char":
                          "{} directive must be at the top of the file",

            lno += value.count('\n')

        elif kind == "SYMBOL":
            value =

            for c in inappropriate_ascii.finditer(
                log.error(lno, "inappropriate character {!r}",

            # Check for close quote.
            end = m.end(kind)
            if len(data) == end or data[end] != '>':
                log.error(lno, "missing close '>' character")

            for xm in esc_re.finditer(value):

            lno += value.count('\n')

        elif kind == "STRING":
            value =

            for c in inappropriate_ascii.finditer(
                log.error(lno, "inappropriate character {!r}",

            # Check for close quote.
            end = m.end(kind)
            if len(data) == end or data[end] != '"':
                log.error(lno, "missing close '\"' character")

            s = esc_re.sub(decode_and_diagnose_esc, value)
            if s:
                strings.append((lno, s))

            lno += value.count('\n')

        #else: other token types are currently ignored

    return strings

def process(fp, log, charsets):
    strings = scan_localedef(fp, log)

    for lno, s in strings:
        nfc_s = unicodedata.normalize("NFC", s)
        nfd_s = unicodedata.normalize("NFD", s)
        if s != nfd_s and s != nfc_s:
            log.error(lno, "string not normalized:")
            log.dump_codepoints("  source: ", s)
            if nfc_s == nfd_s:
                log.dump_codepoints("  nf[cd]: ", nfc_s)
                log.dump_codepoints("     nfc: ", nfc_s)
                log.dump_codepoints("     nfd: ", nfd_s)

        for charset, codec in charsets:
            # It's not necessary to do this test for UTF-8.
            if charset != "utf-8":
                    _ = codec.encode(s)
                except UnicodeEncodeError:
                    log.error(lno, "string not representable in {}:", charset)
                    log.dump_codepoints("    ", s)

def scan_supported_locales(fp, log):
    charsets = {}
    split_xlocale = re.compile(r"^([^.]*)[^@]*(.*)$")

    for z_lno, line in enumerate(fp):
        if not line: continue
        if line[0] == "#": continue
        if line == "SUPPORTED-LOCALES=\\\n": continue

        locale_code = line.split()[0]

        # Everything after the first slash names the character set.
        xlocale, _, charset = locale_code.partition('/')

        # 'xlocale' is in three pieces, of which two are optional:
        # base_locale [.encoding] [@variation]
        # The [.encoding] part needs to be removed, but the [@variation]
        # part should remain.
        locale = split_xlocale.sub(r"\1\2", xlocale)

        if locale not in charsets:
            charsets[locale] = set()

            co = codecs.lookup(charset)
            charsets[locale].add((, co))

        except LookupError:
            log.error(z_lno + 1, "unknown charset {!r} for {}", charset, locale)

    return charsets

def process_files(args):
    logger = ErrorLogger(sys.stderr, args.verbose)

    charsets = {}
    if args.supported:
        with logging_for_file(logger, args.supported), \
             open(args.supported, "rt", encoding=args.encoding) as fp:
            charsets = scan_supported_locales(fp, logger)

    if args.files:
        files = set(args.files)
        files = set(charsets.keys())

    unsupported = []
    for f in sorted(set(files)):
        cs = charsets.get(os.path.basename(f), [])
        if args.supported and not cs:

        if args.locales_path and "/" not in f:
            f = os.path.join(args.locales_path, f)

        with logging_for_file(logger, f), \
             open(f, "rt", encoding=args.encoding) as fp:
            process(fp, logger, cs)

    if unsupported:
        sys.stderr.write("note: locales not in {}: {}\n"
                         .format(args.supported, " ".join(unsupported)))

    return logger.status

def main():
    ap = argparse.ArgumentParser(description=__doc__)
    ap.add_argument("-v", "--verbose", action="store_true")
    ap.add_argument("-e", "--source-encoding", default="utf-8", dest="encoding")
    ap.add_argument("-f", "--supported-locales-file", dest="supported")
    ap.add_argument("-p", "--locales-path")
    ap.add_argument("files", nargs="*")
    args = ap.parse_args()

    if not args.files and not args.supported:
        ap.error("must provide either -f or locale definitions")



# Local Variables:
# indent-tabs-mode: nil
# End:

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]