This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
In the thread about using fewer <Uxxxx> escapes in the locale source files, Carlos was concerned that, if we went over to UTF-8 for everything, not just decoded the escapes that represent ASCII, it would be easy for people to miss incorrectly encoded character sequences - text that isn't normalized, for instance, or homograph characters that *look* OK but are incorrect for the locale's language. It seems to me that this sort of check is not something that humans should have to do by eye; rather, it's a job for a linter. So I wrote one. :) It currently looks for "inappropriate" escape sequences and characters, using a quite strict notion of "inappropriate"; for strings that are not in Unicode Normalization Form C; and for strings that cannot be transcoded to the legacy charset for the locale (as defined by a "% Charset: xxx" annotation in the file - note that not all the files have such annotations). It is not ready for prime time; it is very slow (Python isn't really designed to go character-by-character through a file; it can probably be sped up with a cleverer lexer) and it finds a whole bunch of existing errors, some of which may not actually be _problems_, if you see what I mean. I've attached the script and the result of running it over all of the files in localedata/locales/. But it's ready for people to poke at. Some notes on what I found with it: - Many of the existing locale files have non-ASCII text in their comments. This text is _invariably_ encoded in UTF-8. That no one has complained about this is a weak argument in favor of it being safe to go ahead with UTF-8 - there might be localedef implementations that accept non-ASCII in comments but not elsewhere, I suppose. - A few of the existing locale files have non-ASCII text in strings already. Again, this is invariably encoded in UTF-8, and I think so far it's limited to LC_IDENTIFICATION (accented characters in the author's name, that sort of thing). - Quite a few of the existing locale files have strings outside LC_IDENTIFICATION that contain "raw" ASCII already. (This is why I had to write a full-on lexer for the format; existing files contain both % inside "" strings, and " characters inside % comments. That was a step beyond what I felt like doing with regexes.) - There are quite a few strings that aren't NFC and I suspect it's going to take expert knowledge of the languages involved to tell if that's desirable. - A significantly cleverer homograph checker is wanted, one that keys off of the ISO language code, rather than the legacy charset. (The legacy-charset check is already done by localedef, AFAIK, and localedef has more complete information when it does that.) - The complaints about "inappropriate character '\t'" are all caused by _unintentional_ tabs inside strings. If you write message "xyz/ abc" the whitespace on the second line gets included in the string, which is not what you want. The linter currently only detects this when that indentation is done with tabs, but I think it should probably detect spaces as well. If you _mean_ to put a tab in a string write <U0009>. :-) - All of the complaints about "inappropriate escape sequences" boil down to people forgetting that / is an escape character in these files, and writing strings with slashes in them. This is limited to LC_IDENTIFICATION, so it's cosmetic, but it's still wrong and IMNSHO justifies the linter insisting that you're only supposed to use / to escape <>/". - Speaking of, why is it that every single locale source file uses % for comments and / for escapes, instead of the default # for comments and \ for escapes? It seems gratuitous and it made the linter harder to write. - Suggestions for additional checks are welcome. zw
Attachment:
locale-errs.txt
Description: Text document
#!/usr/bin/python3 # Validate locale definitions. # Copyright (C) 2017 Free Software Foundation, Inc. # This file is part of the GNU C Library. # # The GNU C Library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public # License as published by the Free Software Foundation; either # version 2.1 of the License, or (at your option) any later version. # # The GNU C Library is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public # License along with the GNU C Library; if not, see # <http://www.gnu.org/licenses/>. """Validate locale definition files in ways that are too complicated or too expensive to code into localedef. This script is run over all locale definitions as part of 'make check', when Python 3 is available. Currently this performs two checks on each string within each file on the command line: it must be unchanged by Unicode NFC normalization, and it must be representable in the legacy character set(s) declared in an annotation (e.g. % Charset: ISO-8859-5, KOI8-R). """ import argparse import codecs import contextlib import functools import os import re import sys import textwrap import traceback import unicodedata from curses.ascii import isgraph class ErrorLogger: def __init__(self, ofp, verbose): self.ofp = ofp self.verbose = verbose self.status = 0 self.fname = None self.fstatus = 0 def begin_file(self, fname): self.fname = fname self.fstatus = 0 if self.verbose: self.ofp.write(self.fname) self.ofp.write("... ") def end_file(self): if self.fstatus: self.status = 1 elif self.verbose: self.ofp.write("OK\n") def error(self, lineno, message, *args): if self.verbose: if self.fstatus == 0: self.ofp.write("\n") self.ofp.write(" ") if args: message = message.format(*args) self.ofp.write("{}:{}: {}\n".format(self.fname, lineno, message)) self.fstatus = 1 def oserror(self, filename, errmsg): # If all these things are true, the last thing printed was the # filename that provoked an OS error (e.g. we failed to open the # file we're logging for) so just print the error message. if self.verbose and self.fname == filename and self.fstatus == 0: self.ofp.write(errmsg) self.ofp.write("\n") else: if self.verbose: if self.fstatus == 0: self.ofp.write("\n") self.ofp.write(" ") self.ofp.write("{}: {}\n".format(filename, errmsg)) self.fstatus = 1 def exception(self): if self.verbose: if self.fstatus == 0: self.ofp.write("\n") prefix = " " else: prefix = "" self.ofp.write("{}: error:\n".format(self.fname)) for msg in traceback.format_exc().split("\n"): self.ofp.write(prefix) self.ofp.write(msg) self.ofp.write("\n") self.fstatus = 1 def dump_codepoints(self, label, s): codepoints = [ord(c) for c in s] if any(c > 0xFFFF for c in codepoints): form = "06X" else: form = "04X" dumped = " ".join(format(c, form) for c in codepoints) if self.verbose: label = " " + label self.ofp.write(textwrap.fill(dumped, width=78, initial_indent=label, subsequent_indent=" "*len(label))) self.ofp.write("\n") @contextlib.contextmanager def logging_for_file(log, fname): try: log.begin_file(fname) yield except OSError as e: log.oserror(e.filename, e.strerror) except Exception: log.exception() finally: log.end_file() class PushbackWrapper: """Wrap around a file-like object and provide a pushback stack. Also counts line numbers for you, so that you don't double-count pushed-back newlines. This is not itself a file-like object; its only methods are get(), which returns a single character, and pushback(). Also, although calling get() without pushback() will eventually _consume_ all of the underlying stream, this object does _not_ own the underlying stream; in particular it will not close the underlying stream for you. """ def __init__(self, fp): self.lineno = 1 self._fp = fp self._pushback = [] def get(self): if self._pushback: return self._pushback.pop() c = self._fp.read(1) if c == '\n': self.lineno += 1 return c def pushback(self, c): self._pushback.append(c) def inappropriate_unichar(c): """A relaxed definition of 'inappropriate character', currently used in comments only: arbitary Unicode characters are allowed, but not the legacy control characters (except TAB), nor the Unicode NIH line-breaking characters, nor bare surrogates, nor noncharacters. Private-use, not-yet-assigned, and format controls (Cf) are fine, except that BYTE ORDER MARK (U+FEFF) is not allowed. OBJECT REPLACEMENT CHARACTER (U+FFFC) and REPLACEMENT CHARACTER (U+FFFD) are officially "symbols", but we weed them out as well, because their presence in a locale file means something has gone wrong somewhere. """ cat = unicodedata.category(c) if cat == 'So' and (c == '\uFFFC' or c == '\uFFFD'): return True if cat == 'Zl' or cat == 'Zp' or cat == 'Cs': return True if cat == 'Cc' and c != '\t': return True if cat == 'Cf' and c == '\uFEFF': return True if cat == 'Cn' and ord(c) in { 0x00FDD0, 0x00FDD1, 0x00FDD2, 0x00FDD3, 0x00FDD4, 0x00FDD5, 0x00FDD6, 0x00FDD7, 0x00FDD8, 0x00FDD9, 0x00FDDA, 0x00FDDB, 0x00FDDC, 0x00FDDD, 0x00FDDE, 0x00FDDF, 0x00FDE0, 0x00FDE1, 0x00FDE2, 0x00FDE3, 0x00FDE4, 0x00FDE5, 0x00FDE6, 0x00FDE7, 0x00FDE8, 0x00FDE9, 0x00FDEA, 0x00FDEB, 0x00FDEC, 0x00FDED, 0x00FDEE, 0x00FDEF, 0x00FFFE, 0x00FFFF, 0x01FFFE, 0x01FFFF, 0x02FFFE, 0x02FFFF, 0x03FFFE, 0x03FFFF, 0x04FFFE, 0x04FFFF, 0x05FFFE, 0x05FFFF, 0x06FFFE, 0x06FFFF, 0x07FFFE, 0x07FFFF, 0x08FFFE, 0x08FFFF, 0x09FFFE, 0x09FFFF, 0x0AFFFE, 0x0AFFFF, 0x0BFFFE, 0x0BFFFF, 0x0CFFFE, 0x0CFFFF, 0x0DFFFE, 0x0DFFFF, 0x0EFFFE, 0x0EFFFF, 0x0FFFFE, 0x0FFFFF, 0x10FFFE, 0x10FFFF, }: return True return False def tok_escape(fp, log, escape_char): """Consume an escape sequence from FP and return its value. If the character escaped is not the escape_char, a newline, '"', '<', or '<', issue an error -- we want only <Uxxxx> used for anything else -- but do properly crunch the escape regardless.""" c = fp.get() if c == 'x': # \x consumes one or two hexadecimal digits. maxchars = 2 base = 16 ok = "0123456789abcdef" digits = [] prefix = escape_char + c elif c == 'd': # \d consumes one, two, or three decimal digits. maxchars = 3 base = 10 ok = "0123456789" digits = [] prefix = escape_char + c elif c in "01234567": # \0 consumes one, two, or three octal digits. maxchars = 3 base = 8 ok = "01234567" digits = [c] prefix = escape_char else: # Not a numeric escape. if c not in ('\n', '"', '<', '>', escape_char): log.error(fp.lineno, "inappropriate escape sequence '{}{}'", escape_char, c) return c while len(digits) < maxchars: d = fp.get() if d not in ok: fp.pushback(d) break digits.append(d) s = "".join(digits) log.error(fp.lineno, "inappropriate escape sequence '{}{}'", prefix, s) return chr(int(s, base)) def tokenize(fp, log): """Tokenize a locale definition file. Yields a sequence of pairs (lineno, string). May also emit error messages. """ # Tokenizer state codes S_START = 0 # in between tokens S_WORD = 1 # foo, 123 S_STRING = 2 # "foo" or <foo> S_COMMENT = 3 # comment_char to EOL tbuf = [] tline = None comment_char = '#' escape_char = '\\' end_char = None state = S_START fp = PushbackWrapper(fp) while True: c = fp.get() if state == S_START: if c == '': # end of file break if c == ' ' or c == '\t' or c == '\n': pass elif c == ',' or c == ';': yield (fp.lineno, c) elif c == comment_char: state = S_COMMENT tline = fp.lineno tbuf.append(c) elif c == '<': state = S_STRING end_char = '>' tline = fp.lineno tbuf.append(c) elif c == '"': state = S_STRING end_char = '"' tline = fp.lineno tbuf.append(c) elif c == escape_char: c = tok_escape(fp, log, escape_char) state = S_WORD tline = fp.lineno tbuf.append(c) elif isgraph(c): state = S_WORD tline = fp.lineno tbuf.append(c) else: log.error(fp.lineno, "inappropriate character {!r}", c) elif state == S_WORD: if c == escape_char: c = tok_escape(fp, log, escape_char) if c != '\n': tbuf.append(c) elif isgraph(c) and c != comment_char and c not in ',;<"': tbuf.append(c) else: fp.pushback(c) state = S_START word = ''.join(tbuf) tbuf.clear() if word == "escape_char" or word == "comment_char": c = fp.get() while c == ' ' or c == '\t': c = fp.get() if c == '\n': log.error(fp.lineno - 1, "empty {} directive", word) elif c in ',;<"' or not isgraph(c): log.error(fp.lineno, "{} may not be set to {!r}", word, c) elif word == "escape_char": if c == comment_char: log.error(fp.lineno, "escape_char and comment_char " "may not be the same") else: escape_char = c else: if c == escape_char: log.error(fp.lineno, "escape_char and comment_char " "may not be the same") else: comment_char = c else: yield (tline, word) elif state == S_STRING: if c == escape_char: c = tok_escape(fp, log, escape_char) if c != '\n': tbuf.append(c) elif c == '\n' or c == '' or c == end_char: if c != end_char: log.error(fp.lineno - (0 if c == '' else 1), "end of {} in {}", "file" if c == '' else "line", "string" if end_char == '"' else "symbol") state = S_START yield (tline, ''.join(tbuf)) tbuf.clear() end_char = None else: # We don't accept tab here; inside a string, tab # should be <U0009> to make clear that it is # intentional. if c != ' ' and not isgraph(c): log.error(fp.lineno, "inappropriate character {!r} in {}", c, "string" if end_char == '"' else "symbol") else: tbuf.append(c) elif state == S_COMMENT: # POSIX specifically says that comments are _not_ continued # onto the next line by the escape_char. if c == '\n' or c == '': state = S_START yield (tline, ''.join(tbuf)) tbuf.clear() else: # In comments, we relax the definition of "inappropriate # character"; arbitrary Unicode is allowed. if inappropriate_unichar(c): log.error(fp.lineno, "inappropriate character {!r}", c) else: tbuf.append(c) charset_re = re.compile("(?i)\bcharset: (.+)$") charset_split_re = re.compile("[,; \t][ \t]*") def add_charsets(line, lno, charsets, log): m = charset_re.search(line) if not m: return for cs in charset_split_re.split(m.group(1)): try: co = codecs.lookup(cs) if co.name not in charsets: charsets[co.name] = co except LookupError: log.error(lno, "unknown charset {!r}", cs) unicode_symbol_re = re.compile("(?i)<U([0-9a-f]+)>") def decode_unicode_symbols(s, lineno, log): """Convert <Uxxxx> tokens to the corresponding characters. Other symbolic names are left untouched.""" try: return unicode_symbol_re.sub(lambda c: chr(int(c.group(1), 16)), s) except (UnicodeError, ValueError) as e: log.error("invalid <Uxxxx> token in string: {}", str(e)) def process(fp, log): strings = [] charsets = {} for lno, tok in tokenize(fp, log): if tok[0] == '"': s = decode_unicode_symbols(tok[1:], lno, log) canon_s = unicodedata.normalize("NFC", s) if canon_s != s: log.error(lno, "string not normalized:") log.dump_codepoints(" source: ", s) log.dump_codepoints(" nfc: ", canon_s) strings.append((lno, canon_s)) elif tok[0] == '%': add_charsets(tok[1:], lno, charsets, log) else: pass # ignore all other tokens for now for charset, codec in sorted(charsets.items()): for lno, s in strings: try: _ = codec.encode(s) except UnicodeEncodeError: log.error(lno, "string not representable in {}:", charset) log.dump_codepoints(" ", s) def process_files(args): logger = ErrorLogger(sys.stderr, args.verbose) for f in args.files: with logging_for_file(logger, f), \ open(f, "rt", encoding=args.encoding) as fp: process(fp, logger) return logger.status def main(): ap = argparse.ArgumentParser(description=__doc__) ap.add_argument("-v", "--verbose", action="store_true") ap.add_argument("-e", "--encoding", default="utf-8") ap.add_argument("files", nargs="+") args = ap.parse_args() sys.exit(process_files(args)) main()
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |