This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][PATCH v9 2/6] Implement alternative month names (bug 10871).


27.10.2017 18:29 Zack Weinberg <zackw@panix.com> wrote:
> [...]
> I do not see any new tests anywhere in this patch series, and I found
> a bug in the support for %OB in strptime (see below). To test this
> feature we need at least one locale that actually uses it, so rather
> than gate *these* patches on the addition of tests, I am asking you to
> write comprehensive tests and include them with the first patch you
> submit that adds altmon names to a locale. (This locale will have to
> be one of the locales currently listed in localedata/Makefile as
> available to tests (the test-input variable, I think); if none of them
> need this feature, you may add one.)

I've browsed some test and I've found that a similar test already
exists: time/tst-strptime.c. [1] All we need is to add some more tests
for some more locales.  Of course we need to add more LOCALES to the
corresponding Makefile because now it contains only German, US English,
and Japanese.  I think I should add more than one locale.  Please confirm
that I'm thinking correctly.

Actually we don't need the updated locale data containing the actual
alternative month names.  In most cases the %B/%OB format specifiers
should work correctly in strptime() no matter if the input string contains
the nominative or genitive case because the algorithm searches for the
best match rather than for the equal string.

A longer explanation: The algorithm selects the month name which has
the longest matching initial substring with the input string, including
the terminating '\0' character.  Since in eastern European languages the
grammatical cases are made by appending or changing suffixes while the
stems (usually) remain the same, this algorithm is still able to
recognize the month name correctly.  Though, there are cases where
this will not work:

* Romance languages using prepositions e.g., Catalan "de novembre"
  and "d’octubre" will not be recognized correctly; their best match
  without the preposition will be always "desembre" (December)
  which is incorrect.
* Russian and Belarusian "мая" (of May) will match both "май" (May)
  and "март" (March), therefore it will be ambiguous and (if I think
  correctly) it will be recognized as March which is incorrect.

Of course it's worth to add such tests once the genitive month names
are actually added.

>
> [...]
> A note on writing ChangeLog entries: we really do want you to list
> every single new symbol that's part of the public API, not an
> abbreviated list -- that is:
>
> * locale/langinfo.h [__USE_GNU]: New public symbols ALTMON_1,
> ALTMON_2, ALTMON_3, ALTMON_4, ALTMON_5, ALTMON_6, ALTMON_7,
> ALTMON_8, ALTMON_9, ALTMON_10, ALTMON_11, ALTMON_12,
> _NL_WALTMON_1, _NL_WALTMON_2, _NL_WALTMON_3, _NL_WALTMON_4,
> _NL_WALTMON_5, _NL_WALTMON_6, _NL_WALTMON_7, _NL_WALTMON_8,
> _NL_WALTMON_9, _NL_WALTMON_10, _NL_WALTMON_11, _NL_WALTMON_12.

Applied locally.  Although I have some doubts.  Only ALTMON_x
symbols are defined conditionally (#ifdef __USE_GNU) while
_NL_WALTMON_x are defined unconditionally.  Additionally, there
are __ALTMON_x symbols, also unconditional, defined only because
ALTMON_x are not (yet) defined by POSIX. [2] [3] Should the
ChangeLog be more detailed?

> [ cut longer explanation about ChangeLog ]

Another longer explanation which I find worth copying to wiki,
again if you don't mind, Zack.

> [...]
> Another minor procedural thing: don't include diffs to the file
> ChangeLog in commits sent for review, because if a reviewer wants to
> apply the patch and tinker with it, they don't want to have to clean
> up the inevitable merge botch in the ChangeLog.

OK, again worth copying to wiki.

> > @@ -402,6 +404,20 @@ __strptime_internal (const char *rp, const char *fmt,
> > struct tm *tmp,
> > if (s.decided !=raw)
> > {
> > trp = rp;
> > +#ifdef _LIBC
> > + /* First check the alt month. */
>
> Here you are checking the alt-month name before the month name ...
>
> > @@ -428,6 +444,10 @@ __strptime_internal (const char *rp, const char *fmt,
> > struct tm *tmp,
> > if (s.decided != loc
> > && (((trp = rp, match_string (month_name[cnt], trp))
> > && trp > rp_longest)
> > +#ifdef _LIBC
> > + || ((trp = rp, match_string (alt_month_name[cnt], trp))
> > + && trp > rp_longest)
> > +#endif
> > || ((trp = rp, match_string (ab_month_name[cnt], trp))
> > && trp > rp_longest)))
>
> ... and here you are checking the alt-month name after the month name.
> Please make this consistent one way or the other. (I *think* the
> order of checks would only ever matter if the alt-month name for month
> X were the same as the month name for month Y, which would be a nasty
> ambiguity in the relevant natural language -- but nasty ambiguities do
> actually happen in natural languages, so we need to be careful.)

Indeed, it should not matter because the algorithm searches for the
best matching string which should be only one.  If there are two best
matching strings then the input string is ambiguous and cannot be
determined, probably the reason is that the input string is wrong.
If a language has a string which means two different months depending
on the context then the native speakers have a trouble understanding it.
That means, it's unlikely that a language has such a feature.  But
I agree, "unlikely" is not "impossible".  Please note that so far
I've identified only about 20 languages (out of about 200 supported
by glibc) which need the nominative/genitive difference in month
names, they are in my github repo. [4] It is possible to verify them
manually and I think that I did it in the past although I'm not
sure now.

But I will apply your remark and introduce the consistency to make
the code more readable.  So far I was focused on making the *patches*
more readable.

Note that the question "what should be matched first: the basic or
the regular month name?" does not have a good answer.  The real good
answer should be "alternative first if the O modifier is present,
basic first otherwise" but since the algorithm drops the modifier
and treats all specifiers: B, b, and h identically it is difficult
to tell.  Is it worth to store this information and provide different
paths for the specifiers with the O modifier and without it?  I don't
think so.  Please, note above, most probably that case does not
actually exist.

While at this, I'm somehow concerned by the waste of the CPU cycles
to check both basic and alternative month names while in 90% languages
these strings will always be the same.  But I think that checking
whether the current locale actually requires the alternative month
names and providing separate paths for them would be only worse.

> I wonder whether we actually need the #ifdef _LIBC here -- but that's
> an independent question (is this code _really_ still shared with
> gnulib, or have they diverged to the point where we should stop trying
> to keep the files in sync?) and you are not on the hook to answer it.

OK, I'm unable to answer this reliably but I think that gnulib should
follow this change as well.  The reason is that the command line utility
date uses gnulib rather than glibc.  So if we implement the change in
glibc but do not implement it in other libraries and programs the change
will be incomplete.

> > @@ -1015,6 +1035,10 @@ __strptime_internal (const char *rp, const char *fmt,
> > struct tm *tmp,
> > case 'O':
> > switch (*fmt++)
> > {
> > + case 'B':
> > + /* Undo the increment and continue. */
> > + fmt--;
> > + break;
>
> This is subtly wrong, and I had to read a big chunk of the entire
> function to understand what it was meant to do and how it doesn't
> actually work.

Oops... :-(

Thank you for spotting, analyzing and fixing this.

> Please change to
>
> ] case 'O':
> ] switch (*fmt++)
> ] {
> ] + case 'B':
> ] + /* Match month name. Reprocess as plain 'B'. */
> ] + fmt--;
> ] + goto start_over;
>
> You will also need to remove the `#ifndef _NL_CURRENT` wrapping the
> definition of the `start_over` label, and change its comment:
>
> ] -#ifndef _NL_CURRENT
> ] - /* We need this for handling the `E' modifier. */
> ] + /* In some cases, modifiers are handled by adjusting state and
> ] + then restarting the switch statement below. */
> ] start_over:
> ] -#endif
> ]
> ] /* Make back up of current processing pointer. */

I will analyze it more thoroughly but at the first sight your fixes
are correct and do not require any further changes.

> (Am I seriously telling you to use `goto`? Yes, I am. [...]

I'm OK with goto if there are good reasons to use it, especially
in the core libraries.  Most of the time there are better solutions
but this time your reasons are good.

Thank you again, regards,

Rafal


[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=time/tst-strptime.c;hb=HEAD
[2] https://sourceware.org/ml/libc-alpha/2016-10/msg00303.html
[3] https://sourceware.org/ml/libc-alpha/2016-12/msg01065.html
[4] https://github.com/rluzynski/glibc


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]