[PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668)

Carlos O'Donell carlos@redhat.com
Mon Jun 29 19:42:09 GMT 2020

On 6/29/20 4:13 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
>> During ellipsis processing the collation cursor was not correctly
>> moved to the end of the ellipsis after processing.  This meant that
>> the cursor was left, usually, at the second to last entry.
>> Subsequent operations end up unlinking the ellipsis end entry or
>> just leaving it in the list dangling from the end.  This kind of
>> dangling is immediately visible in C.UTF-8 with the following
>> sorting from strcoll:
>> <U0010FFFF>
>> <U0000FFFF>
>> <U000007FF>
>> <U0000007F>
>> With the cursor correctly adjusted the end entry is correctly given
>> the right location and thus the right weight.
>> No regressions on x86_64 and i686.
>> Co-authored-by: Carlos O'Donell <carlos@redhat.com>
>> ---
>>  locale/programs/ld-collate.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
>> index feb1a11258..a8ba2f07f0 100644
>> --- a/locale/programs/ld-collate.c
>> +++ b/locale/programs/ld-collate.c
>> @@ -1483,6 +1483,9 @@ order for `%.*s' already defined at %s:%Zu"),
>>  	    }
>>  	}
>>      }
>> +  /* Move the cursor to the last entry in the ellipsis.
>> +     Subsequent operations need to start from the last entry.  */
>> +  collate->cursor = endp;
>>  }
> Can't endp be NULL at this point?

Yes, but it's an error condition.

If it's NULL and symstr != NULL, then we are in an error condition
and will trigger the "symbolic range ellipsis must not be directly 
followed by `order_end'", and we will set the cursor to NULL.

It is common to set the cursor to NULL after an error to avoid any
subsequent unlink_element() from working due to the error. In practice
it looks like we'd crash if you called unlink_element() with the cursor
set to NULL e.g. NULL deref in unlink_element(). You shouldn't be doing
that anyway, you should be shutting down.

I could clarify the error condition in a comment?
> Besides that, why does the change make a difference at all?  There
> already is an assignment to collate->cursor in both “Enqueue the new
> element” cases.

When we get called we have the following:

... element_t <-> element_t <-> element_t
                  "<U0000>"     "..."

The cursor points to a double-linked list of collation elements.
We enter with the pseudo-entry pointing at the ellipsis. Next we
unlink the ellipsis, and the cursor is back at <U0000>.

... element_t <-> element_t

The value of symstr is the ending symbol of the ellipsis so it's <U007F>.
We add that onto the list:

... element_t <-> element_t <-> element_t
                  "<U0000>"     "<U007F>"

However, then we set the cursor on purpose because we want to insert
symbols between the start and end elements:

1092   /* Reset the cursor.  */
1093   collate->cursor = startp;

... element_t <-> element_t <-> element_t
                  "<U0000>"     "<U007F>"
                  startp        endp

Then we proceed to add elements *after* the cursor. We iterate from
U0001 to U007E, adding entries.

1451		      /* Enqueue the new element.  */
1452		      elem->last = collate->cursor;
1453		      elem->next = collate->cursor->next;
1454		      elem->last->next = elem;
1455		      if (elem->next != NULL)
1456			elem->next->last = elem;
1457		      collate->cursor = elem;

This code inserts the new entry after the cursor, but before the
real end of the ellipsis:

... element_t <-> element_t <-> element_t <-> element_t
                  "<U0000>"     "<U0001>"     "<U007F>"
                  startp                      endp

At the end of the function we have:

... element_t <-> element_t <-> element_t
                  "<U007E>"     "<U007F>"

The cursor should be pointing at endp, the last element in the
double linked list, otherwise when we come back to the caller we
will start inserting the next line after <U007E>.

The stack at this point looks like this:
#0  handle_ellipsis (ldfile=<optimized out>, ldfile@entry=0x5555555a2e80, symstr=<optimized out>, 
    symstr@entry=0x7fffffffd080 "U0000007F", symlen=<optimized out>, symlen@entry=9, ellipsis=<optimized out>, 
    ellipsis@entry=tok_ellipsis2, charmap=<optimized out>, charmap@entry=0x5555555a2fd0, repertoire=<optimized out>, 
    repertoire@entry=0x0, result=<optimized out>) at programs/ld-collate.c:1488
#1  0x000055555557d424 in collate_read (ldfile=<optimized out>, result=0x7fffffffd170, charmap=0x5555555a2fd0, repertoire_name=0x0, 
    ignore_content=0) at programs/ld-collate.c:3670
#2  0x000055555558436d in locfile_read (result=0x7fffffffd170, charmap=0x5555555a2fd0) at programs/locfile.c:180
#3  0x000055555555da3f in main (argc=<optimized out>, argv=0x7fffffffd3f8) at programs/localedef.c:263

We return to collate_read, break, and read more data:
3963       /* Prepare for the next round.  */
3964       now = lr_token (ldfile, charmap, result, NULL, verbose);
3965       nowtok = now->tok;

We read the *next* ellipsis start symbol, remember that.
We read the *next* ellipsis "..." (tok_ellipsis2), remember that.
We read the *next* ellipsis end symbol, and this triggers handle_ellipsis.
All this time insert_value()->insert_weights() is using collate->cursor to insert
values immediately after <U007E>, and the <U007F> (end of ellipsis) is still at
the end of the doubly-linked list.

When I finish parsing C.UTF-8 I have a list that ends like this:

<U0000> <-> ... <-> <U10FFFF> <-> <UFFFF> <-> <U07FF> <-> <U007F>

With each tok_ellipsis2 failing to reset the cursor, and so leaving the
trailing end of the last ellipsis in the doubly-linked list to get assigned
a weight in that order.

Does that clarify the fix?

Shall I add more comments about the cursor handling?


More information about the Libc-alpha mailing list