[PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers

Simon Marchi simark@simark.ca
Mon Nov 20 01:38:00 GMT 2017


On 2017-11-19 07:41 PM, Pedro Alves wrote:
> The find-upper-bound-for-completion algorithm in the name components
> accelerator table in dwarf2read.c increments a char in a string, and
> asserts that it's not incrementing a 0xff char, but that's incorrect.
> 
> First, we shouldn't be calling gdb_assert on input.
> 
> Then, if "char" is signed, comparing a caracther with "0xff" will
> never yield true, which is caught by Clang with:
> 
>   error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare]
> 	    gdb_assert (after.back () != 0xff);
> 			~~~~~~~~~~~~~ ^  ~~~~
> 
> And then, 0xff is a valid character on non-UTF-8/ASCII character sets.
> E.g., it's 'ÿ' in Latin1.  While GCC nor Clang support !ASCII &&
> !UTF-8 characters in identifiers (GCC supports UTF-8 characters only
> via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html),
> but other compilers might (Visual Studio?), so it doesn't hurt to
> handle it correctly.  Testing is covered by extending the
> dw2_expand_symtabs_matching unit tests with relevant cases.
> 
> However, without further changes, the unit tests still fail...  The
> problem is that cp-name-parser.y assumes that identifiers are ASCII
> (via ISALPHA/ISALNUM).  This commit fixes that too, so that we can
> unit test the dwarf2read.c changes.  (The regular C/C++ lexer in
> c-lang.y needs a similar treatment, but I'm leaving that for another
> patch.)
> 
> While doing this, I noticed a thinko in the computation of the upper
> bound for completion in dw2_expand_symtabs_matching_symbol.  We're
> using std::upper_bound but we should use std::lower_bound.  I extended
> the unit test with a case that I thought would expose it, this one:
> 
>  +  /* These are used to check that the increment-last-char in the
>  +     matching algorithm for completion doesn't match "t1_fund" when
>  +     completing "t1_func".  */
>  +  "t1_func",
>  +  "t1_func1",
>  +  "t1_fund",
>  +  "t1_fund1",
> 
> The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund"
> matches incorrectly.  But turns out the problem is masked because
> later here:
> 
>   for (;lower != upper; ++lower)
>     {
>       const char *qualified = index.symbol_name_at (lower->idx);
> 
>       if (!lookup_name_matcher.matches (qualified)
> 
> the lookup_name_matcher.matches check above filters out "t1_fund"
> because that doesn't start with "t1_func".
> 
> I'll fix the latent bug in follow up patches, after factoring things
> out a bit in a way that allows unit testing the relevant code more
> directly.

Everything you said makes sense to me, the patch looks good to me.  I noted
one comment and a typo below.

> gdb/ChangeLog:
> yyyy-mm-dd  Pedro Alves  <palves@redhat.com>
> 
> 	* cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
> 	(symbol_end): Use cp_ident_is_alnum.
> 	(yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
> 	* dwarf2read.c (make_sort_after_prefix_name): New function.
> 	(dw2_expand_symtabs_matching_symbol): Use it.
> 	(test_symbols): Add more symbols.
> 	(run_test): Add tests.
> ---
>  gdb/cp-name-parser.y |  28 ++++++++++--
>  gdb/dwarf2read.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 129 insertions(+), 18 deletions(-)
> 
> diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
> index 33ecf13..fdfbf15 100644
> --- a/gdb/cp-name-parser.y
> +++ b/gdb/cp-name-parser.y
> @@ -1304,6 +1304,28 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
>  		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
>  }
>  
> +/* Like ISALPHA, but also returns true for the union of all UTF-8
> +   multi-byte sequence bytes and non-ASCII characters in
> +   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
> +   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
> +   identifiers, but we don't need to be pedantic so for simplicity we
> +   ignore that here.  Plus this avoids the complication of actually
> +   knowing what was the right encoding.  */
> +
> +static inline bool
> +cp_ident_is_alpha (unsigned char ch)
> +{
> +  return ISALPHA (ch) || ch >= 0x80;
> +}
> +
> +/* Similarly, but Like ISALNUM.  */
> +
> +static inline bool
> +cp_ident_is_alnum (unsigned char ch)
> +{
> +  return ISALNUM (ch) || ch >= 0x80;
> +}
> +
>  /* Find the end of a symbol name starting at LEXPTR.  */
>  
>  static const char *
> @@ -1311,7 +1333,7 @@ symbol_end (const char *lexptr)
>  {
>    const char *p = lexptr;
>  
> -  while (*p && (ISALNUM (*p) || *p == '_' || *p == '$' || *p == '.'))
> +  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
>      p++;
>  
>    return p;
> @@ -1791,7 +1813,7 @@ yylex (void)
>        return ERROR;
>      }
>  
> -  if (!(c == '_' || c == '$' || ISALPHA (c)))
> +  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
>      {
>        /* We must have come across a bad character (e.g. ';').  */
>        yyerror (_("invalid character"));
> @@ -1802,7 +1824,7 @@ yylex (void)
>    namelen = 0;
>    do
>      c = tokstart[++namelen];
> -  while (ISALNUM (c) || c == '_' || c == '$');
> +  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
>  
>    lexptr += namelen;
>  
> diff --git a/gdb/dwarf2read.c b/gdb/dwarf2read.c
> index 5437d21..b08e81c 100644
> --- a/gdb/dwarf2read.c
> +++ b/gdb/dwarf2read.c
> @@ -4188,6 +4188,60 @@ gdb_index_symbol_name_matcher::matches (const char *symbol_name)
>    return false;
>  }
>  
> +/* Starting from a search name, return the string that finds the upper
> +   bound of all strings that start with SEARCH_NAME in a sorted name
> +   list.  Returns the empty string to indicate that the upper bound is
> +   the end of the list.  */
> +
> +static std::string
> +make_sort_after_prefix_name (const char *search_name)
> +{
> +  /* When looking to complete "func", we find the upper bound of all
> +     symbols that start with "func" by looking for where we'd insert
> +     "func"-with-last-character-incremented, i.e. "fund".  */
> +  std::string after = search_name;
> +
> +  /* Mind 0xff though, which is a valid character in non-UTF-8 source
> +     character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers
> +     allowing it in identifiers.  If we run into it, increment the
> +     previous character instead and shorten the string.  If the very
> +     first character turns out to be 0xff, then the upper bound is the
> +     end of the list.

It's a bit of a nit, but I think this explanation could be a bit more
precise, and maybe simpler.  Maybe you could just say that you strip all
trailing 0xff characters, and increment the last non-0xff character of
the string.  If the string is composed only of 0xff characters, then the
upper bound is the end of the list.

The "If the very first character turns out to be 0xff" threw me off a bit,
because if you have the string "\xffa\xff", the upper bound will be "\xffb",
not the end of the list, despite the very first character being 0xff.

> +
> +     E.g., with these symbols:
> +
> +      func
> +      func1
> +      fund
> +
> +     completing "func" looks for symbols between "func" and
> +     "func"-with-last-character-incremented, i.e. "fund" (exclusive),
> +     which finds "func" and "func1", but not "fund".
> +
> +     And with:
> +
> +      funcÿ     (Latin1 'ÿ' [0xff])
> +      funcÿ1
> +      fund
> +
> +     completing "funcÿ", look for symbols between "funcÿ" and "fund"

looks

Simon



More information about the Gdb-patches mailing list