[PATCH,PE] Allow .DEF file parser to handle 'foreign' language symbols.

Tue May 26 16:22:00 GMT 2009

    Hi all,

  Unless anyone objects, I intend to apply the attached patch within the next
24 hours or so, subject to some further testing on a number of targets
(including Cygwin, MinGW and CeGCC).  I think I'm competent with Bison but
wouldn't mind if someone even more experienced cast an eye over that part of
the patch; I think I did right by using left-recursion and there aren't any
new shift-reduce conflicts, but there could be deeper subtleties I'm not aware of.

  The purpose of the patch is to allow the "aligncomm" .drectve-section
command to parse the kind of non-C-language-family symbols emitted by the
gfortran compiler (and maybe other languages too), so that those languages can
use the PE aligned common extension.  The requirement for this support emerged
during testing of the GCC changes to enable this feature in the compiler;
being a backend change, it's there for all languages so we should aim to
support them all.

  The complication is caused by the presence of '.' in non-C symbols.  In .DEF
file syntax, the period is used primarily as a separator when specifying a
"fully-qualified" ex/import symbol in "MODULE-NAME.EXTERNAL-NAME" format.
Other 'unusual' characters, "$:-_?/@", are allowed in identifiers, but a
period delimits the ID token.

  We could almost but not quite use the dot_name production:

dot_name: ID
	| dot_name '.' ID
	;

... except that it's possible for the character immediately after the period
to be a digit, which isn't allowed as the first character of an ID token, and
indeed forces the lexer to produce a NUMBER token - and that's a second
problem, because we then only get a numeric value for the digit(s), and so
wouldn't be able to discriminate e.g. "_symbol.1_" and "_symbol.001_".

  So there are two changes in the attached patch.  First, the lexer now
returns a string of digits in verbatim char* form, as a DIGITS token, and
there is an elementary production from DIGITS to NUMBER (which is now a type,
not a token), effectively just hoisting the strtoul call out of the lexer and
into the grammar, but thereby exposing the raw DIGITS token string to rules
that want it.  Secondly, I added a production "anylang_id", to compose the
various tokens into which a non-C symbol will be broken down.

  This doesn't yet allow the use of foreign symbols in IMPORT or EXPORT
directives; that's a whole nother can of worms for another day.  But it
provides the infrastructure we'll neeed if/as and when we do decide to add
that support.

ld/ChangeLog

	* deffilep.y (%union):  Add new string-type semantic value 'digits'.
	(%token):  Remove NUMBER as token, add DIGITS.
	(%type):  Add NUMBER as type.  Add new id types anylang_id, opt_id.
	(ALIGNCOMM):  Parse an anylang_id instead of a plain ID.
	(anylang_id):  New production.
	(opt_digits):  Likewise.
	(opt_id):  Likewise.
	(NUMBER):  Likewise.
	(def_lex):  Return strings of digits in raw string form as DIGITS
	token, instead of converting to numeric integer type.

ld/testsuite/ChangeLog

	* ld-pe/non-c-lang-syms.c:  New dump test source file.
	* ld-pe/non-c-lang-syms.d:  New dump test pattern file.
	* ld-pe/pe.exp:  Run new "foreign symbol" test.

  Please shout if this isn't ok by all concerned!

    cheers,
      DaveK
-------------- next part --------------
A non-text attachment was scrubbed...
Name: deffile-commalign-parse-all-lang-syms.diff
Type: text/x-c
Size: 4118 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/binutils/attachments/20090526/f871b9b1/attachment.bin>