Bug 10774 - Bogus documentation
Summary: Bogus documentation
Status: ASSIGNED
Alias: None
Product: binutils
Classification: Unclassified
Component: ld (show other bugs)
Version: 2.19
: P2 normal
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-14 10:04 UTC by Konrad Schwarz
Modified: 2015-09-16 08:54 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Konrad Schwarz 2009-10-14 10:04:18 UTC
Chapter 3.5.4, "Source Code Reference", of the ld Manual is so inaccurate and
inconsistent in its use of vocabulary with the rest of the manual that it should
be replaced.  A detailed critique is below; I suggest the following replacement:

3.5.4 Accessing Symbols defined in Linker Scripts in Source Code
----------------------------------------------------------------
The value of a symbol is its address.  Thus, to access a symbol's
value, declare it an external variable and use its address.

Note that in most cases, symbols defined by linker scripts do *not*
have any associated storage assigned to them, so it is typically an
error to read from or write to such an external variable!

For example, the Unix System V documentation traditionally
uses the following C declarations for the end of the text segment, the
end of the data segment, and the end of the BSS segment, which System V
marks with the symbols ``etext'', ``edata'', and ``end'':

extern etext;
extern edata;
extern end;

Note that these declarations implicitly use a type of ``int''.

One can choose the type most appropriate to the application, because type
checking is not done during link editing.  E.g., declaring such symbols
as incomplete arrays of const char enables the C compiler to diagnose writes,
reads (without array dereference) and use of the sizeof operator as errors:

extern char const end [];

Finally, note that some systems perform a
transformation between variable names as used in a high-level language and
symbol names as seen by the linker.  The transformation is part of the ABI.
E.g., a.out and COFF(?)-based systems prepend an underscore
to variable names to arrive at the symbol name---this is done to create
separate name spaces for high-level language modules and assembly language
modules.  Symbol names must take this transformation into account: e.g.,
the above symbols would be named ``_etext'', ``_edata'', and ``_end'' on
such systems.

In C++, the ``extern "C"'' modifier can be used to suppress the additional
"mangling" of variable names done by that language.

CRITIQUE OF CURRENT TEXT

File: ld.info,  Node: Source Code Reference,  Prev: PROVIDE_HIDDEN,  Up: Assign\
ments

3.5.4 Source Code Reference
---------------------------

Accessing a linker script defined variable  from source code is not
>>                                symbol
intuitive.  In particular a linker script symbol is not equivalent to a
variable declaration in a high level language, it is instead a symbol
that does not have a value.
>>                     ??? It has a value, it just might not have storage
>> associated with it.  This node's parent is titled "Assigning values to 
>> Symbols"!

   Before going further, it is important to note that compilers often
transform names in the source code into different names when they are
stored in the symbol table.  For example, Fortran compilers commonly
>>      That mangling is defined by the ABI should be mentioned
prepend or append an underscore, and C++ performs extensive `name
mangling'.  Therefore there might be a discrepancy between the name of
a variable as it is used in source code and the name of the same
variable as it is defined in a linker script.  For example in C a
linker script variable might be referred to as:

       extern int foo;

   But in the linker script it might be defined as:

       _foo = 1000;

   In the remaining examples however it is assumed that no name
transformation has taken place.

   When a symbol is declared in a high level language such as C, two
things happen.  The first is that the compiler reserves enough space in
the program's memory to hold the _value_ of the symbol.  The second is
>>                               data of the variable
that the compiler creates an entry in the program's symbol table which
>>       technically, for gcc, the assembler
>>                                        object file's
holds the symbol's _address_.  ie the symbol table contains the address
of the block of memory holding the symbol's value.  So for example the
following C declaration, at file scope:

       int foo = 1000;

   creates a entry called `foo' in the symbol table.  This entry holds
the address of an `int' sized block of memory where the number 1000 is
initially stored.

   When a program references a symbol the compiler generates code that
first accesses the symbol table to find the address of the symbol's
>>     Utter nonsense!
memory block and then code to read the value from that memory block.
So:

       foo = 1;

   looks up the symbol `foo' in the symbol table, gets the address
associated with this symbol and then writes the value 1 into that
address.  Whereas:

       int * a = & foo;

   looks up the symbol `foo' in the symbol table, gets it address and
then copies this address into the block of memory associated with the
variable `a'.

   Linker scripts symbol declarations, by contrast, create an entry in
the symbol table but do not assign any memory to them.  Thus they are
an address without a value.  So for example the linker script
>>  Again, this is completely at variance to how the rest of the manual
>> defines the "value" of a symbol, namely as its address for normal symbols
>> or [sic] its value for absolute symbols.
definition:

       foo = 1000;

   creates an entry in the symbol table called `foo' which holds the
address of memory location 1000, but nothing special is stored at
address 1000.  This means that you cannot access the _value_ of a
linker script defined symbol - it has no value - all you can do is
access the _address_ of a linker script defined symbol.
>> See above

   Hence when you are using a linker script defined symbol in source
code you should always take the address of the symbol, and never
attempt to use its value.  For example suppose you want to copy the
contents of a section of memory called .ROM into a section called
.FLASH and the linker script contains these declarations:

       start_of_ROM   = .ROM;
       end_of_ROM     = .ROM + sizeof (.ROM) - 1;
       start_of_FLASH = .FLASH;
       start_of_FLASH = .FLASH;

   Then the C source code to perform the copy would be:

       extern char start_of_ROM, end_of_ROM, start_of_FLASH;
>> A better practice is to define these variables as char start_of_ROM [], etc.
>> This causes the compiler to complain if these variables are read from or
>> written to, e.g., if the address-of operator & is forgotten, as the author
>> describes below.

>> Furthermore, non-writable sections should be const qualified.

       memcpy (& start_of_FLASH, & start_of_ROM, & end_of_ROM - & start_of_ROM)\
;

   Note the use of the `&' operators.  These are correct.
Comment 1 Nick Clifton 2015-09-15 17:29:52 UTC
Hi Konrad,

  Sorry for letting this PR languish for so long.

  I agree that the wording of the Source Code Reference section is wrong, but I think that your proposed replacement is a little bit too terse.  I particularly want to include the memcpy examples as it was precisely this piece of code that tripped up a customer and caused them to file a bogus bug report.  Thus please could you tell me if you are happy with the following replacement version instead:

Cheers
  Nick

3.5.5 Source Code Reference
---------------------------

The value of a symbol is its address.  Thus to access a symbol's value
from a high level language it should be declared as an external variable
and its address used.

   Note that in most cases, symbols defined by linker scripts do _not_
have any associated storage assigned to them, so it is typically an
error to read from or write to such an external variable.  For example,
suppose that a linker script defines some symbols like this:

       start_of_ROM   = .ROM;
       end_of_ROM     = .ROM + sizeof (.ROM);
       start_of_FLASH = .FLASH;

   The the following code to copy data from ROM to FLASH will fail:

       extern char start_of_ROM, end_of_ROM, start_of_FLASH;
       memcpy (start_of_FLASH, start_of_ROM, end_of_ROM - start_of_ROM); /* FAIL */

   This is because it is reading from the symbols.  Instead the copy
should be written as:

       extern char start_of_ROM, end_of_ROM, start_of_FLASH;
       memcpy (& start_of_FLASH, & start_of_ROM, & end_of_ROM - & start_of_ROM);

   Note the use of the '&' operators - these are correct.  Or the copy
could be written as:

       extern char start_of_ROM[], end_of_ROM[], start_of_FLASH[];
       memcpy (start_of_FLASH, start_of_ROM, end_of_ROM - start_of_ROM);

   Which is easier to read and enables the C compiler to diagnose
writes, reads (without array dereference) and use of the sizeof operator
as errors.

   Type checking is not performed on linker symbols, so any type can be
used to reference them.  Note however that using the wrong type could
lead to runtime problems.  For example:

       extern int start_of_FLASH[];
       * start_of_FLASH = 1;

   This could result in a runtime failure if the start_of_FLASH symbol
is not assigned to an address that meets the alignment requirements of
the int data type.

   Finally, note that some systems perform a transformation between
variable names as used in a high-level language and symbol names as seen
by the linker.  The transformation can be an artefact of the high level
language - for example name mangling in C++, or it can be part of the
architecture's ABI - for example prepending an underscore to variable
names.  If a linker script symbol is to be accessed from a high level
language then this transformation must be taken into account.  For
example in C a linker script symbol might be referred to as:

       extern int foo[];

   But in the linker script it might need to be defined as:

       _foo = 1000;
Comment 2 Konrad Schwarz 2015-09-16 06:54:36 UTC
I believe ELF reserves section names beginning with a dot for its own use, so technically you shouldn't name your sections .FLASH or .ROM.

The compiler will issue a diagnostic for a write to a const-qualified object.
In your example, the symbols are not const qualified.  You would make more use of the type-checking facilities of the compiler with

extern char const start_of_FLASH [];

This makes the clause about the compiler being able to check writes true.

(You would then probably want to replace start_of_FLASH with start_of_ROM in the later example dealing with alignment.)

I prefer `&start_of_FLASH' to `& start_of_FLASH'; as I leave a space around binary operators `a & b', it makes it easier to distinguish between the unary and binary meanings of `&'.

In the final section, I would write "transformations".  There is one transformation from C++ symbol to C symbol names (mangling), and possibly a second transformation from C to assember/linker names (e.g., underscore prepending).  Mangling is dealt with `extern "C"'.  You could add that from C++, you need to declare `foo' as `extern "C" int foo [];'
Comment 3 Nick Clifton 2015-09-16 08:54:39 UTC
Hi Konrad,

   Thank you very much for your prompt response, and for persisting with this issue.

> I believe ELF reserves section names beginning with a dot for its own use,
> so technically you shouldn't name your sections .FLASH or .ROM.

Actually the ELF spec says:

  Section names with a dot (.) prefix are reserved for the system, 
  although applications may use these sections if their existing 
  meanings are satisfactory.  Applications may use names without 
  the prefix to avoid conflicts with system sections.

In the case of the .ROM and .FLASH examples, I think that it is fair to assume that they are system specified sections, rather than application created sections.  IE that they refer to ROM and flash regions of the target hardware's address space.

> The compiler will issue a diagnostic for a write to a const-qualified object.
> In your example, the symbols are not const qualified.  You would make more
> use of the type-checking facilities of the compiler with
> 
> extern char const start_of_FLASH [];

A good point.

> I prefer `&start_of_FLASH' to `& start_of_FLASH'; as I leave a space around
> binary operators `a & b', it makes it easier to distinguish between the
> unary and binary meanings of `&'.

Personally I like the spaces, but most people agree with you, so I will make this change.

> In the final section, I would write "transformations".  There is one
> transformation from C++ symbol to C symbol names (mangling), and possibly a
> second transformation from C to assember/linker names (e.g., underscore
> prepending).  Mangling is dealt with `extern "C"'.  You could add that from
> C++, you need to declare `foo' as `extern "C" int foo [];'

Fair enough.  What do you think of this revised wording ?

Cheers
  Nick
------------------------------------------------------------------------
3.5.5 Source Code Reference
---------------------------

The value of a symbol is its address.  Thus to access a symbol's value
from a high level language it should be declared as an external variable
and its address used.

   Note that in most cases, symbols defined by linker scripts do _not_
have any associated storage assigned to them, so it is typically an
error to read from or write to such an external variable.  For example,
suppose that a linker script defines some symbols like this:

       start_of_ROM   = .ROM;
       end_of_ROM     = .ROM + sizeof (.ROM);
       start_of_FLASH = .FLASH;

   The the following code to copy data from ROM to FLASH will fail:

       extern char start_of_ROM, end_of_ROM, start_of_FLASH;
       memcpy (start_of_FLASH, start_of_ROM, end_of_ROM - start_of_ROM); /* FAIL */

   This is because it is reading from and writing to the symbols.
Instead the copy should be written as:

       extern char start_of_ROM, end_of_ROM, start_of_FLASH;
       memcpy (&start_of_FLASH, &start_of_ROM, &end_of_ROM - &start_of_ROM);

   Note the use of the '&' operators - these are correct.  Or the copy
could be written as:

       extern const char start_of_ROM[], end_of_ROM[], start_of_FLASH[];
       memcpy (start_of_FLASH, start_of_ROM, end_of_ROM - start_of_ROM);

   Which is easier to read and enables the C compiler to diagnose
writes, reads (without array dereference) and use of the sizeof operator
as errors.

   Type checking is not performed on linker symbols, so any type can be
used to reference them.  Note however that using the wrong type could
lead to runtime problems.  For example:

       extern const int start_of_FLASH[];
       * start_of_FLASH = 1;

   This could result in a runtime failure if the start_of_FLASH symbol
is not assigned to an address that meets the alignment requirements of
the int data type.

   Finally, note that some systems perform transformations between
variable names as used in high-level languages and symbol names as seen
by the linker.  The transformations can be an artefact of the high level
language - for example name mangling in C++, but they can also be part
of the architecture's ABI - for example prepending an underscore to
variable names.  If a linker script symbol is to be accessed from a high
level language then this transformation must be taken into account.  For
example in C++ a linker script symbol might be referred to as:

      extern "C" const int foo[];

   (Note the use of '"C"' to prevent the C++ name mangling).  In the
linker script however the same symbol might have to be declared as:

       _foo = 1000;

   With a '_' prefix in order to match the ABI requirements.