[Mips}Using DT tags for handling local ifuncs

Mon Dec 23 21:07:00 GMT 2013

Look below for specific comments, but part of my fear is that I may stray into
what I wish we would have done as oppose to what was actually done.

I want regions explicitly defined. TAGs in dynamic sections for example.

I want section flags to mean something SHF_MIPS_GPREL for example.

I want the linkers to be amazingly dull and not creative. This takes more work than
it may seem.

On 12/21/2013 03:33 AM, Richard Sandiford wrote:> Jack Carter <Jack.Carter@imgtec.com> writes:
>> On 12/19/2013 04:35 PM, Richard Sandiford wrote:> Jack Carter
>> <Jack.Carter@imgtec.com> writes:
>>>>>> I also have a hard time with how the GOT is used for binutils. In my
>>>>>> experience and world view, sections have attributes that make them gp
>>>>>> relative or not. All these sections get gathered in gp relative
>>>>>> regions that are 64k from a value that will be in their $GP. If there
>>>>>> are GOT elements that are not gp relative, they should be in another
>>>>>> .got that is not marked SHF_MIPS_GPREL. It will not get laid out and
>>>>>> calibrated with any of the other GOTs.  Other sections in my life that
>>>>>> get bundled up in the equation for multigot are .sbss, .sdata,
>>>>>> .lit[4,8,16], .srdata, but only if they are marked SHF_MIPS_GPREL.
>>>>>
>>>>> Just so I understand, do you think that the ABI GOT should always be 64k
>>>>> or smaller?  I.e. DT_MIPS_LOCAL_GOTNO + (DT_MIPS_SYMTABNO - DT_MIPS_GOTSYM)
>>>>> should be <= 64 * 1024 / sizeof (void *)?  If so, what should happen
>>>>> (under the original or IRIX n32/n64 ABIs) if the number of symbols
>>>>> involved in .rel.dyn relocations exceeds the 64k limit?  Is that a
>>>>> link error?
>>>>>
>>>> Yes, because in sgi's case you count all the SHF_MIPS_GPREL sections as
>>>> the GP area. .got is only one of them and sgi just put gp-relative
>>>> entries in it.
>>>
>>> But why then do you think the R_MIPS_GOTHI16/R_MIPS_GOTLO16 relocs
>>> and R_MIPS_CALLHI16/R_MIPS_CALLLO16 relocs were defined?  (They were
>>> part of the original ABI.)  If the intention really was to limit the
>>> ABI GOT to 64k I don't think these "xgot" relocs would be needed.
>>
>> I believe, remember this is religious, that it was the first attempt
>> to solve the large GP region problem. If we had our multi-got working
>> I don't think xgot would have seen the light of day. Multi-got was
>> invisible to the general user and had no runtime down sides beyond
>> more support sections and startup explicit relocations.
> 
> OK, I thought it might be something like that.  But IIRC the SGI tools
> did continue to support xgot alongside multigot (including for o32,
> which IIRC didn't have multigot retrofitted, at least not on the
> version of IRIX we were using).  The GNU tools support both too.

SGI had to support xgot. Once you let the genie out of the bottle, it is hard to
put it back in. There was no good reason to use it though because after startup
multigot is much faster. We had one customer that autogenerated code and was
forced to use xgot because of single huge .c file. This is an outlier though.

> 
> I don't think it makes sense to say that a GOT entry must be within the
> 64k region if all GOT accesses use a {GOT,CALL}HI16/LO16 pair, since it
> defeats the point of having the HI16 and LO16 relocs.  And IMO that
> includes the specific case of "all" evaluating to zero, i.e. those cases
> where we only have a GOT entry for the sake of dynamic R_MIPS_REL32 relocs.

You are correct in that it is not forbidden, just not wise. If you are computing whether
or not to break into multiple gots you want to deal with just the gp relative data, not
a lot of other things. The other things can go into non-gp relative sections just fine.

Here is the quote from the last paragraph from the 64bit ELF Object File Specification:
"Observe that it is acceptable to allocate non-GOT data at gp-relative addresses, 
although the current 32-bit system does not do so. Such data (e.g.the .sdata, .sbss and 
.litX sections) should be allocated first in the global data area, since its reason for being
here is normally to achieve short-offset addressing.

So, I guess my point is that non-gp-relative GOT entries, although legal, don't fit the model
being in a region (which includes gp-relative data other than GOT) that is suppose to be
GP relative.

> 
>>>>>> The DT_MIPS_LOCAL_GOTNO describes local got entries. Not other
>>>>>> partitions that we reserve the right to put non-local got entries.
>>>>>
>>>>> I'm still not sure which part you're describing as the local GOT here.
>>>>> Let's go back to the original 32-bit GOT layout, without any GNU extensions:
>>>>>
>>>>>         +------------+   +    <--- DT_PLTGOT
>>>>>         |   entry 0  |   |
>>>>>         +------------+ + B
>>>>>         |  ........  | A |
>>>>>         +------------+ + +    <--- DT_PLTGOT + DT_MIPS_LOCAL_GOTNO * 4
>>>>>         | Global GOT |
>>>>>         +------------+
>>>>>
>>>>> where:
>>>>>
>>>>>     The zero entry in the global offset table is reserved to hold the
>>>>>     address of the entry point in the dynamic linker to call when lazy
>>>>>     resolving text symbols. The dynamic linker must always initialize this
>>>>>     entry regardless of whether lazy binding is or is not enabled.
>>>>>
>>>>> Do you see the local GOT as being A or B?  I.e. does it include
>>>>> the zero entry?
>>>>
>>>> It is by definition A and B,
>>
>> Entry[0] is a cheat, mistake, act of carelessness in my humble
>> opinion. Not what it does, but the fact that it was allowed to be part
>> of the local got region.  It should have been explicitly pointed to in
>> the DT table and the local region start pointed to.
>>
>> It is an oversight and an exception that the lawyers can use to
>> further encroach on the local got region.
>>
>> In my view of the object format world, dynamic areas need to be
>> explicitly called out. This is a dangerous region to be working on
>> heuristics and exceptions. Currently ld.so assumes the local region is
>> DT_MIPS_LOCAL_GOTNO long and starts at PLTGOT.  But wait, we have a
>> special entry so we up the loop counter and maybe we will discover
>> that there is another exception for slot #2 and we up the counter
>> again. Then and only then we run the loop to fix up locals.
> 
> Right.  And IMO this means that the local area is effectively A, since
> like you say A is the only part that gets relocated as a local area
> and is the only part that contains local addresses.  So I don't think
> we should get too hung up on the name "DT_MIPS_LOCAL_GOTNO".  IMO it
> was always misleading.
> 
> Regardless of what it was originally supposed to mean, it is actually
> "the GOT index at which the local area ends and the global area starts"
> or "the number of GOT entries before the global area" (the two being
> equivalent of course).

Yes

> 
>>> But it was an either-or choice. :-)  Does it include entry 0 or not?
>>> If yes, it's B.  If no, it's A.
>>>
>>>> here is the quote from the pre-sgi System V
>>>> Application Binary Interface Mips Processor Supplement:
>>>>
>>>> Global Offset Table (5-9, second paragraph)
>>>> "The global offset tables split into two locally separate subtables:
>>>> local and
>>>> externals. Local entries reside in the first part of the global offset
>>>> table. The
>>>> value of the dynamic tag DT_MIPS_LOCAL_GOTNO holds the number of
>>>> local global offset table entries."
>>>
>>> To me this suggests B if taken at face value.
>>
>> No, the reality is that there should be a pointer to the beginning of
>> the local got region and DT_MIPS_LOCAL_GOTNO represent its size.
> 
> Well, for delimiting an area we can either use "start and size" or
> "start and end".  Since DT_MIPS_LOCAL_GOTNO is effectively the end
> of the local area -- despite the "NO" -- we can keep backward
> compatibility by seeing it as an end rather than a size.
> 
> But I agree completely about having an explicit start for the local area.
> That's what the new tag I was suggesting was.  So going back to the new
> GOT region, I was really thinking about the current:
> 
>     +------------------+
>     | reserved entries |
>     +------------------+
>     |   local entries  |
>     +------------------+  <-- T2
>     |  global entries  |
>     +------------------+
> 
> becoming (with a new name for the new region):
> 
>     +------------------+
>     | reserved entries |
>     +------------------+
>     | general GOT data |
>     +------------------+  <-- T1
>     |   local entries  |
>     +------------------+  <-- T2
>     |  global entries  |
>     +------------------+
> 
> It's entirely up to the static linker what goes in the new region.
> In our case it would be R_MIPS_IRELATIVE-relocated entries, but it could
> be anything really (including .lit4, .lit8, or whatever).  I.e. this region
> would be handled like GOTs are on other targets.
> 
> T2 is currently called DT_MIPS_LOCAL_GOTNO, but if we had:
> 
> T1: DT_MIPS_LOCAL_GOTIDX
> T2: DT_MIPS_GLOBAL_GOTIDX
> 
> with:
> 
> #define DT_MIPS_GLOBAL_GOTIDX DT_MIPS_LOCAL_GOTNO
> 
> then would it be more acceptable namewise?  We could throw in a GOTIDX
> tag for the new region too for completeness.
> 

I like this description

>>>> For entertainment sake here is the comment in my private elf dumper wrote back then:
>>>>
>>>> /**
>>>>       @internal
>>>>
>>>>       Function:	mips_print_got
>>>>
>>>>       MIPS has 2 different GOT table variants that are
>>>>       pretty much the same except one depends on symbol
>>>>       table to got table symmetry for runtime fixup purposes
>>>>       and the other uses runtime relocations.
>>>>       
>>>>       If there is multigot there will be entries in the first dynamic section
>>>>       of type DT_MIPS_AUX_DYNAMIC which point to the other
>>>>       dynamic sections which in turn point to and describe their
>>>>       associated gots.
>>>>       
>>>>       DT_MIPS_LOCAL_GOTNO     	Starting point for DEFAULT symbols
>>>>       DT_MIPS_GOTSYM  	    	Index into dsymtab matching DT_MIPS_LOCAL_GOTNO
>>>>       DT_MIPS_HIPAGENO		Number of page table entries.
>>>>       DT_MIPS_LOCALPAGE_GOTIDX	Starting point for a local got page table
>>>>       DT_MIPS_LOCAL_GOTIDX    	Starting point for local full addresses
>>>>       DT_MIPS_HIDDEN_GOTIDX   	Starting point for HIDDEN symbols
>>>>       DT_MIPS_PROTECTED_GOTIDX	Starting point for PROTECTED symbols
>>>>
>>>>       If DT_MIPS_LOCAL_GOTIDX == DT_HIDDEN_GOT_IDX ||
>>>>       	    	    	       DT_PROTECTED_GOT_IDX ||
>>>> 			       DT_MIPS_LOCAL_GOTNO
>>>>       then there are no local entries. Local in this sense
> If we did have multiple .rel.dyn sections, then:
>>>>       means addresses that may or may not have associated
>>>>       entries in the symbol table or relocation table. If
>>>>       they are present in the symbol table they will be marked
>>>>       as STO_INTERNAL and must not be referenced outside of the
>>>>       defining dso/a.out in any form.
>>>>
>>>>       If DT_HIDDEN_GOT_IDX == DT_PROTECTED_GOT_IDX ||
>>>>       	    	    	    DT_MIPS_LOCAL_GOTNO
>>>>       then there are no hidden entries. Hidden symbols
>>>>       are those that are marked STO_HIDDEN in the dynamic
>>>>       symbol table and are accessable from outside the defining
>>>>       dso only non-symbolicly such as through pointers.
>>>>
>>>>
>>>>       If DT_PROTECTED_GOT_IDX == DT_MIPS_LOCAL_GOTNO
>>>>       then there are no protected entries. Protected symbols
>>>>       are those that are marked STO_PROTECTED in the dynamic
>>>>       symbol table and are accessable from the outside, but
>>>>       cannot be preempted during runtime loading and thus are
>>>>       "protected".
>>>>       
>>>>       @return  void.
>>>>    */
>>>>
>>>> Note, for multigot this resulted in multiple dynamic sections, dynsyms and
>>>> relocation fixups for the got entries.
>>>
>>> Did it also result in multiple relocation tables, one for each .dynamic
>>> section?  Or was there still a single .rel.dyn table?
>>>
>>> If just a single .rel.dyn table, did all relocations in the table use
>>> the primary GOT's DT_MIPS_GOTSYM as the local/global threshold?  If so,
>>> did that mean that there was no specific limit to the number of distinct
>>> global symbols that could be stored in GOT entries (thanks to multigot),
>>> but that there was a limit of 16k (or 8k for n64) global symbols that
>>> could be used in relocations?  (Sorry for the barrage of questions --
>>> the downside of doing this by email.)
>>>
>>
>> I may not understand the question, but will try to answer.
>> Let's pretend we had a case where the linker broke up a dso it was making
>> into having 3 gp-relative regions (multigot). Each region would have its own
>> .dynamic table pointing to its own unique dynsym, got, sdata, sbss, etc. By
>> basic ELF format definition, if any of these sections need relocations they
>> will have their own unique relocation sections.
>>
>> I know, .dynrel has sort of stretched this defintion, but we keep to
>> the current rule by having the dynamic table for the individual got
>> describe where its relocations are and how they are distributed.
>>
>> The limit on symbol indexes is preserved because we are only looking at
>> a sub-region.
>>
>> I guess the key is that each got/gp-rel region has its own individual
>> .dynsym that describes its microcosm independent of the others. The
>> main .dynamic section points to all the extra .dynamic sections
>> through DT tags.
> 
> I was more thinking about a DSO containing something like:
> 
> 	.data
> 	.macro	doit
> 	.word	foo\@
> 	.endm
> 	.rept	20000
> 	doit
> 	.endr
> 
> i.e.:
> 
> 	.data
> 	.word   foo0
>          ...
> 	.word   foo19999
> 
> where we have 20000 R_MIPS_REL32s against various foos and therefore need
> 20000 GOT entries.
> 
> Is this allowed on its own, without explicit GOT references to the foos?
> If it is allowed, do you create 2 GOTs to handle it, so that each GOT is
> still within the 64k limit?  If so, do the two .dynamic sections both
> have their own .rel.dyn sections, each containing the R_MIPS_REL32s for
> the symbols in the associated GOT?

Based on my stated view of the world, which may well be skewed: These symbols
would not end up in the gp-relative GOT. If there is a table you need to look them
up on, it would be different from the gp_relative one and not clutter the gp-rel equation.

That said, I have no idea how sgi dealt with the above. I should, but don't remember.
possibly they stuffed them into the GOT as well. I know we didn't have 2 main GOTs
so I would have to expect they took up space in the one GOT (sounds religious).

It may well be that they didn't worry about the extra entries because the compiler
would just produce gp-relative references for externs.

> 
> Does the answer change if, in addition to the above, there are also
> explicit GOT references to each foo, as in:
> 
> a.s:
>          lw	$4,%got(foo0)($gp)
>          ...
>          lw	$4,%got(foo9999)($gp)
> 
> b.s:
>          lw	$4,%got(foo10000)($gp)
>          ...
>          lw	$4,%got(foo19999)($gp)
> 
> so that the symbols fall naturally into two GOTs?  Would b.s's GOT
> then have the .data relocations for foo10000 and above and a.s's GOT
> have the .data relocations for the rest?

The static linker should only use one GOT entry for a given symbol for the
case where we don't have multi-got. There is no need to duplicate.

> 
> If we did have multiple .rel.dyn sections, then:
> 

But if we did, yes, this would be a mess with the current setup. Instead of multiple
.dynrel sections, I would deal with it with DT tags and use the same .dynrel section.

This would need more thought.

>>> If there were multiple .rel.dyn tables, each tied to their own
>>> .dynamic sections, how would we sort them so that all IRELATIVE
>>> relocations in am object are applied after all non-IRELATIVE ones?
> 
> ...this would become a concern.

Actuall after some thought, it happens in the same order as with a single GOT.
Go through everything in the same order as now. If IFUNC is last then each
gp-regions GOT would get processed at the same time after all the other info
is processed.

> 
>>>> I am not proposing that we go down this route, but it may give a sense of
>>>> the world I came from. I liked it because (other than that I designed
>>>> a lot of
>>>> it :-)) of the structure in symbol visibility and that I could dump
>>>> the entries
>>>> symbolically. Also, each GP region was described by its dynamic section.
>>>>
>>>> This is not a trivial change and goes beyond the ifunc scope, but it
>>>> would resolve
>>>> the fixup by relocation issues and usher in GP rel areas that go
>>>> beyond the GOT.
>>>>
>>>> I really just want to get ifunc done without messing up future
>>>> goodness in ld/ld.so.
>>>
>>> OK, this scheme seems to create multiple .dynsyms as a way of avoiding
>>> explicit relocations for the multigot entries.  Is that right?
>>> I.e. rather than have a .rel.dyn entry for a multigot global GOT entry,
>>> it has an entry in a secondary .dynsym instead?
>>
>> Right. No defence. It is a cost of doing business like this. It may be
>> too expensive for some but not if they are sane and forgo building
>> programs that require multigot. My guess is that the ones that need
>> multigot are not afraid of this overhead. I like to guess and am wrong
>> only about 80% of the time.
>>
>> Yes, every GP region had it's own gp relative sections and support
>> sections including .dynamic, dynsym, relocations, etc. They all shared
>> the same string table though.
>>>
>>> Does that really pay off though?  In ELF32, symbols are 16 bytes in size
>>> but REL relocations are 8 bytes in size.  And because the global GOT
>>> acts as a cache, resolving normal global relocations is very cheap.
>>> We only look up the symbol once, when resolving the GOT entry.
>>>
>>> (If the same global symbol appeared in two GOTs and .dynsyms, did you
>>> look it up twice, or just once?  If twice then the .rel.dyn approach
>>> seems to win there too, as well as on size.)
>>
>> This (sgi multigot) does not win on the size of the collective dynamic
>> sections.  There is duplication. It is a start up hit that needs to be
>> evaluated before anyone wants to emulate it.
>>
>> Remember, one can build the dynamic ld.so affected part of the object
>> very close to how we do today if everything falls into a single
>> got. If it goes over the threshold one would start to get this
>> overhead, but the duplication part will not be that big because the
>> second got, and in reality it will be only a second got, will probably
>> be a very small subset of the first got and thus few symbol and
>> relocation dups as well as the duplications of gprel data sections.
> 
> I was really comparing the cost of this multigot scheme with the one
> that was used for binutils.  (Note that I had no part in the binutils
> multigot scheme, so I don't have an attachment either way.)  There the
> idea was to treat the secondary GOTs as just another bit of data and
> relocate them in the same way as you would relocate a data section.
> This is of course how other targets handle their primary GOT too.
> 
> It sounds like this could also be the second multigot variant from the
> comment you quoted:
> 
>      MIPS has 2 different GOT table variants that are
>      pretty much the same except one depends on symbol
>      table to got table symmetry for runtime fixup purposes
>      and the other uses runtime relocations.
> 
> So this might not even be an SGI vs. binutils thing, but I'll call
> them that below for the sake of simplicity.
> 
> I think the differences work out as:
> 
> * The SGI scheme relies on changes to the dynamic linker.
>    The binutils scheme works within the original ABI (assuming that
>    the primary GOT is allowed to be bigger than 64k, as above).
> 
>    I think the binutils scheme even worked on o32 IRIX, although I might
>    have made that up.
> 
> * The SGI scheme uses tags to relocate the local part of the GOT.
>    The binutils scheme uses 8-byte .rel.dyn entries instead.
> 
>    So for this part of the GOT the SGI scheme wins, at least for
>    large numbers of local GOT entries.  But if you create .dynsyms
>    for local, internal and hidden symbols -- which binutils currently
>    treats as local -- then the local part is going to be very small.
>    It probably just contains page entries.
> 
> * The SGI scheme uses 16-byte .dynsym entries for each GOT entry
>    that's bound to a symbol.  The binutils scheme uses 8-byte .rel.dyn
>    entries instead.
> 
>    Who wins here depends on how many duplicate .dynsym entries there are.
>    If the GOTs have several symbols in common (which seems likely) then
>    the binutils scheme should win from both a size and speed perspective,
>    since only one lookup is needed per symbol, regardless of how many
>    GOTs reference it.
> 
>    Using .dynsyms for local, internal and hidden symbols adds 8 bytes
>    per entry over the binutils scheme, on top of the string table cost.
> 
> * The SGI scheme allows lazy binding in secondary GOTs.  The binutils
>    scheme doesn't.  This is definitely the big disadvantage of the
>    binutils scheme.  (One that no-one's ever been sufficiently motivated
>    to fix, unfortunately.)
> 
> * The SGI scheme requires several .dynamics and several .dynsyms,
>    which is likely to confuse generic ELF code.  The binutils scheme
>    avoids this.
> 
> * The SGI scheme allows you to dump the secondary GOTs in the same
>    way as the primary GOT.  The binutils scheme doesn't.
> 
> Does that sound right to you?

Yes, sounds correct.

> 
>>> I agree that in the specific case of ifuncs it would probably work
>>> to do things this way, since for ifuncs the type of GOT entry needed
>>> can be determined from the symbol type (IFUNC rather than FUNC).
>>> But it wouldn't extend well to other types of relocation.  E.g.
>>> TLS GOT entries can't be implied from the symbol type in this way.
>>> It might be that the next relocation type we add also has no associated
>>> symbol type.  (The type is only a 4-bit field after all, and most are
>>> already taken.)
>>
>> I wouldn't put them in this got. I would create another one that was
>> not GP relative.  It would not be part of the multigot party. We would
>> have to have DT rules (maybe) for this got as well if there was
>> special handling beyond explicit relocations.
> 
> Sorry, I meant new types of GOT relocation that would be needed in future.
> I.e. cases where we add a new R_MIPS_FOO relocation and also want to be
> to do something like:
> 
>          lw	$4, %got_foo($gp)
> 
> with %got_foo resolving to the offset of an R_MIPS_FOO-equivalent GOT entry.
> 
> Thanks,
> Richard
> 

Cheers and happy holidays,

Jack