Bug 19434

Summary:	invalid character in attribute value
Product:	libabigail	Reporter:	Ben Woodard <woodard>
Component:	default	Assignee:	Dodji Seketeli <dodji>
Status:	RESOLVED FIXED
Severity:	normal	CC:	libabigail
Priority:	P2
Version:	unspecified
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:
Attachments:	reproducing elf file

Description Ben Woodard 2016-01-06 21:29:11 UTC

bash-4.1$ ~/bin/abidw --abidiff /collab/usr/global/tools/totalview/r/toolworks/totalview.8.12.0-1/linux-x86-64/bin/tvdsvrmain_mic 
/tmp/libabigail-tmp-file-HC4EVK:21019: parser error : invalid character in attribute value
      <parameter type-id='type-id-481' name='$5'/>
                                              ^
/tmp/libabigail-tmp-file-HC4EVK:21019: parser error : attributes construct error
      <parameter type-id='type-id-481' name='$5'/>
                                              ^
/tmp/libabigail-tmp-file-HC4EVK:21019: parser error : Couldn't find end of Start Tag parameter
      <parameter type-id='type-id-481' name='$5'/>
                                              ^
Could not read temporary XML representation of elf file back

This looks like it is a new one.

Comment 1 Ben Woodard 2016-01-06 21:30:56 UTC

Created attachment 8886 [details]
reproducing elf file

Comment 2 Ben Woodard 2016-01-06 21:31:40 UTC

This was with 1.0 RC1 from the git tree.

Comment 3 Dodji Seketeli 2016-01-18 17:27:02 UTC

So this is due to some function parameter names which contain ASCII *control* characters.  I am not sure why this would happen.  Maybe this is because the source code file was encoded in something that is not proper ASCII?  Unfortunately, I am not aware of any way to detect the encoding of the source file, from the DWARF information; so I am assuming it should be ASCII.

The fix involves detecting characters that are not simple ASCII identifier characters in parameter names.  If there is any, the parameter name is dropped on the floor.

The fix has landed into the master branch at https://sourceware.org/git/?p=libabigail.git;a=commit;h=c3869ecc7bbd6f8370ca29446afdcc1d2631e33d.

Comment 4 Ben Woodard 2016-01-18 19:08:51 UTC

Is dropping the name on the floor the best thing to do? Wouldn't it be better to encode the non-ascii parameter name into 7b clean ascii sort of like uuencode does.

Comment 5 Dodji Seketeli 2016-01-19 09:31:37 UTC

> Is dropping the name on the floor the best thing to do? Wouldn't it be
> better to encode the non-ascii parameter name into 7b clean ascii sort
> of like uuencode does.

For now, we don't use the parameter name anyway.  In change reports,
function parameters are referred to using their position.

Furthermore, I think that since we don't know the actual encoding of the
characters, if we are sure they are not ASCII (which is the case here),
I don't think trying to encode each of the byte value can provide us
with any usable information.  It's just like if we had garbage.  We
won't be able to show any useable information to the user anyway.  Hence
my inclination to drop the name altogether.

But if one day we know the actual encoding of the parameter names, then
we can decode them.  At that point we'll change the code again and avoid
dropping the name if it's not ascii.  If it's, say, UTF-8, then we'll be
able to decode the byte stream, knowing that it's an UTF-8 stream.