31964 – Add directive for more efficient encoding of binary data

Bug 31964 - Add directive for more efficient encoding of binary data

Summary: Add directive for more efficient encoding of binary data

Status:	RESOLVED FIXED

Alias:	None

Product:	binutils
Classification:	Unclassified
Component:	gas (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Nick Clifton

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-07-08 15:07 UTC by Jakub Jelinek
Modified:	2024-07-12 03:32 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Proposed patch (5.06 KB, patch) 2024-07-09 17:14 UTC, Nick Clifton	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jakub Jelinek 2024-07-08 15:07:22 UTC

GCC right now when it emits binary data into assembler, whether it is for LTO sections with -flto or large variable initializers, uses either the .ascii/.string directives like
        .string "\304\2347a@\355\004\302tL\302\\\260O>6D\347\266\2527`\200\355\004\276\276L\302j\330'\0279D\347\262\2347`\326v\002_%&a-\354S\017\017\3219i\365\006\214\250\235@\027\324$,\205}\352\271!:G\345\274\331\"l'\020\345f\362U\310>\341\350\020\225+\254\336l\305\265\023\210z1\371\002d\237nz\210\312\3759o\266\216\332\t<EM\276\350\330\247\033\034\242r|\325\233\255\234v\002M{&_j\354\223\315\016Q\271;\347\215V];\201&\035\223\2572\366\311\306\206\250\034\\\365FK\251\235\300\322\227\311W\225}\252\311!*\367\346\274\321\332i'\260Tb\362\025\305>\321\360\020\225S\253\336d\345\265\023H\202\232|1\261O47D\345\320\2347YN\355\004\220\334L\276\202\330\247\031\035\242ra\325\233\254\236v\002H/&_=\354\323\263\207\250\334\226\363\246\312\327N"
        .ascii  "\0325\371\242a\237\2368D\345\252\325\233*[;\201\242=\223"
        .string "/\026\366\211\335!*'u\336T\201\332\t\024\351\230|\241\260O\254\r\321\270\303\352\215\025`;\001\234/\223.B\366\251%\207h\\\241\363\306\312\332N"
        .ascii  "\247\304\244\013\217}b\341!\032\347W\275\261\"j'`\0035\351\232"
        .ascii  "c\237Xn\210\306\3619o\244\204\355\004h\334L\272\322\330\247\025"
or emits it as a sequence of .byte directives
        .byte   127
        .byte   69
        .byte   76
        .byte   70
        .byte   2
        .byte   1
        .byte   1
        .byte   3
        .byte   0
        .byte   0
For ASCII or mostly ASCII data .string/.ascii are just fine, with 1 or slightly more than 1 assembly character per data section byte, but as can be seen
above, for non-ASCII values that is 4 characters per byte or in the .byte sequence case up to 11 characters for byte.

I've been wondering whether gas couldn't add a .base64 directive, base64 encoding/decoding is pretty fast thing and can be implemented in a few lines of C or C++ code
efficiently.  It is something I'm also proposing for #embed preprocessing.
Perhaps .base64 argument could be a string, like:
        .base64 "RUxGAgEBAwAAAAAAAAAAAgA+AAEAAABQ00AAAAAAAEAAAAAAAAAA2CBLEAAAAAAAAAAAQAA4AA4A"
        .base64 "QAAsACsABgAAAAQAAABAAAAAAAAAAEAAQAAAAAAAQABAAAAAAAAQAwAAAAAAABADAAAAAAAACAAA"
        .base64 "AAAAAAAAAAAAAAAAAAA="
https://datatracker.ietf.org/doc/html/rfc4648#section-4
I'd probably not add any requirements on the line (string) length, not accept any line breaks
nor other characters other than [A-Za-z0-9+/=], just require the string has multiple of 4 characters and so is a valid base64 on its own, with at most = or == at the end, no other = chars.
So even
        .base64 "RUxGAgEBAwAAAAAAAAAAAgA+AAEAAABQ00AAAAAAAEAAAAAAAAAA2CBLEAAAAAAAAAAAQAA4AA4AQAAsACsABgAAAAQAAABAAAAAAAAAAEAAQAAAAAAAQABAAAAAAAAQAwAAAAAAABADAAAAAAAACAAAAAAAAAAAAAAAAAAAAAA="
etc. would be valid.

Comment 1 Jakub Jelinek 2024-07-08 15:12:25 UTC

base64 encoding is 4 characters per 3 bytes.
Guess the directive shouldn't be supported for targets which don't have 8-bit bytes (if there are any).

Comment 2 Nick Clifton 2024-07-09 10:06:14 UTC

Hi Jakub,

  Does libiberty (or some other library) have a base64 decoding function ?

  If not, I guess I will have to steal^H^H^H^H borrow some code from some 
  other project.

Cheers
  Nick

Comment 3 Jakub Jelinek 2024-07-09 10:13:47 UTC

(In reply to Nick Clifton from comment #2)
> Hi Jakub,
> 
>   Does libiberty (or some other library) have a base64 decoding function ?

I don't think so.

>   If not, I guess I will have to steal^H^H^H^H borrow some code from some 
>   other project.

I simply wrote my own, see
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655156.html
(the base64_dec_fn helper function, base64_dec array and most of
finish_base64_embed).  Uses C++ for the base64_dec array initialization, of course it could be initialized on demand at runtime, just wanted to make it more efficient.
Now, I don't really remember if gas does any kind of character set translation or not, or whether say 'A' in .ascii/.string etc. routines is expected to be 'A' in gas source and is what is being written to the sections.

Comment 4 Nick Clifton 2024-07-09 17:14:07 UTC

Created attachment 15612 [details]
Proposed patch

Hi Jakub,

  Would you like to try out this patch ?

  With it applied you can use the .base64 directive as you outlined in your description.  The patch allows for multiple comma separated strings to be specified for a single .base64 pseudo-op because this is in keeping with how other assembler pseudo-ops behave.

Cheers
  Nick

Comment 5 Jakub Jelinek 2024-07-09 18:18:32 UTC

Comment on attachment 15612 [details]
Proposed patch

Thanks, will try tomorrow.

Just some nits:
1) @command{uuencode} program's @code{-m} option
   I think base64 program from coreutils is more common than uuencode from sharutils,
   so either mention just that, or both.  For no line wrapping base64 has -w 0 option.
2) I don't know how FRAG_APPEND_1_CHAR is expensive compared to say appending more
   characters at a time; if appending more at a time would be cheaper, with base64
   one can cheaply check for the length of the addition (at least number of non-["=]
   characters divided by 4 times 3); but if it is inexpensive, just ignore
Guess the most important thing will be how fast will be the parsing of it (and encoding on the gcc side).  Just running coreutils base64 on 261M file took around 1s and base64 -d of that too.

Comment 6 Nick Clifton 2024-07-10 08:17:25 UTC

(In reply to Jakub Jelinek from comment #5)
 
> 1) @command{uuencode} program's @code{-m} option
>    I think base64 program from coreutils is more common than uuencode from
> sharutils,
>    so either mention just that, or both.  For no line wrapping base64 has -w
> 0 option.

That is a fair point.  I will change the example to use the base64 program as you suggested.

> 2) I don't know how FRAG_APPEND_1_CHAR is expensive compared to say
> appending more

It is actually pretty fast unless the specific backend involved needs to do something funky.

> Just running coreutils base64 on 261M file
> took around 1s and base64 -d of that too.

I tried a similar test using /usr/bin/lto-dump (29Mb) and it took the assembler less than a second to convert the base64 encoded version of the file into an object file containing the binary.

Comment 7 Jakub Jelinek 2024-07-10 11:08:44 UTC

So, I've tried your patch on my short #embed testcase:
unsigned char a[] = {
#embed "cc1plus"
};
with the #embed patchset for GCC, where cc1plus is 273372376 bytes long binary.
Assembly for this from the gcc is 1328371852 bytes long, just
        .file   "embed-11.c"
        .text
        .globl  a
        .data
        .align 32
        .type   a, @object
        .size   a, 273372376
a:
        .byte   127
        .string "ELF\002\001\001\003"
        .string ""
        .string ""
...
        .string "(\035\214\034\347_u\244\rz|~\002\253h\267\271\203v\244\266\372\001\353\363\026\346\365\305\211\005\220\372\215h\267\211{\022\257\277'\0256\215G\2013c.~\244\206\360\2
26|_\226\223\034\177j\232u\300,\003\3273kh\267q\221\302\326\3153\3772\202,\003\327\346\207\3662giJ3\202,\003\327\305\271\234@%v~\2446-\034\257\310\207\302\326=\256h\267\016\237h\267Q
\201\023\257\016\313\302\326q\032\\*\205(u\244\237\023t\244\344Vt\244\247\335\243k\007\256\302\326,th\267}\221h\267\317O\034\257\377\373v\244\227\202a\221$\236\3772\263\326X\221\215M
z\244\216\227\034\257F\213\302\326G\316\302\326\033\277\302\326\177\220h\267\023\263\302\326X\236v\244\034Zt\244\003>\177[\0135\022\257\226ph\267|\377\3033Ox\022\257\214\307\340`\356
\235\3772M>\245\013\321*\003\327=\377\3033"
...
        .string ""
        .string ""
        .byte   0
        .ident  "GCC: (GNU) 15.0.0 20240703 (experimental)"
        .section        .note.GNU-stack,"",@progbits
Now, if I hand edit this to replace the first .byte up to the last one including .string etc. directives in between with
cat cc1plus | base64 | sed 's/^/\t.base64\t"/;s/$/"/'
the new assembly is 422048853 bytes.
time .../gas/as-new -o embed-11.o embed-11.s 

real	0m10.481s
user	0m10.113s
sys	0m0.356s

time .../gas/as-new -o embed-11_.o embed-11_.s 

real	0m2.519s
user	0m2.282s
sys	0m0.233s
md5sum embed-11.o embed-11_.o
049aaf9fdb9cf6f84fd54984ab032ac0  embed-11.o
049aaf9fdb9cf6f84fd54984ab032ac0  embed-11_.o

So, this looks good to me.

Comment 8 Sourceware Commits 2024-07-10 14:02:27 UTC

The master branch has been updated by Nick Clifton <nickc@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=479edf0a6a61159486f14d5e62403f8769cc591d

commit 479edf0a6a61159486f14d5e62403f8769cc591d
Author: Nick Clifton <nickc@redhat.com>
Date:   Wed Jul 10 15:01:39 2024 +0100

    Add support for a .base64 pseudo-op to gas
    
      PR 31964

Comment 9 Nick Clifton 2024-07-10 14:11:19 UTC

Feature added.

Comment 10 Sourceware Commits 2024-07-11 02:07:06 UTC

The master branch has been updated by Alan Modra <amodra@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=4cf957e7ac44097baa06e6caca5ad444cff78155

commit 4cf957e7ac44097baa06e6caca5ad444cff78155
Author: Alan Modra <amodra@gmail.com>
Date:   Thu Jul 11 11:08:50 2024 +0930

    Re: Add support for a .base64 pseudo-op to gas
    
    Fixes a failure on rx-elf where the standard data section isn't .data.
    run_dump_test has machinery to translate .data in both options and
    expected results for objdump, but not for readelf -x.
    
            PR 31964
            * testsuite/gas/all/base64.d: Dump .data with objdump.  Run on
            all targets.

Comment 11 Sourceware Commits 2024-07-11 11:52:42 UTC

The master branch has been updated by Nick Clifton <nickc@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=a79094915578872a0360c78a54accff994b883b1

commit a79094915578872a0360c78a54accff994b883b1
Author: Nick Clifton <nickc@redhat.com>
Date:   Thu Jul 11 12:51:16 2024 +0100

    base64: Add support for targets with byte size > octet size.
    
    PR 31964

Comment 12 Sourceware Commits 2024-07-12 03:32:39 UTC

The master branch has been updated by Alan Modra <amodra@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=d686a2b68810b4b1f98930cebcf3b2ee256b1ce2

commit d686a2b68810b4b1f98930cebcf3b2ee256b1ce2
Author: Alan Modra <amodra@gmail.com>
Date:   Fri Jul 12 09:50:46 2024 +0930

    Re: base64: Add support for targets with byte size > octet size.
    
    Three extra octets are now expected with the latest change to base64.s.
    They happened to be covered by patterns allowing for zero padding at
    the end of the section, but we don't want to allow fewer octets than
    expected.
    
            PR 31964
            * testsuite/gas/all/base64.d: Adjust.