GCC right now when it emits binary data into assembler, whether it is for LTO sections with -flto or large variable initializers, uses either the .ascii/.string directives like .string "\304\2347a@\355\004\302tL\302\\\260O>6D\347\266\2527`\200\355\004\276\276L\302j\330'\0279D\347\262\2347`\326v\002_%&a-\354S\017\017\3219i\365\006\214\250\235@\027\324$,\205}\352\271!:G\345\274\331\"l'\020\345f\362U\310>\341\350\020\225+\254\336l\305\265\023\210z1\371\002d\237nz\210\312\3759o\266\216\332\t<EM\276\350\330\247\033\034\242r|\325\233\255\234v\002M{&_j\354\223\315\016Q\271;\347\215V];\201&\035\223\2572\366\311\306\206\250\034\\\365FK\251\235\300\322\227\311W\225}\252\311!*\367\346\274\321\332i'\260Tb\362\025\305>\321\360\020\225S\253\336d\345\265\023H\202\232|1\261O47D\345\320\2347YN\355\004\220\334L\276\202\330\247\031\035\242ra\325\233\254\236v\002H/&_=\354\323\263\207\250\334\226\363\246\312\327N" .ascii "\0325\371\242a\237\2368D\345\252\325\233*[;\201\242=\223" .string "/\026\366\211\335!*'u\336T\201\332\t\024\351\230|\241\260O\254\r\321\270\303\352\215\025`;\001\234/\223.B\366\251%\207h\\\241\363\306\312\332N" .ascii "\247\304\244\013\217}b\341!\032\347W\275\261\"j'`\0035\351\232" .ascii "c\237Xn\210\306\3619o\244\204\355\004h\334L\272\322\330\247\025" or emits it as a sequence of .byte directives .byte 127 .byte 69 .byte 76 .byte 70 .byte 2 .byte 1 .byte 1 .byte 3 .byte 0 .byte 0 For ASCII or mostly ASCII data .string/.ascii are just fine, with 1 or slightly more than 1 assembly character per data section byte, but as can be seen above, for non-ASCII values that is 4 characters per byte or in the .byte sequence case up to 11 characters for byte. I've been wondering whether gas couldn't add a .base64 directive, base64 encoding/decoding is pretty fast thing and can be implemented in a few lines of C or C++ code efficiently. It is something I'm also proposing for #embed preprocessing. Perhaps .base64 argument could be a string, like: .base64 "RUxGAgEBAwAAAAAAAAAAAgA+AAEAAABQ00AAAAAAAEAAAAAAAAAA2CBLEAAAAAAAAAAAQAA4AA4A" .base64 "QAAsACsABgAAAAQAAABAAAAAAAAAAEAAQAAAAAAAQABAAAAAAAAQAwAAAAAAABADAAAAAAAACAAA" .base64 "AAAAAAAAAAAAAAAAAAA=" https://datatracker.ietf.org/doc/html/rfc4648#section-4 I'd probably not add any requirements on the line (string) length, not accept any line breaks nor other characters other than [A-Za-z0-9+/=], just require the string has multiple of 4 characters and so is a valid base64 on its own, with at most = or == at the end, no other = chars. So even .base64 "RUxGAgEBAwAAAAAAAAAAAgA+AAEAAABQ00AAAAAAAEAAAAAAAAAA2CBLEAAAAAAAAAAAQAA4AA4AQAAsACsABgAAAAQAAABAAAAAAAAAAEAAQAAAAAAAQABAAAAAAAAQAwAAAAAAABADAAAAAAAACAAAAAAAAAAAAAAAAAAAAAA=" etc. would be valid.
base64 encoding is 4 characters per 3 bytes. Guess the directive shouldn't be supported for targets which don't have 8-bit bytes (if there are any).
Hi Jakub, Does libiberty (or some other library) have a base64 decoding function ? If not, I guess I will have to steal^H^H^H^H borrow some code from some other project. Cheers Nick
(In reply to Nick Clifton from comment #2) > Hi Jakub, > > Does libiberty (or some other library) have a base64 decoding function ? I don't think so. > If not, I guess I will have to steal^H^H^H^H borrow some code from some > other project. I simply wrote my own, see https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655156.html (the base64_dec_fn helper function, base64_dec array and most of finish_base64_embed). Uses C++ for the base64_dec array initialization, of course it could be initialized on demand at runtime, just wanted to make it more efficient. Now, I don't really remember if gas does any kind of character set translation or not, or whether say 'A' in .ascii/.string etc. routines is expected to be 'A' in gas source and is what is being written to the sections.
Created attachment 15612 [details] Proposed patch Hi Jakub, Would you like to try out this patch ? With it applied you can use the .base64 directive as you outlined in your description. The patch allows for multiple comma separated strings to be specified for a single .base64 pseudo-op because this is in keeping with how other assembler pseudo-ops behave. Cheers Nick
Comment on attachment 15612 [details] Proposed patch Thanks, will try tomorrow. Just some nits: 1) @command{uuencode} program's @code{-m} option I think base64 program from coreutils is more common than uuencode from sharutils, so either mention just that, or both. For no line wrapping base64 has -w 0 option. 2) I don't know how FRAG_APPEND_1_CHAR is expensive compared to say appending more characters at a time; if appending more at a time would be cheaper, with base64 one can cheaply check for the length of the addition (at least number of non-["=] characters divided by 4 times 3); but if it is inexpensive, just ignore Guess the most important thing will be how fast will be the parsing of it (and encoding on the gcc side). Just running coreutils base64 on 261M file took around 1s and base64 -d of that too.
(In reply to Jakub Jelinek from comment #5) > 1) @command{uuencode} program's @code{-m} option > I think base64 program from coreutils is more common than uuencode from > sharutils, > so either mention just that, or both. For no line wrapping base64 has -w > 0 option. That is a fair point. I will change the example to use the base64 program as you suggested. > 2) I don't know how FRAG_APPEND_1_CHAR is expensive compared to say > appending more It is actually pretty fast unless the specific backend involved needs to do something funky. > Just running coreutils base64 on 261M file > took around 1s and base64 -d of that too. I tried a similar test using /usr/bin/lto-dump (29Mb) and it took the assembler less than a second to convert the base64 encoded version of the file into an object file containing the binary.
So, I've tried your patch on my short #embed testcase: unsigned char a[] = { #embed "cc1plus" }; with the #embed patchset for GCC, where cc1plus is 273372376 bytes long binary. Assembly for this from the gcc is 1328371852 bytes long, just .file "embed-11.c" .text .globl a .data .align 32 .type a, @object .size a, 273372376 a: .byte 127 .string "ELF\002\001\001\003" .string "" .string "" ... .string "(\035\214\034\347_u\244\rz|~\002\253h\267\271\203v\244\266\372\001\353\363\026\346\365\305\211\005\220\372\215h\267\211{\022\257\277'\0256\215G\2013c.~\244\206\360\2 26|_\226\223\034\177j\232u\300,\003\3273kh\267q\221\302\326\3153\3772\202,\003\327\346\207\3662giJ3\202,\003\327\305\271\234@%v~\2446-\034\257\310\207\302\326=\256h\267\016\237h\267Q \201\023\257\016\313\302\326q\032\\*\205(u\244\237\023t\244\344Vt\244\247\335\243k\007\256\302\326,th\267}\221h\267\317O\034\257\377\373v\244\227\202a\221$\236\3772\263\326X\221\215M z\244\216\227\034\257F\213\302\326G\316\302\326\033\277\302\326\177\220h\267\023\263\302\326X\236v\244\034Zt\244\003>\177[\0135\022\257\226ph\267|\377\3033Ox\022\257\214\307\340`\356 \235\3772M>\245\013\321*\003\327=\377\3033" ... .string "" .string "" .byte 0 .ident "GCC: (GNU) 15.0.0 20240703 (experimental)" .section .note.GNU-stack,"",@progbits Now, if I hand edit this to replace the first .byte up to the last one including .string etc. directives in between with cat cc1plus | base64 | sed 's/^/\t.base64\t"/;s/$/"/' the new assembly is 422048853 bytes. time .../gas/as-new -o embed-11.o embed-11.s real 0m10.481s user 0m10.113s sys 0m0.356s time .../gas/as-new -o embed-11_.o embed-11_.s real 0m2.519s user 0m2.282s sys 0m0.233s md5sum embed-11.o embed-11_.o 049aaf9fdb9cf6f84fd54984ab032ac0 embed-11.o 049aaf9fdb9cf6f84fd54984ab032ac0 embed-11_.o So, this looks good to me.
The master branch has been updated by Nick Clifton <nickc@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=479edf0a6a61159486f14d5e62403f8769cc591d commit 479edf0a6a61159486f14d5e62403f8769cc591d Author: Nick Clifton <nickc@redhat.com> Date: Wed Jul 10 15:01:39 2024 +0100 Add support for a .base64 pseudo-op to gas PR 31964
Feature added.
The master branch has been updated by Alan Modra <amodra@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=4cf957e7ac44097baa06e6caca5ad444cff78155 commit 4cf957e7ac44097baa06e6caca5ad444cff78155 Author: Alan Modra <amodra@gmail.com> Date: Thu Jul 11 11:08:50 2024 +0930 Re: Add support for a .base64 pseudo-op to gas Fixes a failure on rx-elf where the standard data section isn't .data. run_dump_test has machinery to translate .data in both options and expected results for objdump, but not for readelf -x. PR 31964 * testsuite/gas/all/base64.d: Dump .data with objdump. Run on all targets.
The master branch has been updated by Nick Clifton <nickc@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=a79094915578872a0360c78a54accff994b883b1 commit a79094915578872a0360c78a54accff994b883b1 Author: Nick Clifton <nickc@redhat.com> Date: Thu Jul 11 12:51:16 2024 +0100 base64: Add support for targets with byte size > octet size. PR 31964
The master branch has been updated by Alan Modra <amodra@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=d686a2b68810b4b1f98930cebcf3b2ee256b1ce2 commit d686a2b68810b4b1f98930cebcf3b2ee256b1ce2 Author: Alan Modra <amodra@gmail.com> Date: Fri Jul 12 09:50:46 2024 +0930 Re: base64: Add support for targets with byte size > octet size. Three extra octets are now expected with the latest change to base64.s. They happened to be covered by patterns allowing for zero padding at the end of the section, but we don't want to allow fewer octets than expected. PR 31964 * testsuite/gas/all/base64.d: Adjust.