Bug 32003

Summary: Specifying --package-metadata might not be possible and is too fragile
Product: binutils Reporter: Benjamin Drung <bdrung>
Component: ldAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: bluca, dilfridge, doko, hjl.tools, jbeulich, sam, toolchain
Priority: P2    
Version: 2.44 (HEAD)   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed: 2024-07-23 00:00:00

Description Benjamin Drung 2024-07-22 12:09:00 UTC
In Ubuntu we enabled setting ELF package metadata for the Debian package that we build starting from Ubuntu 24.10 (oracular) on. Specifying the linker flag --package-metadata is not possible in a robust way. All tried approaches are either not working or too fragile:

1. The comma in the JSON value is used to split the -Wl parameter specified for gcc:

```
$ echo "void main() { }" > test.c
$ gcc '-Wl,--package-metadata={"type":"deb","os":"ubuntu"}' test.c 
/usr/bin/ld: cannot find "os":"ubuntu"}: No such file or directory
collect2: error: ld returned 1 exit status
```

2. The quotation marks in the JSON value are eaten by configure scripts and Makefiles. Example:

```
$ echo "void main() { }" > test.c
$ printf 'test:\n\tgcc $(CFLAGS) test.c\n' > Makefile
$ env CFLAGS='-Wl,--package-metadata={"type":"deb"}' make
gcc -Wl,--package-metadata={"type":"deb"} test.c
/usr/bin/ld: warning: --package-metadata={type:deb} does not contain valid JSON, ignoring: string or '}' expected near 'type'
```

3. Add `-specs=<spec-file>` to the gcc linker flags. Then this spec file could construct the package metadata parameter. Example spec file:

```
$ cat /usr/share/dpkg/elf-package-metadata.specs
*link:
+ --package-metadata={\"type\":\"deb\",\"os\":\"%:getenv(DEB_BUILD_OS_RELEASE_ID \",\"name\":\"%:getenv(DEB_SOURCE \",\"version\":\"%:getenv(DEB_VERSION \",\"architecture\":\"%:getenv(DEB_HOST_ARCH \"}))))
```

Issue with that approach: It requires the spec file to be around and the environment variables to be set. This will be the case during the package build, but not at a later stage. See https://launchpad.net/bugs/2071468 for examples where this breaks.

## Proposed solution

Add support for an `--escaped-package-metadata` parameter to the linkers that takes a percent encoded (RFC 3986) parameter. Example:
```
-Wl,--encoded-package-metadata,%7B%22type%22:%22deb%22%2C%22os%22:%22ubuntu%22%2C%22name%22:%22dpkg%22%2C%22version%22:%221.22.6ubuntu15%22%2C%22architecture%22:%22amd64%22%7D
```
Comment 1 Benjamin Drung 2024-07-22 12:14:46 UTC
Proposed --encoded-package-metadata patches:
* ld patch v4: https://sourceware.org/pipermail/binutils/2024-July/135769.html
* gold patch v3: https://sourceware.org/pipermail/binutils/2024-July/135796.html
Comment 2 H.J. Lu 2024-07-22 20:46:10 UTC
(In reply to Benjamin Drung from comment #0)
> In Ubuntu we enabled setting ELF package metadata for the Debian package
> that we build starting from Ubuntu 24.10 (oracular) on. Specifying the
> linker flag --package-metadata is not possible in a robust way. All tried
> approaches are either not working or too fragile:
> 
> 1. The comma in the JSON value is used to split the -Wl parameter specified
> for gcc:
> 
> ```
> $ echo "void main() { }" > test.c
> $ gcc '-Wl,--package-metadata={"type":"deb","os":"ubuntu"}' test.c 
> /usr/bin/ld: cannot find "os":"ubuntu"}: No such file or directory
> collect2: error: ld returned 1 exit status
> ```

This is a real issue.

> 2. The quotation marks in the JSON value are eaten by configure scripts and
> Makefiles. Example:
> 
> ```
> $ echo "void main() { }" > test.c
> $ printf 'test:\n\tgcc $(CFLAGS) test.c\n' > Makefile
> $ env CFLAGS='-Wl,--package-metadata={"type":"deb"}' make
> gcc -Wl,--package-metadata={"type":"deb"} test.c
> /usr/bin/ld: warning: --package-metadata={type:deb} does not contain valid
> JSON, ignoring: string or '}' expected near 'type'
> ```

The linker testcase has

"--package-metadata='{\"foo\":\"bar\"}'"

which is how this option should be used.

> 3. Add `-specs=<spec-file>` to the gcc linker flags. Then this spec file
> could construct the package metadata parameter. Example spec file:
> 
> ```
> $ cat /usr/share/dpkg/elf-package-metadata.specs
> *link:
> +
> --package-metadata={\"type\":\"deb\",\"os\":\"%:
> getenv(DEB_BUILD_OS_RELEASE_ID \",\"name\":\"%:getenv(DEB_SOURCE
> \",\"version\":\"%:getenv(DEB_VERSION
> \",\"architecture\":\"%:getenv(DEB_HOST_ARCH \"}))))
> ```
> 
> Issue with that approach: It requires the spec file to be around and the
> environment variables to be set. This will be the case during the package
> build, but not at a later stage. See https://launchpad.net/bugs/2071468 for
> examples where this breaks.
> 
> ## Proposed solution
> 
> Add support for an `--escaped-package-metadata` parameter to the linkers
> that takes a percent encoded (RFC 3986) parameter. Example:
> ```
> -Wl,--encoded-package-metadata,%7B%22type%22:%22deb%22%2C%22os%22:
> %22ubuntu%22%2C%22name%22:%22dpkg%22%2C%22version%22:%221.22.
> 6ubuntu15%22%2C%22architecture%22:%22amd64%22%7D
> ```

It is very much unreadable.  The main issue is that compiler drivers eat
comma.  Can we update linker to support an escape for comma which won't be
eaten by compiler drivers?
Comment 3 H.J. Lu 2024-07-22 23:51:21 UTC
(In reply to H.J. Lu from comment #2)
> It is very much unreadable.  The main issue is that compiler drivers eat
> comma.  Can we update linker to support an escape for comma which won't be
> eaten by compiler drivers?

One possibility is to treat "\comma" as ','.
Comment 4 Jan Beulich 2024-07-23 10:37:12 UTC
(In reply to H.J. Lu from comment #3)
> One possibility is to treat "\comma" as ','.

Yet introducing another escape character would come with further complexity. It also looks to me as if the %-encoding of double-quotes dominated that of commas in the example provided.

I'm inclined to suggest to extend the %-encoding scheme instead; that way what's going to be in 2.43 is not at risk of subtle breakage when support further escaping sequences is added. All that needs to be made sure is for existing escaping sequences to retain their original meaning.

What the best scheme here would be is likely better determined by people who make active use of the feature (knowing what set of characters it is that frequently would need escaping one way or the other). I'd be inclined to borrow naming e.g. from HTML's Named Character References, and then use, say, "%[comma]" or "%comma;" to have proper delimiting on both ends.
Comment 5 H.J. Lu 2024-07-23 12:57:29 UTC
I am against the proposed solution.
The solution should be as close to
the normal JSON syntax as possible
and linker tests should cover +90% of
use cases.
Comment 6 Benjamin Drung 2024-07-23 16:17:18 UTC
(In reply to H.J. Lu from comment #2)
> > 2. The quotation marks in the JSON value are eaten by configure scripts and
> > Makefiles. Example:
> > 
> > ```
> > $ echo "void main() { }" > test.c
> > $ printf 'test:\n\tgcc $(CFLAGS) test.c\n' > Makefile
> > $ env CFLAGS='-Wl,--package-metadata={"type":"deb"}' make
> > gcc -Wl,--package-metadata={"type":"deb"} test.c
> > /usr/bin/ld: warning: --package-metadata={type:deb} does not contain valid
> > JSON, ignoring: string or '}' expected near 'type'
> > ```
> 
> The linker testcase has
> 
> "--package-metadata='{\"foo\":\"bar\"}'"
> 
> which is how this option should be used.

That linker testcase snippet is evaluated to:

--package-metadata='{"foo":"bar"}'

Passing that to a shell or makefile works, because the singe quotes protect the JSON content inside. But this escaping protects this snippet from evaluating only once. The problem is that the linker flag can be passed around and evaluated multiple times. This will happen easily during Debian package build, where debian/rules is a Makefile that calls the project build system (that can use makefiles as well).

Simple example that shows the problem:

```
$ cat example.c 
#include <stdio.h>

void main() {
    printf("Hello world!");
}
$ cat Makefile2
all: example

%: %.c
	gcc $(LDFLAGS) -o $@ $<

.PHONY: all
$ cat Makefile
all:
	make -f Makefile2 all LDFLAGS=$(LDFLAGS)

.PHONY: all
$ LANG=C make LDFLAGS="-Wl,--package-metadata='{\"foo\":\"bar\"}'"
make -f Makefile2 all LDFLAGS=-Wl,--package-metadata='{"foo":"bar"}'
make[1]: Entering directory 'makefile2'
gcc -Wl,--package-metadata={"foo":"bar"} -o example example.c
/usr/bin/ld: warning: --package-metadata={foo:bar} does not contain valid JSON, ignoring: string or '}' expected near 'foo'
make[1]: Leaving directory 'makefile2'
```

I am very confident to find a real world Debian package that will resemble this example. There is no way to programmatically determine how many times the linker flag needs to be escaped for each individual Debian package build. Fixing those cases by quoting will be a lot of work and might not cover all cases (i.e. it will be a fragile approach).

> > ## Proposed solution
> > 
> > Add support for an `--escaped-package-metadata` parameter to the linkers
> > that takes a percent encoded (RFC 3986) parameter. Example:
> > ```
> > -Wl,--encoded-package-metadata,%7B%22type%22:%22deb%22%2C%22os%22:
> > %22ubuntu%22%2C%22name%22:%22dpkg%22%2C%22version%22:%221.22.
> > 6ubuntu15%22%2C%22architecture%22:%22amd64%22%7D
> > ```
> 
> It is very much unreadable.  The main issue is that compiler drivers eat
> comma.  Can we update linker to support an escape for comma which won't be
> eaten by compiler drivers?

The encoding is flexible and you could just encode the characters that are problematic in your case:

-Wl,--encoded-package-metadata,{"type":"deb"%2C"os":"ubuntu"%2C"name":"dpkg"%2C"version":"1.22.6ubuntu15"%2C"architecture":"amd64"}
Comment 7 H.J. Lu 2024-07-23 20:31:49 UTC
(In reply to Benjamin Drung from comment #6)

> The encoding is flexible and you could just encode the characters that are
> problematic in your case:
> 
> -Wl,--encoded-package-metadata,{"type":"deb"%2C"os":"ubuntu"%2C"name":
> "dpkg"%2C"version":"1.22.6ubuntu15"%2C"architecture":"amd64"}

Doesn't this have the same issue with '"'?
Comment 8 Benjamin Drung 2024-07-23 22:20:09 UTC
(In reply to H.J. Lu from comment #7)
> (In reply to Benjamin Drung from comment #6)
> 
> > The encoding is flexible and you could just encode the characters that are
> > problematic in your case:
> > 
> > -Wl,--encoded-package-metadata,{"type":"deb"%2C"os":"ubuntu"%2C"name":
> > "dpkg"%2C"version":"1.22.6ubuntu15"%2C"architecture":"amd64"}
> 
> Doesn't this have the same issue with '"'?

Yes, this example parameter needs to be quoted in case it is used in shell (or multiple times quoted in case it is parsed multiple times). I just wanted to make the point that the percent-encoding can be used sparsely to only encode the characters that are problematic in the specific use-case. So in case you control the whole build pipeline, you can quote just the comma. If you want to pass this flag into a random projects, you can encode the quotation marks as well.
Comment 9 H.J. Lu 2024-07-23 22:35:31 UTC
--package-metadata doesn't work with shell, compiler driver nor make.  The new
solution should support for JSON codes with shell, compiler driver and make.
It should be human readable.   For non-working --package-metadata, we should
either remove it or fix it.
Comment 10 Luca Boccassi 2024-07-23 22:40:03 UTC
(In reply to H.J. Lu from comment #9)
> For non-working --package-metadata, we should either remove it or fix it.

Sorry, but this is absolutely wrong, as the existing option works just fine. I do not mind if you add other options, but the existing one cannot be removed, as it is in active use in production, and it will remain in active use in production for the foreseeable future - I have no intention nor plan of stopping its use, even if there are alternatives. It works just fine either directly with some escaping, or indirectly via a spec file. Again, no problem with adding alternative options if you want to have them, absolutely fine to do so, but just not at the expense of the current one.
Comment 11 Benjamin Drung 2024-07-23 22:46:23 UTC
(In reply to H.J. Lu from comment #9)
> It should be human readable.

What do you recommend? IMO percent-escaping is readable enough and increases the size of the already long string not too much.

The encoding of the JSON {"type":"deb","os":"ubuntu","name":"dpkg","version":"1.22.6ubuntu15","architecture":"amd64"} would be:

-Wl,--encoded-package-metadata=%7B%22type%22:%22deb%22%2C%22os%22:%22ubuntu%22%2C%22name%22:%22dpkg%22%2C%22version%22:%221.22.6ubuntu15%22%2C%22architecture%22:%22amd64%22%7D

At first it might look confusing, but the relevant strings can be seen on a second look: "type", "deb", "os", "ubuntu", "name", "dpkg", "version", "1.22.6ubuntu15", "architecture", "amd64". Only the beginning of the version number is harder to see.

There are multiple tools that can encode/decode it. For example Python's urllib.parse.unquote and urllib.parse.quote.

I am open for better encodings. I am open for making --package-metadata percent-decode the value instead of adding a new parameter. Percents are relative safe encoding option. The Debian package name and the Debian version are not allowed to contain percents. The os, type, and architecture will not have percents in them.
Comment 12 H.J. Lu 2024-07-23 22:57:17 UTC
(In reply to Benjamin Drung from comment #11)
> (In reply to H.J. Lu from comment #9)
> > It should be human readable.
> 
> What do you recommend? IMO percent-escaping is readable enough and increases
> the size of the already long string not too much.
> 
> The encoding of the JSON
> {"type":"deb","os":"ubuntu","name":"dpkg","version":"1.22.6ubuntu15",
> "architecture":"amd64"} would be:
> 
> -Wl,--encoded-package-metadata=%7B%22type%22:%22deb%22%2C%22os%22:
> %22ubuntu%22%2C%22name%22:%22dpkg%22%2C%22version%22:%221.22.
> 6ubuntu15%22%2C%22architecture%22:%22amd64%22%7D
> 
> At first it might look confusing, but the relevant strings can be seen on a
> second look: "type", "deb", "os", "ubuntu", "name", "dpkg", "version",
> "1.22.6ubuntu15", "architecture", "amd64". Only the beginning of the version
> number is harder to see.
> 
> There are multiple tools that can encode/decode it. For example Python's
> urllib.parse.unquote and urllib.parse.quote.
> 
> I am open for better encodings. I am open for making --package-metadata
> percent-decode the value instead of adding a new parameter. Percents are
> relative safe encoding option. The Debian package name and the Debian
> version are not allowed to contain percents. The os, type, and architecture
> will not have percents in the

%22 isn't human readable.  Do we need to escape { and }? We need to escape "
and ,.  Should $ be supported in JSON code? Will "%[string]" escape work?
Comment 13 Benjamin Drung 2024-07-23 23:12:35 UTC
(In reply to H.J. Lu from comment #12)
> (In reply to Benjamin Drung from comment #11)
> > (In reply to H.J. Lu from comment #9)
> > > It should be human readable.
> > 
> > What do you recommend? IMO percent-escaping is readable enough and increases
> > the size of the already long string not too much.
> > 
> > The encoding of the JSON
> > {"type":"deb","os":"ubuntu","name":"dpkg","version":"1.22.6ubuntu15",
> > "architecture":"amd64"} would be:
> > 
> > -Wl,--encoded-package-metadata=%7B%22type%22:%22deb%22%2C%22os%22:
> > %22ubuntu%22%2C%22name%22:%22dpkg%22%2C%22version%22:%221.22.
> > 6ubuntu15%22%2C%22architecture%22:%22amd64%22%7D
> > 
> > At first it might look confusing, but the relevant strings can be seen on a
> > second look: "type", "deb", "os", "ubuntu", "name", "dpkg", "version",
> > "1.22.6ubuntu15", "architecture", "amd64". Only the beginning of the version
> > number is harder to see.
> > 
> > There are multiple tools that can encode/decode it. For example Python's
> > urllib.parse.unquote and urllib.parse.quote.
> > 
> > I am open for better encodings. I am open for making --package-metadata
> > percent-decode the value instead of adding a new parameter. Percents are
> > relative safe encoding option. The Debian package name and the Debian
> > version are not allowed to contain percents. The os, type, and architecture
> > will not have percents in the
> 
> %22 isn't human readable.  Do we need to escape { and }?

Maybe escaping { and } is not needed.

> We need to escape " and ,.

Those two are definitively needed to be escaped.

> Should $ be supported in JSON code?

$ needs to be escaped for shells. Theoretically $ might be part of a string.

> Will "%[string]" escape work?

Like this?

-Wl,--encoded-package-metadata={%[quot]type%[quot]:%[quot]deb%[quot]%[comma]%[quot]os%[quot]:%[quot]ubuntu%[quot]%[comma]%[quot]name%[quot]:%[quot]dpkg%[quot]%[comma]%[quot]version%[quot]:%[quot]1.22.6ubuntu15%[quot]%[comma]%[quot]architecture%[quot]:%[quot]amd64%[quot]}

This might work, but it is much longer. IMO on a scale from human readable to random character this is insignificantly more readable. A longer parameter makes it harder to find the relevant log output (in case of problems unrelated to the package metadata) and the log files will be bigger. Debian package tend to print all calls with all compiler parameters.
Comment 14 H.J. Lu 2024-07-23 23:32:01 UTC
(In reply to Benjamin Drung from comment #13)

> > Will "%[string]" escape work?
> 
> Like this?
> 
> -Wl,--encoded-package-metadata={%[quot]type%[quot]:
> %[quot]deb%[quot]%[comma]%[quot]os%[quot]:
> %[quot]ubuntu%[quot]%[comma]%[quot]name%[quot]:
> %[quot]dpkg%[quot]%[comma]%[quot]version%[quot]:%[quot]1.22.
> 6ubuntu15%[quot]%[comma]%[quot]architecture%[quot]:%[quot]amd64%[quot]}

It should be %[quote]".  Will adding support for "%[string]" to existing
--package-metadata option break anything?

> This might work, but it is much longer. IMO on a scale from human readable
> to random character this is insignificantly more readable. A longer
> parameter makes it harder to find the relevant log output (in case of
> problems unrelated to the package metadata) and the log files will be

Is this a real problem? "grep" should work.

> bigger. Debian package tend to print all calls with all compiler parameters.
Comment 15 Jan Beulich 2024-07-24 06:09:48 UTC
(In reply to Benjamin Drung from comment #6)
> That linker testcase snippet is evaluated to:
> 
> --package-metadata='{"foo":"bar"}'
> 
> Passing that to a shell or makefile works, because the singe quotes protect
> the JSON content inside. But this escaping protects this snippet from
> evaluating only once. The problem is that the linker flag can be passed
> around and evaluated multiple times. This will happen easily during Debian
> package build, where debian/rules is a Makefile that calls the project build
> system (that can use makefiles as well).

That, however, is a problem of the build system. Passing around potentially quoted strings needs special care, to retain quotation. The issue isn't unique to Debian; see e.g. the Linux kernel's "escsq" and its uses (and how it further protects e.g. # and $$).

Imo it's just the commas which are the main problem here, as -Wl,... in the compiler has - afaik - no way of escaping them.
Comment 16 Jan Beulich 2024-07-24 06:11:29 UTC
(In reply to H.J. Lu from comment #14)
> It should be %[quote]".  Will adding support for "%[string]" to existing
> --package-metadata option break anything?

You won't know until someone reports a problem. We simply don't know what people might be using.
Comment 17 H.J. Lu 2024-07-24 06:47:59 UTC
(In reply to Jan Beulich from comment #16)
> (In reply to H.J. Lu from comment #14)
> > It should be %[quote]".  Will adding support for "%[string]" to existing
> > --package-metadata option break anything?
> 
> You won't know until someone reports a problem. We simply don't know what
> people might be using.

It is a new option and only supports very limited JSON code.  Can Debian people
provide feedbacks on adding "%[string]" to existing --package-metadata option?
Comment 18 H.J. Lu 2024-07-24 06:50:13 UTC
(In reply to Jan Beulich from comment #15)
> (In reply to Benjamin Drung from comment #6)
> > That linker testcase snippet is evaluated to:
> > 
> > --package-metadata='{"foo":"bar"}'
> > 
> > Passing that to a shell or makefile works, because the singe quotes protect
> > the JSON content inside. But this escaping protects this snippet from
> > evaluating only once. The problem is that the linker flag can be passed
> > around and evaluated multiple times. This will happen easily during Debian
> > package build, where debian/rules is a Makefile that calls the project build
> > system (that can use makefiles as well).
> 
> That, however, is a problem of the build system. Passing around potentially
> quoted strings needs special care, to retain quotation. The issue isn't
> unique to Debian; see e.g. the Linux kernel's "escsq" and its uses (and how
> it further protects e.g. # and $$).
> 
> Imo it's just the commas which are the main problem here, as -Wl,... in the
> compiler has - afaik - no way of escaping them.

'"' may be an issue for make as in comment #6.
Comment 19 Sam James 2024-07-24 07:30:50 UTC
Just to note that I expect us to have a use for this on our side but I've not had any time for me or someone else to look into it or test it yet.
Comment 20 Andreas K. Huettel 2024-07-24 08:24:50 UTC
The general idea is good, but, man, seriously, this feels like you're reinventing the wheel multiple times....

> 1. The comma in the JSON value is used to split the -Wl parameter specified
> for gcc:

Yep. Won't work. So let's not pass raw json.

And as someone who regularly has to look at stuff that is passed around between bash, python, and perl I stronly advise to rely too much on escaping.

> 2. The quotation marks in the JSON value are eaten by configure scripts and
> Makefiles. Example:

Yep. Won't work. So let's not pass raw json.

Dito.

> Add support for an `--escaped-package-metadata` parameter to the linkers
> that takes a percent encoded (RFC 3986) parameter. Example:

Well, this works, but makes the metadata block pretty much unreadable.
Please not.

> 3. Add `-specs=<spec-file>` to the gcc linker flags. Then this spec file
> could construct the package metadata parameter. Example spec file:
> 
> ```
> Issue with that approach: It requires the spec file to be around and the
> environment variables to be set.

And exactly this is the *much* easier solution. 
Fill in the values of the variables at build time in, e.g., a json file, and then pass the filename on the command line.

You definitely don't want to hardcode environment lookups or any sort of similar evaluation into your metadata (imagine your field for DEB_HOST_ARCH being `cat /dev/zero /dev/root`)... Simple strings that are not interpreted in any way are the best option.
Comment 21 Luca Boccassi 2024-07-25 11:49:23 UTC
(In reply to Luca Boccassi from comment #10)
> (In reply to H.J. Lu from comment #9)
> > For non-working --package-metadata, we should either remove it or fix it.
> 
> Sorry, but this is absolutely wrong, as the existing option works just fine.
> I do not mind if you add other options, but the existing one cannot be
> removed, as it is in active use in production, and it will remain in active
> use in production for the foreseeable future - I have no intention nor plan
> of stopping its use, even if there are alternatives. It works just fine
> either directly with some escaping, or indirectly via a spec file. Again, no
> problem with adding alternative options if you want to have them, absolutely
> fine to do so, but just not at the expense of the current one.

Just to be clear, changing the existing option to also take in escaped input, as long as it is backward compatible and still accepts normal input, would also be absolutely fine by me.
Comment 22 Matthias Klose 2024-08-01 02:01:17 UTC
using a specs file is even more ugly, please see
https://sourceware.org/pipermail/binutils/2024-July/135977.html
Comment 23 Benjamin Drung 2024-08-09 20:26:20 UTC
(In reply to H.J. Lu from comment #14)
> (In reply to Benjamin Drung from comment #13)
> 
> > > Will "%[string]" escape work?
> > 
> > Like this?
> > 
> > -Wl,--encoded-package-metadata={%[quot]type%[quot]:
> > %[quot]deb%[quot]%[comma]%[quot]os%[quot]:
> > %[quot]ubuntu%[quot]%[comma]%[quot]name%[quot]:
> > %[quot]dpkg%[quot]%[comma]%[quot]version%[quot]:%[quot]1.22.
> > 6ubuntu15%[quot]%[comma]%[quot]architecture%[quot]:%[quot]amd64%[quot]}
> 
> It should be %[quote]".

You suggested to borrow from HTML's Named Character References and
https://dev.w3.org/html5/spec-LC/named-character-references.html says that U+00022 has the name "quot" (not "quote").

> Will adding support for "%[string]" to existing
> --package-metadata option break anything?

It might theoretical break existing use cases. 

--package-metadata='{"version":"1.0%2"}'

The only safe option that I could come up with is to use a marker that would be invalid JSON. For example: If the string starts with a percent character, decode it:

--package-metadata='%{"foo":"bar"%[comma]"baz":42}'

would be invalid JSON and decode to:

--package-metadata='{"foo":"bar","baz":42}'
Comment 24 Luca Boccassi 2024-08-09 20:37:07 UTC
(In reply to Benjamin Drung from comment #23)
> (In reply to H.J. Lu from comment #14)
> > (In reply to Benjamin Drung from comment #13)
> > 
> > > > Will "%[string]" escape work?
> > > 
> > > Like this?
> > > 
> > > -Wl,--encoded-package-metadata={%[quot]type%[quot]:
> > > %[quot]deb%[quot]%[comma]%[quot]os%[quot]:
> > > %[quot]ubuntu%[quot]%[comma]%[quot]name%[quot]:
> > > %[quot]dpkg%[quot]%[comma]%[quot]version%[quot]:%[quot]1.22.
> > > 6ubuntu15%[quot]%[comma]%[quot]architecture%[quot]:%[quot]amd64%[quot]}
> > 
> > It should be %[quote]".
> 
> You suggested to borrow from HTML's Named Character References and
> https://dev.w3.org/html5/spec-LC/named-character-references.html says that
> U+00022 has the name "quot" (not "quote").
> 
> > Will adding support for "%[string]" to existing
> > --package-metadata option break anything?
> 
> It might theoretical break existing use cases. 
> 
> --package-metadata='{"version":"1.0%2"}'

Are there distros where '%' is an allowed character in a version string or a package name? I care about backward compatibility, but we can be sensible about it, and if in practice it's not a problem, then it's fine to do such a change
Comment 25 Benjamin Drung 2024-08-09 20:45:35 UTC
(In reply to Luca Boccassi from comment #24)
> (In reply to Benjamin Drung from comment #23)
> > (In reply to H.J. Lu from comment #14)
> > > (In reply to Benjamin Drung from comment #13)
> > > 
> > > Will adding support for "%[string]" to existing
> > > --package-metadata option break anything?
> > 
> > It might theoretical break existing use cases. 
> > 
> > --package-metadata='{"version":"1.0%2"}'
> 
> Are there distros where '%' is an allowed character in a version string or a
> package name? I care about backward compatibility, but we can be sensible
> about it, and if in practice it's not a problem, then it's fine to do such a
> change

For all Debian-based distros: % is neither allowed in the package name nor in the package version. Who wants to check the other 400 distributions?
Comment 26 Luca Boccassi 2024-08-09 20:51:40 UTC
(In reply to Benjamin Drung from comment #25)
> (In reply to Luca Boccassi from comment #24)
> > (In reply to Benjamin Drung from comment #23)
> > > (In reply to H.J. Lu from comment #14)
> > > > (In reply to Benjamin Drung from comment #13)
> > > > 
> > > > Will adding support for "%[string]" to existing
> > > > --package-metadata option break anything?
> > > 
> > > It might theoretical break existing use cases. 
> > > 
> > > --package-metadata='{"version":"1.0%2"}'
> > 
> > Are there distros where '%' is an allowed character in a version string or a
> > package name? I care about backward compatibility, but we can be sensible
> > about it, and if in practice it's not a problem, then it's fine to do such a
> > change
> 
> For all Debian-based distros: % is neither allowed in the package name nor
> in the package version. Who wants to check the other 400 distributions?

We don't need to check all of them individually fortunately, the package format is enough - deb is out, I don't think it's allowed in rpm? So if Arch doesn't allow it either, I'd say we are good? I'm the most invested in retaining backward compat, but in a pragmatic way
Comment 27 Benjamin Drung 2024-08-09 21:07:22 UTC
Taking all comments into account, here is my implementation proposals:

Encoding schema
===============

Option 1: Support percent-encoding of the JSON data. Percent-encoding is widely used and supported (for example, Python provides urllib.parse.unquote for decoding and urllib.parse.quote for encoding). Example encoded JSON: '{%22foo%22:%22bar%22}' or '%7B%22foo%22%3A%22bar%22%7D'

This option has the benefit of being easy to implement. Either the encoded string can be read directly (I can spot package names and version in there) or decoded using Python's urllib.parse.unquote or online tools.

Option 2: Support percent-encoding and %[string] (where string refers to the name in HTML's Named Character References) the JSON data. Example encoded JSON: '{%[quot]foo%[quot]:%[quot]bar%[quot]}' or '{%22foo%22:%22bar%22}'

This option allow people to use %[string] encoding in case they dislike percent-encoding. The drawback is that it is more work to implement since there must be a list of names. To make the code simpler, the list of allowed names might be restricted to, e. g. quot, comma, lbrace, rbrace and maybe add space.

These are the two options that I would be happy about to implement. Supporting only %[string] would not satisfy me. Using base64 encoding would make the string shorter, but would not be human readable. Quoted-printable would be an alternative, but the problematic characters like comma and quotes would not be encoded by quoted-printable.

Parameter usage
===============

Option A: Introduce a new --encoded-package-metadata parameter that takes the encoded string.

Option B: Extend --package-metadata to always decode the given string. As previously discussed, package names and version should not contain percent characters. So this change should not break backward compatibility.

Which of those proposals do you want to see implemented? My initial implementation is option 1A.
Comment 28 Luca Boccassi 2024-08-10 11:31:32 UTC
As user and owner of the spec, I am fine with any of those options. A slight preference for a new command line (option A), as that means you don't need to worry about version matching - if the new flag is there, then you can pass encoded input, if it's not, then you know it only takes unencoded input. But not a strong preference.
Comment 29 Andreas K. Huettel 2024-08-10 11:48:31 UTC
> 
> Option A: Introduce a new --encoded-package-metadata parameter that takes
> the encoded string.
> 
> Option B: Extend --package-metadata to always decode the given string. As
> previously discussed, package names and version should not contain percent
> characters. So this change should not break backward compatibility.
> 

What are you going to do once you run into command line length limits
(which can be hidden in many places)?

(How much mystery data are we really talking about here?)
Comment 30 H.J. Lu 2024-08-10 14:32:18 UTC
(In reply to Benjamin Drung from comment #27)

> Option A: Introduce a new --encoded-package-metadata parameter that takes
> the encoded string.

This option isn't readable.

> Option B: Extend --package-metadata to always decode the given string. As
> previously discussed, package names and version should not contain percent
> characters. So this change should not break backward compatibility.

This is the readable option.

> Which of those proposals do you want to see implemented? My initial
> implementation is option 1A.

We should have a readable option.
Comment 31 Benjamin Drung 2024-08-12 14:24:49 UTC
(In reply to Andreas K. Huettel from comment #29)
> > 
> > Option A: Introduce a new --encoded-package-metadata parameter that takes
> > the encoded string.
> > 
> > Option B: Extend --package-metadata to always decode the given string. As
> > previously discussed, package names and version should not contain percent
> > characters. So this change should not break backward compatibility.
> > 
> 
> What are you going to do once you run into command line length limits
> (which can be hidden in many places)?
> 
> (How much mystery data are we really talking about here?)

The command line length limit is probably around 128 KB on Linux. The package metadata contains only a few items. The examples given in this bug report are real world examples. So even with encoding this example is around 250 characters. With longer package/version and added debuginfod URL, you might get 500 or up to 1000 characters. This is probably still far enough away from the command line length limit.