[PATCH v2 09/18] Include \0 in printable wide characters

Wed Feb 23 22:28:53 GMT 2022

>>>>> "Andrew" == Andrew Burgess <aburgess@redhat.com> writes:

Andrew> My confusion here is that I initially thought; if we have multiple
Andrew> characters, some that are printable, and some that are not, then
Andrew> surely, we would want to print the initial printable ones for real,
Andrew> and only later switch to escape sequences, right?

Andrew> Except, that's not what we do.

Andrew> And the reason (probably obvious to quicker minds than mine) is that
Andrew> characters might have different widths, so we can't "just" print the
Andrew> initial characters, and then print the unprintable as escape
Andrew> sequences, as we wouldn't know where in BUF the unprintable character
Andrew> actually starts.

Yeah, that's my understanding as well.

Andrew> OK, so my idea of removing wchar_printable is clearly a bad idea, but
Andrew> how does this relate to your change?

Andrew> Well, prior to this patch, if we had 3 characters, the first two are
Andrew> printable, and the third was \0, we would spot the non-printable \0,
Andrew> and so print the whole buffer, all 3 characters, as escape sequences.

Andrew> With this patch, all 3 characters will appear to be printable.  So now
Andrew> we will print the first character, just fine.  Then print the second
Andrew> character just fine.  Now for the third character, the \0, we call to
Andrew> print_wchar.  The \0 is not handled by anything but the 'default' case
Andrew> of the switch.

Andrew> In the default case, the \0 is non-printable, so we end up in the
Andrew> escape sequence printing code, which then tries to load bytes starting
Andrew> from BUF - which isn't going to be correct.

I think the idea behind this is that only a real \0 in the input will
really ever turn into a L'\0' in the wchar_t form.  It seems to me that
an L'\0' pretty much has to correspond exactly to a target \0, just
because C is pervasive and an encoding where stray \0 bytes can appear
would break everything.

Andrew> Now, this is where things are a bit weird.  The code in
Andrew> generic_emit_char is clearly written to handle multiple characters,
Andrew> but, I've only ever seen it print 1 character, which is why, I claim,
Andrew> your above change to wchar_printable works.

That's most likely because you are trying this on Linux.  Linux uses
UTF-32 for wchar_t, and so there aren't target characters that can't be
converted to a single wchar_t -- because UTF-32 is pretty much designed
to round-trip everything else.  So, on Linux hosts, I think some of
these loops aren't really needed.

However, Windows uses UTF-16 and a single target character can be
converted to two wchar_t, via surrogate pairs.

On Solaris and (IIRC) NetBSD, wchar_t is even weirder, though I don't
recall whether it is a variable-length encoding.

Anyway the \0 case is only really here for Rust.  So maybe another idea
is to handle it exactly there, somehow.  The Rust printer can assume the
use of UTF-32 on the target, so that would all work out fine.

Tom