20164 – fix string printing in rust

Bug 20164 - fix string printing in rust

Summary: fix string printing in rust

Status:	NEW

Alias:	None

Product:	gdb
Classification:	Unclassified
Component:	rust (show other bugs)
Version:	unknown

Importance:	P2 normal
Target Milestone:	---
Assignee:	Tom Tromey

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-05-27 20:31 UTC by Tom Tromey
Modified:	2024-12-12 14:45 UTC (History)
CC List:	1 user (show)

See Also:	32413
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tom Tromey 2016-05-27 20:31:59 UTC

String printing in rust is somewhat wrong.

I think it always tries to print &[u8] using UTF-8.
Maybe that is ok, but sometimes &[u8] is just bytes.
Maybe some detection could be done; or maybe it could
default to printing bytes and special case the contents
of &str.

Escapes should be printed rust-style, like \u{nnn},
but currently are not.

Comment 1 Tom Tromey 2017-10-02 18:02:38 UTC

Special-casing &str is being done in bug 22236.

Comment 2 Tom Tromey 2022-02-13 18:03:28 UTC

Working on the escapes now.

Comment 3 Tom Tromey 2022-02-17 23:15:36 UTC

I wonder if char printing should use unicode-ish syntax like:

(gdb) print 'x'
$1 = U+0078 'x'

Not sure.

Comment 4 Mark Wielaard 2022-02-17 23:45:23 UTC

(In reply to Tom Tromey from comment #3)
> I wonder if char printing should use unicode-ish syntax like:
> 
> (gdb) print 'x'
> $1 = U+0078 'x'
> 
> Not sure.

I like it.

Note that the convention is to have zeros prepended when there are less than 4 digits, but not when there are 5 or 6 digits.
https://www.unicode.org/versions/Unicode14.0.0/appA.pdf
"Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits — for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345."

I am not sure how to flag the "illegal" char values, U+D800 to U+DFFF and anything larger than U+10FFFF. But it would be good to have some indicator you are dealing with an invalid value.

Comment 5 Tom Tromey 2022-02-17 23:53:00 UTC

(In reply to Mark Wielaard from comment #4)

> I am not sure how to flag the "illegal" char values, U+D800 to U+DFFF and
> anything larger than U+10FFFF. But it would be good to have some indicator
> you are dealing with an invalid value.

You'll already get a hex escape in this situation:

(gdb) p '\u{d800}'
$2 = 55296 '\u{00d800}'

I guess the escapes sort of make the "U+..." idea a bit weird.
It would just be printing the same info twice.

Comment 6 Mark Wielaard 2022-02-18 00:10:55 UTC

(In reply to Tom Tromey from comment #5)
> (In reply to Mark Wielaard from comment #4)
> 
> > I am not sure how to flag the "illegal" char values, U+D800 to U+DFFF and
> > anything larger than U+10FFFF. But it would be good to have some indicator
> > you are dealing with an invalid value.
> 
> You'll already get a hex escape in this situation:
> 
> (gdb) p '\u{d800}'
> $2 = 55296 '\u{00d800}'

But isn't that escape "value" itself invalid?
Shouldn't it be something like

$2 = 55296 '???'

So, use U+xxxxxx 'unicode-char' for valid char values, otherwise "raw number" '???'

Comment 7 Tom Tromey 2022-02-18 00:39:52 UTC

Yeah, I see that rustc rejects a program using that.
We could print something like

$2 = U+D8000 <invalid Unicode character>

... at least if the invalid ranges aren't a pain to figure out.
Is it really just surrogates and values > 0x10ffff?  I didn't look yet.

There are still characters that print as escapes, for example U+200C.
But maybe double-printing these oddities isn't so bad.

Comment 8 Mark Wielaard 2022-02-18 01:04:32 UTC

(In reply to Tom Tromey from comment #7)
> Yeah, I see that rustc rejects a program using that.
> We could print something like
> 
> $2 = U+D8000 <invalid Unicode character>

I would like that.

> ... at least if the invalid ranges aren't a pain to figure out.
> Is it really just surrogates and values > 0x10ffff?  I didn't look yet.

It is, see https://doc.rust-lang.org/reference/types/textual.html
"A value of type char is a Unicode scalar value (i.e. a code point that is not a surrogate), represented as a 32-bit unsigned word in the 0x0000 to 0xD7FF or 0xE000 to 0x10FFFF range. It is immediate Undefined Behavior to create a char that falls outside this range."
 
> There are still characters that print as escapes, for example U+200C.
> But maybe double-printing these oddities isn't so bad.

I don't think it is that odd. You could also use ascii escapes (https://doc.rust-lang.org/reference/tokens.html#ascii-escapes) when possible.

Comment 9 Tom Tromey 2022-02-18 01:16:05 UTC

(In reply to Mark Wielaard from comment #8)
> I don't think it is that odd. You could also use ascii escapes
> (https://doc.rust-lang.org/reference/tokens.html#ascii-escapes) when
> possible.

That's already done, though I see on that page that only up to 0x7f
is supported for char types, so there's a little bug in the output here,
as gdb checks "if (value <= 255)"