[RFC] [gdb/testsuite] Add xfail in gdb.base/hbreak.exp

Wed Jul 24 22:59:26 GMT 2024

On 7/24/24 12:56, Tom de Vries wrote:
> On 7/24/24 12:45, Luis Machado wrote:
>> On 7/24/24 10:28, Tom de Vries wrote:
>>> On 7/24/24 08:53, Luis Machado wrote:
>>>> On 7/24/24 06:25, Tom de Vries wrote:
>>>>> On 7/23/24 12:02, Luis Machado wrote:
>>>>>> On 7/17/24 16:14, Luis Machado wrote:
>>>>>>> On 7/17/24 16:10, Tom de Vries wrote:
>>>>>>>> On an aarch64-linux system with 32-bit userland running in a chroot, and using
>>>>>>>> target board unix/mthumb I get:
>>>>>>>> ...
>>>>>>>> (gdb) hbreak hbreak.c:27^M
>>>>>>>> Hardware assisted breakpoint 2 at 0x4004e2: file hbreak.c, line 27.^M
>>>>>>>
>>>>>>> That is a bit odd, but it goes through the compat layer, which is not exercised
>>>>>>> as often as the 32-bit code.
>>>>>>>
>>>>>>> Let me see if I can reproduce this one on my end.
>>>>>>
>>>>>> I managed to reproduce this. I checked with the kernel folks and this should
>>>>>> work, but I'm not sure where the error is coming from.
>>>>>>
>>>>>
>>>>> Hi Luis,
>>>>>
>>>>> thanks for looking into this, and the approval, committed.
>>>>>
>>>>> Are you or the kernel folks following up on this, in terms of a linux kernel PR or some such?  It would be nice to add some sort of reference to the xfail.
>>>>
>>>> It's in my TODO. I'm still investigating to understand where the error is coming from. Once located, I plan to check with them for their thoughts and a possible
>>>> fix. I don't think the kernel folks use the PR process much. We could probably ammend this commit later on once we have more information though.
>>>>
>>>
>>> Ok, I spent some more time debugging this issue this morning.
>>>
>>> After reading kernel sources for a while, I tried out reversing the order in which the Breakpoint Register Pair is written in arm_linux_nat_target::low_prepare_to_resume, and ... the test-case passes.
>>>
>>
>> But what would change with reversing writing to the control registers, from gdb's perspective?
>>
> 
> Well, from gdb's perspective, the only difference is that both ptrace calls succeed, while with the original order the first one fails (and consequently there's no second call).> 

I've investigated this further, and I think I see the reason why reversing works. It seems handling of hardware breakpoints is slightly different between aarch64 compat and
32-bit arm.

In summary, it seems aarch64 compat attempts to set the address before doing anything with the passed control register value, in arch/arm64/kernel/ptrace.c:hw_break_set.

We can see it punts if ptrace_hbp_set_addr returns an error, which is where we're failing with EINVAL.

For 32-bit arm, in arch/arm/kernel/ptrace.c:ptrace_sethbpregs, we do things in a different way. The important bit is that we only call modify_user_hw_breakpoint after
we're done setting both the address and the control register.

For aarch64 compat we call modify_user_hw_breakpoint for both ptrace_hbp_set_addr and ptrace_hbp_set_ctrl.

When we have a new task and attempt to use a hw break, the kernel initializes the hw break slots. It does so in arch/arm64/kernel/ptrace.c:ptrace_hbp_get_initialised_bp.

Once a slot is initialized, it isn't initialized again it seems. We will only reuse the slot (with whatever information it has, since we will replace it anyway).

With the above context, it seems we're running into trouble when we have an unaligned thumb address (offset == 2) and the slot's bp_len is set to ARM_BREAKPOINT_LEN_4,
which is the default (or it is there because we previously set a 4-byte hw break on this slot).

We can confirm that setting a 2-byte hw break works if we use an aligned thumb address (offset == 0), because we use a different switch case entry. It also works if we first manage to insert
a 2-byte hw break on an aligned thumb address, delete it and then try to insert the hw break on the unaligned thumb address. This is because inserting a hw break on an
aligned thumb address sets the bp_len to ARM_BREAKPOINT_LEN_2, and we eventually reuse that slot during ptrace_hbp_set_addr, which, I think, is the bug we have in the kernel.

We shouldn't be reusing that information. Instead we should use whatever the user passed as the control register value to the ptrace call.

For a potential workaround, I think we'll need to check for attempts at inserting a hw break at an unaligned thumb address (offset == 2). If so, then we do a dance of
inserting the hw break at the aligned version of that address (offset == 0), only to make sure the slot's bp_len is correctly set to ARM_BREAKPOINT_LEN_2, and then
proceed to insert the hw break on the original requested unaligned thumb address.

Off the top of my head I can't think of potential issues with this approach. I don't think the kernel checks if we insert two hw break's at the same address, so that
corner case might not be an issue.

Thoughts?

>>> My theory at this point is that the following happens in the failing case:
>>> - PTRACE_SETHBPREGS with address 0x4004e2
>>> - compat_arch_ptrace
>>> - compat_ptrace_sethbpregs
>>> - compat_ptrace_hbp_set
>>> - ptrace_hbp_set_addr
>>> - ptrace_hbp_get_initialised_bp
>>> - ptrace_hbp_create
>>> - /* Initialise fields to sane defaults
>>>       (i.e. values that will pass validation).  */
>>>    attr.bp_len = HW_BREAKPOINT_LEN_4;
>>
>>
>> The default starts as a 4-byte breakpoint, but is supposed to be adjusted later on to 2 bytes. If this isn't happening, I think we have a bug somewhere.
>>
> 
> Agreed, you could frame that as a kernel bug.  It would be good to known whether the kernel developers agree with that assessment.
> 
>>> - attr.bp_addr = 0x4004e2;
>>> - modify_user_hw_breakpoint
>>> - modify_user_hw_breakpoint_check
>>> - hw_breakpoint_parse
>>> - hw_breakpoint_arch_parse
>>> - case is_compat_bp(bp)
>>> - offset = 2;
>>> - fallthrough to default
>>> - return -EINVAL
>>>
>>> In short, we try to validate:
>>> - attr.bp_len == HW_BREAKPOINT_LEN_4 && attr.bp_addr == 0x4004e2
>>> and fail.
>>>
>>> By reversing the order, we validate:
>>> - attr.bp_len == HW_BREAKPOINT_LEN_2 && attr.bp_addr == 0, and then
>>> - attr.bp_len == HW_BREAKPOINT_LEN_2 && attr.bp_addr == 0x4004e2
>>> which both succeed.
>>
>> Why do we have HW_BREAKPOINT_LEN_2 above while the first case has HW_BREAKPOINT_LEN_4?
>>
> 
> Well, because we reversed the order of the two ptrace calls.
> 
> So, in the original case, the first call to ptrace uses the default bp_len (HW_BREAKPOINT_LEN_4) and the actual address (0x4004e2), which fails.
> 
> And in the reversed order case, the first call to ptrace uses the default address (0x0) and the actual bp_len (HW_BREAKPOINT_LEN_2).
> 
> [ With "default" meaning, as set by ptrace_hbp_create, and "actual", as set by the ptrace calls. ]
> 
>>>
>>> So, my questions at this point are:
>>> - is this a problem limited to aarch64 32-bit mode, or does it also
>>>    occur for native 32-bit arm?
>>
>> I'm not sure at this point. They are two separate code bases, but it is probably reasonable to assume the compat layer of aarch64 was based on the
>> original 32-bit arm code. Checking hw_breakpoint_arch_parse for arm, it does seem fairly similar.
>>
> 
> I also observed that they're very similar.
> 
>>> - is this a kernel bug?
>>
>> Potentially, if it is assuming a length that is not correct.
>>
>>> - if this is a kernel bug, is there a workaround we can use?
>>> - if this is not a kernel bug, is this because gdb is writing the
>>>    Breakpoint Register Pair in the wrong order?
>>
>> I don't think we have a specific order to write things, but if it is a bug that arises from a specific order of commands, we could potentially
>> work around it.
>>
> 
> OK, I'm currently testing that approach.
> 
> Thanks,
> - Tom
> 
>>>
>>> Thanks,
>>> - Tom
>>>
>>>>>
>>>>> Thanks,
>>>>> - Tom
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> (gdb) PASS: gdb.base/hbreak.exp: hbreak
>>>>>>>> continue^M
>>>>>>>> Continuing.^M
>>>>>>>> Unexpected error setting breakpoint: Invalid argument.^M
>>>>>>>> (gdb) FAIL: gdb.base/hbreak.exp: continue to break-at-exit after hbreak
>>>>>>>> ...
>>>>>>>> due to this call in arm_linux_nat_target::low_prepare_to_resume:
>>>>>>>> ...
>>>>>>>>              if (ptrace (PTRACE_SETHBPREGS, pid,
>>>>>>>>                  (PTRACE_TYPE_ARG3) ((i << 1) + 1), &bpts[i].address) < 0)
>>>>>>>>                perror_with_name (_("Unexpected error setting breakpoint"));
>>>>>>>> ...
>>>>>>>>
>>>>>>>> This problem does not happen if instead we use a 4-byte aligned address.
>>>>>>>>
>>>>>>>> I'm not sure if this is simply unsupported or if there's a kernel bug of some
>>>>>>>> sort, but I don't see what gdb can do about this.
>>>>>>>>
>>>>>>>> Tentatively mark this as xfail.
>>>>>>>>
>>>>>>>> Tested on aarch64-linux.
>>>>>>>> ---
>>>>>>>>     gdb/testsuite/gdb.base/hbreak.exp | 40 ++++++++++++++++++++++++++-----
>>>>>>>>     1 file changed, 34 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/gdb/testsuite/gdb.base/hbreak.exp b/gdb/testsuite/gdb.base/hbreak.exp
>>>>>>>> index 73a3e2afb67..b140257a23e 100644
>>>>>>>> --- a/gdb/testsuite/gdb.base/hbreak.exp
>>>>>>>> +++ b/gdb/testsuite/gdb.base/hbreak.exp
>>>>>>>> @@ -27,10 +27,38 @@ if ![runto_main] {
>>>>>>>>       set breakline [gdb_get_line_number "break-at-exit"]
>>>>>>>>     -gdb_test "hbreak ${srcfile}:${breakline}" \
>>>>>>>> -     "Hardware assisted breakpoint \[0-9\]+ at 0x\[0-9a-f\]+: file .*${srcfile}, line ${breakline}\\." \
>>>>>>>> -     "hbreak"
>>>>>>>> +set re_loc "file \[^\r\n\]*$srcfile, line $breakline"
>>>>>>>> +set re_dot [string_to_regexp .]
>>>>>>>>     -gdb_test "continue" \
>>>>>>>> -     "Continuing\\.\[ \r\n\]+Breakpoint \[0-9\]+, .*break-at-exit.*" \
>>>>>>>> -     "continue to break-at-exit after hbreak"
>>>>>>>> +set addr 0x0
>>>>>>>> +gdb_test_multiple "hbreak ${srcfile}:${breakline}" "hbreak" {
>>>>>>>> +    -re -wrap "Hardware assisted breakpoint $decimal at ($hex): $re_loc$re_dot" {
>>>>>>>> +    set addr $expect_out(1,string)
>>>>>>>> +    pass $gdb_test_name
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +set have_xfail 0
>>>>>>>> +if { [istarget arm*-*-*] } {
>>>>>>>> +    # When running 32-bit userland on aarch64 kernel, thumb instructions that
>>>>>>>> +    # are not 4-byte aligned may not be supported for setting a hardware
>>>>>>>> +    # breakpoint on.
>>>>>>>> +    set have_xfail [expr ($addr & 0x2) == 2]
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +set re_xfail \
>>>>>>>> +    [string_to_regexp \
>>>>>>>> +     "Unexpected error setting breakpoint: Invalid argument."]
>>>>>>>> +
>>>>>>>> +gdb_test_multiple "continue" "continue to break-at-exit after hbreak" {
>>>>>>>> +    -re -wrap "Continuing\\.\[ \r\n\]+Breakpoint \[0-9\]+, .*break-at-exit.*" {
>>>>>>>> +    pass $gdb_test_name
>>>>>>>> +    }
>>>>>>>> +    -re -wrap $re_xfail {
>>>>>>>> +    if { $have_xfail } {
>>>>>>>> +        xfail $gdb_test_name
>>>>>>>> +    } else {
>>>>>>>> +        fail $gdb_test_name
>>>>>>>> +    }
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>>>
>>>>>>>> base-commit: 0ed152c5c6b3c72fc505b331ed77e08b438d643a
>>>>>>>
>>>>>>
>>>>>> In any case, I agree gdb doesn't have a better way to deal with it.
>>>>>>
>>>>>> Approved-By: Luis Machado <luis.machado@arm.com>
>>>>>
>>>>
>>>
>>
>