Bug 14071 - Unstable GDB test results due to change in dl_debug_state()
Summary: Unstable GDB test results due to change in dl_debug_state()
Status: WAITING
Alias: None
Product: gdb
Classification: Unclassified
Component: testsuite (show other bugs)
Version: 7.2
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-07 15:00 UTC by Vinitha Vijayan
Modified: 2012-07-06 11:01 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vinitha Vijayan 2012-05-07 15:00:30 UTC
Hi All,

 I have been testing gdb-7.2 in an smp kernel in ARM cortex-A9 environment.
I am getting more than 1000 additional failures.

It appears that this happens only in case of SMP kernel.
On uP kernel any additional unexpected failures are not reported.

The testsuite behaviour appears to be unstable in SMP kernel environment.
For eg, the testcase ending-run.exp it shows unstable behaviour as shown below.

 * The testcases passes sometimes as shown below
{{{
Running ./gdb.base/ending-run.exp ...

                === gdb Summary ===

# of expected passes            19
# of unsupported tests          1
}}}

 * Sometimes it fails as shown below
{{{
Running ./gdb.base/ending-run.exp ...
FAIL: gdb.base/ending-run.exp: run (the program exited)
FAIL: gdb.base/ending-run.exp: clear worked
FAIL: gdb.base/ending-run.exp: cleared bp at line before routine
FAIL: gdb.base/ending-run.exp: Cleared 2 by line
FAIL: gdb.base/ending-run.exp: Clear 2 by default
FAIL: gdb.base/ending-run.exp: all set to continue (didn't clear bps)
FAIL: gdb.base/ending-run.exp: cont (the program is no longer running)
FAIL: gdb.base/ending-run.exp: Step to return (the program is no longer running)
FAIL: gdb.base/ending-run.exp: step out of main (the program is no longer running)
FAIL: gdb.base/ending-run.exp: step to end of run (the program is no longer running)

                === gdb Summary ===

# of expected passes            8
# of unexpected failures        10
# of unsupported tests          2
}}}

On analyzing gdb.log in the failed case it is observed that, the binary is not stopping at breakpoints.
The 'run' command results in program running completely and exiting without stopping at breakpoints.
{{{
...
b ending-run.c:1^M
Breakpoint 1 at 0x84d0: file ./gdb.base/ending-run.c, line 1.^M
(gdb) PASS: gdb.base/ending-run.exp: bpt at line before routine
b ending-run.c:14^M
Note: breakpoint 1 also set at pc 0x84d0.^M
Breakpoint 2 at 0x84d0: file ./gdb.base/ending-run.c, line 14.^M
(gdb) PASS: gdb.base/ending-run.exp: b ending-run.c:14, one
b ending-run.c:31^M
Breakpoint 3 at 0x853c: file ./gdb.base/ending-run.c, line 31.^M
(gdb) PASS: gdb.base/ending-run.exp: b ending-run.c:31
run ^M
Starting program: /mnt/test/gdb-test-7.2/glibc/gdb-7.2.glibc/gdb/testsuite/gdb.base/ending-run ^M
-1 2 7 14 23 34 47 62 79  Goodbye!^M
^M
Program exited normally.^M
(gdb) FAIL: gdb.base/ending-run.exp: run (the program exited)
cle^M
No source file specified.^M
(gdb) FAIL: gdb.base/ending-run.exp: clear worked
i b^M
Num     Type           Disp Enb Address    What^M
1       breakpoint     keep y   0x000084d0 in callee at ./gdb.base/ending-run.c:1^M
2       breakpoint     keep y   0x000084d0 in callee at ./gdb.base/ending-run.c:14^M
3       breakpoint     keep y   0x0000853c in main at ./gdb.base/ending-run.c:31^M
(gdb) FAIL: gdb.base/ending-run.exp: cleared bp at line before routine
...
}}}

 * The glibc version I am using is glibc-2.11.2.
 * And recently we have removed the frame pointer support from glibc. (not using -fno-omit-framepointer with -O2).
  * The disassembly of dl_debug_state function in ld.so is as shown below (not using -fno-omit-framepointer with -O2).
{{{
0000b6d8 <_dl_debug_state>:
    b6d8:       4770            bx      lr
    b6da:       bf00            nop
}}}

 * The glibc-2.11.2 built with -fno-omit-framepointer works fine in SMP and uP kernel enviroments.
  * The disassembly of dl_debug_state function in ld.so is as shown below (built with -fno-omit-framepointer)
{{{
0000aee8 <_dl_debug_state>:
    aee8:       b480            push    {r7}
    aeea:       af00            add     r7, sp, #0
    aeec:       46bd            mov     sp, r7
    aeee:       bc80            pop     {r7}
    aef0:       4770            bx      lr
    aef2:       bf00            nop
}}}

 * We found that the change in _dl_debug_state() causing unstable behaviour in GDB testsuite.
 * We have tested by adding some nop instructions in the _dl_debug_state() as shown below.
{{{
0000b6d8 <_dl_debug_state>:
0000b6d8:       bf00            nop
0000b6da:       bf00            nop
0000b6dc:       bf00            nop
0000b6de:       4770            bx      lr
}}}
 * In this case also the GDB test results are stable.

 * As GDB is putting internal breakpoint on _dl_debug_state to monitor shared library events
   and since the change in assembly of this function is causing an unstable behaviour in the gdb testsuite results,
   we tried putting gdb internal breakpoint on another function in ld.so ( _dl_unload_cache())
   which is getting called from dl_main() after call to dl_debug_state().
{{{
--- gdb-7.2.orig/gdb/solib-svr4.c
+++ gdb-7.2/gdb/solib-svr4.c
@@ -84,6 +84,7 @@ static char *solib_break_names[] =
 {
   "r_debug_state",
   "_r_debug_state",
+  "_dl_unload_cache",
   "_dl_debug_state",
   "rtld_db_dlactivity",
   "__dl_rtld_db_dlactivity",
}}}
 * In this case GDB test results are stable.
 * It seems there is some issue when running gdb testsuite, in a SMP kernel environment when glibc is built without framepointer.

NB: normal debugging is working fine. Only dejagnu testsuite shows unstable results.

Has anyone got similar kind of issue related to gdb testsuite?

Regards,
Vinitha Vijayan
Comment 1 Vinitha Vijayan 2012-05-17 06:15:53 UTC
Hi,

Some more findings regarding the above failed testcase scenario.

In FAIL cases it is observed that the single step breakpoint inserted after hitting breakpoint at dl_debug_state (  @bx lr ), is not being hit and the binary completes execution without stopping at any breakpoints.

Verified the kernel code (arch/arm/kernel/ptrace.c and arch/arm/kernel/traps.c).

After debugging the kernel found that in FAIL case , kernel does not enter the undefined instruction handler(do_undefinstr() in traps.c), in case of the single step breakpoint inserted at dl_main after first dl_debug_state call.
GDB is writing breakpoint instruction 0xa000f7f0 to the location.

When the testcase is passing, kernel enters this function.
It seems there is some timing issue in SMP kernel environment.

When the breakpoint at dl_debug_state and single_step breakpoint inserted after hitting that, are in same function , this testcases shows stable results.

(I have tested by putting a single nop before the bx lr instruction in dl_debug_state.It works fine ).

Any comments?
Comment 2 Gary Benson 2012-07-06 09:52:20 UTC
Note that the _dl_debug_state style of interface in glibc may be about to change:

  http://sourceware.org/ml/libc-alpha/2012-07/msg00096.html

Though from your comments it looks more like this is some generic issue with setting breakpoints on empty functions.  Are you able to reproduce this with a simple testcase?
Comment 3 Pedro Alves 2012-07-06 11:01:03 UTC
I sounds like your kernel has some missing i-cache flushing issue in ptrace.
This is the not the first time something like that is reported.

E.g., see:

 http://kerneltrap.org/mailarchive/git-commits-head/2010/3/1/25453

Make sure you have that patch in your kernel.  There may have been other similar fixes since then, and perhaps others are still necessary.