Bug 22960 - brew gdb 8.1 (but not 8.0.1) breakpoint trap in mac os high sierra 10.13.3
Summary: brew gdb 8.1 (but not 8.0.1) breakpoint trap in mac os high sierra 10.13.3
Status: NEW
Alias: None
Product: gdb
Classification: Unclassified
Component: gdb (show other bugs)
Version: 8.1
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-03-13 19:21 UTC by xdavidliu
Modified: 2022-03-09 00:04 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2018-06-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description xdavidliu 2018-03-13 19:21:55 UTC
I filed this bug at the homebrew page, so the relevant info can be found there.
https://github.com/Homebrew/homebrew-core/issues/25172

If someone wants me to copy the info into a post here, for further convenience, I would be happy to.

This may actually not be the same as bug # 20266 here, which was from two years ago, since the issue occurs only on gdb 8.1 in homebrew in mac os, *not* gdb 8.0.1 homebrew. In 8.0.1, gdb actually works correctly (assuming codesigning is done correctly, which is unrelated but has caused many users trouble), only with a dyld version warning.
Comment 1 thor.lilei 2018-04-05 11:33:41 UTC
Same problem with me.
Comment 2 Ray Seyfarth 2018-04-27 19:21:22 UTC
Same problem with me.

I have to wonder how Apple gets lldb to work with no problems.  It appears to not be codesigned nor setuid/gid or anything special.  There is an option which works which Apple should share with the open source community.  They have wasted a lot of my time.
Comment 3 Pedro Alves 2018-04-28 16:25:34 UTC
If gdb 8.0 works, but gdb 8.1 doesn't, then that suggests doing a git bisect to find the exact change in gdb that caused the problem.  Any takers?
Comment 4 Saagar Jha 2018-05-24 08:05:38 UTC
I think I've found the culprit:

	$ git bisect run ../gdb-bisect.sh
	running /Users/saagarjha/Git/bisect-test.sh

	[Snip]

	f6ac5f3d63e03a81c4ff3749aba234961cc9090e is the first bad commit
	commit f6ac5f3d63e03a81c4ff3749aba234961cc9090e
	Author: Pedro Alves <palves@redhat.com>
	Date:   Thu May 3 00:37:22 2018 +0100

	    Convert struct target_ops to C++

	[Snip]

	bisect run success

Could someone confirm this for me? Commits before this one can successfully follow the debuggee to completion without incident, but the ones after and including this one crash with a null pointer dereference in gdb`push_target(struct target_ops *) at target.c:653. From a cursory glance, it seems a little fishy that darwin-nat.c doesn't have any sort of add_target call in it, but I can't understand the code in the C/C++ frankenstein state it's in right now, so I wasn't able to come up with a fix. (I did find a bunch of undefined behavior being hit, though, which I *do* have patches for. Let me know if you're curious in seeing them.)

<rant>
Just as a FYI, confirming this particular commit took well over two days and testing over two hundred revisions, which is something that I find as an outside observer to be truly horrible. Does GDB have *no* automated testing or continuous integration whatsoever? Putting aside the fact that any such infrastructure would catch simple bugs like this one, which are easy to reproduce, it would have also made my life bisecting a lot easier. Many intermediate commits are broken, as in they *literally don't build on macOS*, because someone forgot a header file or messed up a Makefile. Others dereference null pointers or overflow ints during startup, which really threw off my bisect script with false positives: I had to restart the bisect from the beginning at least half a dozen times because it homed in on the wrong bug. I'm aghast that it's possible for such clearly broken patches to land in the master branch. I do apologize for the vitriolic tone here, but I'm extremely frustrated at the amount of time I had to spend finding this when it should have been a rather trivial task. I do hope none of you take it personally–but if you're looking for things to improve, this is one thing I think you should focus on.
</rant>
Comment 5 Saagar Jha 2018-05-24 09:21:15 UTC
(In reply to Ray Seyfarth from comment #2)
> Same problem with me.
> 
> I have to wonder how Apple gets lldb to work with no problems.  It appears
> to not be codesigned nor setuid/gid or anything special.  There is an option
> which works which Apple should share with the open source community.  They
> have wasted a lot of my time.

LLDB is signed with Apple's certificate:

$ codesign -dvv `xcrun -find lldb`
Executable=/Applications/Xcode-beta.app/Contents/Developer/usr/bin/lldb
Identifier=com.apple.lldb
Format=Mach-O thin (x86_64)
CodeDirectory v=20200 size=622 flags=0x0(none) hashes=15+2 location=embedded
Signature size=4535
Authority=Software Signing
Authority=Apple Code Signing Certification Authority
Authority=Apple Root CA
Info.plist entries=6
TeamIdentifier=59GAB85EFG
Sealed Resources=none
Internal requirements count=1 size=64
Comment 6 Pedro Alves 2018-05-24 11:47:01 UTC
(In reply to Saagar Jha from comment #4)
> I think I've found the culprit:
> 
> 	$ git bisect run ../gdb-bisect.sh
> 	running /Users/saagarjha/Git/bisect-test.sh
> 
> 	[Snip]
> 
> 	f6ac5f3d63e03a81c4ff3749aba234961cc9090e is the first bad commit
> 	commit f6ac5f3d63e03a81c4ff3749aba234961cc9090e
> 	Author: Pedro Alves <palves@redhat.com>
> 	Date:   Thu May 3 00:37:22 2018 +0100
> 
> 	    Convert struct target_ops to C++
> 
> 	[Snip]
> 
> 	bisect run success

That commit can't be the culprit for the issue reported in this bug,
because that commit is recent, it is in master only, not in 8.1.
It if caused some breakage, it's something else.  A separate bug report
would have been better.

> 
> Could someone confirm this for me? Commits before this one can successfully
> follow the debuggee to completion without incident, but the ones after and
> including this one crash with a null pointer dereference in
> gdb`push_target(struct target_ops *) at target.c:653. From a cursory glance,
> it seems a little fishy that darwin-nat.c doesn't have any sort of
> add_target call in it, 

The add_target call is in i386-darwin-nat.c:_initialize_i386_darwin_nat

  add_inf_child_target (&darwin_target);

> but I can't understand the code in the C/C++
> frankenstein state it's in right now, 

Yeah.  Anything in particular you'd like to point out?

> so I wasn't able to come up with a
> fix. (I did find a bunch of undefined behavior being hit, though, which I
> *do* have patches for. Let me know if you're curious in seeing them.)

Yes please.  If you could contribute fixes, it'd be awesome:

 https://sourceware.org/gdb/wiki/ContributionChecklist

In case it isn't obvious, the macOS port is in real need of someone motivated to maintain it.  I'm afraid that none of the day-to-day maintainers uses macOS, AFAIK.  You can see it as an opportunity.

> 
> <rant>
> Just as a FYI, confirming this particular commit took well over two days and
> testing over two hundred revisions, which is something that I find as an
> outside observer to be truly horrible. 

Wow.  Sorry about that.  Two hundred revisions sounds way too many for a git bisect?  How could that have happened?

> Does GDB have *no* automated testing
> or continuous integration whatsoever? 

It does, see <https://sourceware.org/gdb/wiki/BuildBot>.  The problem is nobody ever contributed a macOS buildslave.

> Putting aside the fact that any such
> infrastructure would catch simple bugs like this one, which are easy to
> reproduce, it would have also made my life bisecting a lot easier. Many
> intermediate commits are broken, as in they *literally don't build on
> macOS*, because someone forgot a header file or messed up a Makefile. Others
> dereference null pointers or overflow ints during startup, which really
> threw off my bisect script with false positives: I had to restart the bisect
> from the beginning at least half a dozen times because it homed in on the
> wrong bug.

:-(  Sound like maybe "git bisect skip" would have helped?

> I'm aghast that it's possible for such clearly broken patches to
> land in the master branch. I do apologize for the vitriolic tone here, but
> I'm extremely frustrated at the amount of time I had to spend finding this
> when it should have been a rather trivial task. I do hope none of you take
> it personally–but if you're looking for things to improve, this is one thing
> I think you should focus on.

Nope, sorry.  The thing to improve is _getting someone that actually cares about the port to step up and help maintain it_.  That could be you.  Otherwise, I fear that at some point, the port will just end up deprecated and removed.
Comment 7 Pedro Alves 2018-05-24 12:20:34 UTC
> including this one crash with a null pointer dereference in
> gdb`push_target(struct target_ops *) at target.c:653.

I think I see what is going on here.  I'll send a patch.
Comment 8 Saagar Jha 2018-06-03 08:08:49 UTC
Sorry, I took a break from because I couldn't figure it out: my bisect kept ending up on 4bbd4ef219c5b4c7d437618ba8937af86dd1032e, with a one character diff. My guess is that this commit changes what methods get called, so it might be able to discover what this changes if I could log every method call, but I don't know how to do that in gdb.

> Yeah.  Anything in particular you'd like to point out?

The darwin-nat/i386-darwin-nat thing was kind of confusing to me, since I thought darwin-nat was for x86_64 and i386-darwin-nat was for, well, i386. Plus this one didn't really follow the example set by other platforms so I didn't have much to go off of. Just my thoughts.

> Yes please.  If you could contribute fixes, it'd be awesome

I have a couple of clumsy patches for issues I found up here (as well as yours), if you find them useful: https://github.com/saagarjha/binutils-gdb. If they're useful I could try to format them to follow the guidelines.

> Two hundred revisions sounds way too many for a git bisect?  How could that have happened?

Well, each bisect ideally should have been around a dozen commits to test, but I kept needing to run bisect again because my bisect script, having no real way of testing whether the current commit was good, ended up doing something along the lines of checking "echo r | gdb -return-child-result a.out". But a lot of commits had issues such as not building (which meant that my script, which "git bisect skip"ed any commit that didn't build, ended up mired in a 90-odd commit block where none of the commits built, jumping around randomly to find one that compiled), or many that had some sort of undefined behavior that was recognized much later, which means I had to manually backport fixes to rule out the false positives it discovered.
Comment 9 Pedro Alves 2018-06-04 11:40:07 UTC
(In reply to Saagar Jha from comment #8)

> > Yes please.  If you could contribute fixes, it'd be awesome
> 
> I have a couple of clumsy patches for issues I found up here (as well as
> yours), if you find them useful: https://github.com/saagarjha/binutils-gdb.
> If they're useful I could try to format them to follow the guidelines.

They do seem to point at real issues that should be fixed somehow.  If you send the fixes to the list, they can be discussed there.
Comment 10 Pedro Alves 2018-06-04 12:33:02 UTC
> 
> > Yeah.  Anything in particular you'd like to point out?
> 
> The darwin-nat/i386-darwin-nat thing was kind of confusing to me, since I
> thought darwin-nat was for x86_64 and i386-darwin-nat was for, well, i386.
> Plus this one didn't really follow the example set by other platforms so I
> didn't have much to go off of. Just my thoughts.

OK.  There are exceptions for single-arch ports, but $OS-nat.c is usually _not_ architecture-specific.  E.g., linux-nat.c is for all Linux architectures, and then we have i386-linux-nat.c/amd64-linux-nat.c.  Same with fbsd-nat.c, etc.
Maybe we should rename i386-darwin-nat to x86-darwin-nat though, as that's the convention we follow most everywhere (i386=>32-bit, amd64=>64-bit, x86=>both).
Please do feel free to pop in to #gdb on freenode, where several maintainers hang.  I'd be happy to help you get around the codebase a bit more, if you're interested.
Comment 11 Pedro Alves 2018-06-04 12:37:08 UTC
Back to the original topic:

(In reply to Saagar Jha from comment #8)
> Sorry, I took a break from because I couldn't figure it out: my bisect kept
> ending up on 4bbd4ef219c5b4c7d437618ba8937af86dd1032e, with a one character
> diff. My guess is that this commit changes what methods get called, so it
> might be able to discover what this changes if I could log every method
> call, but I don't know how to do that in gdb.

Hmm, at least that is indeed changing something in the darwin-related code.
The original patch was submitted here, but it didn't come with any sort of detail: <https://sourceware.org/ml/gdb-patches/2017-07/msg00447.html>.

Did you try reverting that commit on top of current master, see if it makes a difference?
Comment 12 Saagar Jha 2018-06-06 08:25:41 UTC
> They do seem to point at real issues that should be fixed somehow.  If you send the fixes to the list, they can be discussed there.

Sure, I'll make sure to stop by after we get this figured out so I can make a clean set of patches.

> Did you try reverting that commit on top of current master, see if it makes a difference?

Yup, the issue (mostly) goes away if I do that. There are other latent issues hiding out somewhere that we can get to later, but at least I can get GDB to execute by program to successful termination as it did in 8.0.1.
Comment 13 Tom Tromey 2018-06-27 22:33:03 UTC
I built gdb 8.0 from git and that did not work for me on macOS 10.13.5.
Neither did git master.  I also tried the 8.0.1 from brew.

They both fail in the same way, with "Unknown signal".

I tend to think this is a dup of 20266.
Comment 14 Pedro Alves 2018-06-28 00:05:44 UTC
Tom, does reverting the offending commit work for you?
Comment 15 Tom Tromey 2018-06-28 13:57:40 UTC
(In reply to Pedro Alves from comment #14)
> Tom, does reverting the offending commit work for you?

No.

What happens for me is that darwin_decode_message gets
a MACH_NOTIFY_DEAD_NAME (the "== 0x48") case.  Then the
subsequent wait4() call returns with wstatus=5.

wstatus=5 is a strange response.  It is not WIFEXITED,
but neither is it WIFSIGNALED.  So far I haven't found
any documentation about what it might be.

One wild guess is that maybe this mach message actually
does carry the name of the new port and it could be
extracted via darwin_find_new_inferior.  But that seems
like a longshot.

I looked at the lldb patch that Jason Molenda posted
(see https://sourceware.org/bugzilla/show_bug.cgi?id=20266#c6),
but lldb seems to work in a completely different way here,
I guess hooking into some low-level mach thing somehow?  Like,
those functions aren't obviously called from anywhere.
So, I do wonder whether the answer is a bigger rewrite of
darwin-nat.c, to use mach stuff everywhere and not ptrace
or wait.  However, this experience has shown me that even
minor revisions of macOS can come with big changes, so
modifying this code seems somewhat tricky.
Comment 16 Pedro Alves 2018-06-28 14:20:29 UTC
Are you testing with "set startup-with-shell off", perhaps?  Or maybe Saagar was?  I could see that impacting whether affecting whether you see the SIGTRAP, since this all seems to be exec-event related.
Comment 17 Tom Tromey 2018-06-28 14:47:59 UTC
(In reply to Pedro Alves from comment #16)
> Are you testing with "set startup-with-shell off", perhaps?  Or maybe Saagar
> was?  I could see that impacting whether affecting whether you see the
> SIGTRAP, since this all seems to be exec-event related.

I have tried it both ways to no avail.
Comment 18 Tom Tromey 2018-06-29 14:53:36 UTC
In my case the first problem was that I was trying "gdb /bin/ls" --
but that is subject to System Integrity Protection.
Using my own test executable gives a different problem.

I'll file a separate bug about detecting SIP.
gdb could at least tell the user what is going on.
Comment 19 Richard Tran Mills 2018-10-15 21:36:55 UTC
I'd just like to confirm that I am seeing the exact same error on Mac OS 10.13.6 using GDB 8.2 installed via Homebrew. Downgrading to 8.0.1 via Homebrew gives me a working version of GDB.
Comment 20 Tom Tromey 2018-10-15 23:31:40 UTC
Try git master gdb.  There have been a few High Sierra fixes there.
I used to get this problem there but now I no longer do.
Comment 21 Saagar Jha 2018-10-16 02:50:50 UTC
Ooh, it's nice to see that the underlying issue has been fixed. Can confirm that this works on macOS Mojave with a small patch to deal with new load commands. I'll look into the process of getting this merged in so we can extend support to 10.14 as well.
Comment 22 Tom Tromey 2018-10-16 19:42:49 UTC
(In reply to Saagar Jha from comment #21)
> Ooh, it's nice to see that the underlying issue has been fixed. Can confirm
> that this works on macOS Mojave with a small patch to deal with new load
> commands. I'll look into the process of getting this merged in so we can
> extend support to 10.14 as well.

Looking forward to that.
See also bug #23728, bug #23742, and bug #23746.
Comment 23 Saagar Jha 2018-10-26 13:35:49 UTC
I've taken the time to clean up my patches and submit them to the gdb-patches mailing list (though, I don't see them in the archives. Is this just a standard delay, or did I mess up somewhere?)
Comment 24 Tom Tromey 2018-10-26 19:34:27 UTC
(In reply to Saagar Jha from comment #23)
> I've taken the time to clean up my patches and submit them to the
> gdb-patches mailing list (though, I don't see them in the archives. Is this
> just a standard delay, or did I mess up somewhere?)

I didn't see them either, so maybe try re-sending.
Comment 25 Saagar Jha 2018-10-27 05:09:37 UTC
The patches should be on the list now. Turns out gdb-patches is extremely strict about emails that contain any kind of HTML ;P
Comment 26 Saagar Jha 2018-12-06 07:44:23 UTC
Ok, back to the original topic, now that the patches have been merged: I'm still intermittently seeing the original issue about half the time. Tom, for me it seems like wait4 is giving me WIFSTOPPED with SIGTRAP. Does this mean we need to refresh out task port?
Comment 27 Tom Tromey 2018-12-10 17:25:01 UTC
(In reply to Saagar Jha from comment #26)
> Ok, back to the original topic, now that the patches have been merged: I'm
> still intermittently seeing the original issue about half the time. Tom, for
> me it seems like wait4 is giving me WIFSTOPPED with SIGTRAP. Does this mean
> we need to refresh out task port?

I don't really know.  Actually I'm surprised to hear that this is still
a problem as I would have thought the earlier round of macOS changes would
have fixed this.  However, I don't have Mojave, only High Sierra, so
I can't really try it.  I haven't been able to reproduce this bug there.
Comment 28 Roman Bolshakov 2018-12-11 22:53:42 UTC
Tom, Saagar

I'm not seeing SIGTRAP on master but it exists in the latest stable gdb from homebrew (8.2_1).

Do you know the commits which could resolve the issue?

Thank you,
Roman
Comment 29 Saagar Jha 2018-12-11 22:59:45 UTC
This issue is intermittent for me; I'm building straight off of master (./configure --disable-werror CFLAGS="-g -fsanitize=address -fsanitize=undefined" CXXFLAGS="-g -fsanitize=address -fsanitize=undefined" LDFLAGS="-g -fsanitize=address -fsanitize=undefined") on macOS Mojave 10.14.3 Beta (18D21c). I'll try gdb a couple times on a toy binary and it'll work, and then it will start to randomly hang until I SIGKILL it. This is similar to the behavior I had on High Sierra when I was on it back in May, so I'm guessing that the underlying issue is still there somewhere. As to what it is, I have no idea…
Comment 30 Roman Bolshakov 2018-12-12 00:11:22 UTC
So, there should be two different issues: 
* when program quits shortly after start with SIGTRAP (the issue). I haven't seen the issue on master.

* program doesn't quit at all (I don't know if it has a bug#) but I could easily catch it by running gdb in loop:

for i in $(seq 1 100); do sudo /usr/local/Cellar/gdb/HEAD-750b258_1/bin/gdb -ex 'r' -ex 'quit' ./a.out; done


Sampling shows gdb hangs in darwin_decode_message:

Analysis of sampling gdb (pid 746) every 1 millisecond
Process:         gdb [746]
Path:            /usr/local/Cellar/gdb/HEAD-750b258_1/bin/gdb
Load Address:    0x100000000
Identifier:      gdb
Version:         0
Code Type:       X86-64
Parent Process:  sudo [745]

Date/Time:       2018-12-12 03:07:55.146 +0300
Launch Time:     2018-12-12 03:07:30.741 +0300
OS Version:      Mac OS X 10.14.1 (18B75)
Report Version:  7
Analysis Tool:   /usr/bin/sample

Physical footprint:         5268K
Physical footprint (peak):  5272K
----

Call graph:
    2826 Thread_21543800   DispatchQueue_1: com.apple.main-thread  (serial)
      2826 start  (in libdyld.dylib) + 1  [0x7fff63cd508d]
        2826 main  (in gdb) + 44  [0x1000039dc]
          2826 gdb_main(captured_main_args*)  (in gdb) + 3701  [0x10019399c]
            2826 catch_command_errors(void (*)(char const*, int), char const*, int)  (in gdb) + 53  [0x1001941af]
              2826 execute_command(char const*, int)  (in gdb) + 489  [0x1002836e4]
                2826 cmd_func(cmd_list_element*, char const*, int)  (in gdb) + 104  [0x100079296]
                  2826 run_command_1(char const*, int, run_how)  (in gdb) + 594  [0x100162c6a]
                    2826 darwin_nat_target::create_inferior(char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char**, int)  (in gdb) + 939  [0x1000bf50b]
                      2826 fork_inferior(char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char**, void (*)(), void (*)(int), void (*)(), char const*, void (*)(char const*, char* const*, char* const*))  (in gdb) + 376  [0x100126069]
                        2826 darwin_ptrace_him(int)  (in gdb) + 99  [0x1000bf96f]
                          2826 gdb_startup_inferior(int, int)  (in gdb) + 22  [0x100125727]
                            2826 startup_inferior(int, int, target_waitstatus*, ptid_t*)  (in gdb) + 205  [0x1001262d4]
                              2826 target_wait(ptid_t, target_waitstatus*, int)  (in gdb) + 67  [0x10026901c]
                                2826 darwin_nat_target::wait(ptid_t, target_waitstatus*, int)  (in gdb) + 39  [0x1000be861]
                                  2826 darwin_wait(ptid_t, target_waitstatus*)  (in gdb) + 290  [0x1000be98d]
                                    2826 darwin_decode_message(mach_msg_header_t*, darwin_thread_info**, inferior**, target_waitstatus*)  (in gdb) + 1091  [0x1000c155d]
                                      2826 __wait4_nocancel  (in libsystem_kernel.dylib) + 10  [0x7fff63e15e72]

Total number in stack (recursive counted multiple, when >=5):

Sort by top of stack, same collapsed (when >= 5):
        __wait4_nocancel  (in libsystem_kernel.dylib)        2826
Comment 31 kemlath 2019-01-06 15:42:31 UTC
I dug into this problem and the issue is that that gdb hangs in darwin_decode_message
I had a look at the most current version from the ftp server gdb-8.2.50.20190105.tar.xz

In darwin-nat.c in darwin_decode_message(...) in line 1131 wait4 is called for the first time check if and how the thread exited.
In line 1154 wait4 is called a second time on the now potentially terminated thread.

darwin-nat.c line 1154:  wait4 (inf->pid, &wstatus, 0, NULL);

changing this to 

darwin-nat.c line 1154:  wait4 (inf->pid, &wstatus, WNOHANG, NULL);

tells wait4 not to wait for threads that won't report in any more.

This makes the frequent hangs of gdb under Mojave go away.

I've tested this with C++ and fortran from the console and from eclipse under Mojave and had no problems.
Comment 32 kemlath 2019-01-06 15:43:25 UTC
I dug into this problem and the issue is that that gdb hangs in darwin_decode_message
I had a look at the most current version from the ftp server gdb-8.2.50.20190105.tar.xz

In darwin-nat.c in darwin_decode_message(...) in line 1131 wait4 is called for the first time check if and how the thread exited.
In line 1154 wait4 is called a second time on the now potentially terminated thread.

darwin-nat.c line 1154:  wait4 (inf->pid, &wstatus, 0, NULL);

changing this to 

darwin-nat.c line 1154:  wait4 (inf->pid, &wstatus, WNOHANG, NULL);

tells wait4 not to wait for threads that won't report in any more.

This makes the frequent hangs of gdb under Mojave go away.

I've tested this with C++ and fortran from the console and from eclipse under Mojave and had no problems.
Comment 33 Philippe Blain 2022-03-09 00:04:15 UTC
I think the problem(s) that were happening on High Sierra mentioned in this here bug report eventually got fixed, if I read the history of this bug correctly. The issue discussed later in here, starting in https://sourceware.org/bugzilla/show_bug.cgi?id=22960#c21 where Mojave is mentioned, seems to be a duplicate of https://sourceware.org/bugzilla/show_bug.cgi?id=24069, which should be fixed in master.