This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH] Unify pthread_once (bug 15215)
- From: Torvald Riegel <triegel at redhat dot com>
- To: Rich Felker <dalias at aerifal dot cx>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>, libc-ports <libc-ports at sourceware dot org>
- Date: Thu, 09 May 2013 17:14:28 +0200
- Subject: Re: [PATCH] Unify pthread_once (bug 15215)
- References: <1368024237 dot 7774 dot 794 dot camel at triegel dot csb> <20130508175132 dot GB20323 at brightrain dot aerifal dot cx> <1368046046 dot 7774 dot 1441 dot camel at triegel dot csb> <20130508212502 dot GF20323 at brightrain dot aerifal dot cx> <1368088765 dot 7774 dot 1571 dot camel at triegel dot csb> <20130509140245 dot GI20323 at brightrain dot aerifal dot cx>
On Thu, 2013-05-09 at 10:02 -0400, Rich Felker wrote:
> On Thu, May 09, 2013 at 10:39:25AM +0200, Torvald Riegel wrote:
> > > However, the idea is that pthread_once only runs
> > > init routines a small finite number of times, so even if you had to so
> > > some horrible hack that makes the synchronization on return 1000x
> > > slower (e.g. a syscall), it would still be better than incurring the
> > > cost of a full acquire barrier in each subsequent call, which ideally
> > > should have the same cost as a call to an empty function.
> >
> > That would be true if non-first calls appear
> > 1000*(syscall_overhead/acquire_mbar_overhead) times. But do they?
>
> In theory they might. Imagine a math function that might be called
> millions or billions of times, but which depends on a precomputed
> table. Personally, my view of best-practices is that you should use
> 'static const' for such tables, even if they're huge, rather than
> runtime generation, but unfortunately I think my view is still a
> minority one...
>
> Also, keep in mind that even large overhead on the first call to
> pthread_once is likely to be small in comparison to the time spent in
> the initialization function, while even small overhead is huge in
> comparison to a call to pthread_once that doesn't call the
> initialization function.
>
> > I think the way forward here is to:
> > 1) Fix the implementation (ie, add the mbars).
> > 2) Let the arch maintainers of the affected archs with weak memory moels
> > (or people interested in this) look at this and come up with some
> > measurements for how much overhead the mbars actually present in real
> > code.
> > 3) Decide whether this overhead justifies adding optimizations.
> >
> > This patch is step 1. I don't think we need to merge this step 3.
>
> I think this is a reasonable approach.
>
> > > > > Since it's impossible to track whether a call is the first call in a
> > > > > given thread
> > > >
> > > > Are you sure about this? :)
> > >
> > > It's impossible with bounded memory requirements, and thus impossible
> > > in general (allocating memory for the tracking might fail).
> >
> > I believe you think about needing to track more than you actually need
> > to know. All you need is knowing whether a thread established a
> > happens-before with whoever initialized the once_control in the past.
> > So you do need per-thread state, and per-once_control state, but not
> > necessarily more. If in doubt, you can still do the acquire barrier.
>
> The number of threads and the number of once controls are both
> unbounded.
They are bounded by the available memory :) So if you can do with a
fixed amount of data in both thread state and once_control state, you
should be fine.
> You might could solve the problem with serial numbers if
> there were room to store a sufficiently large one in the once control,
> but the once control is 32-bit and the serial numbers could (in a
> pathological but valid application) easily overflow 32 bits.
The overflow can be an issue, but in that case I guess you can still try
to detect an overflow globally using global state, and just do the
acquire barrier in this case.
Informally, one can try to trade off a comparison of state in
once_control with a TLS variable; if that is significantly faster than
an acquire barrier, it can be useful; if it's about the same, it doesn't
make sense.
> > > I think my confusion is merely that POSIX does not define the phrase
> > > "synchronize memory", and in the absence of a definition, "full memory
> > > barrier" (both release and acquire semantics) is the only reasonable
> > > interpretation I can find. In other words, it seems like a
> > > pathological conforming program could attempt to use the language in
> > > the specification to use pthread_once as a release barrier. I'm not
> > > sure if there are ways this could be meaningfully arranged (i.e. with
> > > well-defined ordering; off-hand, I would think tricks with cancelling
> > > an in-progress invocation of pthread_once might make it possible.
> >
> > I agree that the absence of a proper memory model makes reasoning about
> > some of this hard. I guess it would be best if POSIX would just endorse
> > C11's memory model, and specify the intended semantics in relation to
> > this model where needed.
>
> Agreed, and I suspect this is what they'll do. I can raise the issue,
> but perhaps you'd be better at expressing it. Let me know if you'd
> rather I do it.
I have no idea how the POSIX folks would feel about this. After all, it
would create quite a dependency for POSIX. With that in mind, trying to
resolve this isn't very high on my todo list. If people would think
that this would be beneficial for how we can deal with POSIX
requirements, or for our users to understand the POSIX requirements
better, I can definitely try to follow up on this. If you want to go
ahead and start discussing with them, please do so (please CC me on the
tracker bug).
Torvald