This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC PATCH] Single threaded stdio optimization


On Tue, 2017-05-16 at 16:42 +0100, Szabolcs Nagy wrote:
> Locking overhead can be significant in some stdio operations
> which are common in single threaded applications so it makes
> sense to optimize that case.
> 
> there are two basic approaches:
> 1) avoid the use of expensive atomics in the lowlevellock code
> while there is only one thread (x86 already implements this).
> 2) jump to the _unlocked variant of an stdio function in case
> the file needs no locking (the process will be single threaded
> until the end of the stdio function).
> 
> this patch is an incomplete implementation of 2) which is target
> independent and improves stdio performance more (however it does
> not affect lock usage in malloc for example).
> 
> issues not tackled:
> 
> - files with _IO_USER_LOCK flag set could use the same mechanism
> which would mean less checks.
> - malloc interposition is not handled yet. whenever (non-as-safe)
> user code may run between flockfile and funlockfile the optimization
> must be disabled in case a thread is created, i just don't know
> what's the best way to detect malloc interposition at libc startup.
> - i used a new libc symbol (_IO_enable_locks) that pthread_create
> can call to enable the stdio locks, there might be a better way.
> (abilists are not updated yet).
> - stdio has various configurations that i did not test (non-linux
> or non-multi-threaded setups).
> 
> my question is if this approach is acceptable or if the target
> specific lowlevellock optimization (like x86 does it) preferred.

I think approach 2) is much better.  If we want to avoid synchronization
costs, we should try to do it at the highest level so that we can also
optimize the users of synchronization (eg, just for illustrative
purposes, avoiding both a lock and unlock call just with one branch).

Once we've implemented that approach for all clients of lowlevellock
where this might matter, I'd also remove the x86-specific optimization
(at which point we could remove the x86-specific lll implementation too,
I believe, and use the generic one).  For example, do something similar
in malloc (or do something elsethat  makes it unlikely that
synchronization is used if that's not necessary). In contrast, in
pthread_mutex*, IMO we can assume that the caller wouldn't use these
unless the program really is multi-threaded; I know others have voiced
other opinions, but even if don't want to assume that optimizing for
single-threaded use in pthread_mutex_* would still be better than doing
it in the low-level lock.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]