This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Fixes tree-loop-distribute-patterns issues
- From: Roland McGrath <roland at hack dot frob dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, Carlos O'Donell <carlos at redhat dot com>, "GNU C. Library" <libc-alpha at sourceware dot org>, Siddhesh Poyarekar <siddhesh at redhat dot com>
- Date: Thu, 20 Jun 2013 13:59:19 -0700 (PDT)
- Subject: Re: [PATCH] Fixes tree-loop-distribute-patterns issues
- References: <51C0AFB7 dot 1060009 at linux dot vnet dot ibm dot com> <20130618205608 dot 9CCE22C0AC at topped-with-meat dot com> <51C1BFE9 dot 4070805 at linux dot vnet dot ibm dot com> <51C1CEFC dot 9000100 at redhat dot com> <51C1FE4C dot 3020400 at linux dot vnet dot ibm dot com> <20130619221130 dot 7B91A2C10E at topped-with-meat dot com> <51C31177 dot 90303 at linux dot vnet dot ibm dot com> <20130620175832 dot 0E6FA2C133 at topped-with-meat dot com> <20130620213141 dot GA4833 at domone dot kolej dot mff dot cuni dot cz>
> Actually you should split simple_* to separate files and compile them with
> O0.
__attribute__ ((optimize ("O0"))) is sufficient in compilers that support
it (4.6, I think) and less hassle than breaking up files. I don't think
anyone does or should care about performance analysis using compilers that
are so old as not to have that.
> Doing otherwise makes their performance dependent on gcc version and
> this makes results even more unreliable.
Perhaps that matters for benchtests, if they are intended to use the
simple_* implementations' performance as a baseline for comparison. The
correctness tests (i.e. all tests outside benchtests/) do not care about
that, and that's all I'm personally concerned with.
If what you want as a performance baseline is "the obvious loop handling a
byte at a time", then -O0 code can easily be substantially worse than this
and give a misleading impression of what naive code would actually do.
With -O0, the compiler is exceedingly stupid (by design), and usually every
operation has excess spill and reload operations, which could easily
dominate the performance of what would otherwise be a very tight loop.
Short of hand-coding naive assembly for each machine, I'm not sure how you
can robustly address that issue. Perhaps -O1 is a good fit for what
assembly a human would write when not trying to be especially clever;
but that's just a shot in the dark.
Thanks,
Roland