This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Massive performance regression of glibc string functions
- From: Petr Baudis <pasky at suse dot cz>
- To: hjl dot tools at gmail dot com, drepper at sourceware dot org
- Cc: libc-alpha at sourceware dot org, matz at suse dot de
- Date: Fri, 6 Nov 2009 14:04:09 +0100
- Subject: Massive performance regression of glibc string functions
Hi!
I have been doing some benchmarking of several string functions and
discovered that some of them are *much* slower than in the past; the
regressions are measured against glibc-2.9. I'm testing on small
strings (4..128, though for 128 much bigger sample of calls would be
needed for good comparison), following the common wisdom that operations
on small strings are the bulk of the calls.
In case of strlen(), there seems to be regression only with very small
strings on AMD, so this is probably fine.
In case of memcmp(), strcmp() and strncmp(), glibc-2.10.1 seems to
improve performance somewhat especially for larger strings, but
glibc-2.11 has massive performance drop across all vendors!
(Interestingly, glibc-2.10.1 is also slightly slower than glibc-2.9 in
these functions on Core i7.)
In case of strcmp(), strncmp(), glibc-2.10.1 seems to improve performance
somewhat especially for larger strings, but glibc-2.11 has massive
performance drop on all vendors.
I'd like to ask how the string routine changes were benchmarked,
for what architectures and string sizes are they supposed to be
optimized and why. I think it would be good to do something about this
regression. ;-)
For the benchmarking, I'm using
http://pasky.or.cz/~pasky/dev/glibc/strbench/
that I quickly hacked together. Here is the data I have collected
on various x86_64 systems, running with 2048 iterations; apply
reasonable error margins, of course:
model name : AMD Opteron (tm) Processor 848
cache size : 1024 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good nopl
fucn,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4 5.630000 6.890000 7.060000 5.660000
strlen8 4.940000 3.580000 3.700000 4.170000
strlen32 2.220000 1.340000 1.490000 2.310000
strlen128 1.220000 0.830000 0.900000 1.330000
memcmp4 3.350000 3.330000 4.400000 3.310000
memcmp8 1.840000 1.740000 2.660000 2.140000
memcmp32 0.970000 0.800000 1.770000 1.300000
memcmp128 0.330000 0.310000 1.050000 0.650000
strcmp4 2.400000 2.290000 5.620000 2.470000
strcmp8 1.600000 1.280000 3.260000 1.560000
strcmp32 0.950000 0.600000 1.630000 0.870000
strcmp128 0.350000 0.210000 1.010000 0.310000
strncmp4 2.560000 2.250000 5.880000 2.960000
strncmp8 1.400000 1.410000 3.230000 1.700000
strncmp32 0.710000 0.770000 1.370000 0.940000
strncmp128 0.270000 0.270000 0.670000 0.350000
model name : Dual Core AMD Opteron(tm) Processor 165
cache size : 1024 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4 6.780000 8.350000 8.580000 6.850000
strlen8 5.920000 4.300000 4.420000 5.010000
strlen32 2.570000 1.440000 1.430000 2.660000
strlen128 1.260000 0.910000 0.850000 1.240000
memcmp4 3.960000 4.040000 5.160000 2.840000
memcmp8 2.020000 2.060000 3.000000 1.890000
memcmp32 0.770000 0.720000 1.350000 0.980000
memcmp128 0.260000 0.240000 0.540000 0.430000
strcmp4 2.740000 2.750000 6.790000 2.910000
strcmp8 1.410000 1.410000 3.600000 1.620000
strcmp32 0.630000 0.580000 1.260000 0.700000
strcmp128 0.200000 0.180000 0.620000 0.230000
strncmp4 3.080000 2.720000 7.180000 3.540000
strncmp8 1.580000 1.440000 3.940000 1.880000
strncmp32 0.720000 0.670000 1.310000 0.840000
strncmp128 0.240000 0.220000 0.550000 0.280000
model name : Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
cache size : 4096 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4 3.870000 3.050000 3.270000 3.870000
strlen8 2.370000 1.530000 1.640000 3.450000
strlen32 1.040000 0.480000 0.470000 1.520000
strlen128 0.600000 0.290000 0.280000 0.680000
memcmp4 2.080000 2.260000 2.680000 1.800000
memcmp8 1.040000 1.130000 1.460000 1.860000
memcmp32 0.270000 0.270000 0.350000 0.770000
memcmp128 0.070000 0.070000 0.090000 0.190000
strcmp4 1.910000 1.910000 3.480000 1.920000
strcmp8 0.960000 0.950000 1.200000 0.960000
strcmp32 0.240000 0.240000 0.290000 0.240000
strcmp128 0.060000 0.060000 0.080000 0.060000
strncmp4 2.030000 1.690000 4.240000 2.810000
strncmp8 1.020000 0.850000 1.610000 1.410000
strncmp32 0.260000 0.210000 0.380000 0.360000
strncmp128 0.070000 0.060000 0.100000 0.080000
model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
cache size : 6144 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4 3.090000 2.960000 2.750000 3.450000
strlen8 1.890000 1.230000 1.360000 3.140000
strlen32 0.810000 0.370000 0.340000 1.220000
strlen128 0.460000 0.220000 0.200000 0.660000
memcmp4 2.160000 1.820000 2.500000 1.800000
memcmp8 1.100000 0.910000 1.500000 1.170000
memcmp32 0.310000 0.220000 0.320000 0.380000
memcmp128 0.090000 0.060000 0.090000 0.110000
strcmp4 1.860000 1.910000 3.530000 1.570000
strcmp8 0.960000 0.960000 1.170000 0.840000
strcmp32 0.280000 0.250000 0.300000 0.270000
strcmp128 0.050000 0.050000 0.090000 0.070000
strncmp4 1.740000 1.750000 3.790000 2.840000
strncmp8 0.940000 0.850000 1.380000 1.380000
strncmp32 0.220000 0.220000 0.320000 0.400000
strncmp128 0.050000 0.050000 0.090000 0.080000
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
cache size : 8192 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida
func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4 3.440000 3.500000 2.780000 3.320000
strlen8 2.260000 1.750000 1.440000 2.220000
strlen32 0.850000 0.500000 0.380000 0.900000
strlen128 0.470000 0.260000 0.200000 0.500000
memcmp4 2.180000 2.060000 2.500000 1.840000
memcmp8 1.100000 1.050000 1.320000 1.060000
memcmp32 0.270000 0.260000 0.350000 0.330000
memcmp128 0.080000 0.070000 0.090000 0.090000
strcmp4 1.660000 1.930000 2.250000 1.640000
strcmp8 0.830000 0.970000 1.140000 0.840000
strcmp32 0.210000 0.240000 0.240000 0.210000
strcmp128 0.050000 0.070000 0.080000 0.060000
strncmp4 1.740000 1.830000 2.490000 2.570000
strncmp8 0.870000 0.920000 1.220000 1.300000
strncmp32 0.220000 0.230000 0.260000 0.320000
strncmp128 0.050000 0.050000 0.090000 0.080000
* numbers after function names indicate string sizes
** 2.11-amd is very old AMD-provided x86_64 string routines patch
(it doesn't implement some of the new things like bounded pointers
checks support) that we still use in SUSE glibc:
http://pasky.or.cz/~pasky/dev/glibc/amd64-string-2.11.diff
If the regression against 2.10.1 is fixed, it is probably not very
interesting, it performs better only at very short memcmp()s.)
*** I can't seem to find newer AMD processors to test on right now,
sorry. If you have any, feel free to run the benchmark there - just
get the /strbench/ directory and run `./strbench.sh outfile`.
Kind regards,
--
Petr "Pasky" Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth