bash 5.2.21-1: a bug in [0-9] expansion

Sam Edge sam.edge.cygwin@gmx.com
Mon Sep 1 21:23:53 GMT 2025


On 01/09/2025 18:19, Brian Inglis via Cygwin wrote:
 > On 2025-08-31 13:06, Mariusz Wodzicki via Cygwin wrote:
 >> Description of the problem.
 >> [0-9]  picks also certain Unicode superscript characters ( namely, ⁰ 
⁴ ⁵ ⁶
 >> ⁷ ⁸ ⁹ ), and every Unicode subscript character.
 >>
 >> Example: the directory has the following files:
 >> $ /bin/ls
 >> ₀.txt  ₁.txt  ₂.txt  ₃.txt  ₄.txt  ₅.txt  ₆.txt  ₇.txt ₈.txt  ₉.txt
 >> ⁰.txt  ¹.txt  ².txt  ³.txt  ⁴.txt  ⁵.txt  ⁶.txt  ⁷.txt ⁸.txt  ⁹.txt
 >>
 >> $ /bin/ls [0-9].txt
 >> ₀.txt  ₁.txt  ₃.txt  ⁴.txt  ⁵.txt  ⁶.txt  ⁷.txt  ⁸.txt
 >> ⁰.txt  ₂.txt  ₄.txt  ₅.txt  ₆.txt  ₇.txt  ₈.txt
 >>
 >> $ locale
 >> LANG=en_US.UTF-8
 >> LC_CTYPE="en_US.UTF-8"
 >> LC_NUMERIC="en_US.UTF-8"
 >> LC_TIME="en_US.UTF-8"
 >> LC_COLLATE="en_US.UTF-8"
 >> LC_MONETARY="en_US.UTF-8"
 >> LC_MESSAGES="en_US.UTF-8"
 >> LC_ALL=
 >>
 >> System.
 >> Fully up to date Windows 11
 >> cygwin 3.6.4-1
 >> bash    5.2.21-1
 >
 > For reproducible results prefix commands with LC_ALL=C … or possibly 
just LC_COLLATE=C or LC_CTYPE=C or =POSIX to standardize the locale, 
otherwise many commands will respect the current locale, and some 
respect Unicode regardless of locale e.g. `info wc`:
 >
 > "Unless the environment variable ‘POSIXLY_CORRECT’ is set, GNU ‘wc’ 
treats the following Unicode characters as white space even if the 
current locale does not: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, 
U+202F NARROW NO-BREAK SPACE, and U+2060 WORD JOINER."
 >
 > For GNU utilities, where info pages are preferred, such as 
coreutils*, compiler and language processors, and tools packages, many 
details do not appear in the man pages, for example:
 >
 > "Full documentation <https://www.gnu.org/software/coreutils/wc> or 
available locally via: info '(coreutils) wc invocation'"
 >
 > although `info wc` shows the same page.
 >
 > —————
 > * [ arch b2sum base32 base64 basename cat chcon chgrp chmod chown 
chroot cksum comm cp csplit cut date dd df dir dircolors dirname du echo 
env expand expr factor false fmt fold gkill groups head hostid id 
install join link ln logname ls md5sum mkdir mkfifo mknod mktemp mv nice 
nl nohup nproc numfmt od paste pathchk pinky pr printenv printf ptx pwd 
readlink realpath rm rmdir runcon seq sha1sum sha224sum sha256sum 
sha384sum sha512sum shred shuf sleep sort split stat stdbuf stty sum 
sync tac tail tee test timeout touch tr true truncate tsort tty uname 
unexpand uniq unlink users vdir wc who whoami yes
 >

Bash is GNU but isn't part of coreutils as far as I know. Type 'man 
bash' and then read the 'Pattern Matching' section for its globbing 
behaviour.

TL;DR For bash 5.2, using 'export LC_ALL=C.UTF-8' as Brian suggests or 
'export LC_COLLATE=C.UTF-8' or 'shopt -s globasciiranges' should revert 
to simple ASCII ranges for '[0-9]', '[a-z]' etc.

I'm seeing the correct behaviour with up-to-date Cygwin bash/coreutils 
etc. by the way. 'echo [0-9]*' only expands out sub/super-digits if I 
use 'LC_COLLATE=en_GB.UTF-8' or similar with 'shopt -u globasciiranges'.


-- 
Sam Edge



More information about the Cygwin mailing list