Bug 31859

Summary: Transliteration rules with two input characters like "ḌḌ" "DDH" do not work.
Product: glibc Reporter: Mike FABIAN <maiku.fabian>
Component: localeAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: carlos, fweimer, maiku.fabian
Priority: P2 Flags: carlos: security-
Version: 2.39   
Target Milestone: 2.41   
Host: Target:
Build: Last reconfirmed:

Description Mike FABIAN 2024-06-07 13:44:01 UTC
See: https://sourceware.org/pipermail/libc-alpha/2024-May/156769.html

If transliteration rules like this:

translit_start
"ḌḌ" "DDH"
"ḍḍ" "ddh"
"Ḍḍ" "Ddh"
translit_en

are used in the LC_CTYPE section of a locale, they don’t work.

These are in our new scn_IT locale, but commented out for the moment because they do not work.

If localedata/locales/translit_combining is not changed, the rules for the single characters Ḍ U+01E0C and ḍ U+1E0D from translit_combining did always win when I tested, the longer input sequences "ḌḌ", "ḍḍ", and "Ḍḍ" were never used.

But when I commented out these short single characters transliteration rules in translit_combining like this:

diff --git a/localedata/locales/translit_combining b/localedata/locales/translit_combining
index ce2f19eee1..6f879d9caf 100644
--- a/localedata/locales/translit_combining
+++ b/localedata/locales/translit_combining
@@ -2486,9 +2486,9 @@ translit_start
 % LATIN SMALL LETTER D WITH DOT ABOVE
 <U1E0B> <U0064>
 % LATIN CAPITAL LETTER D WITH DOT BELOW
-<U1E0C> <U0044>
+%<U1E0C> <U0044>
 % LATIN SMALL LETTER D WITH DOT BELOW
-<U1E0D> <U0064>
+%<U1E0D> <U0064>
 % LATIN CAPITAL LETTER D WITH LINE BELOW
 <U1E0E> <U0044>
 % LAT


then

bash-5.2# echo 'ḌḌ'|iconv -f UTF-8 -t ASCII//translit
^C
bash-5.2#

uses 100% CPU and never stops until I stop it with Control-C.
Comment 2 Carlos O'Donell 2024-08-16 12:41:04 UTC
commit 1b0a2062c8938c7333cd118d85d9976c4e7c92af
Author: Andreas Schwab <schwab@suse.de>
Date:   Mon Jun 10 12:19:17 2024 +0200

    iconv: Fix matching of multi-character transliterations (bug 31859)
    
    Only return __GCONV_INCOMPLETE_INPUT for a partial match when the end of
    the input buffer is reached.  Otherwise it is a non-match, and other
    patterns should be tried.
Comment 3 Carlos O'Donell 2024-08-16 12:50:28 UTC
In general it might have been possible to cause service breakage by building a custom locale with these transliterations, enabling the locale on a server, and then attempting to process these conversions with the locale enabled. However, since glibc didn't ship such a locale, this would be a failure in testing for the developer using the custom locale. There is no actual, concrete, non-synthetic scenario reported here, so I'm marking this security- for the hang in the converter.