Unicode width data inconsistent/outdated

Tue Aug 8 00:29:00 GMT 2017

Am 07.08.2017 um 23:29 schrieb Brian Inglis:
> On 2017-08-07 13:30, Thomas Wolff wrote:
>> Am 07.08.2017 um 21:07 schrieb Brian Inglis:
>>> Implementation considerations for handling the Unicode tables described in
>>>      http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
>>> and implemented in
>>>      https://www.strchr.com/multi-stage_tables
>>>
>>> ICU icu4[cj] uses a folded trie of the properties, where the unique property
>>> combinations are indexed, strings of those indices are generated for fixed size
>>> groups of character codes, unique values of those strings are then indexed, and
>>> those indices assigned to each character code group. The result is a multi-level
>>> indexing operation that returns the required property combination for each
>>> character.
>>>
>>> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode
>>>
>>>
>>> The FOX Toolkit uses a similar approach, splitting the 21 bit character code
>>> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
>>> eliminate redundancy.
>>>
>>> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf
>>>
>> Thanks for the interesting links, I'll chech them out.
>> But such multi-level tables don't really help without a given procedure how to
>> update them (that's only available for the lowest level, not for the
>> code-embedded levels).
> Unicode estimates property tables can be reduced to 7-8KB using these
> techniques, including using minimal int sizes for indices and array elements e.g
> char, short if you can keep the indices small, rather than pointers.
>
> Creation scripts used by PCRE and Python projects are linked from the bottom of
> the second link above. Source and docs for these packages and ICU is available
> under Cygwin, and FOX Toolkit is available in some distros and by FTP.
>
>> Also, as I've demonstrated, my more straight-forward and more efficient approach
>> will even use less total space than the multi-level approach if packed table
>> entries are used.
> Unicode recommends the double table index approach as a means of eliminating the
> massive redundancy that exists in char property entries and char groups, and
> using small integers instead of pointers, that can be optimized to meet
> conformance levels and platform speed and size limits, at the cost of an annual
> review of properties and rebuild. The amount of redundancy removed by this
> approach is estimated in the FOX Toolkit doc and ranges across orders of
> magnitude. Unfortunately none of these docs or sources quote sizes for any
> Unicode release!
>
> My own first take on these was to use run length encoded bitstrings for each
> binary property, similar to database bitmap indices, but the grouping of
> property blocks in Unicode, and their recommendation, persuaded me their
> approach was likely backed by a bunch of supporting corps' and devs' R&D, and is
> similar to those used for decades in database queries handling (lots of) small
> value set equivalence class columns to reduce memory pressure while speeding up
> selections.
I am not quite sure what you're trying to suggest or recommend now, but 
the thing is, I just wanted to get an update of width data in the first 
place, which is an easy and undisputed changed; then Corinna pointed out 
that the ctype functions are based on old Unicode data too, so I made an 
attempt to update them too. I use the approach that I also use for two 
other projects (mined and mintty) and I didn't mean this to become a 
research project for me :/
I am certainly willing to consider specs and all that to achieve a 
suitable result, but I don't feel like implementing any fancy algorithm 
recommended by Unicode with unconvincing rationale, especially after 
I've calculated that my method uses even less memory.
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple