This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: IDN support in getaddrinfo().


Thanks for your thoughts.

Ulrich Drepper <drepper@redhat.com> writes:

> Simon Josefsson wrote:
>> Continuing an old thread regarding support for Internationalized
>> Domain Names (IDN) in glibc, prompted by the adaption of my glibc
>> patches by developers from some Linux distributions, I'd like to
>> formalize my ideas in a proposal for extending the getaddrinfo() API.
>
> The problem I have with this is: we do not have the idn code in glibc.
> It is big, and changing, which makes me not wanting to add this.  And
> getaddrinfo is core functionality.  Requiring some external code for it
> to work is undesirable.  The interface might change or whatever other
> incompatibilities can arise.  This is highly unpleasant.

Right.  Still, the specifications are not likely to change at this
point, and if they do it wouldn't be over night, but rather take
years.  The are published RFCs on Proposed Standard level, and is
currently being revised for Draft Standard level (only editorial
changes).  So the code changes at this point are bug fixes, or feature
additions unrelated to IDN (which could be stripped out of libc).

I have to admit the libidn API has been changing somewhat, but mostly
the reason has been my own inexperience in designing good C APIs.
Although if libidn was part of libc, I believe it would be best to
only advertise the getaddrinfo interface of it, and wait a year or two
until all the libidn APIs are exported (if ever).  Having the libidn
API available via libc have some benefits though, because there are
many programs that will need stringprep functionality in the future
(e.g., iSCSI, XMPP instant messaging, Kerberos, SASL).

Applications need to explicitly ask for IDN functionality, so it is
not something that would likely get in the way of existing code, too.

I have been thinking about a dlopen() approach, to reduce the code
size in libc.  E.g., the application requests IDN, then libc try to
dlopen("idn").  The libc IDN code patch would only amount to, say,
less than 100 lines.  Any thoughts on this?  Is it feasible at all?

> I do see that this form of the interface is nice.  So my questions are:
>
> ~ do you need all of the libidn interface to implement the suggested
>   getaddrinfo extension?

There are some functions that aren't called, but they don't contribute
any substantial code size.  The minimum amount of APIs required are 2
(for punycode encode/decode) + 1 (stringprep) + 2 (IDNA ToASCII and
ToUnicode) = 5, but some utility functions to convert between UTF-8
and UCS-4 are used internally, so make it ~10 API functions.  (Perhaps
those functions already exist elsewhere in libc though?)

The current libidn API consists of 27 functions.  Most of the
additional functions are wrappers around the core functions that take
input in locale or UCS-4 format, and convert output to locale or UCS-4
format.

Basically, five separate functionalities are needed to implement IDN:
charset conversion (locale->UTF-8, UTF->UCS-4, etc), punycode, unicode
NFKC normalization, stringprep and nameprep.

Libidn currently support non-IDN related stringprep profiles as well,
but they re-use the IDN-related stringprep tables.  They add only
about ~20 lines of initialization in a static const table (100-200
bytes?  Dunno.)

> ~ what is the size of the absolutely minimum amount of code (source
>   and object file)

Self-contained C89 portable source code, only external requirement is
iconv and nl_langinfo (the lib/ directory of libidn):

    795 idna.c
   1058 nfkc.c
    309 profiles.c
    456 punycode.c
   3544 rfc3454.c      GENERATED from rfc3454.txt
    663 stringprep.c
    273 toutf8.c
    109 version.c
   9353 gunibreak.h    GENERATED from Unicode standard (from Glib)
    658 gunicomp.h     GENERATED from Unicode standard (from Glib)
  10362 gunidecomp.h   GENERATED from Unicode standard (from Glib)
    286 idn-int.h      GENEREATED by autoconf to get 'uint32_t'.
     95 idna.h
    216 punycode.h
    211 stringprep.h
  28388 total

Most of the large files are generated, here are the "real" files:

jas@latte:~/src/libidn/lib$ wc -l idna.c nfkc.c profiles.c punycode.c stringprep.c toutf8.c idna.h  punycode.h stringprep.h
    795 idna.c
   1058 nfkc.c
    309 profiles.c
    456 punycode.c
    663 stringprep.c
    273 toutf8.c
     95 idna.h
    216 punycode.h
    211 stringprep.h
   4076 total

Also note that the files are heavily commented -- the manual is (in
parts) generated from the source code.

Here are the object sizes for Libidn built on GNU/Linux with GCC 3.3.2
and "-O2".

-rw-r--r--    1 jas      jas        192272 Nov 26 09:47 libidn.a

-rw-r--r--    1 jas      jas          5984 Nov 26 09:47 idna.o
-rw-r--r--    1 jas      jas         91520 Nov 26 09:47 nfkc.o
-rw-r--r--    1 jas      jas          7248 Nov 26 09:47 profiles.o
-rw-r--r--    1 jas      jas          2768 Nov 26 09:47 punycode.o
-rw-r--r--    1 jas      jas         76570 Nov 26 09:47 rfc3454.o
-rw-r--r--    1 jas      jas          4540 Nov 26 09:47 stringprep.o
-rw-r--r--    1 jas      jas          2784 Nov 26 09:47 toutf8.o
-rw-r--r--    1 jas      jas          1648 Nov 26 09:47 version.o

As you can see the only significant parts are the Unicode NFKC
normalization tables and the RFC 3454 tables.  The Unicode NFKC
normalization come from glib, and I haven't investigated how much they
could be optimized in size.  I believe the rfc3454 tables could be
optimized considerably, though.

Note that the unicode tables must be Unicode version 3.2, so it is not
something that can be easily re-used from another library or another
part of libc, even if there would be other NFKC tables on the system
somewhere.

> For the encoding conversion code, in glibc you'd have to use the
> glibc-internal interfaces, and not iconv() itself.

Thanks for the pointer.

Hope this helps,
Simon


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]