Re: Issues about N2401 (Code Conversion Facets)
"Alberto Ganesh Barbati" <AlbertoBarbati@libero.it> wrote in message
news:HjiIi.118527$U01.966046@twister1.libero.it...
Hi Everybody,
(for reference, this is about N2401
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2401.htm)
-- Issue #1: Maxcode
The Maxcode template parameter has practically only two reasonable
values, that are 0xffff (for applications supporting the BMP only) and
0x10ffff (for applications supporting the entire Unicode range). It's
very hard to believe that an application would use any other value for
Maxcode. Is there really the intent to provide support for other values?
Yes. We've had occasion to use 0x7fffffff, and even 0xffffffff [sic].
If not, why not enforce that, using an enum instead of an unsigned long,
such as:
enum codecvd_maxcode { codecvt_bmp, codecvt_full };
template<class Elem,
codecvt_maxcode Maxcode = codecvt_full,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf8 {...};
Alternatively, we could merge the two enums and get rid of one parameter:
enum codecvt_mode {
restrict_to_bmp = 8,
consume_header = 4,
generate_header = 2,
little_endian = 1};
template<class Elem,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf8 {...};
All interesting redesigns. I've proposed codifying existing practice,
however
revolutionary that may appear these days.
-- Issue #2: endianness
The choice to make big endianness as the default is arbitrary and is
going to be very confusing for all people working on little endian
machines.
And the choice of little endianness as the default would be arbitrary
and might be confusing to people working on big endian machines.
I can think of three possible suggestions to overcome that:
1) add a new enumerator that specify the native endianness and use that
as the default:
enum codecvt_mode {
restrict_to_bmp = 8,
consume_header = 4,
generate_header = 2,
little_endian = 1,
native_endianness = /* implementation defined: either 0 or 1 */
};
template<class Elem,
codecvt_mode Mode = native_endianness>
class codecvt_utf8 {...};
This solution has the inconvenient that the user might forget to add
native_endianness when specifying another enumerator, for example
consume_header.
2) provide two enum values for both endianness with the default matching
the platform's native endianness:
enum codecvt_mode {
restrict_to_bmp = 8,
consume_header = 4,
generate_header = 2,
big_endian = /* implementation defined: either 0 or 1 */,
little_endian = /* implementation defined: either 1 or 0 */
};
This solution has the only inconvenience that having the same symbol
codecvt_utf8<T,0> refer to either the big or little endianness might be
a problem in libraries.
3) add a template parameter:
enum codecvt_mode {
restrict_to_bmp = 8,
consume_header = 4,
generate_header = 2,
};
enum codecvt_endianness {
little_endian,
big_endian,
native = /* implementation defined: either little_endian or big_endian */
};
template<class Elem,
codecvt_mode Mode = (codecvt_mode)0,
codecvt_endianness Endian = native>
class codecvt_utf8 {...};
This solution has none of the previous inconveniences but has... ehr...
one more parameter.
All interesting redesigns. I've proposed codifying existing practice,
however
revolutionary that may appear these days.
-- Issue #3: UTF-8 encoding clarification
The paper states the intent to provide support for Unicode, but when
describing UTF-8 encoding it refers to UCS2 and UCS4 which are encoding
forms that are part of ISO 10646 and are *not* part of Unicode. This is
not just nit-picking, because ISO 10646 defines UTF-8 in a slightly
different way than Unicode, so it's not clear which of the two
definitions is the paper referring to. For example, in Unicode the
so-called "non-shortest" sequences, as well as all sequences that would
refer to surrogate code points or to non-characters are invalid UTF-8
sequences, while they are valid in ISO 10646. Which is exactly the
intent of the paper? This point is very important, IMHO. Mis-handling
non-shortest forms is considered a security issue (see
http://unicode.org/reports/tr36/) so the library should at least handle
those, but I would suggest we do it right and support the whole Unicode
semantic.
I actually favor the ISO 10646 formalism, and implicitly did so in this
proposal (and the implementation on which it's based). I'll raise the
issue next week about changing the terms, to UTF-16 and UTF-32
I assume, but I think that an ISO committee should favor ISO
standards.
As for the security issue, and its purported fix in Unicode, I observe
that more computing sins are committed these days in the name of
improving security, without necessarily achieving it, than for most
other reasons, including blind stupidity. (With apologies to Bill Wulf.)
Just my two eurocents,
About USD 0.028 these days (sigh).
P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]