Re: Reading unicode files?

From:

"Alexander Nickolov" <agnickolov@mvps.org>

Newsgroups:

microsoft.public.vc.language

Date:

Wed, 15 Aug 2007 11:18:55 -0700

Message-ID:

<e#5p3i23HHA.5796@TK2MSFTNGP05.phx.gbl>

UTF = Unicode Transformation Format

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnickolov@mvps.org
MVP VC FAQ: http://vcfaq.mvps.org
=====================================

"Ulrich Eckhardt" <eckhardt@satorlaser.com> wrote in message
news:epabp4-lmn.ln1@satorlaser.homedns.org...

Daniel C. Gindi wrote:

I need to implement a class that transparently reads unicode files (e.g.
for xml reading...)

Now I have read some on google, and it seems like there are many
standards: Unicode,

Unicode is the standard.

Unicde-BE (Big-endian),

This is not a term that means much, as Unicode is independent of
endianess.

UTF-8, UTF-16, UTF-16-BE, UTF-8-NO-BOM, UTF-16-NO-BOM,
UTF-16-BE-NO-BOM...

UTF = Unicode Tranport (Transfer?) Format, i.e. the byte-wise
representation.

BE/LE denotes the endianess, which is important as soon as more than one
byte is used to store an element, i.e. for UTF-16 and UTF-32.

BOM means Byte Order Marker. This is a special codepoint that, when
written
at the beginning of a file, makes it easy to distinguish between UTF-8,
UTF-16 and UTF-32, for the latter two it also allows figuring out the byte
order.

I want to be able to parse all of those, but I need some pointers...

I'd suggest looking at the Unicode standard and perhaps reading a bit of
Wikipedia. However, this is nonsense because someone already did all the
work for you. In other words, instead of reinventing the wheel use an XML
parser.

The biggest problem right now is that it seem like UTF-8 has no
constant-length characters,
It could be a sequence of 1-byte characters, followed by 2-byte
characters...
How do I parse this?

I suggest the WP entry for that. However, if you write the bytes bitwise
next to each other, you will see that there are always at first n set bits
(n=0..6) followed by a zero bit. From that you can determine the length of
the codepoint representation. While it is not a bad idea understanding
this, I still suggest you use a library.

Oh, there are also tools that transcode. That way, you could transform
everything to e.g. UTF-8 and go on from there.

Also, a totally different problem you will face is that you need to figure
out what you want to do with this. The reason is that this determines
which
internal representation you need for the data.

Uli