Re: CharNext

From:

"Alexander Nickolov" <agnickolov@mvps.org>

Newsgroups:

microsoft.public.vc.language

Date:

Fri, 18 Jan 2008 13:59:15 -0800

Message-ID:

<#ER1q0hWIHA.4440@TK2MSFTNGP06.phx.gbl>

Actually, IDN names only use ASCII characters since they are
relayed through the old DNS system. They can be decoded
into UNICODE for display purposes though. These are the
names that start with "xn--" I'm talking about.

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnickolov@mvps.org
MVP VC FAQ: http://vcfaq.mvps.org
=====================================

"Norbert Unterberg" <nunterberg@newsgroups.nospam> wrote in message
news:%235KqHBBWIHA.4696@TK2MSFTNGP05.phx.gbl...

Alex schrieb:

How to figure out what encoding the text in the file is in?

You need to know. Nowadays, there is no longer such a thing as "plain
text", you always need to know the encoding:

* the user tells you with a command line switch or with a menu selection
* The encoding is known by some other information (i.e. encoding
information in an XML header or in the e-mail headers)
* The encoding is implicit by some other rules (i.e. a company or project
rule that source files are always encoded in ANSI Latin-1 and config files
are UTF-16LE)
* Sometimes you can guess by statistical analysis of the first 1000 or so
bytes. I think there is windows shell function that tries to guess for
you.
* UNICODE text files may contain a BOM (as the first character that can be
used to identify the file's Unicode encoding type. Unfortunately, this BOM
is optional. See http://en.wikipedia.org/wiki/Byte_Order_Mark
* Most config files that do not contain user displayable texts still use
plain ASCII. But since even internet domain names can contain extended
characters you can no longer rely even on that.

In many cases you can assume that a text file is either in the user's
default ANSI code page or in UTF16 with BOM (Byte Order Mark) when the
file was created by some standard windows applications. But for files that
came from the net, you never really know. 8 bit text files from current
Linux systems are usually encoded in UTF-8, with and without BOM.

Norbert