How to figure out what encoding the text in the file is in?
You need to know. Nowadays, there is no longer such a thing as "plain text", you
always need to know the encoding:
* the user tells you with a command line switch or with a menu selection
* The encoding is known by some other information (i.e. encoding information in
an XML header or in the e-mail headers)
* The encoding is implicit by some other rules (i.e. a company or project rule
that source files are always encoded in ANSI Latin-1 and config files are UTF-16LE)
* Sometimes you can guess by statistical analysis of the first 1000 or so bytes.
I think there is windows shell function that tries to guess for you.
* UNICODE text files may contain a BOM (as the first character that can be used
to identify the file's Unicode encoding type. Unfortunately, this BOM is
optional. See http://en.wikipedia.org/wiki/Byte_Order_Mark
* Most config files that do not contain user displayable texts still use plain
ASCII. But since even internet domain names can contain extended characters you
can no longer rely even on that.
In many cases you can assume that a text file is either in the user's default
ANSI code page or in UTF16 with BOM (Byte Order Mark) when the file was created
by some standard windows applications. But for files that came from the net, you
never really know. 8 bit text files from current Linux systems are usually
encoded in UTF-8, with and without BOM.