Re: Text File problem - VC++ MFC Studio 2008 MFC app

From:

"Tom Serface" <tom.nospam@camaswood.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Tue, 16 Sep 2008 11:14:50 -0700

Message-ID:

<87562353-5AB9-451F-AB52-122D121721E7@microsoft.com>

I agree. A BOM should be required and is even specified by Microsoft.

Tom
"Giovanni Dicanio" <giovanniDOTdicanio@REMOVEMEgmail.com> wrote in message
news:%236s9Zh%23EJHA.4104@TK2MSFTNGP04.phx.gbl...

Hi Tom,

"Tom Serface" <tom.nospam@camaswood.com> ha scritto nel messaggio
news:A416EB5B-6AC7-4370-A6EE-ADEF45CC74AB@microsoft.com...

In addition to what others have written, if you are using CStdioFile you
should use WriteString and ReadString. How did you look at the file.

I still don't trust CStdioFile to write text to files...

I tried this simple MFC code snippet using VS2008, in Unicode mode:

<code>

   CStdioFile outFile;
   if ( ! outFile.Open( L"test.txt",
         CFile::modeCreate | CFile::modeWrite | CFile::typeText ) )
   {
       AfxMessageBox( L"Error opening file" );
       return;
   }

   outFile.WriteString( L"Ciao\n" );
   outFile.WriteString( L"Poich?" );

</code>

Then I opened the file with Cygnus Free Edition in binary mode, and I
found that file bytes are (hex): 43 69 61 ... E9.
There are 12 bytes in total. That means that the text was not written in
Unicode UTF-16, because in UTF-16 there are 2 bytes for each character.
Moreover, there is no BOM (which should be required for UTF-16, e.g. to
identify if it is using UTF-16 LE or BE).

But this text is not Unicode UTF-8, either. In fact, the Italian '?' of
"poich?" is written as one single byte E9 in the file, but '?' is not
encoded as byte E9 in UTF-8.

So, I think that CStdioFile used some form of local code-page to write
text data to file, and using local code-pages is IMHO very bad. In fact,
if I give this file written on my computer with an Italian/West-Europe
code-page, to someone who has a different default code-page (like Chinese,
Japanese, etc.) I believe that the content of the file will be seen as
different (i.e. they will read no "poich?", but something different from
"?").

I think that Unicode is the way to go for international text (CStdioFile
may be good for pure-ASCII, i.e. only English characters), and to me it
seems that CStdioFile ignores Unicode.

The text should be written in some Unicode form; I prefer UTF-8, but
UTF-16 could be fine, too. And if UTF-16 is used, CStdioFile should write
a BOM, to specify if it is using UTF-16LE or UTF-16BE (in fact, one of the
advantages of UTF-8 is that no BOM is required to specify the "endiannes"
BE/LE - there are neither UTF-8 LE nor BE, there is just UTF-8 :)

These are reasons why I don't use CStdioFile.
Maybe a better replacement would be CodeProject::CStdioFileEx

http://www.codeproject.com/KB/files/stdiofileex.aspx

or your Tom::CStdioFileEx...

The class I wrote is more restricted in scope (i.e. it writes only in
UTF-8), but I think that it does his (simple) job well :)

However, Mihai is the "king" in internationalization, so better wait for
him to have a definitive word about CStdioFile.

G