Re: How to read Unicode(Big-Endian) text file(s) in Non-MFC

From:

"meme" <meme@myself.com>

Newsgroups:

microsoft.public.vc.language

Date:

Wed, 20 Feb 2008 03:28:01 +0530

Message-ID:

<OrUQ2K0cIHA.5712@TK2MSFTNGP04.phx.gbl>

"Giovanni Dicanio" <giovanni.dicanio@invalid.com> wrote in message
news:ehNgwPocIHA.2688@TK2MSFTNGP06.phx.gbl...

"meme" <meme@myself.com> ha scritto nel messaggio
news:eALO1CmcIHA.4140@TK2MSFTNGP04.phx.gbl...

so I tried ......following.....but I think I missed or messed up
something and therefore all I see some junk characters when executed
..... :(

You can solve this problem in several ways, there's no one single way.

You might consider this code of mine (need more test, and can be
optimized, but seems to work).
I've put comments in code, so you can read them.

Hi... Thanks again... :-)

Yes this seems working.... finally :-D

However, I made some changes.... And I also have few quarries in mind...

   //
   // Check that file is UTF-16 BE
   //
   BYTE bom[2];
   if ( fread( bom, sizeof(bom), 1, file) != 1 )
   {
       // No UTF-16 BE (BOM does not match)
       ASSERT(FALSE);

       fclose(file);
       return false;
   }

   // UTF-16 BE BOM is FE FF
   if ( bom[0] != 0xFE && bom[1] != 0xFF )
   {
       // No UTF-16 BE (BOM does not match)
       ASSERT(FALSE);

       fclose(file);
       return false;
   }

This does not worked for me.... so I used the following instead...

  int fByte[2];
  file = _wfopen(szFile, L"rb" );

  if (file != NULL)
  {
   // Read the 1st. two bytes... to see if we have a BOM
   fByte[0] = fgetc(file);
   fByte[1] = fgetc(file);
   //fclose(file);

   if((fByte[0] == 255) && (fByte[1] == 254))
   {
    //FF FE i.e. UTF-16(Unicode Little-Endian)
    readUnicode(file, false);
   }
   else if((fByte[0] == 254) && (fByte[1] == 255))
   {
    //FE FF i.e. UTF-16(Unicode Big-Endian)
    readUnicode(file, true);
   }
   else if((fByte[0] == 239) && (fByte[1] == 187))
   {
    //EF BB i.e. UTF-8 with BOM
    readUTF8(file);
   }
   else //ansi
   {
    readAnsi(file);
   }
  }

And I change the following...

   //
   // Now convert from BE to LE, swapping byte order in WORDs
   //
   BYTE * pBuffer = &(buffer[0]);
   ASSERT(pBuffer != NULL);
   for ( long i = 0; i < size; i++ )
   {
       // Swap low and high bytes (*pBuffer and *(pBuffer+1))
       SwapBytes( *pBuffer, *(pBuffer+1) );

       // Go to next WORD (2 bytes)
       pBuffer += 2;
       i += 2;
   }

to ......

  //
  // Convert from BE to LE, swapping byte order in WORDs
  //
  long i = 0;
  while( i < size )
  {
   SwapBytes(data[i], data[i+1]);
   i = i + 2;
  }

here data is....

    BYTE *data = new BYTE[size];

Now the code can read ANSI, Unicode(UTF-16 LE/BE (thanks to you ;-) ) and
UTF-8(with BOM) files.
And here comes "UTF-8 Without BOM" files.... in fact the code can read it
alright but it cannot differentiate it from the plain ANSI file..... the
above code is useless there.... any thought on this...