Re: CHTMLView::GetSource UNICODE, junk after </html>

From:

Giovanni Dicanio <giovanniDOTdicanio@REMOVEMEgmail.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Sun, 13 Sep 2009 00:28:13 +0200

Message-ID:

<#kQNYh$MKHA.1280@TK2MSFTNGP04.phx.gbl>

PRMARJORAM ha scritto:

Iv noticed some files do not always end with </html> either.

But i dont see what i can do about it as its internal to the MFC code and
GetSource()
uses IStream.

The MFC code has bugs (at least in VS2008 implementation, which I tested).

First of all: thanks to Igor Tandetnik for private communication about
this issue, and for having identified the problems in original MFC code.

The first bug (the cause of the garbage displayed at the string tail
after </html>) is caused by the wrong assumption that
PersistStreamInit::Save() would produce a NUL-terminated string.
Instead, the correct way is to use IStream::Stat() to figure out the
actual length of the string (and pass this length parameter to CString
constructor).

If you fix that, you don't have garbage anymore.

However, there is another problem.
In fact, the original MFC code uses CString constructor to convert from
the string returned by the stream to Unicode string.
But this conversion implicitly assumes that the encoding is CP_ACP (i.e.
system default code page). Instead, the correct code page for the
conversion should be read from proper <meta> tag in HTML text, e.g.

   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

and this should be used with MultiByteToWideChar() to convert the string
to Unicode (UTF-16).

A possible fix for the problem #1 could be the following:

<code>

//////////////////////////////////////////////////////////////////////////
//
// This is a partially fixed version of CHtmlView::GetSource() method.
//
// One of the problems of the original MFC implementation is that
// it incorrectly assumes that IPersistStreamInit::Save would produce a
// NUL-terminated string. Instead, the correct way to do it is to use
// IStream::Stat to figure out the actual length of the data.
//
// Another problem is the conversion from the given HTML source text
// to Unicode using CString.
// In fact, original MFC code wrongly assumed that the text is always
// CP_ACP (i.e. system default code page). Instead, the code page for
// conversion should be read from the proper <meta> tag in HTML text,
// e.g.
// <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
//
// Thanks to Igor Tandetnik for pointing that out.
//
// Giovanni Dicanio
//

#define DELETE_EXCEPTION(e) do { if(e) { e->Delete(); } } while (0)

BOOL CHtmlView_GetSource_Fixed(
     IN CHtmlView * pView,

     OUT CStringA & strSource
     //
     // Note on 'strSource' parameter:
     //
     // The HTML data is always returned as ANSI/MBCS string in a CStringA.
     // The caller should figure out the actual encoding searching for
     // proper <meta> tags like:
     //
     // <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
     //
     // Then it could use MultiByteToWideChar() to convert the given
HTML string
     // to Unicode (UTF-16).
     //

     )
{
     ASSERT(pView != NULL);

     BOOL bRetVal = FALSE;

     CComPtr<IDispatch> spDisp;
     spDisp = pView->GetHtmlDocument();
     if (spDisp != NULL)
     {
         HGLOBAL hMemory;
         hMemory = GlobalAlloc(GMEM_MOVEABLE, 0);
         if (hMemory != NULL)
         {
             CComQIPtr<IPersistStreamInit> spPersistStream = spDisp;
             if (spPersistStream != NULL)
             {
                 CComPtr<IStream> spStream;
                 if (SUCCEEDED(CreateStreamOnHGlobal(hMemory, TRUE,
&spStream)))
                 {
                     spPersistStream->Save(spStream, FALSE);

                     // Get stream size
                     STATSTG statstg;
                     ZeroMemory(&statstg, sizeof(statstg));
                     if (SUCCEEDED(spStream->Stat(&statstg,
STATFLAG_NONAME)))
                     {
                         // File must not be very big
                         ASSERT(statstg.cbSize.HighPart == 0);

                         // Size of string, in bytes
                         int cbSize = static_cast<int>(
statstg.cbSize.LowPart );

                         LPCSTR pstr =
static_cast<LPCSTR>(GlobalLock(hMemory));
                         if (pstr != NULL)
                         {
                             // Stream is always ANSI

                             bRetVal = TRUE;
                             TRY
                             {
                                 // Build string user proper size
                                 // (in fact, 'pstr' cannot be assumed
to be NUL-terminated)
                                 strSource = CStringA(pstr, cbSize);
                             }
                             CATCH_ALL(e)
                             {
                                 bRetVal = FALSE;
                                 DELETE_EXCEPTION(e);
                             }
                             END_CATCH_ALL

                                 if(bRetVal == FALSE)
                                     GlobalFree(hMemory);
                                 else
                                     GlobalUnlock(hMemory);
                         }
                         else
                         {
                             GlobalFree(hMemory);
                         }
                     }

                 }
                 else
                 {
                     GlobalFree(hMemory);
                 }
             }
             else
             {
                 GlobalFree(hMemory);
             }
         }
     }

     return bRetVal;
}

//////////////////////////////////////////////////////////////////////////

</code>

You can find an updated VS2008 solution project here, which includes the
above fix:

http://www.geocities.com/giovanni.dicanio/vc/TestHtmlView.zip

Note that no garbage is displayed.

However, you still have to write code to fix bug #2 (i.e. conversion
from the code page specified in HTML text to Unicode UTF-16).

HTH,
Giovanni