Re: _stprintf

From:

"Norman Diamond" <ndiamond@community.nospam>

Newsgroups:

microsoft.public.vc.mfc

Date:

Thu, 3 Aug 2006 10:34:15 +0900

Message-ID:

<OWSw1zptGHA.3264@TK2MSFTNGP03.phx.gbl>

Multibyte Character Set is an *encoding* of a character set.

Yes, ANSI code page 932 is an encoding just like other ANSI code pages such
as (I might not be remembering these numbers correctly) 1252 and 850.

however, StringCchPrintf, sprintf, etc. do only convert characters using
code pages in special cases, e.g., %lc or %C format.

And %s and stuff like that. (If you're compiling in an ANSI environment
then simply use %s, but if you're compiling in a Unicode environment and
want to produce an ANSI encoded string then use %S.)

For ANSI mode, this means that 'character' is 'byte'. In ANSI mode, one
character is one byte.

For some reason I thought that you had sometimes written code targetting
ANSI code pages in which you knew that these statements are not true. It
looks like I misremembered. OK, then it seems that this is your
introduction to such code pages. In ANSI mode, one character is one or more
bytes. In the ANSI code pages that Microsoft implemented, one character is
one or two bytes, no more than two.

I haven't been using Japanese Microsoft systems for nearly 20 years, I've
only been using them for half that length of time and occasionally seen them
in use the other half of that time while I was using Japanese Unix and
Japanese VMS systems. I've used %s format in printf in Japanese Unix and
VMS and Windows systems. This is one kind of experiment that you don't need
to tell me to do.

I will continue to respect your expertise on matters other than character
encodings.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:b9i1d2p7ca3n59258h63bc1mavfgjngicd@4ax.com...

Multibyte Character Set is an *encoding* of a character set. In ANSI
mode, MBCS can be
used to encode 'characters' in an extended set; however, StringCchPrintf,
sprintf, etc. do
only convert characters using code pages in special cases, e.g., %lc or %C
format. The
formal definition for %c, the formatting code being discussed in this
example, is that
the int argument is converted to 'unsigned char' and formatted as a
character. For ANSI
mode, this means that 'character' is 'byte'. In ANSI mode, one character
is one byte.

In a multibyte character set, a glyph might be represented by one to four
successive 8-bit
bytes. Note that using %c would be erroneous for formatting an integer
value, if the
intent was to produce a multibyte sequence representing a single logical
character.

This can easily be seen by looking at the %c formatting code in output.c
in the CRT
source. %c formats exactly one byte in ANSI mode. So arguing that %c
requires two bytes
for a character is not correct.

The exact code executed for %c formatting is
                   unsigned short temp;
                   temp = (unsigned short) get_int_arg(&argptr);
                   {
                       buffer.sz[0] = (char) temp;
                       textlen = 1;
                   }

I see nothing here that can generate more than one byte of output. Note
that the %C and
%lc formats, which take wide character values and format them in
accordance with the code
page, *can* generate more than one byte of character, which does satisfy
the objection
raised. But the format here is clearly %c, and %c is clearly defined, and
the
implementation reflects that definition. So I'm not sure what the issue
is here.

StringCchPrintf is defined in terms of 8-bit characters and 16-bit
characters, not in
terms of logical characters encoded in an MBCS. MBCS does not enter the
discussion; if
you format using %lc or %C it will actually truncate the multibyte string
to fit in the
buffer. Thus, it obeys its requirement of not allowing a buffer overrun.

This can be seen trivially simply by--get this--DOING THE EXPERIMENT!!!!!
So while you
can contend until the cows come home that you think that you know how to
read the
documentation, it is a matter of a couple minutes to actually do the
experiment. I found
that even when the wctomb function produces a sequence of multiple bytes
to represent the
wide character as a multibyte character, when formatting with %lc, the
ANSI definition of
StringCchPrintf is in terms of ANSI characters, 8-bit bytes, and it writes
exactly one of
the three bytes of the multibyte sequence, the first byte. So the
sequence

StringCchPrintf(buffer, '%lc', 0xF95C);

will simply transfer to the target buffer the first 8-bit byte of what
turned out to be a
3-byte multibyte sequence.

Note that since I don't have appropriate multinational support, I had to
actually set a
breakpoint and "fake" the results of wctomb, because what it does on my
machine is fail
the conversion and return -1. So I simply placed two bytes and a NUL into
the buffer as
if wctomb had worked correctly, changed the length to 2, and proceeded
with the execution.
Otherwise, I just get an empty string.

UTF-8 is one of the many multibyte character encodings that exist. I
chose it as an
example because it is specified in the Unicode standard.

joe

On Wed, 2 Aug 2006 09:12:11 +0900, "Norman Diamond"
<ndiamond@community.nospam> wrote:

I wrote:

The documentation for StringCchPrintf talks about counts of characters.

Dr. Newcomer's response emphasises several times that the documentation
for
StringCchPrintf talks about counts of ***** characters ***** EXACTLY as I
said it does. It is reassuring to see this agreement, though I wonder why
it's expressed so oddly.

But then odd questions arises

Now where, in the above documentation, does it say that a 'character' is
exactly one byte?
How do you infer that a 'character', in ANSI mode, can occupy two bytes?

Very very true. In the documentation of StringCchPrintf, MSDN correctly
refrains from saying that a 'character' is exactly one byte. Microsoft is
well aware that code page 932 (Shift-JIS) and the code page for the
world's
largest country by population and a couple of other code pages contain
characters that, in ANSI mode, occupy two bytes. Dr. Newcomer, I think
you
are well aware of this too, and I am really confused why you ask these
questions.

Meanwhile, this is still the reason why, if MSDN's documentation is
correct,
buffer overflow can still occur. A caller of the ANSI version can have a
buffer 2 bytes long, long enough for 1 single-byte character plus 1
single-byte null character, and say that its buffer length is 2. But
StringCchPrintf, if it behaves as documented, will copy in 1 character no
matter how many bytes it requires, plus 1 single-byte null character. If
the first character occupies two bytes then the null character goes into
the
third byte of the two-byte buffer.

I don't know where the discussion of UTF-8 came from but I'm not joining
it,
at least not for the moment.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:pbpuc2lq2o7e2ca2pi9opis83ubrilg4vp@4ax.com...

The documentation for StringCchPrintf talks about counts of characters.
In the ANSI
compilation, each character occupies exactly one TCHAR. I'm not sure
how
you figure a
character can occupy two TCHARs (which are just chars in ANSI) since
each
char has a value
of exactly the range 0..255, which fits in exactly one char.

The documentation for StringCchPrintf says
============================================
StringCchPrintf Function

StringCchPrintf is a replacement for sprintf. It accepts a format string
and a list of
arguments and returns a formatted string. The size, in characters, of
the
destination
buffer is provided to the function to ensure that StringCchPrintf does
not
write past the
end of this buffer.

Syntax

HRESULT StringCchPrintf(
   LPTSTR pszDest,
   size_t cchDest,
   LPCTSTR pszFormat,
    ...
);
Parameters

pszDest
   [out] Pointer to a buffer which receives the formatted,
null-terminated
string created
from pszFormat and its arguments.
cchDest
    [in] Size of the destination buffer, in ****characters****. This
value
must be
             sufficiently large to accommodate the final formatted
string
plus 1 to
             account for the terminating null character. The maximum
number of
             characters allowed is STRSAFE_MAX_CCH.
pszFormat
   [in] Pointer to a buffer containing a printf-style format string.
This
string must be
             null-terminated.
...
   [in] Arguments to be inserted into pszFormat.
=====================================
Note that the word ****characters**** is clearly in italics in the
original documentation.
Now where, in the above documentation, does it say that a 'character' is
exactly one byte?
How do you infer that a 'character', in ANSI mode, can occupy two bytes?
Where is there
the slightest confusion between the char and wchar_t data type here? I
think you have a
very serious confusion in understanding the difference between the terms
'character'
(which is one or two bytes depending on the compilation mode), 'char'
(which is always one
byte), 'wchar_t' (which is always two bytes), and TCHAR (which is one or
two bytes
depending on the compilation mode).

I have no idea what you mean by "one 2-TCHAR character". This is a
contradiction. A
character is by definition a 1-TCHAR character, because that is what is
meant by
"character". A TCHAR[2] holds two characters. A string is a sequence
of
zero or more
characters followed by a NUL character. In ANSI mode, this means for a
TCHAR[2] to
represent a string, it holds a single 8-bit character and a single 8-bit
NUL character, in
Unicode this means it holds a single 2-byte Unicode character and 2-byte
NUL character.
How can you get a 2-byte "character" in ANSI mode? This contradicts the
whole concept of
"character" as specified for each mode. (Note that in ANSI mode, you
can
have UTF
encoding that represents a single 8-bit character as two characters, but
note that this is
two characters, and in ANSI mode that is two bytes. But StringCchPrintf
is not going to
somehow magically convert anything to UTF-8 in the process of formatting
it. Since the
target formatting string, %c, formats exactly one character, a
2-character
buffer, in any
mode, will suffice, and StringCchPrintf will work. UTF-8 is a multibyte
encoding and that
is a discussion completely separate from the one we are having here).
joe

On Tue, 1 Aug 2006 10:27:37 +0900, "Norman Diamond"
<ndiamond@community.nospam> wrote:

The documentation for StringCchPrintf talks about counts of characters.
In
an ANSI compilation each character occupies one or two TCHARs depending
on
the actual character. The documentation for StringCchPrintf doesn't say
that TCHARs are counted where it does say that characters are counted.

Dr. Newcomer, you KNOW how, in an ANSI compilation, one 2-TCHAR
character
will overflow a buffer which has enough space for only one 1-TCHAR
character.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:tppsc21810onsurc601ligkkiivh5pui77@4ax.com...

The libraries are shared and there is already a copy of them loaded.

What is wrong with StringCchPrintf? It won't overflow the buffer,
which
is a good thing.

The char/wchar_t is what TCHAR means. But it is signed, which implies
sign extension for
any Unicode character > 7FFFU. This will not produce a good result in
most cases. WORD
will handle a char value because it won't sign extended.

I made B an array of two characters. not two bytes. I distinctly
recall
writing
TCHAR B[2];
which is two characters. This means in Unicode it is 4 bytes.

StringCchPrintf will format the string, which is one character plus a
terminal null
character. Do not confuse "character" with "byte". StringCchPrintf
will
copy the single
character and add a NULL character, which the last I looked, was two
characters, the size
of the array.
joe

On Mon, 31 Jul 2006 19:40:24 +0900, "Norman Diamond"
<ndiamond@community.nospam> wrote:

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:e57oc2lrr2nd1j0nt83h8e7h02ahjsbqih@4ax.com...

Use CString::Format as the preferred choice.

On "real" Windows I agree. On Windows CE where extra libraries will
occupy
the machine's RAM, it might not be a good idea.

If you MUST use some form like _stprintf, use StringCchPrintf (I
think
that's the name, but search for strsafe.h on the MSDN) which at
least
will
avoid any possibility of buffer overflow

As documented it will not have such a beneficial effect.

StringCchPrintf(_T("%c"), B, sizeof(B) / sizeof(TCHAR), (BYTE)('a' +
i));

Mihai N. addressed a problem with your cast to BYTE and you made an
adjustment which I'm still thinking about. Since arguments to
StringCchPrintf are either Unicode or ANSI, the last argument should
be
either char or wchar_t, and I'm trying to figure out if WORD is
guaranteed
to marshall a char value properly.

More importantly is that, as documented, buffer overflow can very
easily
occur. Suppose we have an ANSI compilation and make B an array of 2
chars.
Then the buffer has enough space for 1 single-byte character plus a
null
character. But if the last argument is a double-byte character then
StringCchPrintf is documented to copy both bytes plus a single-byte
null
character, total 3 bytes.

Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm