Re: _stprintf

From:

"Norman Diamond" <ndiamond@community.nospam>

Newsgroups:

microsoft.public.vc.mfc

Date:

Tue, 8 Aug 2006 12:46:21 +0900

Message-ID:

<#RJd#0puGHA.3912@TK2MSFTNGP03.phx.gbl>

"READING THE CODE"

I don't have the source code to StringCchPrintf. (Source code to some of
Microsoft's versions of ISO printf and stuff like that yes, this one no.)

"PERFORMING THE EXPERIMENT"

And getting a result which works today in one version of Windows XP with one
version of MS Office and one version of Internet Explorer and four versions
of Visual Studio to muddy the waters. It won't work tomorrow. For some of
the incorrect statements in MSDN this kind of experiment is useful in
proving them incorrect, but it's not a reliable way to make reliable code
myself.

"Now, if you have a working system with code page 932 in place,"

give or take an order of magnitude (in the quantity of such systems)...

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:5thdd297fn5d56bd53jnfl78pql63nu1gm@4ax.com...

I think the confusion here is that you are interpreting "character" in one
context as "a
sequence of bytes representing a glyph", and StringCchPrintf, as I said,
when %c is used,
does NOT interpret the word 'character' this way. So you can interpret it
any way you
want, but the only interpretation that matters is the interpretation given
by
StringCchPrintf, and you can see that easily, as I said, by READING THE
CODE and
PERFORMING THE EXPERIMENT. Now, if you have a working system with code
page 932 in place,
try the experiments I did, and tell us what you get. Try %c, in an ANSI
code page, using
any bit value of your choice for the character value, and tell us what
StringCchPrintf
does with respect to %c. I was not discussing %s, but %c, which you
insist won't work. So
if you're convinced it produces more than one 8-bit character or 16-bit
character of
output, please demonstrate this. Note that %lc and %C *do* expand wide
character codes to
multibyte representations, but that was not what we were discussing.
joe

On Thu, 3 Aug 2006 10:34:15 +0900, "Norman Diamond"
<ndiamond@community.nospam> wrote:

Multibyte Character Set is an *encoding* of a character set.

Yes, ANSI code page 932 is an encoding just like other ANSI code pages
such
as (I might not be remembering these numbers correctly) 1252 and 850.

however, StringCchPrintf, sprintf, etc. do only convert characters using
code pages in special cases, e.g., %lc or %C format.

And %s and stuff like that. (If you're compiling in an ANSI environment
then simply use %s, but if you're compiling in a Unicode environment and
want to produce an ANSI encoded string then use %S.)

For ANSI mode, this means that 'character' is 'byte'. In ANSI mode, one
character is one byte.

For some reason I thought that you had sometimes written code targetting
ANSI code pages in which you knew that these statements are not true. It
looks like I misremembered. OK, then it seems that this is your
introduction to such code pages. In ANSI mode, one character is one or
more
bytes. In the ANSI code pages that Microsoft implemented, one character
is
one or two bytes, no more than two.

I haven't been using Japanese Microsoft systems for nearly 20 years, I've
only been using them for half that length of time and occasionally seen
them
in use the other half of that time while I was using Japanese Unix and
Japanese VMS systems. I've used %s format in printf in Japanese Unix and
VMS and Windows systems. This is one kind of experiment that you don't
need
to tell me to do.

I will continue to respect your expertise on matters other than character
encodings.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:b9i1d2p7ca3n59258h63bc1mavfgjngicd@4ax.com...

Multibyte Character Set is an *encoding* of a character set. In ANSI
mode, MBCS can be
used to encode 'characters' in an extended set; however,
StringCchPrintf,
sprintf, etc. do
only convert characters using code pages in special cases, e.g., %lc or
%C
format. The
formal definition for %c, the formatting code being discussed in this
example, is that
the int argument is converted to 'unsigned char' and formatted as a
character. For ANSI
mode, this means that 'character' is 'byte'. In ANSI mode, one
character
is one byte.

In a multibyte character set, a glyph might be represented by one to
four
successive 8-bit
bytes. Note that using %c would be erroneous for formatting an integer
value, if the
intent was to produce a multibyte sequence representing a single logical
character.

This can easily be seen by looking at the %c formatting code in output.c
in the CRT
source. %c formats exactly one byte in ANSI mode. So arguing that %c
requires two bytes
for a character is not correct.

The exact code executed for %c formatting is
                   unsigned short temp;
                   temp = (unsigned short) get_int_arg(&argptr);
                   {
                       buffer.sz[0] = (char) temp;
                       textlen = 1;
                   }

I see nothing here that can generate more than one byte of output. Note
that the %C and
%lc formats, which take wide character values and format them in
accordance with the code
page, *can* generate more than one byte of character, which does satisfy
the objection
raised. But the format here is clearly %c, and %c is clearly defined,
and
the
implementation reflects that definition. So I'm not sure what the issue
is here.

StringCchPrintf is defined in terms of 8-bit characters and 16-bit
characters, not in
terms of logical characters encoded in an MBCS. MBCS does not enter the
discussion; if
you format using %lc or %C it will actually truncate the multibyte
string
to fit in the
buffer. Thus, it obeys its requirement of not allowing a buffer
overrun.

This can be seen trivially simply by--get this--DOING THE
EXPERIMENT!!!!!
So while you
can contend until the cows come home that you think that you know how to
read the
documentation, it is a matter of a couple minutes to actually do the
experiment. I found
that even when the wctomb function produces a sequence of multiple bytes
to represent the
wide character as a multibyte character, when formatting with %lc, the
ANSI definition of
StringCchPrintf is in terms of ANSI characters, 8-bit bytes, and it
writes
exactly one of
the three bytes of the multibyte sequence, the first byte. So the
sequence

StringCchPrintf(buffer, '%lc', 0xF95C);

will simply transfer to the target buffer the first 8-bit byte of what
turned out to be a
3-byte multibyte sequence.

Note that since I don't have appropriate multinational support, I had to
actually set a
breakpoint and "fake" the results of wctomb, because what it does on my
machine is fail
the conversion and return -1. So I simply placed two bytes and a NUL
into
the buffer as
if wctomb had worked correctly, changed the length to 2, and proceeded
with the execution.
Otherwise, I just get an empty string.

UTF-8 is one of the many multibyte character encodings that exist. I
chose it as an
example because it is specified in the Unicode standard.

joe

On Wed, 2 Aug 2006 09:12:11 +0900, "Norman Diamond"
<ndiamond@community.nospam> wrote:

I wrote:

The documentation for StringCchPrintf talks about counts of
characters.

Dr. Newcomer's response emphasises several times that the documentation
for
StringCchPrintf talks about counts of ***** characters ***** EXACTLY as
I
said it does. It is reassuring to see this agreement, though I wonder
why
it's expressed so oddly.

But then odd questions arises

Now where, in the above documentation, does it say that a 'character'
is
exactly one byte?
How do you infer that a 'character', in ANSI mode, can occupy two
bytes?

Very very true. In the documentation of StringCchPrintf, MSDN correctly
refrains from saying that a 'character' is exactly one byte. Microsoft
is
well aware that code page 932 (Shift-JIS) and the code page for the
world's
largest country by population and a couple of other code pages contain
characters that, in ANSI mode, occupy two bytes. Dr. Newcomer, I think
you
are well aware of this too, and I am really confused why you ask these
questions.

Meanwhile, this is still the reason why, if MSDN's documentation is
correct,
buffer overflow can still occur. A caller of the ANSI version can have
a
buffer 2 bytes long, long enough for 1 single-byte character plus 1
single-byte null character, and say that its buffer length is 2. But
StringCchPrintf, if it behaves as documented, will copy in 1 character
no
matter how many bytes it requires, plus 1 single-byte null character.
If
the first character occupies two bytes then the null character goes into
the
third byte of the two-byte buffer.

I don't know where the discussion of UTF-8 came from but I'm not joining
it,
at least not for the moment.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:pbpuc2lq2o7e2ca2pi9opis83ubrilg4vp@4ax.com...

The documentation for StringCchPrintf talks about counts of
characters.
In the ANSI
compilation, each character occupies exactly one TCHAR. I'm not sure
how
you figure a
character can occupy two TCHARs (which are just chars in ANSI) since
each
char has a value
of exactly the range 0..255, which fits in exactly one char.

The documentation for StringCchPrintf says
============================================
StringCchPrintf Function

StringCchPrintf is a replacement for sprintf. It accepts a format
string
and a list of
arguments and returns a formatted string. The size, in characters, of
the
destination
buffer is provided to the function to ensure that StringCchPrintf does
not
write past the
end of this buffer.

Syntax

HRESULT StringCchPrintf(
   LPTSTR pszDest,
   size_t cchDest,
   LPCTSTR pszFormat,
    ...
);
Parameters

pszDest
   [out] Pointer to a buffer which receives the formatted,
null-terminated
string created
from pszFormat and its arguments.
cchDest
    [in] Size of the destination buffer, in ****characters****. This
value
must be
             sufficiently large to accommodate the final formatted
string
plus 1 to
             account for the terminating null character. The maximum
number of
             characters allowed is STRSAFE_MAX_CCH.
pszFormat
   [in] Pointer to a buffer containing a printf-style format string.
This
string must be
             null-terminated.
...
   [in] Arguments to be inserted into pszFormat.
=====================================
Note that the word ****characters**** is clearly in italics in the
original documentation.
Now where, in the above documentation, does it say that a 'character'
is
exactly one byte?
How do you infer that a 'character', in ANSI mode, can occupy two
bytes?
Where is there
the slightest confusion between the char and wchar_t data type here?
I
think you have a
very serious confusion in understanding the difference between the
terms
'character'
(which is one or two bytes depending on the compilation mode), 'char'
(which is always one
byte), 'wchar_t' (which is always two bytes), and TCHAR (which is one
or
two bytes
depending on the compilation mode).

I have no idea what you mean by "one 2-TCHAR character". This is a
contradiction. A
character is by definition a 1-TCHAR character, because that is what
is
meant by
"character". A TCHAR[2] holds two characters. A string is a sequence
of
zero or more
characters followed by a NUL character. In ANSI mode, this means for
a
TCHAR[2] to
represent a string, it holds a single 8-bit character and a single
8-bit
NUL character, in
Unicode this means it holds a single 2-byte Unicode character and
2-byte
NUL character.
How can you get a 2-byte "character" in ANSI mode? This contradicts
the
whole concept of
"character" as specified for each mode. (Note that in ANSI mode, you
can
have UTF
encoding that represents a single 8-bit character as two characters,
but
note that this is
two characters, and in ANSI mode that is two bytes. But
StringCchPrintf
is not going to
somehow magically convert anything to UTF-8 in the process of
formatting
it. Since the
target formatting string, %c, formats exactly one character, a
2-character
buffer, in any
mode, will suffice, and StringCchPrintf will work. UTF-8 is a
multibyte
encoding and that
is a discussion completely separate from the one we are having here).
joe

On Tue, 1 Aug 2006 10:27:37 +0900, "Norman Diamond"
<ndiamond@community.nospam> wrote:

The documentation for StringCchPrintf talks about counts of
characters.
In
an ANSI compilation each character occupies one or two TCHARs
depending
on
the actual character. The documentation for StringCchPrintf doesn't
say
that TCHARs are counted where it does say that characters are counted.

Dr. Newcomer, you KNOW how, in an ANSI compilation, one 2-TCHAR
character
will overflow a buffer which has enough space for only one 1-TCHAR
character.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:tppsc21810onsurc601ligkkiivh5pui77@4ax.com...

The libraries are shared and there is already a copy of them loaded.

What is wrong with StringCchPrintf? It won't overflow the buffer,
which
is a good thing.

The char/wchar_t is what TCHAR means. But it is signed, which
implies
sign extension for
any Unicode character > 7FFFU. This will not produce a good result
in
most cases. WORD
will handle a char value because it won't sign extended.

I made B an array of two characters. not two bytes. I distinctly
recall
writing
TCHAR B[2];
which is two characters. This means in Unicode it is 4 bytes.

StringCchPrintf will format the string, which is one character plus
a
terminal null
character. Do not confuse "character" with "byte". StringCchPrintf
will
copy the single
character and add a NULL character, which the last I looked, was two
characters, the size
of the array.
joe

On Mon, 31 Jul 2006 19:40:24 +0900, "Norman Diamond"
<ndiamond@community.nospam> wrote:

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:e57oc2lrr2nd1j0nt83h8e7h02ahjsbqih@4ax.com...

Use CString::Format as the preferred choice.

On "real" Windows I agree. On Windows CE where extra libraries will
occupy
the machine's RAM, it might not be a good idea.

If you MUST use some form like _stprintf, use StringCchPrintf (I
think
that's the name, but search for strsafe.h on the MSDN) which at
least
will
avoid any possibility of buffer overflow

As documented it will not have such a beneficial effect.

StringCchPrintf(_T("%c"), B, sizeof(B) / sizeof(TCHAR), (BYTE)('a'
+
i));

Mihai N. addressed a problem with your cast to BYTE and you made an
adjustment which I'm still thinking about. Since arguments to
StringCchPrintf are either Unicode or ANSI, the last argument should
be
either char or wchar_t, and I'm trying to figure out if WORD is
guaranteed
to marshall a char value properly.

More importantly is that, as documented, buffer overflow can very
easily
occur. Suppose we have an ANSI compilation and make B an array of 2
chars.
Then the buffer has enough space for 1 single-byte character plus a
null
character. But if the last argument is a double-byte character then
StringCchPrintf is documented to copy both bytes plus a single-byte
null
character, total 3 bytes.

Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm