Re: stream bytes
Christopher <cpisz@austin.rr.com> wrote in news:43e044b9-4ef9-45df-bd83-
7547199c73c6@e2g2000vbb.googlegroups.com:
I am trying to debug some code that supposedly stores text as bytes
and then back again.
I am not vey sure what this is supposed to mean, in a computer any data is
stored as bytes, including any original "text". Probably you mean that you
want to change the encoding of characters in some text. In this case, which
encoding into which?
The first step is to examine the contents.
Somewhere data was inserted that contained at least one character with
the signed bit set.
So, the supposed "text" isn't even valid text at all.
It depends on the encoding of the text. It seems you assume this is ASCII.
Later you mention that your files are XML. XML standards say that almost
any Unicode character is valid, and specifically require that all readers
must support at least UTF-8 and UTF-16 encodings (http://www.w3.org/TR/REC-
xml/#charsets 2.2 "All XML processors MUST accept the UTF-8 and UTF-16
encodings of Unicode [Unicode]"). So your assumption that an XML file can
be ASCII is basically flawed.
I had orginally written this code, which threw if an invalid text
cahr (or non-ascii) was found:
std::wstring StringBufferList::GetBytesAsText() const
According to the function name this should convert to "text", not from
"text". But your complaint is that it is "text" that contains "invalid"
characters. How comes?
{
// This class should have been storing bytes as unsigned char
rather than char
// to begin with and needs to be changed later.
//
// I am just adding this method quickly for debugging purposes.
//
// Because of the lack of type safety currently in insertion of
any type using a reinterpret_cast
// this class made use of, we must check each byte for validity.
//
// It was assumed only ascii characters would be used, but that
might not be the case
std::wstringstream output;
size_t numBytes = getSize();
for( const_iterator itBuffer = begin(); itBuffer != end(); itBuffer
++ )
{
for( size_t byteIndex = 0; byteIndex < numBytes; ++byteIndex )
{
char & data = itBuffer->buffer->value_.get()[byteIndex];
Why do you use a reference? Why a non-const reference?
if( data < 0 )
{
// Error - Invalid byte value
std::wstringstream msg;
msg << L"Attempted to convert byte values to wide
character hex text values and came across a negative signed
character.";
Hex encoding is always ASCII so "wide character hex text value" is an
oxymoron.
LOG4CXX_ERROR(logger, msg.str());
throw InternalErrorException(__WFILE__, __LINE__) <<
msg.str();
}
output << std::hex << std::setw(2) << std::setfill(L'0')
<< data;
And as this is ASCII there is no need to use a wide stream here (but should
not harm either).
output << "' '";
}
}
return output.str();
}
In debugging, my exception happened.
Can I examine the values just by taking my if statement out? How do I
recognize the signed values in hex?
In debugging, the easiest way to examine values is by using the debugger.
But of course you can output them as well.
Hex is a textual (ASCII) representation of some integer value. This
representation does not support negative values, so any negative values are
most probably casted into positive values before processing. Negative
values usually become large positive values during this process.
It seems that you have been confused by the legacy habit to use 'char'
buffers for storing text, and that 'char' is a signed datatype in many
common implementations. Actually no text encoding uses negative codes, so
this habit does not make much sense. One ought to use unsigned chars
instead, but this is cumbersome because of a myriad of legacy interfaces.
Fortunately, most data transport routines do not care whether it is char or
unsigned char, so one needs to worry only when processing the values. What
one usually does is just to cast char values to unsigned char values before
any processing.
If this buffer supposedly held XML in ASCII, how would you go about
looking at the contents to see where it became invalid?
"XML in ASCII" is an oxymoron. But OK, if you want to check if some char
buffer can be interpreted as a text in ASCII encoding:
bool IsAscii(const char* buffer, size_t length) {
for (size_t i=0; i<length; ++i) {
unsigned char c = static_cast<unsigned char>(buffer[i]);
if (c>0x7f) {
std::cerr << "Byte at index " << i << " has value " << c <<
"\n";
return false;
}
}
return true;
}
hth
Paavo