Re: peek() vs unget(): which is better?
Seungbeom Kim wrote:
I'm writing a simple lexer. It has to determine when to stop reading
for the current token, and it seems to have basically two options:
(1) peek(), and if valid for the current token, get() and continue
(2) get(), and if not valid for the current token, unget() and continue
Which is better? Or are they equally good?
Better in what sense? For various reasons, I feel more at home
using peek(), so I use peek. Somehow, it seems more rational to
look at the character, then consume it, rather than to consume
it, then put it back. More generally, I think it's usually
clearer in the code when you need a new character than it is
when you have to reject the character you already have. (And of
course: would you systematically write *iter++, and then do
--iter if you found you'd gone too far, or would you write
*iter, and then ++iter when you wanted to advance.)
It seems to me that (1) makes the code more cluttered and
incurs two unformatted input function per character. But I
have read somewhere that unget() is not guaranteed to work
across buffer boundaries, so I suspect (2) is rather unsafe
though simple. Is this correct?
At least one character of unget() is guaranteed. Typically,
unget() will work as long as you don't cross buffer boundaries,
but this isn't guaranteed. (For that matter, the input might be
unbuffered -- which means in practice a single character
buffer.)
Comments about any other part of the implementation is
welcome, too. Thank you in advance.
------------------------------------------------------------------------
token get_token(std::istream& is)
{
typedef std::istream::traits_type traits;
char c;
int i;
// skip whitespaces
while (is.get(c) && std::isspace(c)) { }
Which results in undefined behavior. You can't call the
one-parameter version of isspace with a char, and expect to get
away with it. (In practice, both Solaris and the ctype.h used
by g++ under Linux make it work for all characters except '?'.
But it's still undefined behavior according to the standard.)
I'd write:
while ( isspace( is.peek() ) ) {
is.get() ;
}
More likely, I'd write something a little more complicated,
using std::ctype, so that my code would be independant of the
global locale. But I'd definitely use peek() like this.
Unless, of course, performance raised its head. In that case,
I'd use the streambuf directly, e.g.:
streambuf* sb = is.rdbuf() ;
if ( sb == NULL ) {
// Handle error, probably shouldn't happen...
}
while ( isspace( sb->sgetc() ) ) {
sb->sbumpc() ;
}
Typically, the low level streambuf functions are inline, and
have a very low cost, but if for some reason, I didn't want to
call them more than necessary :
int lookAhead = sb->sgetc() ;
while ( isspace( lookAhead ) ) {
lookAhead = sb->snextc() ;
}
The use of a variable lookAhead and sb->snextc() is probably the
fastest solution available, and IMHO, is also very readable.
The one place you have to watch out is to ensure that eofbit
gets set in is if you see an end of file here.
Using <locale>, of course, this would become:
typedef std::ctype< char >
CType ;
CType const& ctype
= std::use_facet< CType >( std::locale::classic() ) ;
// or
// = std::use_facet< CType >( is.getloc() ) ;
// depending on whether you are imposing an encoding, or
// you want to accept that of the file.
int lookAhead = sb->snextc() ;
while ( lookAhead != EOF
&& ctype.is( CType::space, (char)lookAhead ) ) {
lookAhead = sb->snextc() ;
}
(In a stand-alone application, I'd probably force the global
C-style locale, and use ::isspace( int ). Unless I wanted to
handle different input encodings. But then, neither <locale> nor
<locale.h> are much help; in UTF-8, the multibyte encoding 0xC2,
0xA0 is a space, for example.)
The rest should follow from the strategy used in skipping
blanks. Just be careful -- you have three different ways to
check for the type of a character in C++, and the simplest
(which you are apparently trying to use) doesn't work with a
variable of type char. Basically, it's:
::isxxx( int ch )/::iswxxx( wint_t ch )
ch == EOF || (ch >= 0 && ch <= UCHAR_MAX) for the char
version. All functions return != 0 for EOF, which can be
used to avoid an external check. Depends on the global
locale -- depending on the application, that's either not a
problem, or it can cause all sorts of problems.
template< typename charT >
std::isxxx( charT ch, locale const& )
Defined only if charT is char or wchar_t, doesn't work for
EOF (because EOF is not representable in a character type),
and requires two parameters. I suspect that it's also
fairly slow; it must call std::use_facet for each
invocation.
In fact, I think this one was only designed for occasional
use.
template< typename charT >
std::ctype< charT >.is( std::ctype_base::mask test, charT ch )
Defined only if charT is char or wchar_t. Requires
explicitly extracting the ctype facet from the locale
beforehand. Doesn't work for EOF.
There are also functions in std::ctype for scanning over
characters which are/are not xxx. Regretfully, they only
work on charT const*, which makes them pretty useless here
(and in just about any code I write).
None of the above handle multibyte encodings, like UTF-8; the
only way to do that within standard C++ is to read from a
wistream, with the appropriate locale to convert the UTF-8 into
Unicode (UCS-4), and use the wchar_t verions of the above
functions. Supposing such a locale exists in your
implementation, of course. (And that it supports UCS-4 -- in some
implementations, wchar_t is only 16 bits, which makes such
support impossible.)
--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]