Re: question on java lang spec chapter 3.3 (unicode char lexing)

=?ISO-8859-1?Q?Arne_Vajh=F8j?= <>
Wed, 02 Jan 2013 20:40:32 -0500
On 1/2/2013 8:21 PM, Lew wrote:

Arne Vajh?j wrote:

Lew wrote:

Aryeh M. Friedman wrote:

If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
representation of text) and 100% of the code that I will be lexing is from that environment do I need still
deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
will be imported from non-unicode environments

What do you mean "have to deal with"?

If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
authority on what that constitutes.

Being "in a 100% unicode [sic] environment" (whatever that's supposed to mean) does not excuse
any responsibilities.

Nor does it obviate the need for the occasional "\uXXXX" in source.

However, I don't think the lexer deals with that. Unicode escape sequences are a precompile
phenomenon. Everything is substituted before parsing starts.

Well - lexing happens before parsing so ...

So does writing source code. What's your point?

That it being done before parsing does not imply not done by lexer.

My point is that the lexer picks up after the substitution of Unicode sequences.
However, my point is wrong, and yours is right.

I am not quite sure what that source code snippet shows.

But a lexer is something that converts from a stream of
source code to a stream of tokens.

Given that:
- the source code contains the escape sequences
- escape sequences get treated similar to real unicode
and if we assume that:
- the parser has not duplicated a ton of logic to handle
   a unicode token
then the conversion of escape sequences must either happen in
the lexer.

Whether it is a filter in front of the real lexer or more
deeply buried into the lexer is not as easy to say.


