Re: Is this UTF-8 Regular Expression semantically correct?
On 5/8/2010 10:37 AM, Peter Olcott wrote:
BYTE_ORDER_MARK [0\xEF][0\xBB][0\xBF]
ASCII [\x0-\x7f]
U2 [\xC2-\xDF][\x80-\xBF]
U3 [\xE0][\xA0-\xBF][\x80-\xBF]
U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5 [\xED][\x80-\x9F][\x80-\xBF]
U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
U {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
I will be building a UTF-8 string probably called utf8string
that will provide a subset of the exact std::string
interface. This is my starting point. I will post the code
for code review, and provide a license to use this code for
any purpose as long as the original authorship is retained
in the source-code.
After the above regular expression has passed peer review, I
will post the interface subset that I will be providing.
If you hadn't posted all those "I will do that" and "this will provide
such and such functionality", your regular expression question would
simply be answered on its merit (whatever that might be). Why make all
those promises? When/if you have something to share/discuss, post. If
you don't, then do you to discuss your intentions? I guess I'm just
tired of the previous UTF-8 "discussion"... Don't mind me, please. Or
better yet, just killfile me...
Oh, speaking of your question, does it have anything to do with C++
language? I use regular expressions in my work so rarely and of such a
specific nature (my editor search accepts optional reg exp syntax, only
specific to it), that I haven't paid attention to regexp area of the C++
Standard library at all... Is UTF-8 now part of C++?
V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask