Re: How to check variables for uniqueness ?

From:

John Ersatznom <j.ersatz@nowhere.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 16 Jan 2007 11:46:43 -0500

Message-ID:

<eoivjd$u5e$1@aioe.org>

Chris Uppal wrote:

String.toUpperCase() does /not/ change the spelling of words (how could it, it
doesn't know anything about words ?). What it does follow are the correct
(insofar as the Unicode spec is correct) rules for mapping lowercase to
uppercase. It produces the /same/ word with the /same/ spelling[*], but
(naturally) a different representation. In this case the number of visually
separable glyphs changes because the U+00DF character (LATIN SMALL LETTER SHARP
S) is a ligature of two logical characters, long s and short s (U+017F and
U+0073), there is no upper case ligature for that combination (compare fi and
FI in English typography), so the correct uppercase version of those (logical)
characters is the sequence SS. (At least that's the theory the Uncicode people
seem to be operating on -- they know more about it than me so I'm willing to
believe them).

This seems to be excessively technical when the matter under discussion
is simply capitalizing strings. In any event, equalsIgnoreCase should
collapse these "ligatures" of yours as well. Also, I don't notice "fi"
and "FI" producing strange behavior myself -- even if the letters are
often run together so the 'i' hasn't got a separate dot *when typeset*,
this doesn't affect the representation of a string in a computer, only
the visually displayed output (and then usually only when serious
typesetting software is used). Likewise, it makes sense to represent any
other logical sequence of characters in a sensible way under the hood,
regardless of any rendering fanciness that is done when presenting them
to the user.

It is simply erroneous to expect String.toUpperCase() to map characters
one-to-one in the way that English case mapping works. I can't, it isn't
supposed to, and it doesn't...

No, it is not erroneous to expect a method to do exactly and only what
its name implies. It is erroneous, of course, to give a method a name
that is misleading. If toUpperCase needs a lengthy documentation block
explaining why its behavior is surprising, then it's a sure bet that it
should not have been named that, since it's apparently really
toUpperCaseAndDoesSomeExtraStuffToo.

String.equalsIgnoreCase(), on the other hand, is badly broken in that it does
/not/ follow those rules.

So you at least agree with me that it should be consistent with
toUpperCase (and toLowerCase) -- all strings should have a single
canonical toUpperCase, a single canonical toLowerCase, both should
define equivalence classes on the mixed-case input strings, these should
be the SAME equivalence class, and equalsIgnoreCase should implement and
embody the corresponding equivalence relation.

Or, since it's behaviour is clearly documented,
perhaps "broken" is too strong a term -- "badly misleading" might be preferred.

It sounds like toUpperCase has a "badly misleading" name since it
(supposedly) does transformations that go well beyond what is normally
meant by everyday blokes by "to upper case", and the method name is
supposed to be a reasonably meaningful capsule summary for everyday
blokes of what the method does. If a method is supposed to do behavior
that's surprising for any English speaker but not for a German speaker,
maybe it should have a German rather than an English name? :) If it's
supposed to do locale-dependent stuff, then it should have a version
that accepts a Locale object. The version that doesn't shouldn't
surprise English speakers; the version that does shouldn't surprise
anyone familiar with its locale-specific behavior for the locale
actually used. Having locale-dependent behavior invoked randomly without
explicit use of Locale objects, and which furthermore doesn't use the
system locale, is by itself a sign of a questionable design as well as a
sure source of bugs and problems.

I've even encountered somewhere a notion that aString.length() is not
even accurate in current Java versions if a string contains obscure
characters. It suggests aString.<something using the obscure term "code
point", apparently just Unicode-geek for "character"> as its
replacement, while of course there's a ton of legacy code using
length(). I don't suppose it occurred to them that the new fancy-whosit
should have been a replacement length() implementation instead of some
new name that doesn't suggest anything to do with the length of a string
to someone who doesn't care about all the Unicode bells and whistles and
just wants to process strings while remaining agnostic about what they
are ultimately used for or contain? Those users will gravitate to
length() (plus all that legacy code), not caring about the actual
storage length of the internal representation but the length in
characters of their data as a general rule. So there should be a
length() method that returns the true length of the string, and if
necessary a getSize() method that returns the representation's size in
bytes or whatever in case someone needs such low level data. (If they
persist strings as UTF-8 in a text format file that is parsed, or use
serialization, then they don't.)

[*] Arguably the concept "same spelling" is flawed in the context of Unicode
case mapping.

A concept like "same spelling" can't be flawed. It's generally accepted
that "color" and "colour" are the same word, but have different
spellings, right? While "two" and "too" are different words spelled
differently that sound the same, "tomato" and "tomato" are the same word
spelled the same but pronounced differently, and "ant" (the bug) and
"ant" (the build tool) are different words both spelled and pronounced
the same.