Re: 32-bit String (was: CodePoint amanuensis)
On Sat, 2 Jan 2010, Thomas Pornin wrote:
According to Steven Simpson <ss@domain.invalid>:
You suggest that there should be a "32-bit analog to String, perhaps
called CodePoints", but it shouldn't really be necessary, as String
doesn't reveal its internal char[] representation anywhere, right?
String.charAt(int) uses an index which is counted in 16-bit code units
(which Java calls 'char'). A Unicode code point is represented as either
one or two consecutive code units. String.codePointAt(int) returns the
code point which begins at the specified code unit.
If you want to iterate over code points, you end up using code such as:
for (int i = 0, n = str.length(); i < n; i ++) {
int cp = str.codePointAt(i);
if (cp > 0xFFFF)
i ++;
}
which is slightly inelegant. For random access ("I want the 42nd code
point in this string"), things are worse. Basically, String lacks
support for a by-code-point index.
Internally, it would be expensive to maintain both the by-code-unit
index and the by-code-point index
You could have a representation that was like:
/**
The indices in the string's sequence of code points of each supplementary
character (that from a high plane), which is encoded as a pair of
surrogates in the string's sequence of characters. If there are no
supplementary characters, this should be null rather than empty.
*/
private final int[] supplementaryIndices;
/**
Gets the ith code point from the string.
This almost certainly contains at least one off-by-one error, but you get
the idea.
*/
public int codePointAtCodePointIndex(int i) {
if (supplementaryIndices == null) return charAt(i) & 0xffff; // fast path
int slip = Arrays.binarySearch(supplementaryIndices, i);
if (slip < 0) slip = -(slip + 1);
return codePointAt(i + slip);
}
In the very common case that there are no supplementary characters in the
string, this will be just as fast as indexing by character. If there are
supplementary characters, it adds the cost of a binary search through
them, ie O(log(m)), where m is the number of supplementaries, which should
be pretty cheap. Indeed, m is likely to be so small that a linear search
would often be faster; it might be worth having a threshold below which
linear search is used (if Arrays.binarySearch doesn't do this already).
Storage overhead is O(m), reducing to a single word in the common
no-supplementaries case.
Mind you, i wouldn't bother. The mismatch between chars and code points is
the least of java's string representation problems - the mismatch between
either of those and actual letters that results from the existence of
combining marks, precomposed ligatures, and various other unicode oddities
is much more of a headache.
tom
--
Science of a sufficiently advanced form is indistinguishable from magic