Re: Optimizing Java method

From:

Eric Sosman <esosman@ieee-dot-org.invalid>

Newsgroups:

comp.lang.java.programmer,comp.lang.java.help

Date:

Wed, 11 Jul 2007 08:58:25 -0400

Message-ID:

<xoudnYekpvCISQnbnZ2dnUVZ_tadnZ2d@comcast.com>

Benjamin White wrote:

The routine below is supposed to convert a String containing decimal
digits to a IBM mainframe Z/series packed decimal number. It seems to
consume a lot of CPU time. Background about this is at
http://benjaminjwhite.name/zdecimal/doc/index.html Is there a faster
way to code this?

    /**
     * Converts String to packed decimal number. Decimal points, commas and
     * spaces are ignored. Sign character is processed. Avoid multiple signs.
     * Characters other than digits are invalid and will cause DataException.
     * Comma, blank, period, dollar sign and plus are ignored. Scaling and
     * exponents are not valid.
     *
     * @param str
     * String of number to convert
     * @return byte array of packed decimal, 16 long
     */
    public static byte[] stringToPack(String str) throws DataException {
        int i; // string index
        int j; // byte array index
        boolean nibble_ordinal = false;
        char ch1;
        byte nibble;
        int strlen = str.length();
        byte[] pknum = new byte[16];
        for (i = 0; i < 16; i++) { // initialize to zero
            pknum[i] = 0;
        }

This loop is unnecessary, since the newly-created array
is already zero-filled.

         i = strlen - 1;
        j = 15; /* byte index */
        pknum[j] = 12; // start with positive sign
        while (i > -1) {
            ch1 = str.charAt(i);
            switch (ch1) {
            case '0':
            case '1':
            case '2':
            case '3':
            case '4':
            case '5':
            case '6':
            case '7':
            case '8':
            case '9':

The majority of the input characters probably wind up in
this branch. It might -- *might* -- be quicker to use an
explicit comparison '0' <= ch1 && ch1 <= '9' than to pull
out the switch machinery; you could resort to a switch if
you found a non-digit character.

nibble = (byte) Character.getNumericValue(ch1);

Since you don't accept (and hence can ignore) locale-dependent
"other digit" characters, (byte)(ch1 - '0') gives the same answer
and might be faster.

                 if (nibble_ordinal) {
                    pknum[j] = (byte) (pknum[j] | nibble);
                    nibble_ordinal ^= true;
                } else {
                    pknum[j] = (byte) (pknum[j] | nibble << 4);
                    nibble_ordinal ^= true;
                    --j;
                }

This logic could be streamlined a bit. Instead of testing a
flag and branching, consider maintaining a shift count that
alternates between 0 and 4 (shift = 4 - shift is one way). The
only conditional that remains is deciding when to decrement j,
which you'd do after shift returns to zero.

                 --i; // get next char
                break;
            case ',':
            case ' ':
            case '.':
            case '$':
            case '+':
                --i; // get next char
                break;
            case '-':
                pknum[15] = (byte) (pknum[15] & 0xf0);
                pknum[15] = (byte) (pknum[15] | 0x0d);

This could also be streamlined, but probably doesn't have
much effect on the performance since it's executed only once
per number.

                 --i; // get next char
                break;
            default:
                throw new DataException("Invalid decimal digit: " + ch1);

Tastes vary, but I think it would be more helpful to display
the entire objectionable String. When you're trying to track
down the source of an invalid input, knowing that the invalid
digit was "x" tells you a little, but knowing that the wider
context was "8.5x11" tells you more.

             }
        }
        return (pknum);
    }

     One thing I don't like is the way the maintenance of the
index 'i' is scattered all over the code: It's just a loop
counter, yet it's modified at three different sites inside
the loop. This introduces comprehensional complexity ("How
sure am I that 'i' is decremented exactly once on each path
through the loop?"), and may also make it harder for the JIT
compiler to recognize that it's just an index and optimize
accordingly.

     Putting it all together, I'd suggest something along the
lines of (just typed in, not tested):

    public static byte[] stringToPack(String str)
        throws DataException
    {
        byte[] pknum = new byte[16];
        byte[15] = 0x0C;
        int j = 15;
        int shift = 4;
        for (int i = str.length; --i >= 0; ) {
            char ch = str.charAt(i);
            if ('0' <= ch && ch <= '9') {
                // optional:
                if (j < 0)
                    throw ... // something nicer than ArrayIndexOOBE
                pknum[j] |= (byte)((ch - '0') << shift);
                shift = 4 - shift;
                if (shift == 0)
                    --j;
            }
            else {
                switch (ch) {
                case ',':
                case ... // the remaining punctuation
                    break;
                case '-':
                    pknum[15] |= 0x01;
                    break;
                default:
                    throw ...;
            }
        }
        return pknum;
    }

     The explicit digit-ness test may or may not save time;
if it's important you should code it both ways and measure.

     Validation: There's not much in the way of validation
in your original code, nor in my rewrite. I can't say how
careful you should be because I don't know the context:
how trustworthy are the inputs? But it seems to me you
might consider allowing only one + or one -, and insisting
that it be either at the end or at the beginning rather
than embedded: "123+-0.4" probably shouldn't be valid. If
you want to make such tests, I'd suggest checking the first
and last characters before starting the loop and adjusting
the loop's start and end accordingly; then any embedded
sign characters will show up as invalid. Be careful when
checking those boundary characters, though: if the input
string is "" there won't be any (you might or might not
want to consider that an error in its own right).

--
Eric Sosman
esosman@ieee-dot-org.invalid