Re: reading filenames from stdin - with umlauts?

From:

strombrg@gmail.com

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 14 Sep 2008 14:06:41 -0700 (PDT)

Message-ID:

<5bbd1156-10c4-4965-a462-e38ed19cd385@p25g2000hsf.googlegroups.com>

I found some good help with this over on OpenJDK's i18n-dev mailing
list.

it turns out that in java (and perhaps other languages with
localization support) many locales do not guarantee correct round-trip
conversion from 8 bit filenames to 16 bit and back to 8 bit - so
you'll seem to get phantom files that seem to be there for one purpose
but not another. en_US.ISO-8859-1 is one of the few that does make
this guarantee - that is, no phantom files. I'd been trying that
locale among a handful of others, but it wasn't working because I
didn't have that locale configured on my system.

The python, perl and java versions of the program are now at
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html

Thanks to all who took an interest in the project!

On Jul 27, 3:54 pm, Dan Stromberg <dstrombergli...@gmail.com> wrote:

I wrote a small java program to read filenames from stdin (produced by
Linux' "find"), and then to divide those files up into like groups.

Actually, it was originally a python program, but I've been wanting to
expand my horizons a little, so I rewrote it in perl, and now I'm trying
to redo it in java to celebrate java going opensource, and I'll likely
rewrite it in Haskell and/or Objective Caml after the java version.

The java version of the program seems to work pretty well, and I have a
feeling it's going to prove faster than the python or perl versions
(which are athttp://stromberg.dnsalias.org/~strombrg/equivalence-
classes.html - and I hope to put the java version there too after it's
working a little better).

However, to my disappointment, the java version of the program can't seem
to deal with filenames that have umlauts in them. Filenames using only
characters in the English alphabet seem fine.

I suspect the problem is that the file_name_, as it appears in a Linux
ext3 filesystem, has an 8 bit per character representation, but java
wants to convert the string I read from stdin to a 16 bit per character
representation, and then doesn't reverse the conversion when I go to open
the file by its name.

I've googled about this for around 4 hours now, and found little but
other people having similar issues - sometimes with files, sometimes with
files inside zip archives.

The error looks like:

find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
java -jar equivs.jar equivs.main
Encoding on isr is ISO8859_1
IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
such file or directory)
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:10=

        at Sortable_file.get_prefix(Sortable_file.java:63)
        at Sortable_file.compareTo(Sortable_file.java:266)
        at Sortable_file.compareTo(Sortable_file.java:1)
        at java.util.Arrays.mergeSort(Arrays.java:1144)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.sort(Arrays.java:1079)
        at equivs.main(equivs.java:54)

The code I'm reading filenames with looks like:

      InputStreamReader isr = null;
      try
         {
         isr = (new InputStreamReader(System.in, "ISO-8859-1"=

));

         }
      catch (UnsupportedEncodingException uee)
         {
         System.err.println("UnsupportedEncodingException: " + =

uee);

         uee.printStackTrace();
         java.lang.System.exit(1);
         }
      System.err.println("Encoding on isr is " + isr.getEncoding())=

;

      BufferedReader stdin = new BufferedReader (isr);
      String line;

      try
         {
         while((line = stdin.readLine()) != null)
            {
            // System.out.println(line);
            // System.out.flush();
            lst.add(new Sortable_file(line));
            }
         }
      catch(java.io.IOException e)
         {
         System.err.println("IO error 0.5: " + e);
         e.printStackTrace();
         java.lang.System.exit(1);
         }

...and the code I'm opening the filenames with looks like:

      byte[] buffer = new byte[128];
      java.io.File this_file;
      try
         {
         this_file = new java.io.File(this.filename);
         java.io.FileInputStream file = new java.io.FileInput=

Stream

(this_file);
         file.read(buffer);
         // System.out.println("this.prefix.length " +
this.prefix.length);
         file.close();
         }
      catch (java.io.IOException ioe)
         {
         System.out.println( "IO error 1: " + ioe );
         ioe.printStackTrace();
         java.lang.System.exit(1);
         }

(this is just one small part of the compareTo function - the goal was to
make things fast, and one of the optimizations is to compare just the
first 128 bytes of a file early in the comparison, and keep it cached in
memory to make the sort fast. Only if two files have the same prefix d=

we do the expensive md5 hash - etc.).

Has anyone found a way to do:

find <options> -print | ./java-prog

...and have java-prog act on the files coming from stdin - including
opening them?

Thanks!

PS: I suspect I could write a class to read bytes and piece together
strings, but 1) That'd probably be slow and 2) I want to use the
established java class hierarchy where possible and 3) the byte arrays
still might get upconverted to a different encoding upon converting them
to a string anyway. But if that's the only way, that's fine.