Re: parsing a tab delimited or CSV, but keep the delimiter
On Mar 22, 1:04 pm, "Sideswipe" <christian.bongio...@gmail.com> wrote:
I know this question has been asked before, and believe me I checked
the newsgroup and web extensively before asking, but I think my needs
are slightly different.
I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.
I am cross posting this in the Perl and Java groups because, my
implementation is in Java, but Perl users use regexp far more
frequently.
Cross posting is fine, but you should add a follow up header. Which I
have (to comp.lang.java.programmer)
Here are the 3 different REGEX expressions I have found /created but
none are correct. The only certainty I can get is to get rid of all
the delimiters. I have to maintain the delimiters because the
information I am accessing is column based (and thus fixed)
private final Pattern COLUMN_PATTERN = Pattern.compile("(\"[^\"]*\",,|
[^,]+)"); // I think this close
private final Pattern COLUMN_PATTERN = Pattern.compile("([^\",]*|\"([^
\"]|\"\")+\")(,)");
private final Pattern COLUMN_PATTERN = Pattern.compile(",(?=(?:[^\\\"]*
\\\"[^\\\"]*\\\")*(?![^\\\"]*\\\"))");
So, you have the cases of:
1) continuous string or with space -> single ',' (comma) separated
2) String has a comma in it, and is "" -> it is followed by a ",,"
double comma token. So the string in "" is a token and the double
comma is also a token
3) blank cells are just a single comma ,
That's my understanding of the cases. The logic should be IDENTICAL
for tab delimited and simply substitute characters
I'm not sure that a regex is good enough to do everything...
Anyway, here are the cases that I can think of, ignoring the
delimiters.
Field value:
Field value: ,
Field value: "
Field value: a,b
Field value: "a and b"
Field value: 6"3
What are the encodings of this?
I'm guessing that
Field value:
Encoded value:
Field value: ,
Encoded value: ","
Field value: "
Encoded value: ""
Field value: a,b
Encoded value: "a,b"
Field value: "a and b"
Encoded value: ""a and b""
Field value: 6"3
Encoded value: 6""3
You can verify these cases in excel.
If those ARE the correct cases, then this would work:
import java.util.List;
import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ParseCSV {
final static String quoted = "\"(:?[^\"]|\"\")+\"";
public static List<String> parseCSV(String csv, String delim) {
final Pattern NEXT_COLUMN = nextColumnRegex(delim);
final List<String> strings = new ArrayList<String>();
final Matcher matcher = NEXT_COLUMN.matcher(csv);
while (!matcher.hitEnd() && matcher.find()) {
String match = matcher.group(1);
if (match.matches(quoted)) {
match = match.substring(1, match.length() - 1);
}
match = match.replaceAll("\"\"", "\"");
strings.add(match);
}
return strings;
}
private static Pattern nextColumnRegex(String comma) {
String unquoted = "(:?[^\"" + comma + "]|\"\")*";
String ending = "(:?" + comma +"|$)";
return Pattern.compile('(' + quoted + '|' + unquoted + ')' +
ending);
}
public static void main(String[] args) {
String csv = ",\",\",\"\",\"a,b\",\"\"a and b\"\",6\"\"3";
List<String> result = parseCSV(csv, ",");
for (String col : result) {
System.out.println("Field value:" + col);
}
}
}