Strings and Things
C- language “string” (really “char*”)
3.13 Parsing Comma-Separated Data
Problem
You have a string or a file of lines containing comma-separated values (CSV) that you need to read. Many Windows-based spreadsheets and some databases use CSV to export data.
Solution
Use myCSV class or a regular expression (see Chapter 4).
Discussion
CSV is deceptive. It looks simple at first glance, but the values may be quoted or unquoted. If quoted, they may further contain escaped quotes. This far exceeds the capabilities of theStringTokenizerclass (Recipe 3.2). Either considerable Java cod-ing or the use of regular expressions is required. I’ll show both ways.
First, a Java program. Assume for now that we have a class calledCSVthat has a no-argument constructor and a method calledparse( ) that takes a string representing one line of the input file. Theparse( )method returns a list of fields. For flexibility, the fields are returned as a List, from which you can obtain anIterator(see Recipe 7.4). I simply use the Iterator’shasNext()method to control the loop and itsnext( ) method to get the next object:
import java.util.*;
/* Simple demo of CSV parser class.
*/
public class CSVSimple { try {
String inputLine;
while ((inputLine = is.readLine( )) != null) { if (inputLine.trim( ).equals(startMark)) { printing = true;
} else if (inputLine.trim( ).equals(endMark)) { printing = false;
} else if (printing)
System.out.println(inputLine);
}
is.close( );
} catch (IOException e) { // not shown
} }
Example 3-9. GetMark.java (trimming and comparing strings) (continued)
public static void main(String[] args) { CSV parser = new CSV( );
List list = parser.parse(
"\"LU\",86.25,\"11/4/1998\",\"2:19PM\",+4.0625");
Iterator it = list.iterator( );
while (it.hasNext( )) {
System.out.println(it.next( ));
} } }
After the quotes are escaped, the string being parsed is actually the following:
"LU",86.25,"11/4/1998","2:19PM",+4.0625
RunningCSVSimple yields the following output:
> java CSVSimple LU
86.25 11/4/1998 2:19PM +4.0625
>
But what about the CSVclass itself? The code in Example 3-10 started as a transla-tion of a CSV program written in C++ by Brian W. Kernighan and Rob Pike that appeared in their book The Practice of Programming (Addison Wesley). Their ver-sion commingled the input processing with the parsing; myCSVclass does only the parsing since the input could be coming from any of a variety of sources. And it has been substantially rewritten over time. The main work is done inparse( ), which del-egates handling of individual fields to advquoted( ) in cases where the field begins with a quote; otherwise, toadvplain( ).
Example 3-10. CSV.java import java.util.*;
import com.darwinsys.util.Debug;
/** Parse comma-separated values (CSV), a common Windows file format.
* Sample input: "LU",86.25,"11/4/1998","2:19PM",+4.0625 * <p>
* Inner logic adapted from a C++ original that was * Copyright (C) 1999 Lucent Technologies
* Excerpted from 'The Practice of Programming' * by Brian W. Kernighan and Rob Pike.
* <p>
* Included by permission of the http://tpop.awl.com/ web site, * which says:
* "You may use this code for any purpose, as long as you leave * the copyright notice and book citation attached." I have done so.
* @author Brian W. Kernighan and Rob Pike (C++ original)
* @author Ian F. Darwin (translation into Java and removal of I/O)
Parsing Comma-Separated Data | 77 * @author Ben Ballard (rewrote advQuoted to handle '""' and for readability)
*/
protected char fieldSep;
/** parse: break the input String into fields * @return java.util.Iterator containing each field * from the original as a String, in order.
*/
public List parse(String line) {
In the online source directory, you’ll find CSVFile.java, which reads a text file and runs it throughparse( ). You’ll also find Kernighan and Pike’s original C++ program.
We haven’t discussed regular expressions yet (we will in Chapter 4). However, many readers are familiar with regexes in a general way, so the following example demon-strates the power of regexes, as well as providing code for you to reuse. Note that this program replaces all the code*in both CSV.java and CSVFile.java. The key to understanding regexes is that a little specification can match a lot of data.
/** advQuoted: quoted field; return index of next separator */
protected int advQuoted(String s, StringBuffer sb, int i) {
int j;
int len= s.length( );
for (j=i; j<len; j++) {
if (s.charAt(j) == '"' && j+1 < len) { if (s.charAt(j+1) == '"') { j++; // skip escape char
} else if (s.charAt(j+1) == fieldSep) { //next delimiter j++; // skip end quotes
break;
}
} else if (s.charAt(j) == '"' && j+1 == len) { // end quotes at end of line break; //done
}
sb.append(s.charAt(j)); // regular character.
} return j;
}
/** advPlain: unquoted field; return index of next separator */
protected int advPlain(String s, StringBuffer sb, int i) {
int j;
j = s.indexOf(fieldSep, i); // look for separator Debug.println("csv", "i = " + i + ", j = " + j);
if (j == -1) { // none found sb.append(s.substring(i));
return s.length( );
} else {
sb.append(s.substring(i, j));
return j;
} } }
* With the caveat that it doesn’t handle different delimiters; this could be added usingGetOptand constructing the pattern around the delimiter.
Example 3-10. CSV.java (continued)
Parsing Comma-Separated Data | 79
/* Simple demo of CSV matching using Regular Expressions.
* Does NOT use the "CSV" class defined in the Java CookBook, but uses * a regex pattern simplified from Chapter 7 of <em>Mastering Regular * Expressions</em> (p. 205, first edn.)
* @version $Id: ch03,v 1.3 2004/05/04 18:03:14 ian Exp $ */
public class CSVRE {
/** The rather involved pattern used to match CSV's consists of three * alternations: the first matches aquoted field, the second unquoted, * the third a null field.
*/
public static final String CSV_PATTERN = "\"([^\"]+?)\",?|([^,]+),?|,";
private static Pattern csvRE;
public static void main(String[] argv) throws IOException { System.out.println(CSV_PATTERN);
new CSVRE().process(new BufferedReader(new InputStreamReader(System.in)));
}
public void process(BufferedReader in) throws IOException { String line;
// For each field while (m.find()) {
System.out.println(m.groupCount());
String match = m.group();
if (match == null) break;
if (match.endsWith(",")) {// trim trailing , match = match.substring(0, match.length() - 1);
}
if (match.startsWith("\"")) { // assume also ends with match = match.substring(1, match.length() - 1);
}
if (match.length() == 0) match = null;
list.add(match);
}
return list;
} }
It is sometimes “downright scary” how much mundane code you can eliminate with a single, well-formulated regular expression.