Re: Parsing CSV files

From:

Hector Santos <sant9442@nospam.gmail.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Thu, 21 Jan 2010 22:52:07 -0500

Message-ID:

<er7uFZxmKHA.1212@TK2MSFTNGP04.phx.gbl>

Stanza wrote:

What is the easiest way of reading a line at a time through a textual
CSV file, and then extracting the comma-separated elements from each line?

"Easiest" depends on what language and framework you are using and how
you hold, store, process the data in memory.

Assuming C language, the traditional implementation is to use
strtok(), is a C/C++ simple example:

// File: d:\wc5beta\testtok.cpp

// compile with: cl testtok.cpp

#include <stdio.h>
#include <afx.h>

int main(char argc, char *argv[])
{
    //
    // get file name from command line
    //

    char *pfn = (argc>1)?argv[1]:NULL;

    if (!pfn) {
        printf("- syntax: testeol csv_filename\n");
        return 1;
    }

    //
    // open text file for reading
    //

    FILE *fv = fopen(pfn,"rt");
    if (!fv) {
        printf("ERROR %d Opening file\n",GetLastError());
        return 1;
    }

    //
    // read each line using fgets() and parse
    // the "," and cr/lf (\r\n) token characters.
    //

    char *tok = ",\r\n";

    int nLine = 0;
    char szLine[1024];
    memset(&szLine,sizeof(szLine),0);
    while (fgets(szLine,sizeof(szLine)-1,fv)) {
        nLine++;
        printf("# %d | %s",nLine, szLine);

        //
        // parse the line by the tok characters
        //
        char *fld = strtok(szLine, tok);
        while(fld) {
            printf("- [%s]\n",fld);
            fld = strtok(NULL, tok);
        }
    }

    fclose(fv);
    return 0;
}

So for example testdata.csv file containing these lines:

hector santos,email1@whatever.com
stanza,email2@whatever2.com
Joe Newcomer,email3@whatever3.com

compiling and running testtok testdata.csv, you get:

# 1 | hector santos,email1@whatever.com
- [hector santos]
- [email1@whatever.com]
# 2 | stanza,email2@whatever2.com
- [stanza]
- [email2@whatever2.com]
# 3 | Joe Newcomer,email3@whatever3.com
- [Joe Newcomer]
- [email3@whatever3.com]

This is very simplistic and doesn't many design issues in regards to
parsing csv bases files.

The #1 design issue is the idea of "escaping" the token character you
are using to separate fields, in this case the comma (',') because it
is possible to have the comma with the field strings. That depends on
the type and data specifications. Maybe your program doesn't expect
them and maybe the creator the file will never ADD them and/or escapes
them. All this is implementation base.

For example, the data file can have a 3rd field that is a description
like field, OR the name field can have commas this, thus introduce the
idea that it can escaping is requiring. i.e, the data file can look
like this:

hector santos,email1@whatever.com,whatever,whatever,whatever
stanza,email2@whatever2.com,"whatever,whatever,whatever"
Joe Newcomer,email3@whatever3.com
Serlace, tom,email4@whatever4.com

So you can roll up sleeves and begin to use the above simple C/C++
code as a basis to fine tune the reading requirements for your CSV by
adding token escaping concepts, or you can use 3rd party libraries and
functions available to do these things, and your requirements will be
that these 3rd party libraries and function have the features of
escaping tokens.

Now, I purposely creates the testdata.csv above that would normally be
considered bad formatting and doesn't promote or help good csv
reading. A good practice it surround the fields with double quotes
and that MAY be enough for escaping embedded commas, for example,
the first line has a 3rd field:

        whatever,whatever,whatever

well, if you parsing only by comma, the field results in just
"whatever". So what is normally done is use lines like the 2nd line
where the 3rd field is quoted:

        "whatever,whatever,whatever"

The same issue with the 4th line with the first "expected" field has:

         Serlace, tom,

and this causes your fields to be shifted and off.

There are other concepts to deal with, namely, how you are reading
into memory storage, if needed or if your processing each line and
forgetting about it.

So writing a robust CSV reader that takes into account, such as:

   - escaping and embedded tokens
   - reading into memory

are common design requirements here. It really isn't that hard. I
would encourage to learn and gain the rewarding experiences to program
this yourself. It covers ideas that will be common ideas in a
programmers life. I will say, that sometimes it pays do to just a
byte stream parser instead of using strtok() checking each possible
token and delimiter, double quoted strings, etc. For example, instead
of the strtok block of lines, you can use something like:

        char *p = szLine;
        while (*p) {
           switch(*p) {
             case '\r':
               ... add logic for this ...
               break;
             case '\n':
               ... add logic for this ...
               break;
             case '\"':
               ... add logic for this ...
               break;
             case ',':
               ... add logic for this ...
               break;
           }
           p++;
        }

It can be simple to complex depending on the CSV reading requirements.

Anyway, if you just wish to get a solution, you can use one the many
3rd party libraries, classes, that will do these things for you.

If you using another language, the same ideas apply, but some
languages already have a good library, like .NET perhaps. It has an
excellent text I/O reader class in its collections library, See
OpenTextFieldParser(). It supports CSV reading and covers the two
important ideas above for escaping and storage.

--
HLS