==================================================================
HowTo: Convert Large Comma Delimited Files to a Column Spaced File
==================================================================
.. include:: ..\version.h
.. include:: howto-reference.h


.. contents:: Table of Contents
   :backlinks: top


-------
Preface
-------

This document discusses how to put together a script that performs an efficient
traversal of a comma delimited (CSV) file to create a column spaced file.

For example, imagine that you wish to convert the following CSV line to a line
with 12 space columns,

  :: 

    '1, 2'

the columns would look like this,

  ::

    '1           2           '

Often data is presented in very, large--greater than two
Gigabytes--comma delimited files, and you want to convert the file to a
column spaced file. In this case we can not load the entire file into
the Computer's memory and then start processing the data. Instead we
need to read one line and write one line at a time to reduce the overall
memory consumption.

.. sidebar:: Efficient File Read/Write

   With Python you can trust that the file input/output routines will not
   operate only on one line at a time. Instead, Python will maintain a small
   buffer with N bytes of data within the Computer's memory.  Python will
   operate the file input/output routines with in the most efficient manner
   possible.

This document will discuss the steps to building a script that accomplishes the
task of converting the CSV file to a column spaced file. In addition, this
document will discuss how to read command window options, and how to efficiently
traverse a very large file in Python.


Assumptions
-----------

It is assumed that before you begin with this HowTo, that you are familiar with
the basic concepts of File Input/Output (I/O) and string formatting within
Python. These concepts are discussed within the *Python Tutorial* under the
"Input and Output" section [#python-tutorial-input-and-output]_.

In addition you should have a working understanding of Python's
Exception [#python-lib-builtin-exceptions]_ handling in order to be
familiar with how we handle the user's arguments to our script. "Errors
and Exceptions" [#python-tutorial-errors-and-exceptions]_ within *Python
Tutorial* provides some good documentation on the Python's error
handling process.

---------------
Getting Started
---------------

We are going to go through the steps required to write a script called
:file:`csv2col.py`, which will convert a comma delimited (CSV) file into a
column spaced file.

The steps to building our script will be as follows,

  1. `Getting the User Input`_
  
     We need to figure out what are user's desired input and output filenames.

  2. `Configuring our Format`_

     Here will set up the format string such that it can accept N columns. We
     will also configure it to be of an arbitrary width. This will allow us to
     handle any CSV file with any number of fields per line.

  3. `Processing the File`_

     This is the heart of the script. This is where we will read each CSV line
     from the input file and write out the appropriate column spaced line to
     the output file.

  4. `Speeding up the Script`_
   
     After we have gone over the basic mechanics required to write our script
     we will review it to determine if we can find a few places to make the
     script more efficient.


------------------------------
Building our Conversion Script
------------------------------

The following sections will go step by step through how to build our conversion script. At any time you can examine the complete script by reviewing `The Completed Script`_ section.

Now lets start to build our conversion script. 

Getting the User Input
----------------------

Before we can start processing the file we need to discover the name of the
input file and the name of the output file. Let us work with the notion that
the user will specify the script's input and output file via the command window 
with something like,

  ::
     
     c:\>cvs2col.py infile.csv outfile.col

This means the first program argument is the input file and the second program
argument is the output file.

.. sidebar:: Using Python's ``getopt`` 
   :subtitle: An alternative command window parser.

   Python's ``getopt`` [#python-lib-getopt]_ provides a very robust and useful
   built-in module for performing command window argument parsing. The module
   is far to much for our simple little script, but if you are considering
   adding options to the script you should consider using this very useful
   module rather than attempting to parse the command window options yourself. 

To get the command window arguments we will use Python's ``sys``
[#python-lib-sys]_ module. In this module there is a list defined as :d:`argv`
that contains all the arguments passed into the script. The first element in
this list is always the name of the program, and the remaining arguments in
this case will correspond to the user's input and output files.  So as a good
start lets do the following, 

 .. Python::
  
    import sys

    infile = sys.argv[1]   
    outfile = sys.argv[2] 
   
Now if the user did not specify either an input or output file, the above calls
would fail because the index may be out of range (meaning that the :d:`argv`
list only contains one argument). We can add a little error handling to our
script as follows, 

 .. Python::
  
    import sys

    try:
       infile = sys.argv[1]   
       outfile = sys.argv[2]  
    except IndexError, e:
       # either index 1 or 2 was not defined.
       print "error: invalid parameters specified"
       sys.exit(-1)

In the above code snippet we will generate an error if we are unable to get index 1 or index 2.


Configuring our Format
----------------------

Before starting this section you should have a working familiarity with string
formatting as presented in the "Input and Output" section of the *Python Tutorial* [#python-tutorial-input-and-output]_.

We want to create a format string that can be N columns wide, where N is
determined by the number of fields in a comma-delimited line. Furthermore, we
wish to set up the spacing between columns to of arbitrary size.  

First we will create a constant to help define the arbitrary column width,

 .. Python::

   COLUMN_WIDTH = 12

Next we will define a format string,

 .. Python::

   FORMAT_STRING = "%%-%ds" % (COLUMN_WIDTH)

This will effectively set our format string to be: "%-12s". 

The next step is to notice that if we were to take "%-12s" and multiple it by N
we would get N versions of "%-12s". For example, say N was equal to 10, then the
following snippet,

 .. Python::

   columns = '%-12s'*10

would result in columns being equal to
'%-12s%-12s%-12s%-12s%-12s%-12s%-12s%-12s%-12s%-12s'. This will be useful for
later, when we read a line from the comma-delimited file. We will be able to
read a line, determine the number of fields and dynamically construct a format
string required to generate a column spaced line.


Processing the File
-------------------

The next step is to read the contents of the input file and write out the
contents to an output file. The trickiest part about this operation is that we
are working with a very large file. We can not read everything into the
computer's memory, and expect to be able to process the file efficiently.

Fortunately, Python scripts are very efficient when it comes to file
traversal. Furthermore they provide a convenient method for processing a file in
a line by line fashion.

To begin our file processing we will open an input :c:`file` and an output
:c:`file` based on the user inputs that we gathered in `Getting the User
Input`_. The following will create our :c:`file` objects with the :a:`infile` and :a:`outfile` paths, 

 .. Python::

   f_in = open(infile, "rt")
   f_out = open(outfile, "wt")

The "rt" and "wt" in the above lines state that we are interested in reading
text and writing text respectively. 

Now we will traverse the input file line-by-line, and do the following,

  1. Split the line into a list of strings based on the "," separator.
  2. Strip the whitespace around each field.
  3. Determine the number of fields.
  4. Generate our format string based on the number of fields.
  5. Generate our output line based on the format string and the input
     fields.
  6. Write the new output line to the output file.

The above steps can be done as follows,

 .. Python::

   for line in f_in.readlines():
       fields = line.split(",")
       fields = map(string.strip, fields)
       numfields = len(fields)
       format_out = FORMAT_STRING*numfields + "\n"
       f_out.write(format_out % (tuple(fields)))

Now let us review the above loop in further detail.

The loop reads one line at a time from the input :c:`file` object called,
:a:`f_in`. For each of the lines it splits the input line into a list of fields
based on the "," separator. For example, the following Python code splits a
string based on the "," into three distinct fields,

   >>> import string
   >>> aline = "field1, field2, field3"
   >>> aline.split(",")
   ['field1', ' field2', ' field3']
   >>> map(string.strip, aline.split(","))
   ['field1', 'field2', 'field3']

We use the :f:`map` function to apply the :f:`string.strip` function to each
field in the line. The application of the :f:`string.strip` function to each
line removes any whitespace that may be left over from the :f:`split` operation.

The next step is to determine how many fields there are present within our new
list. We take that information and construct our format string. So to continue
with our Python code, we would have something like,

   >>> import string
   >>> aline = "field1, field2, field3"
   >>> fields = map(string.strip, aline.split(","))
   >>> "%-12s"*len(fields)
   '%-12s%-12s%-12s'

Now that we have our format set up, we simply need to apply our arguments to
the format string. To apply the arguments--the fields in our case--to the
format string, we convert the list to a tuple and use the standard Python
format operator (%). Again, continuing with our Python code, we would have, 

   >>> import string
   >>> aline = "field1, field2, field3"
   >>> fields = map(string.strip, aline.split(",")) 
   >>> format = "%-12s"*len(fields)
   >>> print format % (tuple(fields))
   field1      field2      field3

The above little python snippet is performed for each line of in our file
processing loop, taking an arbitrary line with N commas in it and generating the
appropriate column spaced line.


Speeding up the Script
----------------------

Consider that we have a comma delimited file with fields nearly equal for each
line. With the current implementation a file with the same field width for
nearly every line would result in us needlessly constructing the same format
string for every line that we process.

To speed things up, we will store some common format strings into a dictionary
ahead of time, this way we can quickly retrieve the format from the dictionary,
and if the dictionary lacks the format, we can add it while we process. Let us
create our format cache (:d:`FORMAT_CACHE`) based on the :d:`FORMAT_STRING` as
follows,

 .. Python::

   FORMAT_CACHE = {}
   for i in range(1, 65):
      FORMAT_CACHE[i] = FORMAT_STRING*i + "\n"

Now if we have a cache capable of handling  1 to 64 fields, and easily accessible via the number of fields (the key to the dictionary).

The next step is to re-write our file processing shown in `Processing the
File`_ to use the our new format cache.  We would re-write it as follows,

 .. Python::

   for line in f_in.readlines():
       fields = line.split(",")
       fields = map(string.strip, fields)
       numfields = len(fields)

       if numfields in FORMAT_CACHE:
          format_out = FORMAT_CACHE[numfields]
       else:
          format_out = FORMAT_STRING*numfields + "\n"
	  FORMAT_CACHE[numfields] = format_out

       f_out.write(format_out % (tuple(fields)))

The additional if statement checks to see if the format cache has the format
string already stored within it. If it does, then it uses that to create its
format out string. If the format cache does not contain the format string
within it, we create one, and store it in the cache for future use.

The addition of this cache provided approximately 25% speed up.


--------------------
The Completed Script
--------------------

The complete script, :file:`csv2col.py` in its entirety is as follows,

.. Python::

 """Converts csv file into a column spaced file."""

 import sys
 import os
 import string

 #
 # CONFIGURING OUR FORMAT
 #
 COLUMN_WIDTH = 12
 FORMAT_STRING = "%%-%ds" % (COLUMN_WIDTH)

 # 
 # SPEEDING UP OUR SCRIPT
 # 
 FORMAT_CACHE = {}
 for i in range(1, 65):
     FORMAT_CACHE[i] = FORMAT_STRING*i + "\n"


 ERROR_USAGE_MESSAGE = """\
 error: invalid arguments supplied.
 try: csv2col.py infile outfile"""

 ERROR_INVALID_FILE_MESSAGE = """\
 error: file does not exist: '%s'"""

 def csv2col(infile, outfile, 
	     format_str=FORMAT_STRING,
	     format_cache=FORMAT_CACHE):
     """Take a csv infile and generate a column spaced outfile."""
     # 
     # PROCESSING THE FILE
     #
     f_in = open(infile, "rt")
     f_out = open(outfile, "wt")
     for line in f_in.readlines():
         fields = line.split(",")
         fields = map(string.strip, fields)
         numfields = len(fields)

	 # 
	 # SPEEDING UP THE SCRIPT
	 #
         if numfields in format_cache:
            format_out = format_cache[numfields]
         else:
            format_out = format_string*numfields + "\n"
            format_cache[numfields] = format_out

         f_out.write(format_out % (tuple(fields)))
                    

 if __name__=="__main__":
     #
     # GETTING THE USER INPUT
     #
     try:
         infile = sys.argv[1]
         outfile = sys.argv[2]
     except IndexError, e:
         print ERROR_USAGE_MESSAGE
         sys.exit(-1)

     if not os.path.isfile(infile):
         print ERROR_INVALID_FILE_MESSAGE % (infile)
         sys.exit(-1)
     
     # 
     # RUNNING OUR ROUTINE
     #
     csv2col(infile, outfile)

Notice that the way we have written the script the :f:`csv2col()` is now a
function that can be called from another script. For example, say as part of
writing a routine you want to first convert some input file into a columnar
file, you could do something like,

.. Python::

   import csv2col
   csv2col.csv2col(inputfile, outputfile) 

and that will generate a file that meets the default format string
specifications.

However, you can still run the script from the command window using by invoking the python interpreter as follows,

 ::

  c:\>python csv2col.py input.csv output.col
  
----------
References
----------

.. [#python-tutorial-input-and-output]

    *Python Tutorial*. "Input and Output". 30 May 2003. 1 June 2004.
    <http://www.python.org/doc/2.2.3/tut/node9.html>

.. [#python-lib-builtin-exceptions]

  "Built-in Exceptions" *Python Library Reference*. 30 May 2003. 8 June 2004.
  <http://www.python.org/doc/2.2.3/lib/module-exceptions.html>


.. [#python-tutorial-errors-and-exceptions] 
    
   "Errors and Exceptions" *Python Tutorial*. 30 May 2003. 8 June 2004.
   <http://www.python.org/doc/2.2.3/tut/node10.html>
   
.. [#python-lib-getopt]
   
   "getopt -- Parser for command line options" *Python Library Reference*. 30
   May 2003. 8 June 2004.
   <http://www.python.org/doc/2.2.3/lib/module-getopt.html>

.. [#python-lib-sys]

   "sys -- System-specific parameters and functions" *Python Library
   Reference*. 30 May 2003. 7 June 2004.
   <http://www.python.org/doc/2.2.3/lib/module-sys.html>

.. [#python-lib-builtin] 

   "Built-in Functions". *Python Library Reference*.  30 May 2003. 7 June 2004.
   <http://www.python.org/doc/2.2.3/lib/built-in-funcs.html>