================================================================== HowTo: Convert Large Comma Delimited Files to a Column Spaced File ================================================================== .. include:: ..\version.h .. include:: howto-reference.h .. contents:: Table of Contents :backlinks: top ------- Preface ------- This document discusses how to put together a script that performs an efficient traversal of a comma delimited (CSV) file to create a column spaced file. For example, imagine that you wish to convert the following CSV line to a line with 12 space columns, :: '1, 2' the columns would look like this, :: '1 2 ' Often data is presented in very, large--greater than two Gigabytes--comma delimited files, and you want to convert the file to a column spaced file. In this case we can not load the entire file into the Computer's memory and then start processing the data. Instead we need to read one line and write one line at a time to reduce the overall memory consumption. .. sidebar:: Efficient File Read/Write With Python you can trust that the file input/output routines will not operate only on one line at a time. Instead, Python will maintain a small buffer with N bytes of data within the Computer's memory. Python will operate the file input/output routines with in the most efficient manner possible. This document will discuss the steps to building a script that accomplishes the task of converting the CSV file to a column spaced file. In addition, this document will discuss how to read command window options, and how to efficiently traverse a very large file in Python. Assumptions ----------- It is assumed that before you begin with this HowTo, that you are familiar with the basic concepts of File Input/Output (I/O) and string formatting within Python. These concepts are discussed within the *Python Tutorial* under the "Input and Output" section [#python-tutorial-input-and-output]_. In addition you should have a working understanding of Python's Exception [#python-lib-builtin-exceptions]_ handling in order to be familiar with how we handle the user's arguments to our script. "Errors and Exceptions" [#python-tutorial-errors-and-exceptions]_ within *Python Tutorial* provides some good documentation on the Python's error handling process. --------------- Getting Started --------------- We are going to go through the steps required to write a script called :file:`csv2col.py`, which will convert a comma delimited (CSV) file into a column spaced file. The steps to building our script will be as follows, 1. `Getting the User Input`_ We need to figure out what are user's desired input and output filenames. 2. `Configuring our Format`_ Here will set up the format string such that it can accept N columns. We will also configure it to be of an arbitrary width. This will allow us to handle any CSV file with any number of fields per line. 3. `Processing the File`_ This is the heart of the script. This is where we will read each CSV line from the input file and write out the appropriate column spaced line to the output file. 4. `Speeding up the Script`_ After we have gone over the basic mechanics required to write our script we will review it to determine if we can find a few places to make the script more efficient. ------------------------------ Building our Conversion Script ------------------------------ The following sections will go step by step through how to build our conversion script. At any time you can examine the complete script by reviewing `The Completed Script`_ section. Now lets start to build our conversion script. Getting the User Input ---------------------- Before we can start processing the file we need to discover the name of the input file and the name of the output file. Let us work with the notion that the user will specify the script's input and output file via the command window with something like, :: c:\>cvs2col.py infile.csv outfile.col This means the first program argument is the input file and the second program argument is the output file. .. sidebar:: Using Python's ``getopt`` :subtitle: An alternative command window parser. Python's ``getopt`` [#python-lib-getopt]_ provides a very robust and useful built-in module for performing command window argument parsing. The module is far to much for our simple little script, but if you are considering adding options to the script you should consider using this very useful module rather than attempting to parse the command window options yourself. To get the command window arguments we will use Python's ``sys`` [#python-lib-sys]_ module. In this module there is a list defined as :d:`argv` that contains all the arguments passed into the script. The first element in this list is always the name of the program, and the remaining arguments in this case will correspond to the user's input and output files. So as a good start lets do the following, .. Python:: import sys infile = sys.argv[1] outfile = sys.argv[2] Now if the user did not specify either an input or output file, the above calls would fail because the index may be out of range (meaning that the :d:`argv` list only contains one argument). We can add a little error handling to our script as follows, .. Python:: import sys try: infile = sys.argv[1] outfile = sys.argv[2] except IndexError, e: # either index 1 or 2 was not defined. print "error: invalid parameters specified" sys.exit(-1) In the above code snippet we will generate an error if we are unable to get index 1 or index 2. Configuring our Format ---------------------- Before starting this section you should have a working familiarity with string formatting as presented in the "Input and Output" section of the *Python Tutorial* [#python-tutorial-input-and-output]_. We want to create a format string that can be N columns wide, where N is determined by the number of fields in a comma-delimited line. Furthermore, we wish to set up the spacing between columns to of arbitrary size. First we will create a constant to help define the arbitrary column width, .. Python:: COLUMN_WIDTH = 12 Next we will define a format string, .. Python:: FORMAT_STRING = "%%-%ds" % (COLUMN_WIDTH) This will effectively set our format string to be: "%-12s". The next step is to notice that if we were to take "%-12s" and multiple it by N we would get N versions of "%-12s". For example, say N was equal to 10, then the following snippet, .. Python:: columns = '%-12s'*10 would result in columns being equal to '%-12s%-12s%-12s%-12s%-12s%-12s%-12s%-12s%-12s%-12s'. This will be useful for later, when we read a line from the comma-delimited file. We will be able to read a line, determine the number of fields and dynamically construct a format string required to generate a column spaced line. Processing the File ------------------- The next step is to read the contents of the input file and write out the contents to an output file. The trickiest part about this operation is that we are working with a very large file. We can not read everything into the computer's memory, and expect to be able to process the file efficiently. Fortunately, Python scripts are very efficient when it comes to file traversal. Furthermore they provide a convenient method for processing a file in a line by line fashion. To begin our file processing we will open an input :c:`file` and an output :c:`file` based on the user inputs that we gathered in `Getting the User Input`_. The following will create our :c:`file` objects with the :a:`infile` and :a:`outfile` paths, .. Python:: f_in = open(infile, "rt") f_out = open(outfile, "wt") The "rt" and "wt" in the above lines state that we are interested in reading text and writing text respectively. Now we will traverse the input file line-by-line, and do the following, 1. Split the line into a list of strings based on the "," separator. 2. Strip the whitespace around each field. 3. Determine the number of fields. 4. Generate our format string based on the number of fields. 5. Generate our output line based on the format string and the input fields. 6. Write the new output line to the output file. The above steps can be done as follows, .. Python:: for line in f_in.readlines(): fields = line.split(",") fields = map(string.strip, fields) numfields = len(fields) format_out = FORMAT_STRING*numfields + "\n" f_out.write(format_out % (tuple(fields))) Now let us review the above loop in further detail. The loop reads one line at a time from the input :c:`file` object called, :a:`f_in`. For each of the lines it splits the input line into a list of fields based on the "," separator. For example, the following Python code splits a string based on the "," into three distinct fields, >>> import string >>> aline = "field1, field2, field3" >>> aline.split(",") ['field1', ' field2', ' field3'] >>> map(string.strip, aline.split(",")) ['field1', 'field2', 'field3'] We use the :f:`map` function to apply the :f:`string.strip` function to each field in the line. The application of the :f:`string.strip` function to each line removes any whitespace that may be left over from the :f:`split` operation. The next step is to determine how many fields there are present within our new list. We take that information and construct our format string. So to continue with our Python code, we would have something like, >>> import string >>> aline = "field1, field2, field3" >>> fields = map(string.strip, aline.split(",")) >>> "%-12s"*len(fields) '%-12s%-12s%-12s' Now that we have our format set up, we simply need to apply our arguments to the format string. To apply the arguments--the fields in our case--to the format string, we convert the list to a tuple and use the standard Python format operator (%). Again, continuing with our Python code, we would have, >>> import string >>> aline = "field1, field2, field3" >>> fields = map(string.strip, aline.split(",")) >>> format = "%-12s"*len(fields) >>> print format % (tuple(fields)) field1 field2 field3 The above little python snippet is performed for each line of in our file processing loop, taking an arbitrary line with N commas in it and generating the appropriate column spaced line. Speeding up the Script ---------------------- Consider that we have a comma delimited file with fields nearly equal for each line. With the current implementation a file with the same field width for nearly every line would result in us needlessly constructing the same format string for every line that we process. To speed things up, we will store some common format strings into a dictionary ahead of time, this way we can quickly retrieve the format from the dictionary, and if the dictionary lacks the format, we can add it while we process. Let us create our format cache (:d:`FORMAT_CACHE`) based on the :d:`FORMAT_STRING` as follows, .. Python:: FORMAT_CACHE = {} for i in range(1, 65): FORMAT_CACHE[i] = FORMAT_STRING*i + "\n" Now if we have a cache capable of handling 1 to 64 fields, and easily accessible via the number of fields (the key to the dictionary). The next step is to re-write our file processing shown in `Processing the File`_ to use the our new format cache. We would re-write it as follows, .. Python:: for line in f_in.readlines(): fields = line.split(",") fields = map(string.strip, fields) numfields = len(fields) if numfields in FORMAT_CACHE: format_out = FORMAT_CACHE[numfields] else: format_out = FORMAT_STRING*numfields + "\n" FORMAT_CACHE[numfields] = format_out f_out.write(format_out % (tuple(fields))) The additional if statement checks to see if the format cache has the format string already stored within it. If it does, then it uses that to create its format out string. If the format cache does not contain the format string within it, we create one, and store it in the cache for future use. The addition of this cache provided approximately 25% speed up. -------------------- The Completed Script -------------------- The complete script, :file:`csv2col.py` in its entirety is as follows, .. Python:: """Converts csv file into a column spaced file.""" import sys import os import string # # CONFIGURING OUR FORMAT # COLUMN_WIDTH = 12 FORMAT_STRING = "%%-%ds" % (COLUMN_WIDTH) # # SPEEDING UP OUR SCRIPT # FORMAT_CACHE = {} for i in range(1, 65): FORMAT_CACHE[i] = FORMAT_STRING*i + "\n" ERROR_USAGE_MESSAGE = """\ error: invalid arguments supplied. try: csv2col.py infile outfile""" ERROR_INVALID_FILE_MESSAGE = """\ error: file does not exist: '%s'""" def csv2col(infile, outfile, format_str=FORMAT_STRING, format_cache=FORMAT_CACHE): """Take a csv infile and generate a column spaced outfile.""" # # PROCESSING THE FILE # f_in = open(infile, "rt") f_out = open(outfile, "wt") for line in f_in.readlines(): fields = line.split(",") fields = map(string.strip, fields) numfields = len(fields) # # SPEEDING UP THE SCRIPT # if numfields in format_cache: format_out = format_cache[numfields] else: format_out = format_string*numfields + "\n" format_cache[numfields] = format_out f_out.write(format_out % (tuple(fields))) if __name__=="__main__": # # GETTING THE USER INPUT # try: infile = sys.argv[1] outfile = sys.argv[2] except IndexError, e: print ERROR_USAGE_MESSAGE sys.exit(-1) if not os.path.isfile(infile): print ERROR_INVALID_FILE_MESSAGE % (infile) sys.exit(-1) # # RUNNING OUR ROUTINE # csv2col(infile, outfile) Notice that the way we have written the script the :f:`csv2col()` is now a function that can be called from another script. For example, say as part of writing a routine you want to first convert some input file into a columnar file, you could do something like, .. Python:: import csv2col csv2col.csv2col(inputfile, outputfile) and that will generate a file that meets the default format string specifications. However, you can still run the script from the command window using by invoking the python interpreter as follows, :: c:\>python csv2col.py input.csv output.col ---------- References ---------- .. [#python-tutorial-input-and-output] *Python Tutorial*. "Input and Output". 30 May 2003. 1 June 2004. .. [#python-lib-builtin-exceptions] "Built-in Exceptions" *Python Library Reference*. 30 May 2003. 8 June 2004. .. [#python-tutorial-errors-and-exceptions] "Errors and Exceptions" *Python Tutorial*. 30 May 2003. 8 June 2004. .. [#python-lib-getopt] "getopt -- Parser for command line options" *Python Library Reference*. 30 May 2003. 8 June 2004. .. [#python-lib-sys] "sys -- System-specific parameters and functions" *Python Library Reference*. 30 May 2003. 7 June 2004. .. [#python-lib-builtin] "Built-in Functions". *Python Library Reference*. 30 May 2003. 7 June 2004.