Wednesday, March 17, 2010

COBOL File Processing in Python (really)

Years ago (6? 7?) I did some data profiling in Python.

This required reading COBOL files with Python code.

Superficially, this is not really very hard.
  1. Python slice syntax will pick fields on of the record. For example: data[12:14].
  2. Python codecs will convert from EBCDIC to Unicode without pain. codecs.get('cp037').decode( someField ).
With some more finesse, one can handle COMP-3 fields. Right?

Maybe not.

Problems

There are three serious problems.
  • Computing the field offsets (and in some cases sizes) is a large, error-prone pain.
  • The string slice notation makes the COBOL record structure completely opaque.
  • COMP-3 conversion is both ubiquitous and tricky.
Okay, what's the solution?

COBOL DDE Parsing

What I did was write a simple parser that read the COBOL "copybook" -- the COBOL source that defined the file layout. Given this Data Definition Entry (DDE) it's easy to work out offset, size and type conversion requirements.

It was way cool, so I delivered the results -- but not the code -- to the customer. I posted parts of the code on my personal site.

Over the years, a few people have found it and asked pointed questions.

Recently, however, I got a patch kit because of a serious bug.

Unit Tests

The code was written in Python 2.2 style -- very primitive. I cleaned it up, added unit tests, and -- most importantly -- corrected a few serious bugs.

And, I posted the whole thing to SourceForge, so others can -- in principle -- fix the remaining bugs. The project is here: https://sourceforge.net/projects/cobol-dde/.