Thursday, May 29, 2014

Stingray 4.4 Update -- the Posix split command applied to COBOL files

Here's an interesting problem. Implement the split command for mainframe COBOL EBCDIC files with their BDW and RDW headers.

The conventional split can't handle COBOL EBCDIC files because they don't have sensible \n line breaks. Translating an EBCDIC file to ASCII is high-risk because COMP and COMP-3 fields will be trashed by the translation.
If the files include Occurs Depending On, then the FTP transfer should include the RDW/BDW headers. The SITE RDW (or LOCSITE RDW) are essential. It's much faster to include this overhead. Stingray can process files without the headers, but it's slower.
There are two essential Python techniques for building file splitters than involve parsing.
  • The itertools.groupby() function.
  • The with statement.
Along with this, we need an iterator over the underlying records.  For example, the stingray.cobol.RECFM subclasses will parse the various mainframe RECFM options and iterate over records or records+RDW headers or blocks (BDW headers plus records with RDW headers.

The itertools.groupby() function can break a record iterator into groups based on some group-by criteria. We can use this to break into sequential batches.

itertools.groupby( enumerate(reader), lambda x: x[0]//batch_size )

This expression will break the iterable, reader, into groups each of which has a size of batch_size records. The last group will have total%batch_size records.

The with statement allows us to make each individual group into a separate context. This assures that each file is properly opened and closed no matter what kinds of exceptions are raised.

Here's a typical script.

    import itertools
    import stringray.cobol
    import collections
    import pprint
    
    batch_size= 1000
    counts= collections.defaultdict(int)
    with open( "some_file.schema", "rb" ) as source:
        reader= stringray.cobol.RECFM_VB( source ).bdw_iter()
        batches= itertools.groupby(enumerate(reader), lambda x: x[0]//batch_size):
        for group, group_iter in batches:
            with open( "some_file_{0}.schema".format(group), "wb" ) as target:
            for id, row in group_iter:
                target.write( row )
                counts['rows'] += 1
                counts[str(group)] += 1
    pprint.pprint( dict(counts) )

There are several possible variations on the construction of the reader object.

  • cobol.RECFM_F( source ).record_iter() -- result is RECFM_F
  • cobol.RECFM_F( source ).rdw_iter() -- result is RECFM_V; RDW's have been added. 
  • cobol.RECFM_V( source ).rdw_iter() -- result is RECFM_V; RDW's have been preserved. 
  • cobol.RECFM_VB( source ).rdw_iter() -- result is RECFM_V; RDW's have been preserved; BDW's have been discarded. 
  • cobol.RECFM_VB( source ).bdw_iter() -- result is RECFM_VB; BDW's and RDW's have been preserved. The batch size is the number of blocks, not the number of records.
This should allow slicing up a massive mainframe file into pieces for parallel processing.

Thursday, May 22, 2014

Python Package Design, Refactoring and the Stingray Reader Project

We'll be digging into Mastering Object-Oriented Python. Chapter 17, specifically.

We'll also be looking at a big refactoring of the Stingray Schema-Based File Reader.

We can identify three species of packages.

One common design is a Simple Package. A directory with an empty __init__.py file. This package name becomes a qualifier for the internal module names. The package is simply a namespace for modules. We’ll use the package with something like this:

import package.module


Another common design is the Module-Package. This is a package which appears to be a module.  It will have a larger and more sophisticated __init__.py that is a effectively, a  module definition. There are two variations on this theme. Sometimes we'll use this during the early stages of development because we don't know if the package will get really big or stay small. If we start out with a package and all the code is in the __init__.py, we can refactor down to a module.

The more common use for a module-package is to have the __init__.py import objects or other modules from the package directory. Or, it can stand as a part of a larger design that includes the top-level module and the qualified sub-modules. We’ll use the package with something like this:

import package

or perhaps

from package import thing


The third common pattern is a package where the __init__.py selects among alternative implementations. The os module is a good example of this. We’ll use the package with something like this:


import package


Knowing that it did something roughly like the following for us.

import package.implementation as package


Refactoring Module to Package

The Stingray angle on this is the need to add iWork '13 numbers to the collection of spreadsheets which it can parse. The iWork '13 format is unique.

Previously, all of the spreadsheets fell into three families:

iWork '13 uses Snappy compress and Protobuf Serialization. Without some documentation, the files would be incomprehensible.  Read this: See https://github.com/obriensp/iWorkFileFormat. Brilliant. 

The previous releases of Stingray had a single, large module to handle a variety of workbook formats. Folding iWork '13 into this module would have been lunacy. It was already large to the point of being painful to understand.

The original module will be transparently turned into Module-Package. The API (import stingray.workbook or from stingray.workbook import SomeClass) will remain the same.

However.

The implementation will involve a package with each workbook format as a separate module inside that package. At the top, the __init__.py will include code like the following.

    from stingray.workbook.csv import CSV_Workbook
    from stingray.workbook.xls import XLS_Workbook
    from stingray.workbook.xlsx import XLSX_Workbook
    from stingray.workbook.ods import ODS_Workbook
    from stingray.workbook.numbers_09 import Numbers09_Workbook
    from stingray.workbook.numbers_13 import Numbers13_Workbook
    from stingray.workbook.fixed import Fixed_Workbook

This has the advantage of allowing us to include additional parsing cruft in each module that's not part of the exposed API in the workbook package.

The Mastering Object-Oriented Python book has more details on this kind of design.

Thursday, May 15, 2014

Want a copy of Mastering Object-Oriented Python? Free?

Want a copy free? See this contest: http://www.blog.pythonlibrary.org/2014/05/12/ebook-contest-win-a-free-copy-of-mastering-object-oriented-python/.

If you're really interested, I can sign a copy.  That will double the shipping cost, so perhaps that's not the best idea.

The bad news is that the errata have started to trickle in. Some of the default serialization cases in chapter 9 aren't handled properly. The demonstrations don't exercise the defaults very well, so things happen to work for me, but don't work when generalized or pulled out of context.

Sigh.

What's important -- I guess -- is that we have a critical mass of readers who are applying the concepts and finding problems.