Moved. See All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, December 17, 2013

Apple's Feckless Download Protocol

For those of us who live mostly off the grid, with inconsistent WiFi through hosts that flake out frequently, Apple's AppStore download protocol is absolutely infuriating. It's the most brittle damn protocol I've ever seen.

When there's any disruption, it simply discards the data it has and starts again.

How stupid. How blatantly and infuriatingly stupid.

If I pause a download, it will resume. If it breaks, it will not resume. WTF?

For some things, I can use BitTorrent, which tolerates noisy links. But for proper AppStore Apps, their protocol is the pits.

Anyone know anyone at Apple who's able to work on a solution to this?

Thursday, December 12, 2013

Secure Salted Password Hashing

An excellent exposition of secure salted password hashing.

This was really quite nice. It didn't have a Python version, but the clarity of the exposition makes the Python easy to write.

A few months back, I had this mystery conversation:

While this is not going to produce identical results to the code shown in the blog post, it seems to fit the requirements.

from hashlib import sha256
import os
class Authentication:
    iterations= 1000
    def __init__( self, username, password ):
        """Works with bytes. Not Unicode strings."""
        self.username= username
        self.salt= os.urandom(24)
        self.hash= self._iter_hash( self.iterations, self.salt, username, password )
    def _iter_hash( iterations, salt, username, password ):
        seed= salt+b":"+username+b":"+password
        for i in range(iterations):
            seed= sha256( seed ).digest()
        return seed
    def __eq__( self, other ):
        return self.username == other.username and self.hash == other.hash
    def __hash__( self, other ):
        return hash(self.hash)
    def __repr__( self ):
        salt_x= "".join( "{0:x}".format(b) for b in self.salt )
        hash_x= "".join( "{0:x}".format(b) for b in self.hash )
        return "{username} {iterations:d}:{salt}:{hash}".format(
            username=self.username, iterations=self.iterations,
            salt=salt_x, hash=hash_x)
    def match( self, password ):
        test= self._iter_hash( self.iterations, self.salt, self.username, password )
        return self.hash == test # Constant Time is Best

It may be helpful to use __slots__ with this to reduce the storage and make the object less mutable.

Perhaps I didn't google well enough to find a clear explanation that also included Python code samples.

Tuesday, December 3, 2013

Python vs. R for Data Science

This: Python Displacing R As The Programming Language For Data Science.

Recently, I've had a former colleague asking questions about Data Science. See Obstinate Idiocy.

They -- weirdly -- insisted that the only language that made sense to them was Excel.

My response was a blunt "What?"

The Python vs. R post cited above clarifies that reasons why a programming language is a better choice than a "tool" or "platform".

Tuesday, November 26, 2013

Mac OS X 10.9 and Python 3.3

Finally upgraded to Python 3.3.3. The ordinary maintenance release fixed the problem with the previous release.

The previously documented patch ( is no longer required.

Time to start incrementally installing all the various add-on components: docutils, PyYaml, Django, Jinja2, SQLAlchemy, etc.

Also, time to put more focus into rewriting various projects to finally cut the cord with Python2. At this point, there's no longer a reason to be looking backwards.

Tuesday, October 29, 2013

When to choose Python over Java and vice versa ??: A Very Silly Question

The correct answer is: It Doesn't Matter.

In spite of this.

(A) The question gets asked.

And worse.

(B) It gets answered. And people take their answers seriously. As if there are Profound Differences among programming languages.

Among Turing Complete programming languages there are few Profound Differences.

The pragmatic differences are the relative ease (or pain) of expressing specific algorithms or data structures.

This means that there's no easy, blanket, one-size-fits-all answer to such a silly question.

You can have some code (or data) which is painful in Java and less painful in Python.


You can also find an extension library that makes it much, much less painful.

This, alone, makes the question largely moot.

When it comes to a specific project, the question of the team's skills, the existing infrastructure, and any integration requirements are the driving considerations.

Because of this, an incumbent language has huge advantages.

If you've already got a dozen web sites in Java, there's no good reason to flip-flop between Java and Python.

If you're going to switch from some Java Framework to Django, however, you'd do this is part of a strategic commitment to drop Java and convert to Python.

To read the discussion, see LinkedIn Python Community.

Friday, October 25, 2013

Mac OS X 10.9 "Mavericks" Crashes Python -- Patch Available

I upgraded to Mac OS X 10.9 "Mavericks" today.

Python 3.2.4 crashed in odd places. Not all the time. But help() stopped working to give one example.

I upgraded to Python 3.3 (it's about time to do that, anyway.)

It crashed, also. Similar places.

Here's the workaround:

There's a script which you can download and run. This will tweak your installations until the next maintenance release of Python.

Whew. That was close.

Now I need to square away setuptools and "easy_install-3.3" all the missing pieces of my 3.2 environment.

Thursday, October 24, 2013

Required Reading for everyone who said "If it ain't broke, don't fix it."

Here an important lesson.

Code Rot is Real.  It Leads to Subtle and Expensive Bugs.

Claiming that code cleanup is just pointless "gold plating" is the kind of thing that can drive a company out of business.

Tuesday, October 15, 2013

Literate Programming: PyLit3

I've revised PyLit to work with Python3.


The code seems to pass all the unit tests.

The changes include Python3 revisions, plus a small change to handle trailing spaces in a sightly cleaner fashion. This was necessary because I have most of my editors set to remove trailing spaces from the files I create, and PyLit tended to create trailing spaces. This made the expected output from the unit tests not precisely match the actual output.

Thursday, October 3, 2013

Literate Programming and PyLit

Even though I wrote a literate programming tool (PyWeb) I slowly came to realize that it's not very good.

Mostly, I followed the Web/Weave world view and cribbed their markup syntax. It's not bad, but, the PyWeb markup is based on some presumptions about literate programming that were, perhaps, true with some languages, but are not true at all when working with Python.
  1. The source presentation order incomprehensible. To fix this, we create a literate programming document, and from that tangle the source into an order that's acceptable to the compiler, but perhaps hard to understand for people. We weave a document that's easy for people to understand.
  2. The source syntax may be incomprehensible. To fix this, we have fine grained substitution. The target source can be built at any level of syntax (token, line, or higher-level language construct.) We can assure that the woven document for people is written using elegant symbols even if the tangled source code uses technical gibberish.
  3. The woven documentation needs a lot of additional output markup. The original web/weave toolset create extensive TeX markup. Later tools reduced the markup to allow HTML or XML, minimizing the added markup in a woven document.
In Python, there's very little "boilerplate" or overhead in a module file. Also, because of very late binding, the presentation order of the source can better match reader expectations. For definitions, inter-class references mandate an order for the class statements in an inheritance hierarchy, but almost everything else is remarkably flexible.

Python syntax doesn't benefit from fine-grained web/weave techniques. It's pretty clear as written in it's normal form.

Finally, the presence of RST markup language means that a whole new meta-markup for literate programming isn't necessary

PyLit demonstrates that an additional markup language is not helpful. RST is sufficient. PyLit is an elegant parser of RST and Python. It can reshape RST into Python as well as reshape Python into RST. Do your literate programming in either language and produce the other easily.

Enter Python 3

The problem with PyLit is that it's oriented to Python 2.4 through 2.7. How can we use PyLit for Python 3?
  • Use to make a single version that covers both Python2 and Python3.
  • Rewrite PyLit it for Python3 and move forward.
My preference is to move forward. The backward compatibility is helpful when there's a vast user base, lots of ongoing development, and the very real possibility of bug fixes that apply to Python2 as well as Python3.

PyLit has a small user base, no real development to speak of, and a very remote possibility of backward compatible bug fixes.

The rewrites are small. Here's the summary.
  • Remove print statement and exec statements.
  • Replace string formatting % with .format().
  • Replace raise statements and except statements with Python3 (and Python2.7) syntax.
  • Upgrade for dict method changes in Python3.
  • Replace DefaultDict with collections.defaultdict.
  • Replace optparse with argparse.
I've done this in my Python3.2 installation.

This doesn't address the Sphinx documentation, however, which should probably be tweaked to be the latest and greatest Sphinx version, also. Not much will change, there, however, since the RST remains compatible.

Also, it doesn't address the files with names that differ only in case. There are two graphics files in the /trunk/rstdocs/logo/ path that differ only in case of letters. Bad, but acceptable for Linux. Fatal for Mac OS X with the default filesystem.

The question is, what's the polite way to proceed? 
  • Fork the PyLit 0.7.5 to create PyLit3 release 1.0? A whole, new project.
  • Try to use to create a 2-3 compatible source file and call this PyLit 0.8?
Adding to a package that was a single module file seems like a bit of overkill. One of the elegant features of PyLit was that it was so simple, it didn't even have a However, there may be a community of staunchly Python2 literate programming advocates. 

Tuesday, September 24, 2013

Introduction to Programming: iBook Edition for Python 3.2

That was challenging.

I rewrote almost all of my Introduction to Programming book into an iBook. Trimmed it down. Refocused it. Changed from Python 2.6 to 3.2. A complete refactoring from which almost nothing of the original book survives except the goals.

Look for it October 1st in the iTunes bookstore. Programming for Absolute Beginners: Building Skills in Programming.

[My intent is to have several Building Skills titles. We'll see how far I get.]

The rewrite involved three substantial changes.

  1. I removed all of the duplicative reference material. The Python library reference is (now)  utstandingly good. When I started using Python over ten years ago, it was not very good, and I started writing a Python reference of my own merely to organize the documentation. The books grew from there; the reference aspect is now useless.
  2. I dropped almost all Python 2 references. There's a little bit of Python 2 history, but that's it. It's time to move forward, now that most of the major packages seem to have made the switch. 
  3. I changed the focus from processing to data.
Processing vs. Data

When looking at a multi-faceted language like Python, it's difficult to know what's the most gentle introduction to software development.

Historically, the procedural, imperative style of programming appears the most appealing. The roots of Python come from procedural programming. It reaches back to Pascal (and even Algol 60) by elegantly restating the core principles of those languages with an easier-to-read syntax.

Indeed, if you read classic foundational CACM articles where essential algorithms were first formally described, they used a neatly typeset variant on Algol that (for the early years of my career) was the gold standard in how code should look. Python follows this tradition nicely.


That doesn't mean that procedural programming is really the absolutely best way to introduce the language.

Data First

I think that it may be possible to introduce the language with a focus on data objects first and the procedural/imperative statements as a secondary consideration.

When it comes to anything beyond trivial Rate-Time-Distance calculations, the data structure matters more than almost any other aspect of the software. The objects, their relationships, their operations and their attributes are core to the problem. The presentation, user actions and persistent representation are secondary considerations after the structure of the data.

It seems like the data structures should "drive" the presentation. The outline of the book should be introductions of each of the important and visible builtin data structures. Additionally, the library extensions that are most often used can be introduced, also.

Definitional features (def, return, yield, class, and the ideas of module and package) are central, but a step behind the builtin data structures.

Procedural features (if, for, while, break, continue, with, etc.) are clearly second-class; they exist only to support access to the data structures. A for statement, makes a "for all" assertion about a data structure. A for with a break (or a while) makes a "there exists" assertion about a data structure. The data is central. The imperative statements are secondary.

Other features (global, nonlocal, del, raise, try, etc.) are tertiary, and exist to create more elegant programs that don't annoy the other developers as much. 

This also means that generator expressions and comprehensions are first-class, front-and-center features of the language. This parallels the approach in the NLTK Book, which puts the focus on generator expressions as a way to clearly state the processing.

Other Forms

Currently, I only have the iBook available.

The iBook Author application can (and does) produce a PDF. I think I may offer that separately through

Tuesday, September 17, 2013

iWeb File Extract and XML Iterators

Once upon a time, Apple offered iBlog. Then they switched to iWeb. Then they abandoned that market entirely.

That leaves some of us with content in iBlog as well as iWeb. Content we'd like to work with without doing extensive cutting and pasting. Or downloading from a web server. After all, the files are on our computer.

The iWeb files are essentially XML, making them relatively easy to work with. We can reduce the huge, and hugely complex iWeb XML to a simple iterator and use a simple for statement to extract the content.

[Historical note. I wrote a Python script to convert iBlog to RST. It worked reasonably well, all things considered. This is not the first time I've tried to preserve old content from obsolete tools. Sigh.]

Some tools (like SandVox) have a "extract iWeb content" mode. But that's not what we want. We don't want to convert from iWeb to another blog. We want to convert from iWeb to CVS or some other more useful format so we can do some interesting processing, not simple web presentation.

This is a note on how to read iWeb files to get at the content. And further, how to get at XML content in the form of a simple iterator.

Opening The Package

Here's how to overview the package.

    path_full= os.path.expanduser(path+".sites2")
    for filename in os.listdir(path_full):
        name, ext = os.path.splitext( filename )
        if ext.lower() in ( ".jpg", ".png", ".mov", ".m4v", ".tiff", ".gif", ".m4a", ".mpg", ".pdf" ): continue
        print( filename )

This will reveal the files; we only really care about the "index.xml.gz" file since that has the bulk of the content.

    with closing( gzip.GzipFile( os.path.join(path_full,"index.xml.gz") ) ) as index:
        index_doc= xml.parse( index )
        index_root= index_doc.getroot()

This gets us the XML version of the blog.

Finding the Pages

We can use the following to thread through the XML. We're looking for a particular "Domain", a "Site" and a particular blog page within that site. The rest of the blog is mostly text. This portion of the blog is more structured.

For some reason, the domain is "Untitled". The site is "Cruising", and the blog page is "Travel 2012-2013". We insert these target names into XPath search strings to locate the relevant content.

search= '{{}}domain[@{{}}name="{0}"]'.format(domain_name)
domain= index_root.find( search )
mdu_uuid_tag= domain.find('{}metadata/{}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{}string').get('{}string')
domain_filename= "domain-{0}".format( mdu_uuid_value )

search= './/{{}}site[@{{}}name="{0}"]'.format(site_name)
cruising= domain.find(search)
mdu_uuid_tag= cruising.find('{}metadata/{}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{}string').get('{}string')
site_filename= "site-{0}".format(mdu_uuid_value)

search= '{{}}site-blog[@{{}}name="{0}"]'.format(site_blog_name)
site_nodes= cruising.find('{}site-nodes')
travel= site_nodes.find(search)
mdu_uuid_tag= travel.find('{}metadata/{}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{}string').get('{}string')
site_blog_filename= "site-blog-{0}".format(mdu_uuid_value)

This will allow us to iterate through the blog entries, called "pages". Each page, it turns out, is stored in a separate XML file with the page details and styles. A lot of styles. We have to assemble the path from the base path, the domain, site,  site-blog and site-page names. We'll find an ".xml.gz" file that has the individual blog post.

    for site_page in travel.findall('{}series/{}site-page'):
        mdu_uuid_tag= site_page.find('{}metadata/{}MDUUID')
        mdu_uuid_value= mdu_uuid_tag.find('{}string').get('{}string')
        site_page_filename= "site-page-{0}".format(mdu_uuid_value)

        blog_path= os.path.join(path_full, domain_filename, site_filename, site_blog_filename, site_page_filename )
        with closing( gzip.GzipFile( os.path.join(blog_path,site_page_filename+".xml.gz") ) ) as child:
            child_doc= xml.parse( child )
            child_root= child_doc.getroot()
        main_layer= child_root.find( '{}site-page/{}drawables/{}main-layer' )

Once we have access to the page XML document, we can extract the content. At this point, we could define a function which simply yielded the individual site_page tags.

Summary Iterable

The most useful form for the pages is an iterable that yields the date, title and content text. In this case, we're not going to preserve the internal markup, we're just going to extract the text in bulk.

        content_map = {}
        for ds in main_layer.findall( '{}drawable-shape' ):
            style_name= ds.get('{}name')
            if style_name is None:
                #xml.dump( ds ) # Never has any content.
            for tb in ds.findall('{}text/{}text-storage/{}text-body' ):
                # Simply extract text. Markup is lost.
                content_map[style_name] = tb.itertext()
        yield content_map

This works because the text has no useful semantic markup. It's essentially HTML formatting full of span and div tags.

Note that this could be a separate generator function, or it could be merged into the loop that finds the site-page tags. It's unlikely we'd ever have another source of site-page tags. But, it's very like that we'd have another function for extracting the text, date and title from a site-page tag. Therefore, we should package this as a separate generator function.  We didn't, however. It's just a big old function named postings_iter().

There are three relevant style names. We're not sure why these are used, but they're completely consistent indicators of the content.
  • "generic-datefield-attributes (from archive)"
  • "generic-title-attributes (from archive)"
  • "generic-body-attributes (from archive)"
These becomes keys of the content_map mapping. The values are iterators over the text.

Processing The Text

Here's an iterator that makes use of the postings_iter() function shown above.

def flatten_posting_iter( postings=postings_iter(path="~/Documents/iWeb/Domain") ):
    """Minor cleanup to the postings to parse the date and flatten out the title."""
    for content_map in postings:
        date_text= " ".join( content_map['generic-datefield-attributes (from archive)'] )
        date= datetime.datetime.strptime( date_text, "%A, %B %d, %Y" ).date()
        title= " ".join( content_map['generic-title-attributes (from archive)'] )
        body= content_map['generic-body-attributes (from archive)']
        yield date, title, body

This will parse the dates, compress the title to remove internal markup, but otherwise leave the content untouched. 

Now we can use the following kind of loop to examine each posting.

    for date, title, text_iter in sorted(flat_postings):
        for text in text_iter:
           # examine the text for important content.

We've sorted the posting into date order. Now we can process the text elements to look for the relevant content.

In this case, we're looking for Lat/Lon coordinates, which have rather complex (but easy to spot) regular expressions. So the "examine" part is a series of RE matches to collect the data points we're looking for.

We'll leave off those application-specific details. We'll leave it at the following outline of the processing.

def fact_iter( flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain")) ):
    for date, title, text_iter in sorted(flat_postings):
        fact= Fact()
        for text in text_iter:
           # examine the text for important content, set attributes of fact
           if fact.complete(): 
               yield fact
               fact= Fact()

This iterates through focused data structures that include the requested lat/lon points.

Final Application

The final application function that uses all of these iterators has the following kind of structure.

source= flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain"))
with open('target.csv', 'w', newlines='') as target:
    wtr= csv.DictWriter( target, Fact.heading )
    for fact in fact_iter( source ):
        wtr.writerow( fact.as_dict() )

We're simply iterating through the facts and writing them to a CSV file.

We can also simplify the last bit to this.

wtr.writerows( f.as_dict() for f in fact_iter( source ) )

The iWeb XML structure, while bulky and complex, can easily be reduced to a simple iterator. That's why I love Python.

Thursday, September 12, 2013

Omni Outliner, XML Processing, and Recursive Generators

First, and most important, Omni Outliner is a super-flexible tool. Crazy levels of flexibility. It's very much a generic-all-singing-all-dancing information management tool.

It has a broad spectrum of file export alternative formats. Most of which are fine for import into some kind of word processor.

But what if the data is more suitable for a spreadsheet or some more structured environment? What if it was a detailed log or a project outline decorated with a column of budget numbers?

We have two approaches, one is workable, but not great, the other has numerous advantages.

In the previous post, "Omni Outliner and Content Conversion", we read an export in tab-delimited format. It was workable but icky.

Here's the alternative. This uses a recursive generator function to flatten out the hierarchy. There's a trick to recursion with generator functions.

Answer 2: Look Under the Hood

At the Mac OS X level, an Omni Outline is a "package". A directory that appears to be a single file icon to the user. If we open that directory, however, we can see that there's an XML file inside the package that has the information we want.

Here's how we can process that file.

import xml.etree.ElementTree as xml
import os
import gzip

packagename= "{0}.oo3".format(filename)
assert 'contents.xml' in os.listdir(packagename)
with gzip.GzipFile(packagename+"/contents.xml", 'rb' ) as source:
   self.doc= xml.parse(source)

This assumes it's compressed on disk. The outlines don't have to be compressed. It's an easy try/except block to attempt the parsing without unzipping. We'll leave that as an exercise for the reader.

And here's how we can get the column headings: they're easy to find in the XML structure.

self.heading = []
for c in self.doc.findall(
    # print( c.tag, c.attrib, c.text )
    if c.attrib.get('is-note-column','no') == "yes":
        # is-outline-column == "yes"? May be named "Topic".
        # other columns don't have a special role
        title= c.find("{}title")
        name= "".join( title.itertext() )
        self.heading.append( name )

Now that we have the columns titles, we're able to walk the outline hierarchy, emitting normalized data. The indentation depth number is provided to distinguish the meaning of the data.

This involves a recursive tree-walk. Here's the top-level method function.

def __iter__( self ):
    """Find  for outline itself. Each item has values and children.
    Recursive walk from root of outline down through the structure.
    root= self.doc.find("{}root")
    for item in root.findall("{}item"):
        for row in self._tree_walk(item):
            yield row

Here's the internal method function that does the real work.

    def _tree_walk( self, node, depth=0 ):
        """Iterator through item structure; descends recursively.
        note= node.find( '{}note' )
        if note is not None:
            note_text= "".join( note.itertext() )
            note_text= None
        data= []
        values= node.find( '{}values' )
        if values is not None:
            for c in values:
                if c.tag == "{}text":
                    text= "".join( c.itertext() )
                    data.append( text )
                elif c.tag == "{}null":
                    data.append( None )
                    raise Exception( c.tag )
        yield depth, note_text, data
        children= node.find( '{}children' )
        if children is not None:
            for child in children.findall( '{}item' ):
                for row in self._tree_walk( child, depth+1 ):
                    yield row

This gets us the data in a form that doesn't require a lot of external schema information.

Each row has the indentation depth number, the note text, and the various columns of data. The only thing we need to know is which of the data columns has the indented outline.

The Trick

Here's the tricky bit.

When we recurse using a generator function, we have to explicitly iterate through the recursive result set. This is different from recursion in simple (non-generator) functions. In a simple function, we it looks like this.

def function( args ):
    if base case: return value
        return calculation on function( other args )
When there's a generator involved, we have to do this instead.

def function_iter( args ):
    if base case: yield value
        for x in function_iter( other args )
            yield x

Columnizing a Hierarchy

The depth number makes our data look like this.

0, "2009"
1, "November"
2, "Item In Outline"
3, "Subitem in Outline"
1, "December"
2, "Another Item"
3, "More Details"

We can normalize this into columns. We can take the depth number as a column number. When the depth numbers are increasing, we're building a row. When the depth number decreases, we've finished a row and are starting the next row.

"2009", "November", "Item in Outline", "Subitem in Outline"
"2009", "December", "Another Item", "More Details"

The algorithm works like this.

row, depth_prev = [], -1
for depth, text in source:
    while len(row) <= depth+1: row.append(None)
    if depth <= depth_prev: yield row
    row[depth:]= [text]+(len(row)-depth-1)*[None]
    depth_prev= depth
yield row

The yield will have to also handle the non-outline columns that may also be part of the Omni Outliner extract.

Tuesday, September 10, 2013

Omni Outliner and Content Conversion

First, and most important, Omni Outliner is a super-flexible tool. Crazy levels of flexibility. It's very much a generic-all-singing-all-dancing information management tool.

It has a broad spectrum of file export alternative formats. Most of which are fine for import into some kind of word processor.

But what if the data is more suitable for a spreadsheet or some more structured environment? What if it was a detailed log or a project outline decorated with a column of budget numbers?

We have two approaches, one is workable, but not great, the other has numerous advantages.

Answer 1: Workable

Sure, you say, that's easy. Export into a Plain Text with Tabs (or HTML or OPML) and then parse the resulting tab-delimited file.

In Python. Piece of cake.

import csv

class Tab_Delim(csv.Dialect):
rdr= csv.reader( source, dialect=Tab_Delim )
column_names= next(rdr)
for row in rdr:
   # Boom. There it is.    

That gets us started. But.

Each row is variable length. The number of columns varies with the level of indentation. The good news is that the level of indentation is consistent. Very consistent. Year, Month, Topic, Details in this case.

[When an outline is super consistent, one wonders why a spreadsheet wasn't used.]

Each outline node in the export is prefaced with "- ".

It looks pretty when printed. But it doesn't parse cleanly, since the data moves around.

Further, it turns out that "notes" (blocks of text attached to an outline node, but not part of the outline hierarchy) show up in the last column along with the data items that properly belong in the last column.


The good news is that notes seem to appear on a line by themselves, where the data elements seem to be properly attached to outline nodes. It's still possible to have a "blank" outline node with data in the columns, but that's unlikely.

We have to do some cleanup

Answer 1A: Cleanup In Column 1 

We want to transform indented data into proper first-normal form schema with a consistent number of fixed columns. Step 1 is to know the deepest indent. Step 2 is to then fill each row with enough empty columns to normalize the rows.

Each specific outline has a kind of schema that defines the layout of the export file. One of the tab-delimimted columns will be the "outline" column: it will have tabs and leading "-" to show the outline hierarchy. The other columns will be non-outline columns. There may be a notes column and there will be the interesting data columns which are non-notes and non-outline.

In our tab-delimited export, the outline ("Topic") is first. Followed by two data columns. The minimal row size, then will be three columns. As the topics are indented more and more, then the number of columns will appear to grow. To normalize, then, we need to pad, pushing the last two columns of data to the right.

That leads to a multi-part cleanup pipeline. First, figure out how many columns there are.

    rows= list( rdr )
    width_max= max( len(r) for r in rows )+1

This allows us the following two generator functions to fill each row and strip "-".

def filled( source, width, data_count ):
    """Iterable with each row filled to given width.
    Rightmost {data_count} columns are pushed right to preserve
    their position.
    for r_in in source:
        yield r_in[:-data_count] + ['']*(width-len(r_in)) + r_in[-data_count:]

def cleaned( source ):
    """Iterable with each column cleaned of leading "- "
    def strip_dash( c ):
        return c[2:] if c.startswith('- ') else c

    for row in source:
        yield list( strip_dash(c) for c in row )

That gets us to the following main loop in a conversion function.

    for row in cleaned( filled( rows, width_max, len(columns) ) ):
        # Last column may have either a note or column data.
        # If all previous columns empty, it's probably a note, not numeric value.
        if all( len(c)==0 for c in row[:-1] ):
            row[4]= row[-1]
            row[-1]= ''
        yield row

Now we can do some real work with properly normalized data. With overheads, we have an 80-line module that lets us process the outline extract in a simple, civilized CSV-style loop.

The Ick Factor

What's unpleasant about this is that it requires a fair amount of configuration.

The conversion from tab-delim outline to normalized data requires some schema information that's difficult to parameterize.

1. Which column has the outline.
2. Are there going to be notes on lines by themselves.

We can deduce how many columns of ancillary data are present, but the order of the columns is a separate piece of logical schema that we can't deduce from the export itself.

Tuesday, August 27, 2013

Obstinate Idiocy, Expanded

See Obstinate Idiocy for some background.

Here are three warning signs I was able to deduce.

  • No Rational Justification
  • Ineffective Tool Choice
  • Random Whining

To which I can now add two more.

Symptom 4 of Obstinate Idiocy is that all questions are rhetorical and they often come with pre-argued answers.

Actual email quote:

Me: ">Excel is almost the stupidest choice possible

OI: "What criteria are you using to make that statement?

My criteria was that I needed a way for non-tech people and non-programmers..."

And on the email spins, pre-arguing points and pre-justifying a bad answer. Since their argument is already presented (in mind-numbing detail), there's no effective way to answer the question they asked. Indeed, there's little point in trying to answer, since the pre-argued response is likely to be the final response.

In order to answer, we have to get past the pre-argued response. And this can be difficult because this devolves to "it's political, you don't need the details." So, if it's not technical, why am I involved?

Symptom 5 of Obstinate Idiocy is Learning is Impossible. This may actually be the root cause for Symptom 3, Ineffective Tool Choice. It now seems to me that the tool was chosen to minimize learning. I had suggested using Mathematica. I got this response: " I don't know Python or R or SAS." The answer seems like a non-sequitur because it is. It's justification for a bad decision.

The problem they're trying to solve is gnarly, perhaps it's time to consider learning a better toolset.

Excel has already failed the OI. They asked for an opinion ("Q2: What do you believe are the pros/cons of ... using Excel with "Excel Solver" ...?") that seems to ignore the fact that they already failed trying to use Excel. They already failed, and they followed up by asking for the pros and cons of a tool they already failed with. 

From this limited exchange it appears that they're so unwilling to learn that they can't gather data from their own experience and learn from it.

Thursday, August 8, 2013

Negative Requirements

Nothing is funnier than requirements which state things that are not to happen.

An actual quote.

... don't screw up cutting and pasting and the "/" vs "\" depending on unix / windows.
Why not list everything that's not supposed to happen?

  • No fire in the server room.
  • No anthrax outbreak.
  • No Zombie apocalypse.

The list could go on. I wonder why it doesn't.

Tuesday, August 6, 2013

How to Manage Risk

Also see "On Risk and Estimating and Agile Methods". This post is yet another angle on a common theme.

Orders of Ignorance and Risk Management.

Software risk management has two sides.  First, there's the classical meaning of risk, we'll call that "casino risk" because it's really random events.  This includes fire, flood, famine, conquest, war, pestilence, death, etc.  Actual risks.

The second meaning of risk is a load of malarkey.  It's a code word that includes two things: "bad management" and "ignorance".  Some things called project risks are just plain old bad management—generally driven by a nonexistent process for handling ignorance.  The events aren't random.  

There are five orders of ignorance, and each of them leads to project management problems.  None of these are "random events"; none of this is like casino gambling.  There aren't any odds; most of these things are certainties.

0 Order Ignorance: Things We Know.

There are two sides to the things we think we know about a project.  There are the things we know which are true, and things which are false.  Falsehoods come from at least two places:  we assumed something or we were actually lied to.  (Other choices, like illusions and hallucinations, are too creepy to pursue.)

Our assumptions aren't facts.  This sounds so obviously stupid, but projects get into trouble based on assumptions that were never checked to see if they were true or not.  Managers insist on doing "risk analysis" and then pad their project estimates with time and money instead of simply challenging their assumptions.

Some "assumptions" are explicit placeholders for facts to be found out later.  These formally documented assumptions are a different thing, they're 1st order ignorance, something we know we don't know.

Example.  The customer says they need an app to do [X].  There's 8 people in the department who are likely actors.  We assume the user population is 8.  With no fact checking.  We don't put it in the plan as a formal, documented assumption, we just assume it.

Bottom line.  What are the odds that a plan is based on false knowledge?  This isn't casino gambling.  There aren't any odds that we assumed something, or odds that we were lied to.  This is simple fact-checking, simple bad management. Every unchecked fact is going to be false.

1st Order Ignorance: Things We Don't Know.

There are two ways to deal with things we don't know.  Make a guess and document this, or actually ask a question.

The formally documented guesses (usually called "assumptions") are the hallmark of software project plans.  Documents are often full of lists of assumptions on which the plan and associated estimates rest.

Each one of these "assumptions" is a question that—for some reason—couldn't be asked or couldn't be answered.  Some questions are "politically sensitive" and can't be asked.  Some questions require lots of research to develop an answer.

The answers, of course, change the project.  In most cases they change the project dramatically.  If they didn't have a big impact, we wouldn't spend any time documenting them so carefully, would we?

And all project managers are punished for making any changes.  We can't expand the scope without a lot of accusatory conversations where people keep repeating the original price back to us.  

The canonical line is something like "I thought this was only 1.8 million dollars, how can you change it now?" or "I've already committed to $750K, we can't change the price, something else has to be changed.  We have to work smarter not harder."  Bleah.

This doesn't involve casino-like odds of finding answers that will change the scope of the project.  We know we have questions.  We know we made guesses and documented them as "assumptions".  There were no odds; it was absolutely certain there would be changes.

Bottom line.  What are the odds that a plan is based on things we guessed at?  Typically, this is a fact of life: parts of the estimate are guesses.  

What are the odds that the real answer will be different from the guess?  This, too, is absolutely certain. It's merely a question of magnitude.

For first-order ignorance problems, we should create a contingency budget for each answer that will diverge from the guesses.  This isn't a book-making exercise, it's a list of alternate guesses (or assumptions).

It isn't enough to simply detail the assumptions.  We have to provide alternative answers and the associated costs when the assumptions turn out to be false.

2nd Order Ignorance: Things We Didn't Know to Ask.

If we didn't know to ask, what's that?  Is that a risk?  What are the odds we forgot to ask something?

Here's the canonical quote: "Are there any other unforeseen problems?"

What? If they're unforeseen, then, uhhh, we can't identify them now.

If we can identify them, then, uhhh, they're not unforeseen.

There aren't any "odds" of an unforeseen problem.  It's an absolute certainty that there will be unforeseen problems.

Remember, these are things we didn't know to ask.  Things that didn't make the list of "assumptions".  These are things that completely escaped our attention.

What will the impact be?  We have no way of knowing. No. Way. Of. Knowing.

We can't even put a contingency in place for these things.  We didn't know to ask.  So we don't know what it will cost for rework when we figure out what we should have asked.

All we can do here is use a good, transparent management process.  Each new piece of information -- each thing that's learned that we didn't know to ask -- will change scope, schedule, cost, deliverables, everything.

This isn't "casino risk".  There are no odds associated with this.  This is just change management. It's inevitable. Calling it a project risk is lying about it.

3rd Order Ignorance: No Process for Managing Ignorance.

When we have second-order ignorance (we didn't know to ask) there are two responses: an organized change-management process, or a leap down to 3rd order ignorance.  Third order ignorance slips from simply not knowing into denying that the level of knowledge changes through time.

When we learn something unexpected, we can either deny that it is something new, or we can expose it.  When a business analyst learns that the "simple" calculation involves a magical MS-Access database with no known author, magical numbers and no discernible calculations, this is going to change the scope of the work.  Or make it impossible to make progress until someone explains the MS-Access database.

Denying this kind of unexpected information is common, it's done by playing the management trump card of "schedule is sacred.  Once the schedule is sacred, all learning is either trivially denied or learning turns into ways of shaving scope or quality to make the schedule.  

3rd Order Ignorance means there's no change process and the "schedule is sacred".  If the only thing that matters is schedule, then buggy, useless software will be delivered on time and on budget.

What are the odds of 3rd order ignorance?  Either 1.0 or 0.0.  Either the organization has an effective change management process (in which case, we don't have 3rd order ignorance) or there will be problems in delivering software that works on time. 

Bottom Line.

Here's the summary of ignorance and mitigation.

0th order ignorance: do basic fact checking to validate your assumptions.

1st order ignorance: do contingency planning. Define specific contingencies around each specific unknown fact.  Don't just document an "assumption", plan for alternatives when the assumption is invalidated.

2nd order ignorance: have a change management process.

3rd order ignorance (i.e., no change management): stop using waterfall-style methodologies. Switch to Agile methods so that change and the management of ignorance become essential features of the overall process.

4th order ignorance is the state of not being aware that ignorance is one of the most significant driving forces behind project failure.  A symptom of 4th-order ignorance is conflating "risk analysis" for a project with "casino risks" (or "insurance risks.")  With rare exceptions, all project risk analysis is just ways of coping with bad management.

When there's 4th order ignorance, folks are told that it's helpful or meaningful to try and assign odds to the veracity of the facts, the presence of things which were forgotten, and the change management process itself.

Avoiding 4th order ignorance means recognizing that software project management "risks" are just bad management (with minor exceptions for fire, flood, famine, conquest, war, pestilence, and death.)

Here's how to manage risk:

  • Check the facts, 
  • plan specific contingencies, 
  • use Agile methods because of their built-in ability to manage change.

Thursday, July 25, 2013

Database Conversion or Schema Migration

I was told a horror story about a bad database schema migration. Let's call the author of the horror story HPL.

HPL related a tale of woeful conversion problems from a mismanaged schema migration.

While I could feel HPL's pain, the reasons given for their pain were wrong. They didn't quite get the lessons they had learned. Consequently, HPL sounded like someone doomed to repeat the mistake, or—worse—unlearning their lessons.

Here's HPL's most distressing comment.
"we can't migrate over the weekend and be done w/ it."
Apparently, the horror of a weekend migration was somehow desirable to HPL. Who wants a lost weekend? And who wants to put all of the eggs in a single basket?

Anyone who's done more than one "lost weekend migration"—and who's also honest—knows that they don't go well. There are always subsets of data that (a) don't get converted properly and (b) have to get swept under the carpet in order to claim to meet the schedule.

It's a standard situation to have less than 100% of the data successfully converted and still call the effort a success. If 100% was not required, why lose a weekend over it?

Good Plans and Bad Plans

From far wiser people than me, I learned an important lesson in schema migration.

These Wiser Heads ran a "conversion" business. They moved data and applications from platform to platform. They knew a lot about database schema migrations. A lot.

Their standard plan was to build a schema migration script (usually a sequence of apps) that could be run to convert the database (or files or whatever) from old to new schema as often as was necessary.

I'll repeat that.

As often as was necessary.

They debugged the script to get to an acceptable level of conversion. The data conversion (or schema migration) was perfectly repeatable. Of course, they longed for 100% conversion; but pragmatically, the legacy software had bad data. So some fraction would not convert. And once that fraction was found, the schema migration applications could be modified to treat the non-convertable data in some intelligent way.

Their stated goal was to convert data and run parallel testing with that converted data as often as necessary to create confidence that the new data was as perfect a conversion as was possible. At some point, the confidence became certainty and the parallel testing was deemed complete. Since they were parallel testing with live data, the decision amounted to a formalized "commissioning" of the new application. And by then, the new application was already being used.

There are bad ways to do schema migration, of course. HPL listed many.

Horrible Mistakes

The horror story from HPL included this:
"For the migrated tables, create views in the old system and create instead of triggers on those views to ship data to the new system."
It appears that they used views and triggers to create a new system "facade" over the legacy system. Apparently, they wanted both suites of application software to coexist. Not a good approach to schema migration. It appeared that they were trying to share one database with two application schema.

This seems like it's doomed. Unless they're all geniuses.

Wiser Heads would have strongly suggested that the new system use a extract of the old system's data.

HPL goes on to complain,
"Sometimes we can take over a column or 2 and sometimes we can only take over some of the data in the table".  
HPL emphasizes this point with "This is not that far fetched". I'm not sure why the emphasis was needed.

This is not "far fetched". It doesn't need emphasis. It's not really much of a problem, either. It's a standard part of schema migration. Extracting a copy of the data makes this quite easy. Triggers and views to create some kind of active SQL-based Facade is what created the complexity. Not the number of columns involved.

HPL summarizes,
"So you end up w/ [many] tables/views triggers all moving data back and forth and faking stuff out"
Back and forth. A fundamental mistake. A copy can be much easier to manage. One way data movement: Legacy to New.

HPL concludes with a litany of errors of various types: performance, change management, file system issues, error logging and auditing. Blah blah blah. Yes, it was a nightmare. I feel their pain.

What About Coexistence? 

It appears that HPL was involved in a project where the entire old and new applications were supposed to somehow coexist during the conversion.

It appeared that they failed to do any kind of partitioning.

Coexistence is not a trivial exercise. Nor is it a monolith where the entire legacy application suite must coexist with the entire new schema and the entire new application suite.

Pragmatically, coexistence usually means that some portion of the legacy must be kept running while some other portion is modernized. This means the coexistence requires that the application portfolio be partitioned.

Step 1: Some suite of functionality is migrated. That means data from the legacy database/file system is copied to new. That also means some data from new is copied back into the legacy database file/system. Copied.

Step 2: Some other suite of functionality is migrated. As functionality is moved, less and less data is copied back to the legacy.

At some point, this copying back is of no value and can be discontinued.

What About Timing?

This copying clearly requires some coordination. It's not done haphazardly.

Does it require "real time" data movement? i.e. triggers and views?

Rarely is real time movement required. This is the point behind partitioning wisely. Partitioning includes timing considerations as well as data quality and functionality considerations.

It's remotely possible that timing and partitioning are so pathological that data is required in both legacy and new applications concurrently. This is no reason to throw the baby out with the bathwater. This is nothing more than an indication that the data is being copied back to the legacy application close to real time.

This also means performance must be part of the test plan. As well as error handling and diagnostic logging. None of this is particularly difficult. It simply requires care.

Lessons Learned

HPL appeared to make the claim that schema migration is super hard. Or maybe that coexistence is really hard.

Worse, HPL's horror story may be designed to indicate that a horrifying lost weekend is the only way to do schema migration.

Any or all of these are the wrong lessons to learn.

I think there are several more valuable lessons here.

  1. Schema migration can and should be done incrementally. It's ideally tackled as an Agile project using Scrum techniques. It's okay to have release cycles that are just days apart as each phase of the conversion is run in parallel and tested to the user's satisfaction.
  2. Coexistence requires partitioning to copy any data back to unconverted legacy components. Triggers and views and coexistence of entire suites of software make a difficult problem harder.
  3. The conversion script is just another first-class application. The same quality features apply to the conversion as to every other component of the app suite.
  4. The conversion must be trivially repeatable. It must be the kind of thing that can be run as often as necessary to move legacy data to the new schema. 

Tuesday, July 23, 2013

Almost a good idea

Appleworks (formerly Clarisworks) is software that's been dead meat since 2007.


Which is fine unless you have an old computer with old applications that still works. For example, a 2002-vintage iMac G4 still works. Slowly.

When someone jumps 11 years to a new iMac, they find that their 2002 iMac with 2007 apps has files which are essentially unreadable by modern applications.

How can someone jump a decade and preserve their content?

1.  iWork Pages is cheap. Really. $19.99.  I could have used it to convert their files to their new iMac and then told them to ignore the app. Pages can be hard to learn. For someone jumping from 2007-vintage apps, it's probably too much. However, they can use TextEdit once the files are converted to RTF format.
2.  iWork for iCloud may be a better idea. But they have to wait a while for it to come out. And they want their files now.
3.  Try to write a data extractor.

Here are some places to start.
Documentation on the Appleworks file format does not seem to exist. It's a very weird void, utterly bereft of information.

In the long run $19.99 for a throw-away copy of Pages is probably the smartest solution.

However, if you're perhaps deranged, you can track down the content through a simple brute-force analysis of the file. This is Python3 code to scrape the content out of a .CWK file.

import argparse
import struct
import sys
import os
from io import open

class CWK:
    """Analyzes a .CWK file; must be given a file opened in "rb" mode.
    DSET = b"DSET"
    BOBO = b"BOBO"
    ETBL = b"ETBL"

    def __init__( self, open_file ):
        self.the_file= open_file

    def header( self ):
        #print( self.version[:3] )
        assert bobo == self.BOBO
        #print( version_prev[:3] )
        return self.version

    def margins( self ):
        self.height_page= struct.unpack( ">h",[30:32] )
        self.width_page= struct.unpack( ">h",[32:34] )
        self.margin_1= struct.unpack( ">h",[34:36] )
        self.margin_2= struct.unpack( ">h",[36:38] )
        self.margin_3= struct.unpack( ">h",[38:40] )
        self.margin_4= struct.unpack( ">h",[40:42] )
        self.margin_5= struct.unpack( ">h",[42:44] )
        self.margin_6= struct.unpack( ">h",[44:46] )
        self.height_page_inner= struct.unpack( ">h",[46:48] )
        self.width_page_inner= struct.unpack( ">h",[48:50] )

    def dset_iter( self ):
        """First DSET appears to have content.

        This DSET parsing may not be completely correct. 
        But it finds the first DSET, which includes all
        of the content except for headers and footers.

        It seems wrong to simply search for DSET; some part of the
        resource directory should point to this or provide an offset to it.
        for i in range(len(
            if[i:i+4] == self.DSET:
                    #print( "DSET", i, hex(i) )
                    pos= i+4
                    for b in range(5): # Really? Always 5?
                        size, count= struct.unpack( ">Ih",[pos:pos+6] )
                        pos += size+4
                    #print([i:pos] )
                    yield pos
    def content_iter( self, position ):
        """A given DSET may have multiple contiguous blocks of text."""
        done= False
        while not done:
            size= struct.unpack( ">I",[position:position+4] )[0]
            #print( "ENDING", repr([position+4+size-1]) )
            if[position+4+size-1] == 0:
                yield content[:-1]
                done= True
                yield content
                position += size+4

The function invoked from the command line is this.

def convert( *file_list ):
    for f in file_list:
        base, ext = os.path.splitext( f )
        new_file= base+".txt"
        print( '"Converting {0} to {1}"'.format(f,new_file) )
        with open(f,'rb') as source:
            cwk= CWK( source )
            with open(new_file,'w',encoding="MacRoman") as target:
                position = next( cwk.dset_iter() )
                for content in cwk.content_iter(position):
                    # print( content.encode("ASCII",errors="backslashreplace") )
                    target.write( content )
        atime, mtime = os.path.getatime(f), os.path.getmtime(f)
        os.utime( new_file, (atime,mtime) )

This is brute-force. But. It seemed to work. Buying Pages would have been less work and probably produced better results.

This does have the advantage of producing files with the original date stamps.  Other than that, it seems an exercise in futility because there's so little documentation.

What's potentially cool about this is the sane way that Python3 handles bytes as input. Particularly pleasant is the way we can transform the file-system sequence of bytes into proper Python strings with a very simple bytes.decode().

Thursday, July 18, 2013

NoSQL Befuddlement: DML and Persistence

It may be helpful to look back at 'How Managers Say "No"' which is about breaking the RDBMS Hegemony.

I got an email in which the simple concepts of "data manipulation" and "persistence" had become entangled with SQL DML to a degree that the conversation failed to make sense to me.

They had been studying Pandas and had started to realize that the RDBMS and SQL were not an essential feature of all data processing software.

I'll repeat that with some emphasis to show what I found alarming.
They had started to realize that the RDBMS and SQL were not an essential feature of all data processing.
Started to realize.

They were so entrenched in RDBMS thinking that the very idea of persistent data outside the RDBMS was novel to them.

They asked me about extending their growing realization to encompass other SQL DML operations: INSERT, UPDATE and DELETE. Clearly, these four verbs were all the data manipulation they could conceive of.

This request meant several things, all of which are unnerving.
  1. They were sure—absolutely sure—that SQL DML was essential for all persistent data. They couldn't consider read-only data? After all, a tool like Pandas is clearly focused on read-only processing. What part of that was confusing to them? 
  2. They couldn't discuss persistence outside the narrow framework of SQL DML. It appears that they had forgotten about the file system entirely.
  3. They conflated data manipulation and persistence, seeing them as one thing.
After some back-and-forth it appeared that they were looking for something so strange that I was unable to proceed. We'll turn to this, below.

Persistence and Manipulation

We have lots of persistent data and lots of manipulation. Lots. So many that it's hard to understand what they were asking for.

Here's some places to start looking for hints on persistence.

This list might provide some utterly random hints as to how persistent data is processed outside of the narrow confines of the RDBMS.

For manipulation... Well... Almost the entire Python library is about data manipulation. Everything except itertools is about stateful objects and how to change state ("manipulate the data.")

Since the above lists are random, I asked them for any hint as to what their proper use cases might be. It's very difficult to provide generic hand-waving answers to questions about concepts as fundamental as state and persistence. State and persistence pervade all of data processing. Failure to grasp the idea of persistence outside the database almost seems like a failure to grasp persistence in the first place.

The Crazy Request

Their original request was—to me—incomprehensible. As fair as I can tell, they appeared to want the following.

I'm guessing they were hoping for some kind of matrix showing how DML or CRUD mapped to other non-RDBMS persistence libraries.

So, it would be something like this.

CREATEfile()some pandas requestjson.dump()csv.writer()
INSERTfile.write()depends on the requirementscould be anythingcsv.writerow(); file.write()doesn't make sensenot something that generalizes welldepends on the requirements; file.write()inappropriate for analysisdepends on the requirementshard to make this up without more details
APPEND -- not part of DMLfile.write()depends on requirementscould be anythingcsv.writerow()

The point here is that data manipulation, state and persistence is intimately tied to the application's requirements and processing.

All of which presumes you are persisting stateful objects. It is entirely possible that you're persisting immutable objects, and state change comes from appending new relationships, not changing any objects.

The SQL reductionist view isn't really all that helpful. Indeed, it appears to have been deeply misleading.

The Log File

Here's an example that seems to completely violate the spirit of their request. This is ordinary processing that doesn't fit the SQL DML mold very well at all.

Let's look at log file processing.
  1. Logs can be persisted as simple files in simple directories. Compressed archives are even better than simple files.
  2. For DML, a log file is append-only. There is no insert, update or delete.
  3. For retrieval, a query-like algorithm can be elegantly simple. 
Without any brain-cramping, one can create simple map-reduce style processing for log files. See "Map Reduce -- How Cool is that?" for a snippet of Python code that turns each row of an Apache log file into a record-like tuple. It also shows how to scan multiple files and directories in simple map-reduce style loops.

Interestingly, we would probably loose considerable performance if we tried to load a log file into an RDBMS table. Why? The RDBMS file for a table that represents a given log file is much, much larger than the original file. Reading a log file directly involves far fewer physical I/O operations than the table.

Here's something that I can't answer for them without digging into their requirements.

A "filter" could be considered as a DELETE.  Or a DELETE can be used to implement a filter. Indeed, the SQL DELETE may work by changing a row's status, meaning the the SQL DELETE operation is actually a filter that rejects deleted records from future queries.

Which is it? Filter or Delete? This little conundrum seems to violate the spirit of their request, also.

Python Code

Here's an example of using persistence to filter the "raw" log files. We keep the relevant events and write these in a more regular, easier-to-parse format. Or, perhaps, we delete the irrelevant records. In this case, we'll use CSV file (with quotes and commas) to speed up future parsing.

We might have something like this:

log_row_pat= re.compile( r'(\d+\.\d+\.\d+\.\d+) (\S+?) (\S+?) (\[[^\]]+?]) ("[^"]*?") (\S+?) (\S+?) ("[^"]*?") ("[^"]*?")' )

def log_reader( row_source ):
 for row in row_source:
     m= log_row_pat.match( row )
     if m is not None:
         yield m.groups()

def some_filter( source ):
    for row in source:
        if some_condition(row): 
            yield row

with open( subset_file, "w" ) as target:
    with open( source_file ) as source:
        rdr= log_reader( source )
        wtr= csv.writer( target )
        wtr.writerows( some_filter( rdr ) )

This is a amazingly fast and very simple. It uses minimal memory and results in a subset file that can be used for further analysis.

Is the filter operation really a DELETE?

This should not be new; it should not even be interesting.

As far as I can tell, they were asking me to show them how is data processing can be done outside a relational database. This seems obvious beyond repeating. Obvious to the point where it's hard to imagine what knowledge gap needs to be filled.


Persistence is not a thing you haphazardly laminate onto an application as an afterthought.

Data Manipulation is not a reductionist thing that has exactly four verbs and no more.

Persistence—like security, auditability, testability, maintainability—and all the quality attributes—is not a checklist item that you install or decline.

Without tangible, specific use cases, it's impossible to engage in general hand-waving about data manipulation and persistence. The answers don't generalize well and depend in a very specific way on the nature of the problem and the use cases.

Tuesday, July 16, 2013

How Managers Say "No": The RDBMS Hegemony Example

Got an email looking for help in attempting break through the RDBMS Hegemony. It's a little confusing, but this is the important part of how management says "no".
"Their response was nice but can you flush [sic] it out more"
[First: the word is "flesh": "flesh it out." Repeat after me: "Flesh it out," "Flesh it out," "Flesh it out." Flesh. Put flesh on the bones. No wonder your presentation went nowhere, either you or the manager or both need help. English as a second language is only an excuse if you never read anything in English.]

There's a specific suggestion for this "more". But it indicates a profound failure to grasp the true nature of the problem. It amounts to a drowning person asking us to throw them a different colored brick. It's a brick! You want a life preserver! "No," they insist, "I want a brick to build steps to climb out."

Yes, RDBMS Hegemony is a real problem. I've talked about it before "Hadoop and SQL/Relational Hegemony". Others have noted it: "NoSQL and NewSQL overturning the relational database hegemony". You can read more concrete details in articles like this: "Introduction to Non-Relational Data Storage using Hbase".

RDBMS Hegemony is most visible when every single in-house project seems to involve the database. And some of those uses of the database are clearly inappropriate.

For example, trying to mash relatively free-form "documents" into an RDBMS is simple craziness. Documents—you know, the stuff created by word processors—are largely unstructured or at best semi-structured. For most RDBMS's, they're represented as Binary Large Objects (BLOBs). To make it possible to process them, you can decorate each document with "metadata" or tags and populate a bunch of RDBMS attributes. Which is fine for the first few queries. Then you realize you need more metadata. Then you need more flexible metadata. Then you need interrelated metadata to properly reflect the interrelationships among the documents. Maybe you flirt with a formal ontology. Then you eventually realize you really should have started with document storage, not a BLOB in an RDBMS.

Yes, some companies offer combo products that do both. The point is this: avoiding the RDBMS pitfall in the first place would have been a big time and money saver. Google Exists. The RDBMS is not the best choice for all problems.

The problem is this:
  • Getting away from RDBMS Hegemony requires management thinking and action.
  • Management thinking is a form of pain.
  • Management action is a form of pain. 
  • Managers hate pain.
In short, the only way to make progress away from the RDBMS is to create or expose existing pain. Or make it possible for the manager to avoid pain entirely.

Let's look at the various approaches.

Doing A "Presentation"

The email hinted at a conversation or presentation on the problem of RDBMS Hegemony. 
"I finally convinced my current client that RDBMS's are expensive in terms of adding another layer to the archtiecture [sic] and then trying to maintain it."
It's not clear from the email what the details of this conversation or presentation were, but it clearly involved the two key technical points (1) the RDBMS has specific use cases, and (2) not all applications fit those use cases.

However. Those two key technical points involve no real management pain.

Real pain comes from cost. And since the RDBMS license is usually site-wide, there's no obvious cost to the technology.

The labor cost for DBA support, similarly, is side-wide and already in the budget. So there's no obvious cost to the labor.

No cost means no pain. No pain means no change.

Asking a manger to think, however, causes actual pain. Managers want technical people to do the thinking for them.

Asking a manager to consider the future means they may have to take action in the future. That's potential pain. 

Either way, a management presentation on database hegemony is pure pain. No useful action will ever come from a simple, direct encapsulation of how the RDBMS is not really the universal data tool. Management said "no" by asking for more information.

We'll return to the "more information" part below.

It was good to start the conversation.

It's good to continue the conversation. But the specific request was silliness.

Exposing the Existing Pain

What's more important than a hypothetical conversation is showing how the RDBMS is causing pain right now. It's easier to convince managers of the hidden cost of the RDBMS by exposing existing actual pain in the current environment. And it has to be a level of pain that exceeds the pain of thinking and taking action.

What's most clear is a specific and avoidable labor cost. Ideally, this specific—and avoidable—labor cost will obviously be associated with something obviously database-related. It must be obvious or it won't yield a technology-related management understanding. If it's not obvious, management will say "no", by asking for more data; they'll claim it's people or process or measurement error.

The best place to look for avoidable labor is break-fix problem reports, bugs and enhancements. Another good source of avoidable costs are schema migrations: waiting for the DBA's to add columns to a table, or add tables to a database.

If you can point to specific trouble tickets that come from wrong use of an RDBMS, then you might be able to get a manager to think about it.

The Airtight Case

Your goal on breaking RDBMS Hegemony is to have a case that is "airtight". Ideally, so airtight that the manager in question sits up, takes notice, and demands that a project be created to rip out the database and save the company all that cost. Ideally, their action at the end of the presentation is to ask how long it will take to realize the savings.


It is actually pretty easy to make an airtight case. There are often a lot of trouble tickets and project delays due to overuse and misuse of the RDBMS.


Few managers will actually agree to remove the RDBMS from an application that's limping along. Your case may be airtight, and compelling, and backed with solid financials, but that's rarely going to result in actual action.

"If it ain't broke, don't fix it," is often applied to projects with very high thresholds for broken. Very high.

This is another way management says "no". By claiming that the costs are acceptable or the risk of change is unacceptable. Even more farcical claims will often be made in favor of the status quo. They may ask for more cost data, but it's just an elaborate "no".

It's important to make the airtight case.

It's important to accept the "no" gracefully.

Management Rewards

When you look at the management reward structure, project managers and their ilk are happiest when they have a backlog of huge, long-running projects that involve no thinking and no action. Giant development efforts with stable requirements, unchallenging users, mature technology and staff who don't mind multiple-hour status meetings.

A manager with a huge long-running project feels valuable. When the requirements, people and technology are stable, then thinking is effectively prevented.

Suggesting that technology choices are not stable introduces thinking. Thinking is pain. The first response to pain is "no". Usually in the form of "get more data."

Making a technology choice may require that a manager facilitate a conversation which selects among competing technology choices. That involves action. And possible thinking.

Real Management Pain. The response? Some form of "no".

Worse. (And it does get worse.)

Technology selection often becomes highly political. The out-of-favor project managers won't get projects approved because of "risky technology." More Management Pain.

War story. Years ago, I watched the Big Strategic Initiative shot down in flames because it didn't have OS/370 as the platform. The "HIPPO" (Highest Paid Person's Opinion) was that Unix was "too new" and that meant risk. Unix predates OS/370 by many years. When it comes to politics, facts are secondary.

Since no manager wants to think about potential future pain, no manager is going to look outside the box. Indeed, they're often unwilling to look at the edge of the box. The worst are unwilling to admit there is a box.

The "risk" claim is usually used to say "no" to new technology. Or. To say "no" to going back to existing, well-established technology. Switching from database BLOBs to the underlying OS file system can turn into a bizzaro-world conversation where management is sure that the underlying OS file system is somehow less trustworthy than RDBMS BLOBs. The idea that the RDBMS is using the underlying file system for persistence isn't a compelling argument.

It's important to challenge technology choices for every new project every time.

It's necessary to accept the "no" gracefully.

The "stop using the database for everything" idea takes a while to sink in.

Proof Of Concept

The only way to avoid management pain (and the inaction that comes from pain avoidance) is to make the technology choice a fait accompli.

You have to actually build something that actually works and passes unit tests and everything.

Once you have something which works, the RDBMS "question" will have been answered. But—and this is very important—it will involve no management thought or action. By avoiding pain, you also default into a kind of management buy-in.

War Story

The vendors send us archives of spreadsheets. (Really.) We could unpack them and load them into the RDBMS. But. Sadly. The spreadsheets aren't consistent. We either have a constant schema migration problem adding yet another column for each spreadsheet, or we have to get rid of the RDBMS notion of a maximalist schema. We don't want the schema to be an "at most" definition; we'd need the schema be an "at least" that tolerates irregularity.

It turns out that the RDBMS is utterly useless anyway. We're barely using any SQL features. The vendor data is read-only. We can't UPDATE, INSERT or DELETE under any circumstances. The delete action is really a ROLLBACK when we reject their file and a CREATE when they send us a new one.

We're not using any RDBMS features, either. We're not using long-running locks for our transactions; we're using low-level OS locks when creating and removing files. We're not auditing database actions; we're doing our own application logging on several levels.

All that's left are backups and restores. File system backups and restores. It turns out that a simple directory tree handles the vendor-supplied spreadsheet issue gracefully. No RDBMS used.

We had—of course—originally designed a lot of fancy RDBMS tables for loading up the vendor-supplied spreadsheets. Until we were confronted with reality and the inconsistent data formats.

We quietly stopped using the RDBMS for the vendor-supplied data. We wrote some libraries to read the spreadsheets directly. We wrote application code that had methods with names like "query" and "select" and "fetch" to give a SQL-like feel to the code.

Management didn't need to say "no" by asking for more information. They couldn't say no because (a) it was the right thing to do and (b) it was already done. It was cheaper to do it than to talk about doing it.

Failure To See The Problem

The original email continued to say this:
"how you can achieve RDBMS like behavior w/out an actual RDBMS"
What? Or perhaps: Why?

If you need RDBMS-like behavior, then you need an RDBMS. That request makes precious little sense as written. So. Let's dig around in the email for context clues to see what they really meant.
"consider limting [sic] it to
    An update requires a unique key. Let's limit the key to contain only 1 column.
    A delete requires a unique key. Let's limit the key to contain only 1 column."
Oh. Apparently they really are totally fixated on SQL DML.

It appears that they're unable to conceive of anything outside the SQL DML box.

As noted in the above example, INSERT, UPDATE and DELETE are not generic, universal, always-present use cases. For a fairly broad number of "big data" applications, they're not really part of the problem.

The idea that SQL DML CRUD processing forms a core or foundational set of generic, universal, always-present use cases is part of their conceptual confusion. They're deeply inside the SQL box wondering how they can get rid of SQL.

Back to the drowning person metaphor. 

It's actually not like a drowning person asking for a different colored brick because they're building steps to walk out.

It's like a person who fell face down in a puddle claiming they're drowning in the first place. The brick vs. life preserver question isn't relevant. They need to stand up and look around. They're not drowning. They're not even in very deep water.

They've been laying face-down in the puddle so long, they think it's as wide as the ocean and as deep as a well. They've been down so long it looks like up.

Outside the SQL Box

To get outside the SQL box means to actually stop using SQL even for metaphoric conversations about data manipulation, persistence, transactions, auditing, security and anything that seems relevant to data processing.

To FLESH OUT ["flesh", the word is "flesh"] the conversation on breaking the SQL Hegemony, you can't use hypothetical hand-waving. You need tangible real-world requirements. You need something concrete, finite and specific so that you can have a head-to-head benchmark shootout (in principle) between an RDBMS and something not an RDBMS.

You may never actually build the RDBMS version for comparison. But you need to create good logging and measurement hooks around your first noSQL application. The kind of logging and measurement you'd use for a benchmark. The kind of logging and measurement that will prove it actually works outside the RDBMS. And it works well: reliably and inexpensively. 

This is entirely about asking for forgiveness instead of asking for permission.  

Managers can't give permission, it involves too much pain.

They can offer forgiveness because it requires neither thinking nor action.