Bio and Publications

Friday, May 21, 2010

Python in the News

Making the rounds: Droopy: easy file receiving. Apparently, there were some widely-read blog posts about this. Google "Droopy: A Tiny Web Server That Makes Receiving Files a Snap" to see the buzz.

The point here is that 750 lines of Python code can go a long way. It's a complete web server to support large file transfers without the botheration of email servers and their limitations.

Elegantly, it's packaged as a single module including HTML page templates, translations into several languages, plus the core server.

Wednesday, May 19, 2010

Technology Adoption and the "No"-gates

Let's say you've found some new, good way to do business.

JSON, for example. Or Agile Methods in general. Or TDD specifically. Or use of an ORM.

You read up on it. You build a spike solution to show that it's more efficient.

The First No-Gate

You make The Essential Pitch. You keep it simple and direct.

A manager says "that's too Blah-Blah-Blah, we don't want to add the cost/risk/complexity." Of course, it doesn't matter what specific things are said. This is management "No #1". We just can't. There is too much "we" (that is, "I, as a non-technical manager") don't understand.

The first answer must generally be "No." No manager can agree unless it was their idea before it was your idea. If they'd already heard of this and ask you to look into it, then you might get a yes. But if it was your idea first, management must say "No".

You work some more, you refine your pitch to address the Blah-Blah-Blah issue and show that it does not actually increase cost, risk or complexity.

BTW, there's no point in trying to pre-empt the initial "No". It has to be there. You have to get specific objections, so you have to go through this "No" gate.

The Second No-Gate

You make the Address-Your-Concerns Pitch. You elaborate with lots of what-ifs and alternatives and objections and solutions. Two Powerpoint slides expand to about a dozen.

A manager says "I'm not sure that it has the cost-benefit we need to see." This is management "No #2". We just can't afford it. [Yes, this is a repeat of the cost argument, but it's different because the expected response is now different.]

The second answer must always "raise the bar" from technical issues to monetary issues.

At this point, you really can't go too far. There's essentially no cost-benefit information on any element of any technology stack in use anywhere. No one sits down and finds the most cost-effective operating system, the most cost-effective language, the most cost-effective protocols. There's no data and cost-benefit is not a core part of approval. It's tangential at best.

Further, the real answer in technology selection always boils down to skills. Does the organization have the necessary skills? Do they understand it?

You work some more, you refine your pitch to address the cost-benefit issue and show that it does not actually increase cost, may reduce risk, and have have some tangible benefits.

The Third No-Gate

You make the Cost-Benefit pitch. You try to keep it factual.

At this point, you've entered a loop. Essentially, you must be redirected to address one more concern. That concern, once addressed, won't have the monetization. Back and forth until something breaks you out of the loop.

You're stuck here because there's no compelling reason to adopt. Managers talk about cost and risk and benefits and other vaguely monetary ways to determine if the technology has or creates value. But those are reasons to say "no", not "yes." Technology rarely has a compelling monetized business case. It's just a little better or a little less risky. But that involves change and any change is inherently more risky than anything else.

Remember: the first fax machine was useless until someone else got a fax machine.

So you iterate. Pitch, refine. Pitch, refine.

Compulsion

At some point, you either implement your spike solution, which makes management's approval a fait accompli, or some force outside IT (business demand for a new kind of software product, external force from competitors) compels the organization to make a change.

Note that there has been no change to the technology itself. JSON was unacceptable until a customer demanded JSON-format files. Now JSON is required. The organization, however, has flipped from "No with a million reasons" to "Yes".

Agile cannot be done until a customer requires it. TDD has no ROI until someone gets their project done early because of TDD. An ORM is needless complexity until the new web framework requires it.

At this point, there are a series of steps to go from "acceptable" to "required" to "standard" to "irreplaceable legacy". Those just as puzzling as the No-Gates.

Finesse

It isn't possible to finesse this and reduce the frustration. The organization must resist change until compelled to make the change. Once compelled, it must then stumble forward as though all the nay-saying never happened. And what could have been a simple technology adoption must turn into a morass of competing bad ideas.

So, we can (and should) continue to find new technology. We can (and will) make the pitch. We will be shot down.

The trick is not to take it personally. Just keep refining so that when the organization is eventually compelled to adopt, we've already got it planned out.

Monday, May 10, 2010

A Limit to Reuse

We do a lot of bulk loads. A lot.

So many that we have some standard ETL-like modules for generic "Validate", "Load", "Load_Dimension", "Load_Fact" and those sorts of obvious patterns.

Mostly our business processes amount to a "dimensional conformance and fact load", followed by extracts, followed by a different "dimensional conformance and fact load". We have multiple fact tables, with some common dimensions. In short, we're building up collections of facts about entities in one of the dimensions. [And no, we're not building up data individual consumers. Really.]

Until, of course, someone has a brain-fart.

Overall Application Design

An overall load application is a simple loop. For each row in the source document, conform the various dimensions, and then load the fact. Clearly, we have a bunch of dimension conformance objects and a fact loading object. Each object gets a crack at the input row and enriches it with some little tidbit (like a foreign key).

This leads us to pretty generic "Builder" and "Dimension Builder" and "Fact Builder" class hierarchy. Very tidy.

Each new kind of feed (usually because no two customers are alike) is really just a small module with builders that are specific to that customer. And the builders devolve to two methods
  • Transform a row to a new-entity dict, suitable for a Django model. Really, just a simple dict( field=source['Column'], field=source['Column'], ... ) block of code.
  • Transform a row to a dimension conformance query, suitable for a Django filter. Again, a simple dict( some_key__iexact= source['Column'] ).
The nice thing is that the builders abstract out all the messy details. Except.

Hard-to-Conform Data

We're now getting data that's not -- narrowly -- based on things our customers tell us. We're getting data that might be useful to our customer. Essentially, we're processing they're data as well as offering additional data.

Cool, right?

But... We lack the obvious customer-supplied keys required to do dimensional conformance. Instead, we have to resort to a multi-step matching dance.

Limiting Factors

The multi-step matching dance pushed the "Builder" design one step beyond. It moved from tidy to obscure. There's a line that seems to be drawn around "too much" back-and-forth between framework code and our Builders.

Something as bone-simple as a bulk loader has two candidate design patterns.
  • Standard loader app with plug-in features for mappings. This is what I chose. The mappings have been (until now) simple. The app is standard. Plug a short list of classes into the standard framework. Done.
  • Standard load support libraries that make a simple load app look simple. In this case, each load app really is a top-level app, not simply some classes that plug into an existing, standardized app. Write the standard outer loop? Please.
What's wrong with plug-ins?

It's hard to say. But it seems that a plug-in passes some limit to OO understandability. It seems that if we refactor too much up to the superclass then our plug-ins become hard to understand because they lose any "conceptual unity".

The limiting factor seems to be a "conceptually complete" operation or step. Not all code is so costly that a simple repeat is an accident waiting to happen.

Hints from Map-Reduce

It seems like there are two conceptual units. The loop. The function applied within the loop. And we should write all of the loop or all of the mapped function.

If we're writing the mapped function, we might call other functions, but it feels like we should limit how much other functions call back to the customer-specific piece.

If we're writing the overall loop -- because some bit of logic is really convoluted -- we should simply write the loop without shame. It's a for statement. It's not obscure or confusing. And there's no reason to try and factor the for statement into the superclass just because we can.

Friday, May 7, 2010

Functional Programming Goodness -- Python to the Rescue

Here's the situation.

A vendor sent us three separate files which need to be merged. 70,000+ records each. They're CSV files, so column position doesn't much matter. The column name (in row 1) is what matters.

I looked at three solutions. Two of which are merely OK. The third was some functional programming that was very cool.

Option 1 -- OS Sort/Merge

To get the files into a consistent order, we need to sort. The Linux sort, however, is biased toward columns that are known positionally.

So, we need to exploit the Decorate-Sort-Undecorate design pattern. So we have a shell script something like the following.

decorate.py a.csv | sort >a_sorted.csv
decorate.py b.csv | sort >b_sorted.csv
decorate.py c.csv | sort >c_sorted.csv
sort -m a_sorted.csv b_sorted.csv c_sorted.csv | undecorate.py >result.csv
This works well because decorate.py and undecorate.py are such simple programs. Here's decorate.py.

from __future__ import print_function
import csv
import sys
with open(sys.argv[1],"rb") as source:
rdr= csv.DictReader( source )
for row in rdr:
print( row['the key'], row )
Undecorate is similar. It uses the str.partition() method to remove the decoration.

Note that the initial "decorate" steps can be run concurrently, leading to some time reduction. This scales well. It doesn't use much memory; the OS concurrency management means that it uses every core available.

I didn't benchmark this, BTW.

Option 2 -- Big In-Memory Dict

Since the files aren't insanely big, they do fit in memory. This is pretty simple, also.

import csv
from collections import defaultdict
# build the result set
result = defaultdict( dict )
for f in ( 'a.csv', 'b.csv', 'c.csv' ):
with open( f, 'rb' ) as source:
rdr = csv.DictReader( source )
for row in rdr:
result[row['key']].update( row )
# find the column titles
keys = set()
for row in result:
keys |= set( result[row].keys() )
# write the result set
with open( 'output.csv', 'wb' ) as target:
wtr= csv.DictWriter( target, sorted(keys) )
wtr.writerow( dict(zip(keys,keys)) )
for row in result:
wtr.writerow( result[row] )
This isn't too bad. For insanely big files, however, it won't scale well.

Elapsed time for the real files (which were zipped, adding processing that's not relevant to this posting) was 218 seconds on my little laptop.

Option 3 -- Functional Programming

The functional programming approach is a bit more code than option 1. But it's way cool and very extensible. It offers more flexibility without the memory limitation of the big dictionary.

Let's start with the end in mind.

We're doing a 3-file merge. The algorithm for 2-file merge is really simple. The algorithm for an n-file merge, however, is not so simple. We can easily build up an n-file merge as a composition of n-1 pair-wise merges.

Here's how it should look.
with open('temp.csv','wb') as output:
wtr= csv.DictWriter( output, sorted(fieldNames) )
wtr.writerow( dict( zip( fieldNames, fieldNames )))
for row in merge( merge( s1, s2 ), s3 ):
wtr.writerow( row )
We're doing merge( merge( s1, s2 ), s3 ) to compose a 3-file merge from 2 2-file merges. And yes, it can be just that simple.

Composable Sort

To be "composable", we must write iterator functions which read and write data of the same type. In our case, since we're using a DictReader, our various functions must work with an iterable over dicts which yields dicts.

In order to merge, the input must be sorted. Here's our composable sort.
def key_sort( source, key='key' ):
def get_key( x ):
return int(x[key])
for row in sorted(source, key=get_key ):
yield row
Yes, we need to pre-process the keys, they're not simple text; they're numbers.

Composable 2-File Merge

The composable merge has a similar outline. It's a loop over the inputs and it yields outputs of the same type.
def merge( f1, f2, key='key' ):
"""Merge two sequences of row dictionaries on a key column."""
r1, r2 = None, None
try:
r1= f1.next()
r2= f2.next()
while True:
if r1[key] == r2[key]:
r1.update(r2)
yield r1
r1, r2 = None, None
r1= f1.next()
r2= f2.next()
elif r1[key] < r2[key]:
yield r1
r1= None
r1= f1.next()
elif r1[key] > r2[key]:
yield r2
r2= None
r2= f2.next()
else:
raise Exception # Yes, this is impossible
except StopIteration:
pass
if r1 is not None:
yield r1
for r1 in f1:
yield r1
elif r2 is not None:
yield r2
for r2 in f2:
yield r2
else:
pass # Exhausted with an exact match.
This runs in 214 seconds. Not a big improvement in time. However, the improvement in flexibility is outstanding. And the elegant simplicity is delightful. Having the multi-way state managed entirely through the Generator Function/Iterator abstraction is amazing.

Also, this demonstrates that the bulk of the time is spent reading the zipped CSV files and writing the final CSV output file. The actual merge algorithm doesn't dominate the complexity.

Wednesday, May 5, 2010

Goodhart's Law and Numerosity

They say "You Can't Manage What You Don't Measure". This isn't quite right, however. You can measure lots of things you can't manage. Rainfall, for example. Software development is like that. You can measure stuff that you can't actually control.

The original Deming quotes are more subtle: there are things which you cannot measure, but they are still important. And there are visible measures that are an attractive nuisance.

What gets lost is that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes" This is Goodhart's Law.

As soon as you try to measure "programmer productivity" or "quality" or similar things, folks will find ways to tweak the numbers without actually improving anything.

Metrics are troubling things.