S.Lott-Software Architect: January 2010

Thursday, January 28, 2010

Aristotle's Poetics and Project Management

It can be a fatal mistake to impose a story arc on a project.

Aristotle's Poetics is a commentary on drama, in which he identified two story arcs that are sure-fire hits: Big Man Brought Down, and Small Man Lifted Up. These are the standard "Change in Fortune" story lines that we like.

Most movies in the category called "westerns" have elements of both. When we look at a movie like "Who Shot Liberty Valance", we see both changes in fortune intertwined in this story.

Movies, further, have a well-defined three-act structure with an opening that introduces characters, context and the dramatic situations. Within the first minutes of the film, there will be some kind of initiating incident that clarifies the overall conflict and sets up the character's goals. We can call this the narrative framework. Or a story design pattern.

Project Narrative

A "project" -- in principle -- has a narrative arc, much like a movie. Walker Royce (Project Management: A Unified Framework) breaks big waterfall-like projects into four acts (or "phases"):

Inception
Elaboration
Construction
Transition

Even if done in a spiral, instead of a waterfall, these are the acts in the narrative between the opening "Fade In:" and the closing credits.

In some cases, folks will try to impose this four-act structure on an otherwise Agile method. It's often a serious mistake to attempt to impose this convention of fiction on reality.

Things That Don't Exist

One of the most important parts of the narrative arc is "inception". Every story has a "beginning".

Projects, however, do not always have a clear beginning. They can have a "kick-off" meeting, but that's only a fictionalized beginning. Work began long before the kick-off meeting. Often, the kick-off is just a one small part of Inception.

Some projects will have a well-defined narrative structure. Projects labeled "strategic", however, do not ever have this structure. They can't.

For large projects, something happened before "inception"; this is a real part of the project. The fiction is that the project somehow begins with the inception phase. This narrative framework is simply wrong; the folks that helped plan and execute inception know this thing that filmmakers call "back story". This pre-inception stuff is a first-class part of the project, even though it's not an official part of the narrative framework.

Even if you have an elaborate governance process for projects, there's a lot that happens before the first governance tollgate. In really dysfunctional organizations, there can be a two-tiered inception, where there's a big project to gather enough information to justify the project governance meeting to justify the rest of the work. The "rest of the work" -- the real project -- starts with an "inception" effort that's a complete falsification. It has to ignore (or at best summarize) the stuff that happened prior to inception.

The Price of Ignorance

The Narrative Arc of a project requires us to collect things into an official Inception or story setup. It absolutely requires that we discard things that happened before inception.

Here's what happens.

Users say they want something. "Automated customer name resolution". Something that does an immediate one-click credit check on prospective B2B e-commerce customers.

In order to justify the project, we do some preliminary work. We talk to our vendors and discover that business names are always ambiguous, and there's no such thing as one-click resolution. So we write sensible requirements. Not the user's original dream.

We have a kick-off meeting for a quick, three-sprint project. We have one user story that involves two multi-step alternative scenarios. We have some refactoring in the first sprint, then we'll build one scenario for the most-common case, then the other scenario for some obscure special cases.

When we get ready for release, the customer asks about the one-click thing.

"What one-click thing?" we ask.

"I always understood that this would be one-click," the customer says.

"Do you recall the project governance meetings, the signed-off requirements and the kick-off meeting? You were there. We walked through the story and the scenarios. There can't be one-click."

Communicate More? Hardly

What can be done to "prevent" this? Essentially, nothing.

The standard project narrative framework -- start, work, finish -- or perhaps inception, elaboration, construction, transition -- doesn't actually exist.

Stuff happened "before" the start is part of the project. We can claim (or hope) that it doesn't exist, but it really does. We can claim that a project has a "start", but it doesn't. It sort of eases into being based on lots of preliminary conversations.

When the users asked for "one click", it was a result of several other conversations that happened before going to the business analyst to ask for the additional feature.

Indeed, the "one click" was a compromise (of sorts) reached between a bunch of non-IT people having a conversation prior to engaging the business analyst to start the process of starting the project. All of that back story really is part of the project, irrespective of the requirements of the project's standard narrative structure.

Bottom Line

Poetics don't apply to large, strategic projects. A project is a subtle and complex thing. There's no tidy narrative arc. Backstory matters; it can't be summarized out of existence with a kick-off slide show.

Monday, January 25, 2010

Map-Reduce, Python and Named Tuples

A year and change back, I wrote this on "Exploratory Programming".

It turns out that it was a mistake. While the subclass-expansion technique is a cool way to bang out a program incrementally, in the long run, the subclassing is ill-advised.

The more I look at Python generator functions and the idea of using Map-Reduce, the more I realize that Visitors and Subclass Extension are not the best design pattern. The same kind of exploration can be done with map-reduce techniques, and the resulting application is slightly simpler.

Design Coupling

The problem with design-by-subclass is that the map and reduce operations are often defined relatively informally. After all, they're just method invocations within the same class. You can, unthinkingly, create obscure dependencies and assumptions. This can make the Visitor rather complex or make subclass features hard to refactor into the Visitor.

As a concrete example, we have an application that processes directories of files and file archives (ZIP and TAR) of workbooks with multiple sheets. All of this nesting merely wraps a source of spreadsheet rows. This should be a collection of simple nested map operations to transform directories, files, archives of files, etc., into a spreadsheet row source.

Part way through our class hierarchy, however, a subclass introduced a stateful object that other method functions could use. However, when we tried to refactor things into a simple Visitor to visit all of the workbooks (ignoring all the directory, archive and file structure), we worked around the hidden stateful object without realizing it.

Named Tuples and Immutability

A much cleaner solution is to make use of Python's namedtuple constructor and write generator functions which map one kind of namedtuple to another kind of namedtuple. This has the advantage that -- unless you've done something really bad -- you should be able to pickle and unpickle each tuple, easily seeding a multi-processing pipeline. Maximal concurrency, minimal work.

Each stage in a map-reduce pipeline can have the following form.

SomeResult = namedtuple('SomeResult',['a','b','c'])
for x in someSource():
  assert isinstance( x, SomeSource )
  yield SomeResult( some transformation of x )

The assertion should be obvious from inspection of the someSource function. It's provided here because it's essential to creating map-reduce pipelines that work. It's better to prove that the assertion true through code inspection and comment out the assert statement.

Pipelining

What pops out of this are stateless objects. Since named tuples are immutable, it appears that we've done some purely functional programming without really breaking a sweat.

Further, we can see our way toward encapsulating a generator function and it's resulting namedtuple as a single MapReduce object. The assertion can then be plucked out of the loop and refactored into a pipeline manager.

This might give us several benefits.

A way to specify a pipeline as a connected series of generator functions. E.g., Pipeline( generator_1, map_2, map_3, reduce_4 ).
Inspection of the pipeline to be sure that constraints will be met. The head of the pipeline has a resulting type, all other stages have an required source type and a result type. Since they're named tuples we only care that the required attributes are a subset of the previous stage's result attributes.
Implementation through injection of a pickle/unpickle wrapper around each stage.
Distribution through a "runner" that forks a subprocess pipeline. This should yield map-reduce operations that swamp the OS by having multiple concurrent stages.

Goals

Generally, our goal is to get the CPU 100% committed to the right task. Either 100% doing web services, or 100% doing database queries or 100% doing batch processing of massive flat-file datasets.

Currently, we can't get much past 66%: one core is close to 100%, but the other is only lightly involved. By growing to multi-processing, we should be able to red-line a processor with any number of cores.

Thursday, January 21, 2010

Exacting Definitions

Interesting comments to Splitting Meta-Hairs.

Terms like Scripting, Interpreted and Dynamic are not "marketing terms". New, Improved, All Natural, Visual, Groovy, Lucid, etc., are marketing terms. I regret giving the impression that one should "not to spend time trying to get definitions that are exacting". One absolutely must get definitions which are exacting. Clearly, I failed. Exact definitions matter.

Hiding behind the edge and corner cases, however, isn't helpful. Just because some terms could be redundant, or could be ambiguous (when stripped of useful meaning) isn't really a helpful thing. Harping on the "ambiguity" or "contradiction" or "redundancy" isn't helpful. Yes, Python has edge cases. Outside the purity of mathematics, no categorization is ever perfect.

Scripting. The Python community steers away from this because it's limiting. However, it's also true. What's important is that some folks overlook this and over-engineer solutions. Python modules require three things (1) an appropriate #! line, (2) a mode that includes appropriate "x" mode flags and (3) a location on the PATH to be indistinguishable from binary executables.

I find it necessary to repeat "scripting" to prevent over-engineering. Clearly, scripting isn't a completely distinct dimension of language description, but it's still an important clarification to many of the people I work with.

Python's on this scripting language list.

[We had a RHEL system with SELinux settings that prevented Python scripts from running. A sysadmin said -- seriously -- that I just needed to use `sudo su -` to get past this. The admin, it appeared, couldn't see why Python scripts should behave exactly like all other scripts. Hence the need to emphasize that Python is a scripting language. Otherwise people forget.]

Interpreted. Python is a byte-code interpreter. Saying things like "compiling to machine code is also interpreted" eliminates all meaning from the words, so it can't be true or useful. We need to distinguish between compiled to machine code and interpreted; machine code binary executes directly. And Python doesn't compile to machine code. Python is interpreted.

[The fact that some hardware had microprogramming is irrelevant; there are programmable ASIC chips in my Macintosh, that doesn't matter to my Python interpreter. There's a clear line between the binary machine code and the Python or Java interpreter. Yes, there are other levels of abstraction. No, they don't matter when discussing the nature of Python.]

You can use cython or py2exe or py2app to create binaries from Python. But that's not the interpreted Python language. This is the distinction I'm trying to emphasize.

I find it necessary to repeat "interpreted" so people are not confused by the presence of visible bytecode cache (.pyc) files.

Dynamic. Python is dynamic. Dynamic is clearly distinct from the other dimensions. There's less confusion over this word, but it still fails to sink in.

I find that this needs to be repeated frequently so people stop looking for static type declarations. The number of Stack Overflow questions that include "python" and "declaration" is always disappointing.

Wednesday, January 20, 2010

Splitting Meta-Hairs

Recently, I've been involved in some hair-splitting over the nature of Python.

I've described it as "interpreted", "scripting" and "dynamic", all of which seem to be true, but yet each seems to lead to a standard -- and pointless -- dispute.

Yes but No

Some folks object to "interpreted". They feel a need to repeat the fact that Python is compiled to byte code which is then interpreted. Somehow, compiling to byte code interferes with their notion of interpreter. Further exploration of the objection reveals their unwavering conviction that an interpreter must work directly with the original source. And it must be slow.

Eventually, they'll admit that's Python is interpreted, but not really. I don't know why it is so important to raise the objection.

So noted. Are we done? Can we move beyond this?

Scripting Means Bad

Some folks object to "scripted". They insist that scripting languages must also include performance problems, limited data representation or other baggage. Python is a scripting language because it responds properly to the shell #! processing rules. Period.

I don't know why it's important, but someone feels the need to object to calling Python a scripting language. Somehow the #! thing doesn't convey enough complexity; scripting just has to be bad. Pages like Wikipedia's Scripting Language don't seem to help clarify that scripting isn't inherently bad.

Again, objection noted. And overruled. Scripting doesn't have to be complex or bad. It's just a relationship with the shell.

Further Nuances

I'm baffled when some folks take this further and object to Scripted and Interpreted being separate terms. I guess they feel (very strongly) that it's redundant and the redundancy is somehow confusing. A shell script language pretty much has to be interpreted, otherwise the #! line wouldn't mean anything. I guess that this is why they have to emphasize their point that Scripted is a proper subset of Interpreted.

But then, of course, Python is technically compiled before being interpreted, so what then? What's the point of bringing up the detail yet again?

Dynamic

More rarely, folks will object to using Dynamic and Interpreted as separate dimensions of the language space.

Hard-core C++ and Java programmers object to Dynamic in the first place; sometimes claiming that a dynamic language isn't a "robust" language. Or that it isn't "safe enough" for production use. Or that it can't scale for "enterprise" use. Or that there are no "real" applications built with dynamic languages.

Once we get past the "dynamic" argument, they go on to complain that dynamic languages must be interpreted. The byte-code compiling -- and the possibility that the byte code could be translated to native binary -- doesn't enter into the discussion early enough.

Also, some folks don't like the fact that an IDE can't do code completion for a dynamic language. To me, it seems like just pure laziness to object to a language based on the lack of code completion. But some folks claim that IDE auto-completion makes VB a good language.

Hair Resplitting

How about we stop wasting so much bandwidth resplittting these hairs? It's scripted. It's interpreted. It's dynamic. How does it help to micro-optimize the words? Even if scripted really is a proper subset of interpreted, these prominent features solve different kinds of problems; it seems to help the potential language user to keep these concepts separate.

Can we slow down the repetition of (irrelevant) fact that Python is compiled (but not to executable binary) and interpreted? It's not confusing: byte-code compilation really is a well-established design pattern for interpreted languages. Has been for decades. Applesoft Basic on the Apple ][+ used byte-codes. Okay?

Duck Typing is not a "flaw" or "weakness". Binary compilation is not a "strength". It's trivial to corrupt a binary file and introduce bugs or viri; binary compilation is not inherently superior.

Can we move on and actually solve problems rather than split meta-hairs?

Tuesday, January 12, 2010

FW: Eight Things Business Hates About IT

Eight Things Business Hates About IT. Plus eight things IT hates about business.

I suppose.

While there are 8 things identified, they seem to boil down to 2 things to fix: Replace bureaucratic with Agile; replace Keep The Lights On Management with any other way of budgeting.

While Agility is not a panacea, it does address a lot of problems in more active, engaging and solution-oriented ways.

1. IT can be bureaucratic. That's IT's dumb reaction to failed projects -- add more process. While it seems obvious that projects don't fail for lack of process, that's the standard remediation. The pithy management summary is "The Project Is Out Of Control." But it isn't a lack of process. It's a lack of clarity and milestones that are (a) irrelevant (who needs a detailed design document, really?) and (b) too far apart.

Good point. Something to fix. Use Agile methods.

2. IT can be condescending techies. Whatever. That's a consequence of the huge technical complexity. If non-IT people want a "simplified" explanation, it's going to sound condescending. I don't like this one.

Rotten point. Hard to fix.

3. IT can be reactive. When IT chooses the low-road of "Keep The Lights On Management", it elects to be reactive.

Good point. Something to fix. Spend more time with the business and less time in the server room.

4. IT can propose "deluxe" solutions. Sometimes. Programming is a hobby and too many folks in IT really enjoy their hobbies. But. Managers on both sides (IT and business) both pad projects with stuff that will "add enough value" to justify the costs. It's equally bad on both sides. I reject this as an IT problem per se.

In some cases, this is really a consequence of #1. By forcing projects to be Big, Complex Affairs, solutions get padded.

Duplicate. Agile methods and incremental delivery can push the deluxe features out far enough in time that they don't impact this year's budget.

5. IT doesn't deliver on time. This is a consequence of point #1. IT often adds process where none is needed. Delivery on time is easy, if IT simply delivers incrementally. Customer IT departments have said that a partial solution was of no value, and refused to entertain the idea of risk and cost reduction through incremental ("Agile") delivery. The end users who were in the meeting had to disagree with their own IT people over this -- and agree with us -- that incremental delivery did create value.

Duplicate.

6. IT doesn't understand customization. This is also a consequence of point #1. A too complex, overly bureaucratic project can't tolerate customization (or change).

Duplicate.

7. IT doesn't support innovation. What? That's point #3 again.

8. IT inhibits business change. This is also a consequence of point #1. A too complex, overly bureaucratic project can't tolerate customization (or change).

Friday, January 8, 2010

Map Reduce -- How Cool is that?

From time-to-time I hear a few mentions of MapReduce; up until recently, I avoided looking into it.

This month's CACM, however, is chock-full of MapReduce goodness.

After reading some of the articles, I decided to look a little more closely at that approach to handling large datasets.

Python Implementation

Map-Reduce is a pleasant functional approach to handling several closely-related problems.

Concurrency.
Filtering and Exclusion.
Transforming.
Summarizing.

The good part is that it works with no magic. The best part is that you can build map-reduce processors very easily with readily available tools.

Map-Reduce on the Cheap

The basics of map reduce can be done several ways in Python. We could use the built-in map and reduce functions. This can lead to problems if you provide a poorly designed function to reduce.

But Python also provides generator functions. See PEP 255 for background on these. A generator function makes it really easy to implement simple map-reduce style processing on a single host.

Here's a simple web log parser built in the map-reduce style with some generator functions.

Here's the top-level operation. This isn't too interesting because it just picks out a field and reports on it. The point is that it's delightfully simple and focused on the task at hand, free of clutter.

def dump_log( log_source ):
 for entry in log_source:
     print entry[3]

We can improve this, of course, to do yet more calculations, filtering and even reduction. Let's not clutter this example with too much, however.

Here's a map function that can fill the role of log_source. Given a source of rows, this will determine if they're parseable log entries and yield up the parse as a 9-tuple. This maps strings to 9-tuples, filtering away anything that can't be parsed.

log_row_pat= re.compile( r'(\d+\.\d+\.\d+\.\d+) (\S+?) (\S+?) (\[[^\]]+?]) ("[^"]*?") (\S+?) (\S+?) ("[^"]*?") ("[^"]*?")' )
def log_from_rows( row_source ):
 for row in row_source:
     m= log_row_pat.match( row )
     if m is not None:
         yield m.groups()

This log source has one bit of impure functional programming. The tidy, purely functional alternative to saving the match object, m, doesn't seem to be worth the extra lines of code.

Here's a map function that can participate as a row source. This will map a file name to an sequence of individual rows. This can be decomposed if we find the need to reuse either part separately.

def rows_from_name( name_source ):
 for aFileName in name_source:
     logger.info( aFileName )
      with open(aFileName,"r") as source:
           for row in source:
                yield row

Here's a mapping from directory root to a sequence of filenames within the directory structure.

def names_( root='/etc/httpd/logs' ):
for path, dirs, files in os.walk( root ):
   for f in files:
       logging.debug( f )
       if f.startswith('access_log'):
           yield os.path.join(path,f)

This applies a simple name filter. We could have used Python's fnmatch, which would give us a slightly more extensible structure.

Putting it Together

This is the best part of this style of functional programming. It just snaps together with simple composition rules.

logging.basicConfig( stream=sys.stderr, level=logging.INFO )
dump_log( log_from_rows( rows_from_name( names_from_dir() ) ) )
logging.shutdown()

We can simply define a of map functions. Our goal, expressed in dump_log, is the head of the composition. It depends on the tail, which is parsing, reading a file, and locating all files in a directory.

Each step of the map pipeline is a pleasant head-tail composition.

Pipelines

This style of programming can easily be decomposed to work through Unix-style pipelines.

We can cut a map-reduce sequence anywhere. The head of the composition will get it's data from an unpickle operation instead of the original tail.

The original tail of the composition will be used by a new head that pickles the results. This new head can then be put into the source of a Unix-style pipeline.

Parallelism

There are two degrees of parallelism available in this kind of map-reduce. By default, in a single process, we don't get either one.

However, if we break the steps up into separate physical processes, we get huge performance advantages. We force the operating to do scheduling. And we have processes that have a lot of resources available to them.

[Folks like to hand-wring over "heavy-weight" processing vs. threads. Practically, it rarely matters. Create processes until you can prove it's ineffective.]

Additionally, we can -- potentially -- parallelize each map operation. This is more difficult, but that's where a framework helps to wring the last bit of parallel processing out of a really large task.

Until you need the framework, though, you can start doing map-reduce today.

A Link: http://hadoop.apache.org/mapreduce/

Sunday, January 3, 2010

Fossil and SQLite

Interesting thoughts: http://nedbatchelder.com/blog/201001/d_richard_hipps_software_universe.html

I use SQLite heavily. Time to look into Fossil.

Saturday, January 2, 2010

Building Skills in Object-Oriented Design

Completely revised Building Skills in Object-Oriented Design. Cleaned up many of the exercises to make them simpler and more sensible to the n00b designer.

Also, to make it easier to follow, I made use of the Sphinx "ifconfig" feature to separate the text into two parallel editions: a Python edition and a Java edition. A little language-specific focus may help.

Interestingly, I got an email recently from someone who wanted the entire source code for the projects described in the book. I was a little puzzled. Providing the source would completely defeat the purpose of the book, which is to build skills by actually doing the work. So, no, there is no source code available for these exercises. The point is to do the work, all the work, all by yourself.

S.Lott-Software Architect

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.