Thursday, December 29, 2011

LANGSEC: Language-theoretic Security

Wow.  Just wow.  See "LANGSEC explained in a few slogans".

Short, easy-to-grasp explanation of why complex protocols create new problems.

I'm happy with REST and the stack of stuff under it (HTTP, TCP/IP, etc.)

Once upon a time (2001), I invented by own version of a RESTful protocol outside HTTP.  That was cool.  Very simple, and very fast.  But relatively inflexible.  The syntax was more like FTP and SMTP; the semantics where mostly just CRUD rules and RESTful state transfers.

I was way too dumb to leverage HTTP methods and the genius of a URI.

Tuesday, December 27, 2011

Technology Refresh

I've been refurbishing an older project -- written in 2008.  Probably with Django 1.0.1.  Certainly with Python 2.5.

The Django 1.3 release has been around since March.  The change underscored the importance of technology refresh.

The best part was to delete code.  There were two significant reasons.

  • The testserver command allowed me to eliminate a bunch of low-value test harness.   Without this command, we had to create our own test database, start a server, run integration tests, and then kill the server.  With this command, we simply start and kill the server.
  • The RESTful web services can be securely integrated into the main web application.  A simple piece of middleware can authenticate requests based on headers containing ForgeRock OpenAM tokens.  It may be that this was always a feature of Django, but over the last few years, we've figured out how to exploit it with simple middleware.
Few things are better than removing old code and replacing it with code written (and tested) by someone else.

In  addition to the deletes, we also rearranged some of the dependencies.  We had (incorrectly) thought of the Django project as somehow central or essential.  It turns out that a bunch of other Python libraries were actually core to the application.  The Django web presentation was just one of the sensible use cases.  A suite of command-line apps could also be built around the underlying libraries.

In addition to this cleanup, we also replaced the documentation with a new Sphinx project.  The project originally used Epydoc markup.  This meant that every single docstring had to be rewritten to use RST markup.  The upside of this is that we corrected numerous errors.

There Was Pain

This wasn't without some pain.  

Was the cost worth the effort?  That's the real question here.

I think that many IT managers adopt a silly "If it ain't broke, don't fix it" policy that focuses on short-term cost and short-term value.  It ignores long-term accrual from even tiny short-term cost savings.

Here's are two important lessons.  
  • Money saved today is saved forever.
  • Savings accrue.  Forever.
It's important to avoid short-term thinking about cost and benefit.

Tuesday, December 20, 2011

Color Schemes

I worked with this a few years ago to tweak up some web pages.

I just rediscovered it.  It's a cool toy.  You get some colors that all "go" together.  If you're careful with your .CSS definitions, you give people this page and let them fuss around until their positively silly with color palettes.

Thursday, December 15, 2011

Good Summary of Bad Security Assumptions

This isn't the OWASP Top 10 list, but it's still very handy.

Top 10 Dumb Computer Security Notions.

I'm particularly fond of the "security can't be perfect; since it can't be perfect, why bother?" approach.

One other notion that amuses me is the silliness of changing a password every 90 days.  The argument is that "it's harder to hit a moving target".  That's obviously false.  A good rainbow table and a bad password without salt can be broken in about half an hour.  There's no "moving target" here.  At 30 minutes to crack a password, the only way the target can appear to move is making every password a 1-time-only password based on some kind of external source (like a token generator.)

Tuesday, December 13, 2011

The need for ping

Years ago, when designing an interface to a vendor's web services, I did the following.  This isn't a genius move, but it's worth emphasizing how important it is.  And what's most important isn't technical.

  1. I built a simple spike solution to access their service.
  2. I morphed this into a "sanity check" to be sure that their service really was working.  Mostly, I cleaned up the code so that it was testable and deliverable without embarrassment.
  3. I morphed this into a "diagnostic tool" to bypass the higher-levels of the application and simply access the vendor (and optionally dump the results) to help determine what wasn't work.  This involved adding the dump option to the sanity check and renaming the command-line application.
  4. I morphed this into a "credentials check and diagnostic tool".  This was -- ahem -- merely taking the hard-wired credentials out of the application.  Yes.  The first versions had hard-wired credentials.
That brings us to the version in use today.  The "vendor ping" application.

The default behavior is a credentials check.

One optional behavior is to dump the interface details.

Another optional behavior is to allow selection among a small number of simple interactions just to be sure things are working.

Unplanned Work

What's important here isn't that I did all this.  What's important is that the deliverables, user stories and project plans didn't include this little nugget of high-value goodness.

It gets run fairly frequently in crunch situations.  The actor in the story ("As system admin...") is rarely considered as a first-class user of the application.  Yet, the admin is a first-class user, and needs to have proper user stories for confirming that the application is working properly.

Friday, December 9, 2011

Statically Typed Language Nonsense

Read this: "Here Comes Functional Programming" by Larry O'Brien in SD Times.

people who should know better continue to assert that statically typed languages are "safer, because the compiler can catch errors that otherwise wouldn't show up until runtime." While it's true a statically typed language can detect that you've assigned a string to a double without running your code, no type system is so strict that it can substitute for a test suite, and if you have a test suite, type-assignment errors are discovered and precisely diagnosed with little difficulty.
Thank you.   A language like Python, which lacks static type declarations for variables, is not evil or an accident waiting to happen.

The article is about functional languages.  But the static declaration statement is universally true.

Tuesday, December 6, 2011

I'm Confused by this Marketing Ploy

Got this a few weeks back.

My job is to persuade bloggers to link to our site. 
I really love my job! We have a friendly team and good management, but unfortunately I have no idea how to convince a blogger to link to us, I'm afraid I might lose my job because of it :( 
And that is why, instead of sending letters to thousands of different blogs, I am reading yours.
Couldn't parse it.

It seems to be a calculated Pity Ploy.  "I'm afraid I might lose my job...I am reading your [blog]."

The product seemed cool enough.  The pitch, however, was too sketchy for me.

Thursday, December 1, 2011

Agile "Religion" Issues

See this Limitations of Agile Software Development and this The Agile "Religion" -- What?.  What's important is that the limitations of Agile are not limitations.  They're (mostly) intentional roadblocks to Agile.

Looking for "limitations" in the Agile approach misses the point of Agile in several important ways.
The most important problem with this list of "limitations" is that five of the six issues are simply anti-Agile positions that a company can take.

In addition to being anti-Agile, a company can be anti-Test Driven Development.  They can be Anti-Continuous Integration.  They can be Anti-NoSQL.  There are lots of steps a company can take to subvert any given development practice.  Taking a step against a practice does not reveal a limitation.

"1. A team of stars... it takes more than the average Joe to achieve agility".   This is not a specific step against agility.  I chalk this up to a project manager who really likes autocratic rule.  It's also possible that this is from a project manager that's deeply misanthropic.  Either way, the underlying assumption is that developers are somehow too dumb or disorganized to be trusted.

Agile only requires working toward a common goal.  I can't see how a project manager is an essential feature of working toward a common goal.  A manager may make things more clear or more efficient, but that's all.  Indeed, the "clarity" issue is emphasized in most Agile methods: a "Scrum Master" is part of the team specifically to foster clarity of purpose.

Further, some Agile methods require a Product Owner to clarify the team's direction.

"A team of stars" is emphatically not required.  The experience of folks working in Agile environments confirms this.  Real, working Agile teams really really are average.

"2. Fit with organizational culture".  This has nothing to do with Agile methods.  This is just a sweeping (and true) generalization about organizations.  An organization that refuses autonomy and refuses flexibility can't use Agile methods.  An organization that refuses to create a "Big Design Up Front" can't use a traditional waterfall method and must use Agile methods.

Organizational fit is not a limitation of Agile.  It's just a fact about people.

"3. Small team...Assuming that large projects tend to require large teams, this restriction naturally extends to project size."

The assumption simply contradicts Agile principles.  It's not a "limitation" at all.  Large projects (with large numbers of people) have a number of smaller teams.  I've seen projects with over a dozen parallel Agile teams.  This means that in addition to a dozen daily scrums, there's also a scrum-of-scrums by the scrum masters.

Throwing out the small team isn't a limitation of Agile.  It's a failure to understand Agile.  A project with many small teams works quite well.  It's not "religion".  It's experience.

A single large team has been shown (for the last few decades) to be expensive and risky.

"4. Collocated team...We can easily think of a number of situations where this limitation prevents using agile:"  These are not limitations of Agile, but outright refusals to follow Agile principles.  Specifically:

  • "Office space organized by departments" is not a limitation of Agile.  That's a symptom of an organization that refuses to be Agile.  See #2 above; this indicates a bad fit with the culture.  An organization that doesn't have space organized by department might have trouble executing a traditional waterfall method.
  • "Distributed environment" is not a limitation of Agile.  Phones work.  Skype works.
  • "Subcontracting... We have to acknowledge that there is no substitute for face-to-face".  Actually, subcontracting is irrelevant.  Further, subcontracting is not a synonym for a failure to be collocated.  When subcontractors are located remotely, phones still work.  Skype works better and is cheaper.  
"5. Where’s my methodology?"  This is hard to sort out, since it's full of errors.  Essentially, this appears to be a claim that a well-defined, documented processes is somehow essential to software development.  Experience over the last few decades is quite clear that the written processes and the work actually performed diverge a great deal.  Most of the time, what people do is not documented, and the documented process has no bearing on what people actually do.  A documented process -- in most cases -- appears irrelevant to the work actually done.

Agile is not chaos.  It's a change in the rules to de-emphasize unthinking adherence to a plan and replace this with focus on working software.  Well-organized software analysis, design, code and test still exist even without elaborately documented (and irrelevant) process definitions.

"6. Team ownership vs. individual accountability... how can we implement it since an organization’s performance-reward system assesses individual performance and rewards individuals, not teams...?"  Again, the assumption ("performance-reward system assesses individual performance") is simply a rejection of Agile principles.  It's not a limitation of Agile, it's an intentional step away from an Agile approach.  

If an organization insists on individual performance metrics, see #2.  The culture is simply antithetical to Agile. Agile still works; the organization, however, is taking active steps to subvert it.

Agile isn't a religion.  It doesn't suffer from hidden or ignored "limitations".

"But did we question the assumption that Agile was indeed superior to traditional methodologies?"  

The answer is "yes".  A thousand times yes.  The whole reason for Agile approaches is specifically and entirely because of folks questioning traditional methodologies.  Traditional command-and-control methodologies have a long history of not working out well for software development.  The Agile Manifesto is a result of examining the failures of traditional methods.

A traditional "waterfall" methodology works when there are few unknowns.  Construction projects, for example, rarely have the kinds of unknowns that software development has.  Construction usually involves well-known techniques applied to well-documented plans to produce a well-understood result.  Software development rarely involves so many well-known details.  Software development is 80% design and 20% construction.  And the design part involves 80% learning something new and 20% applying experience.

Agile is not Snake Oil.  It's not something to be taken on faith.  

The Agile community exists for exactly one reason.  Agile methods work.

Agile isn't a money-making product or service offering.  Agile -- itself -- is free.  Some folks try to leverage Agile techniques to sell supporting products or services, but Agile isn't an IBM or Oracle product.  There are no "backers".  There's no trail of money to see who profits from Agility.

Folks have been questioning "traditional" methodologies for years.  Why?  Because "traditional" waterfall methodologies are a crap-shoot.  Sometimes they work and sometimes they don't work.  The essential features of long term success are summarized in the Agile Manifesto.  Well-run projects all seem to have certain common features; the features of well-run projects form the basis for the Agile methods.

Tuesday, November 29, 2011

The Value of Microsoft's Tools

See Andrew Binstock's "Windows 8: Microsoft's Development Re-Do".
The costs of these migrations has been enormous and continues to accumulate...
I can only rub my hands with glee and engage in shameless "I Told You So" self-congratulations.

Only you can prevent being held hostage by Microsoft.

More than once, I've observed that a strategy of using only proprietary tools would be expensive and complex.  And every time, the folks I was talking to trivialized my concerns as hardly worth considering.

I've seen orphaned software: it only compiles on an old version of Visual Studio.   I've seen software orphaned so badly that it can only be compiled on one creaky old PC.  The cost to convert was so astronomical that the customer preferred to hope for a product to arise somewhere in the marketplace.  When no suitable product appeared over the decades, the problem reached palpable Pants On Fire (POF) levels of panic.  All due to the hidden costs of Microsoft's tools.

I've even been told that VB is a terrible language, but Visual Studio makes it acceptable.

Thursday, November 24, 2011

Justification of Project Staffing

I really dislike being asked to plan a project.  It's hard to predict the future accurately.

In spite of the future being -- well -- the future, and utterly unknowable, we still have to have the following kinds of discussions.

Me: "It's probably going to take a team of six."

Customer: "We don't really have the budget for that.  You're going to have to provide a lot of justification for a team that big."

What's wrong with this picture?  Let's enumerate.
  1. Customer is paying me for my opinion based on my experience.  If they want to provide me with the answers, I have a way to save them a lot of money.  Write their own project plan with their own answers and leave me out of it.
  2. I've already provided all the justification there is.  I'm predicting the future here.  Software projects are not simple Rate-Time-Distance fourth-grade math problems.  They involve an unknown number of unknowns.  I can't provide a "lot" of justification because there isn't any indisputable basis for the prediction.
  3. I don't know the people. The customer -- typically -- hasn't hired them yet.  Since I don't know them, I don't know how "productive" they'll be.  They could hire a dozen n00bz who can't find their asses blindfolded even using both hands.  Or.  They could hire two singular geniuses who can knock the thing out in a weekend.  Or.  They could hire a half-dozen arrogant SOB's who refuse to follow my recommendations. 
  4. They're going to do whatever they want no matter what I say.  Seriously.  I could say "six".  They could argue that I should rewrite the plan to say "four" without changing the effort and duration.  Why ask me to change the plan?  A customer can only do what they know to be the right thing. 
Doing the Right Thing

Let's return to that last point.  A customer project manager can only do what they absolutely know is the right thing.  I can suggest all kinds of things.  If they're too new, too different, too disturbing, they're going to get ignored.

Indeed, since people have such a huge Confirmation Bias, it's very, very hard to introduce anything new.  A customer doesn't bring in consultants without having already sold the idea that a software development project is in the offing.  They justify spending a few thousand on consulting by establishing some overall, ball-park, big-picture budget and showing that the consulting fees are just a small fraction of the overall.

As consultants, we have to guess this overall, ball-park, big-picture budget accurately, or the project will be shut down.  If we guess too high, then the budget is out of control, or the scope isn't well-enough defined, or some other smell will stop all progress.  If we guess too low, then we have to lard on additional work to get back to the original concept.

Architectures, components and techniques all have to meet expectations. A customer that isn't familiar with test drive development, for example, will have an endless supply of objections.  "It's unproven."  "We don't have the budget for all that testing."  "We're more comfortable with our existing process."

The final trump card is the passive aggressive "I'll have to see the detailed justification."  It means "Don't you dare."  But it sounds just like passive acceptance.

Since project managers can only do what they know is right, they'll find lots of ways of subverting the new and unfamiliar.

If they don't like the architecture, the first glitch or delay or problem will immediately lead to a change in direction to yank out the new and replace it with the familiar.

If they don't like a component, they'll find numerous great reasons to rework that part of the project to remove the offending component.

If they don't like a technique (e.g., Code Walk Throughs) they'll subvert it.  Either not schedule them.  Or cancel them because there are "more important things to do."  Or interrupt them to pull people out of them.

Overcoming the Confirmation Bias

I find the process of overcoming the confirmation bias to be tedious.  Some people like the one-on-one "influencing" role.  It takes patience and time to overcome the confirmation bias so that the customer is open to new ideas.  I just don't have the patience.  It's too much work to listen patiently to all the objections and slowly work through all the alternatives.

I've worked with folks who really relish this kind of thing.  Endless one-on-one meetings.  Lots of pre-meetings and post-meetings and reviews of drafts.  I suppose it's rewarding.  Sigh.

Tuesday, November 22, 2011

How to Learn

A recent question.
i came up with two options.
 1.  building skills 1 (+ other references)... then algorithms & data
structures.... then your books 2 & 3


 2.  your three books 1,2 & 3... then algo & ds

kindly help me decide so i can start soon. 

I have two pieces of advice.

First.  Programming is a language skill.  Just like English.  If you can't get the English right, the odds of getting Python, Java, HTML or SQL right is considerably reduced.  Please, please, please take more care in grammar, syntax and punctuation.  Otherwise, your future as a programmer doesn't look very good.  For example, the personal pronoun is spelled "I".  In the 20th century, we spell out "and"; we stopped writing "&" as a stand-in for the Latin "et" centuries ago.  Also, ellipses ("...") shouldn't be used except when eliding part of a quote.  Clarity and precision actually matter.

Second, and more relevant, your two choices don't really amount to a significant difference.  If you're waiting around for advice, you're wasting your time.  Both sequences are good ideas. It's more important to get started than it is to carefully choose the precise and exact course of study. Just start doing something immediately.

Learning to program is a life-long exercise. There will always be more to learn. Start as soon as you can. The exact choices don't matter.  Why?  Because, eventually, you'll read all of those books plus  many, many others.

Spend less time waiting for advice and more time studying.

Thursday, November 17, 2011

More On Inheritance vs. Delegation

Emphasis on the "More On" as in "Moron".  This is a standard design error story.  The issue is that inheritance happens along an "axis" or "dimension" where the subclasses are at different points along that axis.  Multi-dimensional inheritance is an EPIC FAIL.


Data warehouse processing can involve a fair amount of "big batch" programs.  Loading 40,000 rows of econometric data in a single swoop, updating dimensions and loading facts, for example. 

When you get data from customers and vendors, you have endless file-format problems.  To assure that things will work, each of these big batch programs has at least two operating modes.
  • Validate.  Go through all the motions.  Except.  Don't commit any changes to the database; don't make any filesystem changes.  (i.e., write the new files, but don't do the final renames to make the files current.)
  • Load.  Go through all the motions including a complete commit to the database and any filesystem changes.

What's the difference between the two modes?  Clearly, one is a subclass of the other.
  • Load can be the superclass.  The Validate subclass simply replaces the save methods stubs that do nothing.
  • Validate can be the superclass.  The Load subclass simply implements the save method stubs with methods that do something useful.
Simple, right?


What Doesn't Work

This design has a smell.  The smell is that we can't easily extend the overall processing to include an additional feature. 

Why not? 

This design has the persistence feature set as the inheritance axis or dimension.  This is kind of limited.  We really want a different feature set for inheritance.

Consider a Validate for two dimensions (Company and Time) that loads econometric facts.  It has stub "save" methods.

We subclass the Validate to create the proper Load for these two dimensions and one fact.  We replace the stub save methods with proper database commits. 

After the actuaries think for a while, suddenly we have a file which includes an additional dimension (i.e., business location) or an additional fact (i.e., econometric data at a different level of granularity).  What now?  If we subclass Validate to add the dimension or fact, we have a problem.  We have to repeat the Load subclass methods for the new, extended Load.  Oops.

If we subclass Load to add the dimension or fact, we have a problem.  We have to repeat the Validate stubs in the new extended Load to make it into a Validate.  Oops.

Recognizing Delegation

It's difficult to predict inheritance vs. delegation design problems.

The hand-waving advice is to consider the essential features of the object.  This isn't too helpful.  Often, we're so focused on the database design that persistence seems essential.

Experience shows, however, that some things are not essential.  Persistence, for example, is one of those things that should always be delegated.

Another thing that should always be delegated is the more general problem of representation: JSON, XML, etc., should rely on delegation since this is never essential.  There's always another representation for data.  Representation is always independent of the object's essential internal state changes.


In my case, I've got about a dozen implementations using a clunky inheritance that had some copy-and-paste programming.  Oops.

I'm trying to reduce that technical debt by rewriting each to be a proper delegation.  With good unit test coverage, there's no real technical risk.  Just tedious fixing the same mistake that I rushed into production twelve separate times. 

Really.  Colossally dumb.

Tuesday, October 25, 2011

VMware, VIX and PyVIX2

The topic of VMware came up at my local 757 Python Users Group.

A common administrative need is to control VM farms.  While there are a number of pointy-clicky GUI tools, VMware offers the VIX library to permit writing scripts to control VM's.

Here's some information we looked at recently on PyVIX2 and VMware.

The idea behind PyVIX2 is to provide a relatively simple Python binding to VIX.    This, too, is a command-line interface, following on the heels of More Command-Line Goodness and Command-Line Applications.

Thursday, October 20, 2011

The Agile "Religion" -- What?

Received "it seems that software development has caught the agile religion. Personally, I have an issue w/ being unimodal."


First.  "agile religion".  As in the deprecating statement: Agile is nothing more than a religion?  As in Agile is nothing more than a vague religious practice with no tangible value to an organization?  Interesting, I guess.

I'm assuming that the author did not read the Manifesto for Agile Software Development.  Or--worse-- they read it and find that the four values (Individuals and interactions, Working software, Customer collaboration and Responding to change) are just of no tangible value.

That's alarming.  Really.  The alternative (processes and tools, comprehensive documentation, contract negotiation, and following a plan without regard to changes) seems like it's a recipe for cost, risk, low-value work and a cancelled project.  Indeed, it seems like non-Agile project management is the best way to get to the fabled "Software Crisis" where lots of money gets spent but little of value gets created.

Further, it seems that all modifications of the classic waterfall method (e.g., spiral method as a prime example) specifically create "iterative, incremental" approaches to software development.  That is, everything that's not a strict (brain-dead) waterfall has some elements of Agile.

This causes me to think that Agile isn't a religion.  It causes me to think that Waterfall methods were a religious practice of no tangible value.  All the methodology experiments over the last 15 years have been ways of introducing flexibility (agility, brains) into a foolishly inflexible methodology definition.

Indeed, it appears that the heavy-weigh waterfallish methods are an attempt to replace thinking with process.  And it didn't work.  So, we have to go back to the thinking part.  Only, we call it Agile now.

Religious Wars.

Second.  "agile religion" (again).  As in methodology discussions are just religious wars?  As in methodology discussions are just quibbling over no-value details?  Some folks may get this impression of making a choice between Agile vs. Non-Agile methods.  I think that those folks haven't actually had the opportunity to work from a prioritized backlog and build the most valuable part first.  I think that someone who things Agile is just a religious war hasn't been allowed to fix a broken project plan based on lessons learned during the first release.


Third.   "unimodal".  As in being exclusively Agile is bad?  As in sometimes you need to have a rigid, unyielding process that sticks strictly to the schedule irrespective of changes which may occur?  That doesn't seem rational.

Change happens.  Forcing the inevitable changes to conform to some farcical schedule made up by people who didn't have all the details seems silly.  Making contract negotiation the focal point of response to change seems like a waste of effort.  Trying to document everything so completely that all possible changes are already accounted for seems impossible.  And replacing change with a process that regulates change seems -- perhaps -- unhinged.

There were some links and some charts and graphs attached.  I couldn't get past the two sentences above to see if there was, perhaps, something more to it.  All I could do was respond with a request for clarification that didn't involve the trivialization of Agile methods.  It doesn't seem sensible to try and remove the human element from software development.

I'll provide whatever follow-up clarification surfaces on this topic.  It's interesting to see if the "agile religion" was misplaced, or if there are folks who think that responding to the messiness of real software development is a bad idea.

We tried the waterfall method.  And it didn't work very well.  Agile isn't a "religion".  It's a simple acknowledgement that reality is messy.

Thursday, October 13, 2011

More Command-Line Goodness

In Command-Line Applications, we looked at a Python main-import switch which boiled down to this.

for file in args.file: 
    with open( file, "r" ) as source:
        process_file( source, args )

The point was that each distinct file on the command-line was processed in a more-or-less uniform way by a single function that does the "real work" for that input file.

It turns out that we often have flat files which are spreadsheets or spreadsheet-like.   Indeed, for some people (and some organizations) the spreadsheet is their preferred user interface.  As I've said before, 
Spreadsheets are the universal user interface. Everyone likes them, they're almost inescapable. And they work. There's no reason to attempt to replace the spreadsheet with a web page or a form or a desktop application. It's easier to cope with spreadsheet vagaries than to replace them.
They have problems, but they are surprisingly common.  

Enter Stingray Reader.  This is a small Python library to make it easy to have programs which read workbooks--collections of spreadsheets--or spreadsheet-like files with a degree of transparency.  

And.  It allows a clean command-line interface.

With a little care, we can reduce the main-import switch to something like this.

if __name__ == "__main__":
    logging.basicConfig( stream=sys.stderr )
    args= parse_args()
    logging.getLogger().setLevel( args.verbosity )
    builder= make_builder( args )
        for file in args:
            with workbook.open_workbook( input ) as source:
                process_workbook( source, builder )
        status= 0
    except Exception as e:
        logging.exception( e )
        status= 3
    sys.exit( status )

The bold lines are specific to workbook ("spreadsheet") processing.  A "builder" creates application-specific Python objects from spreadsheet rows.  The "workbook.open_workbook" is a function that builds a workbook reader based on the file name.  It can handle a number of file types.  

The process_workbook function is the "real work" function that handles a workbook of individual spreadsheets (or a spreadsheet-like file).

Tuesday, October 11, 2011

A smoothly operating, well-oiled engine for failure

It occurs to me that much of "Big IT" creates a well-oiled organization that makes broken software seem acceptable. The breakage is wrapped in layers of finely-tuned process.

Consider a typical Enterprise Application.  There's a help desk, ticket tracking, a user support organization that does "ad-hoc" processing, and a development organization to handle bug fixes and enhancement requests.  All those people doing all that work.


If people need all that support, then the application is -- from a simplistic view -- broken.

The organization, however, has coped with the broken application by wrapping it in layers of people, process, tools, technology, management and funding.  The end users have a problem, they call the help desk, and the machine kicks in to resolve their problem.

It is a given -- a going-in assumption -- a normal, standard expectation that any enterprise software is so broken that a huge organization will be essential for pressing forward.  It is expected that good software cannot be built.

We're asked to help a client create a sophisticated plan for the New Enterprise App support organization.  Planning this organization feels like planning for various kinds of known, predicted, expected failures. Failure is the expectation.  Broken is the standard operating mode.

Consider a typical non-Enterprise Application.  Let's say, the GNU C compiler.  Or Python.  Or Linux.  An almost entirely volunteer organization, no help desk, no trouble tickets, no elaborate support organization plan.  Yet.  These products actually work flawlessly.  They're not wrapped in a giant organization.

Why is the bar for acceptability so low for "Enterprise" applications?  Why is this tolerated?

Thursday, October 6, 2011

Command Line Applications

I'm old -- I admit it -- and I feel that command-line applications are still very, very important. Linux, for example, is packed full of almost innumerable command-line applications. In some cases, the Linux GUI tools are specifically just wrappers around the underlying command-line applications.

For many types of high-volume data processing, command-line applications are essential.

I've seen command-line applications done very badly.

Overusing Main

When writing OO programs, it's absolutely essential that the OS interface (public static void main in Java or the if __name__ == "__main__": block in Python) does as little as possible.

A good command-line program has the underlying tasks or actions defined in some easy-to-work with class hierarchy built on the Command design pattern. The actual main program part does just a few things: gather the relevant environment variables, parse command-line options and arguments, identify the configuration files, and initiate the appropriate commands. Nothing application-specific.

When the main method does application-specific work, that application functionality is buried in a method that's particularly hard to reuse. It's important to keep the application functionality away from the OS interface.

I'm finding that main programs should look something like this:

if __name__ == "__main__":
    logging.basicConfig( stream=sys.stderr )
    args= parse_args()
    logging.getLogger().setLevel( args.verbosity )
        for file in args.file:
            with open( file, "r" ) as source:
                process_file( source, args )
        status= 0
    except Exception as e:
        logging.exception( e )
        status= 3
    sys.exit( status )

That's it.  Nothing more in the top-level main program.  The process_file function becomes a reusable "command" and something that can be tested independently.

Tuesday, October 4, 2011

"Hard Coding" Business Rules

See this: "Stop hard-coding business rules" in SD Times.

Here's what's exasperating: "Memo to developers: Stop hard-coding business rules into applications. Use business rules engines instead."

Business Rules Engines?  You mean Python?

It appears that they don't mean Python.

"Developers can use [a BPM suite or rules engine] and be more productive, so long as they don’t use C# or Java as a default for development".

I'm guessing that by "C# or Java" they mean "a programming language" and I would bet that Python is included in "bad" languages for development.

Python has all the simplicity and expressive power of a Domain-Specific Language (DSL) for business rules.

Don't hard-code business rules in Java.  Code them in an interpreted language like Python.

Also, don't be mislead by any claims that business analysts or (weirdly) users can somehow "code" business rules.  They can't (and mostly, they won't).  That's what SD Times wisely says "Developers".  That's how coding gets done.

Thursday, September 29, 2011

The Politics of Estimating

Computerworld, September 12, page 10.

MicroburstIT Disasters
According to a study of 1,471 big IT projects, 15% turn out to be money pits, with cost overruns averaging 200%.

How is this a politically-charged statement?  We hear this kind of thing all the time.

As developers (or project leaders) we're failing to execute.



An "overrun" is isomorphic to "badly justified" or "badly budgeted" or "oversold to executive sponsors".

An "overrun" can be a failure to use (or even permit) realistic estimates.  It may reflect an executive sponsor restating objectives to make the project large enough to justify it.  An overrun can mean anything.

Calling it an overrun is a way to label it as "failure to execute".

I prefer to call it a failure of vision (or whatever it is executive sponsors do).  It's more likely to be an under-estimate than it is to be an over-run.

After all, how many times have we been told to reduce an estimate?  How many times have folks gotten their "attaboys" and "attagirls" for "sharpening their pencils" and reducing the proposal to the smallest amount that the customer would approve?

Tuesday, September 27, 2011

Threads and I/O

Threads don't promote concurrent I/O.

Kernel threads may.  Most of us write user threads.  Here's a great summary under Thread (Computer Science).
However, the use of blocking system calls in user threads (as opposed to kernel threads) or fibers can be problematic. If a user thread or a fiber performs a system call that blocks, the other user threads and fibers in the process are unable to run until the system call returns. A typical example of this problem is when performing I/O: most programs are written to perform I/O synchronously. When an I/O operation is initiated, a system call is made, and does not return until the I/O operation has been completed. In the intervening period, the entire process is "blocked" by the kernel and cannot run, which starves other user threads and fibers in the same process from executing.

The point is this.

If it involves I/O, multi-threading doesn't help.  Processes do.

If it involves computation, multi-threading may help.

Thursday, September 22, 2011

"Strict" Unit Testing -- Everything In Isolation Is Too Much Work

Folks like to claim that unit testing absolutely requires each class be tested in isolation using mocks for all dependencies.  This is a noble aspiration, but doesn't work out perfectly well in Python.

First, "unit" is intentionally vague.  It could be a class, a function, a module or a package.  It's "unit" of code.  Anything could be considered a "unit".

Second--and more important--the extensive mocking isn't fully appropriate for Python programming.  Mocks are very helpful in statically-typed languages where you must be very fussy about assuring that all of the interface definitions are carefully matched up properly.  

In Python, duck typing allows a mock to be defined quite trivially.  A mock library isn't terribly helpful, since it doesn't reduce the code volume or complexity in any meaningful way.

Dependencies without Injection

The larger issue with trying to unit test in Python with mock objects is the impact of change.

We have some class with an interface.

class AppFeature( object ):
    def app_method( self, anotherObject ):

class AnotherClass( object ):
    def another_method( self ):

We've properly used dependency injection to make AppFeature depend on an instance of AnotherClass.  This means that we're supposed to create a mock of AnotherClass to test the AppFeature

class MockAnotherClass( object ):
    def another_method( self ):

In Python, this mock isn't a best practice.  It can be helpful.  But adding a mock can also be confusing and misleading.

Refactoring Scenario

Consider the situation where we're refactoring and change the interface to AnotherClass.  We modify another_method to take an additional argument, for example.

How many mocks do we have?  How many need to be changed?  What happens when we miss one of the mocks and have the mysterious Isolated Test Failure?  

While we can use a naming convention and grep to locate the mocks, this can (and does) get murky when we've got a mock that replaces a complex cluster of objects with a simple Facade for testing purposes.  Now, we've got a mock that doesn't trivially replace the mocked class.

Alternative: Less Strict Mocking

In Python--and other duck typing languages--a less mock-heavy approach seems more productive.  The goal of testing every class in isolation surrounded by mocks needs to be relaxed.  A more helpful approach is to work up through the layers.
  1. Test the "low-level" classes--those with few or no dependencies--in isolation.  This is easy because they're already isolated by design.
  2. The classes which depend on these low-level classes can simply use the low-level classes without shame or embarrassment.  The low-level classes work.  Higher-level classes can depend on them.  It's okay.
  3. In some cases, mocks are required for particularly complex or difficult classes.  Nothing is wrong with mocks.  But fussy overuse of mocks does create additional work.
The benefit of this is 
  • The layered architecture is tested the way it's actually used.  The low-level classes are tested in isolation as well as being tested in conjunction with the classes that depend on them.
  • It's easier to refactor.  The design changes aren't propagated into mocks.
  • Layer boundaries can be more strictly enforced.  Circularities are exposed in a more useful way through the dependencies and layered testing.
We need to still work out proper dependency injection.  If we try to mock every dependency, we are forced to confront every dependency in glorious detail.  If we don't mock every single dependency, we can slide by without properly isolating our design.

Tuesday, September 13, 2011

Thursday, September 8, 2011

I was going to be talking about Schema Migration, tacit knowledge, and -- of course -- Python.

The hard part would have been avoiding a LONG rant on how devilishly hard the problem really is.

Apparently, however, DevDays is cancelled.  Sigh.

Thursday, September 1, 2011

Data Warehousing and SQL -- Tread Carefully

"Are you implying that a scalable Data Warehouse solution could be implemented using Python and serialised files?"

Not "implying".  I'm trying to state it as clearly as I can.

A scalable data warehouse solution involves a lot of flat file processing.

ETL, for example, is mostly a flat-file pipeline.  It starts with source application extract (to create a flat file) and proceeds through a number of transformation steps to filter, cleanse, recode, conform dimensions, and eventually relate facts to dimensions.  This is generally very, very fast when done with simple flat files and considerably slower when done with a database.

This is the "Data Warehouse Bus" that Kimball describes in chapter 9 of The Data Warehouse Lifecycle Toolkit.

Ultimately, the cleansed, conformed files will lay around in a "staging area" forever.  When a datamart is built, then a subset of these files can be (rapidly) loaded into an RDBMS for query processing.

Doing this in Python is no different from doing it in Java, C++ or (for that matter) Syncsort.  Yes.  You can build a data warehouse using processing steps written around Syncsort and be quite successful.

The important part of this is to recognize the following.

When trying to do data warehouse flat-file processing in C++ (or Java) you have the ongoing schema maintenance issue.  The source data changes.  You must tweak the schema mapping from source to warehouse.  You can encode this schema mapping as property files or some such, or you can simply use an interpreted language like Python and encode the mappings as Python code.

The "Data Warehouse Bus" is a lot of applications that are trivially written as simple, parallel, multi-processing, small, read-match-write programs.  Forget threads.  Simply use heavy-weight, OS-level processes so that you can maximize the I/O bandwidth.  (Remember: when one thread makes an I/O request, the entire process waits; an I/O-bound application isn't helped by multi-threading.)

    with open('some_data','rb') as source:
        rdr= csv.DictReader( source )
        wtr= csv.DictWriter( sys.stdout, some_schema )
        for row in rdr:
            if exclude( row ): continue
            clean = cleanse( row )
            wtr.writerow( clean )

This example writes to stdout so that it can be connected in a pipeline with other steps in the processing.  Programs running in an OS pipeline run concurrently.  They tie up all the cores available without any real programming effort other than decomposing the problem into discrete parallel steps that apply to each row being touched.

Simple file processing is much, much faster than SQL processing.  Why?  No overheads for locking or buffer pooling or rollback segments, or logging, or after-image journaling or deadlock detection, etc.

Note that a data warehouse database has no need for sophisticated locking.  All of the "updates" are bulk loads.  80% of the activity is "insert".  With some Slowly Changing Dimension (SCD) operations there is a trivial status-change update, but this can be handled with a single database-wide lock during insert.

The primary reason for using SQL is to handle "SELECT something ... GROUP BY" queries.  SQL does this reasonably well most of the time.  Python does it pretty well, also.

    sum_col1 = defaultdict( float )
    count_group = defaultdict( int )
    with connection.cursor() as c:
        c.execute( "SELECT COL1, GROUP FROM..." )
        for row in c.fetchall():
            sum_col1[] += col1
            count_group[] += 1
    print( sum_col1, count_group )

That's clearly wordier than SQL.  But not much wordier.  The SELECT statement embedded in the Python is simpler because it omits the GROUP BY clause.  Since it's simpler, it's more likely to benefit from being reused in the RDBMS.

The Python may actually run faster than a pure SQL query because it avoids the (potentially expensive) RDBMS sort step.  The Python defaultdict (or Java HashMap) is how we avoid sorting.  If we need to present the keys in some kind of user-friendly order, we have limited the sort to just the distinct key values, not the entire join result.

Because of the huge cost of group by, there are two hack-arounds.  One is "materialized views".  The idea is that a group-by view is updated when the base tables are updated to avoid the painful cost of sorting at query time.  In addition to this, there are reporting tools which are "aggregate aware".  They can leverage the materialized view to avoid the sort.

How about we avoid all the conceptual overhead of materialized views and aggregate aware reporting. Instead we can write simple Python procedures that do the processing we want.

Bottom Line

Data Warehouse does not imply SQL.  Indeed, it doesn't even suggest SQL except for datamart processing of flexible ad-hoc queries where there's enough horsepower to endure all the sorting.

Thursday, August 4, 2011

Brain-Damaged Data

We process a fair amount of externally-prepared datasets.  40,000 rows of econometric data that we purchased from a third-party.  Mostly, the data is in a usable format: .CSV or .XSLX.

Once in a while, we get CSV with | (pipe).  A few times, we got fixed-format COBOL-style records.

Recently, we got a CSV-with-pipe that included 2 records with embedded \n sequences in the middle of a CSV row of data.  Really.

Painful Elimination

There are two ways to "eliminate" this problem. 
  • Subclass our input processing to handle this special CSV-with-pipe case.
  • Actually read and parse the source file creating a clean intermediate file that we can simply process with an existing CSV-with-pipe configuration.
I elected to do the first.  The second is (to my mind) an auditing nightmare because we touched the file.  We have to prove that we didn't disturb any other fields.  While not impossible, it becomes a very strange special case for this one-and-only file.

CSV Simplicity

The CSV module's epic simplicity makes it easy to work around this kind of goofy data.  Our subclass for this case had the following extra foolishness put in

def make_reader( self ):
        def filter_damage( aFile ):
            file_iter= iter(aFile)
            for row in file_iter:
                if row.rfind('"') >= len(row)-3:
                    logger.error( "Damaged Line: %r", row )
                    rest= next(file_iter)
                    line= row[:row.rfind('"')] + rest[3:]
                    logger.warning( "Repaired Line: %r", line )
                    yield line
                    yield row
        tweaked_file= filter_damage( self.sourceFile )
        return csv.reader( tweaked_file, delimiter='|', doublequote=False, escapechar='"' )

That's it.  Since the Python CSV reader merely wants an iterator over lines, we can (with a simple generator function) provide the necessary "iterator-over-lines". 



The murky-looking row.rfind('"') >= len(row)-3 condition is one of those consequences of trying to find just a few irregular line endings in an otherwise regular file.  For CSV processing, files often have to be opened in "rb" mode because they originate (or will be used with) MS-Excel.  This makes the damaged line-ending either '"\n' or maybe '"\r\n'.  Rather than spend too much time negotiating with Python's universal newline and "rb" mode, it's slightly easier to look for a '"' near the end. 

We're hoping this is a one-time-only subclass that we can safely ignore in the future.  If hope is dashed, it's a distinct subclass, so it's easily reused and didn't break anything else.

Tuesday, July 26, 2011

One of Those Things

Check out this question on Stack Overflow: "Python: replace a string by a float in txt file".

The question is confusing, but it appears to be a longish and confused description of simple formatting or template substitution.  It's hard to be sure, but it sounds like one of Those Things™ (TT).

Most of Those Things (TT) are standard problems with standard solutions.  Until you've seen a lot TT's, it seems like your problem is unique and special.  It's hard to see TT's for what they are.

In this case, the problem appears to be solved by Python's string.Template class with minor modifications.  The documentation for customizing string.Template isn't clear, so here's an example.

from string import Template
class MyTemplate( Template ):
    delimiter= '@'
    pattern= r"@(?P<escaped>@)|@(?P<named>[_a-z][_a-z0-9]*)@|@(?P<braced>[_a-z][_a-z0-9]*)@|@(?P<invalid>)"

That appears to be the standard solution to the standard problem.  Define a new delimiter ('@') and some slightly different delimiter parsing rules and away you go.

This can be used as follows to replace any '@x@' variables in any template file.  What's important is that very little actual code is needed, since it's one of Those Things that's already been solved.

with open( 'a.txt', 'r' ) as source:
    t = MyTemplate(
    result= t.substitute( x=15 )
    print result

Thursday, July 21, 2011

Spam Email Footers

I don't want the spamilicious email.  I'm trying to actually unsubscribe.

The footer says "If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of any information contained in or attached to this communication is strictly prohibited. If you have received this message in error, please notify the sender immediately and delete the material from any computer."

I don't feel like the intended recipient because it's just irrelevant junk.  Perhaps you should not have disseminated, distributed, copied or sent me this.  Wouldn't that have been simpler? Keep it to yourself?

I also think I've received the message in error.  Since I don't want the damn thing. And that means that I have to delete it?  Why can't you stop sending it?  Wouldn't that be simpler for both of us?

Monday, July 18, 2011

757 Python User's Group Meetup

Wednesday night.  At 757 Labs.  Be there.

Here's the details on

Lacking any other agenda, I'll do some more presentation on the supreme coolness of Django.

Tuesday, July 12, 2011

I almost wet myself

Someone sent me this: "“Building Skills in Python” – Steven F. Lott".

I had a vague idea that this book would get some traction.  This response was surprising.  I guess I should get to work on the upgrades.  And focus on the "no-nonsense" comment.

Thursday, July 7, 2011

Security Vulnerabilities

Just saw this for the first time today:

I'd always relied on this:

Both are really good lists of security vulnerabilities.

I once had to listen to a DBA tell me that "we don't know what we don't know" as a way of saying that there was no way to be sure that a web app was "secure".  That comment lead the project manager to go  through the classic "risk exposure" exercise (and hours of discussion) to determine that security mattered.  We defined the risks, the costs and the probability of occurrence so that we could document all kinds of potential exposures or something.

Instead of hand-wringing, these kinds of simple lists of the common vulnerabilities provides actionable steps for design, code, test and audit of operations.  Further, they guide selection, configuration and operation of web server technology to assure that the vulnerabilities are addressed.

Thursday, June 30, 2011

Implementing the Unsubscribe User Story

I've been unsubscribing from some junk email recently.

The user story is simple: As a not-very-interested person, I want to get off your dumb-ass mailing list so that I don't have to flag your crap as spam any more.

The implementations vary from good to evil.  Here's what I've found.

The best sites have an unsubscribe link that simply presents the facts -- you are unsubscribed.  I almost feel like re-subscribing to a site that handles this use case so well.

The first level of crap is a site which forces me to click an OK or Unsubscribe button to confirm that I really want to unsubscribe and wasn't clicking the tiny little links at the end of the message randomly.

The deeper level of "marketing" crap is a form that allows me to "configure my subscription settings".  This is done by some marketing genius who wanted to "offer additional value" rather than simply do what I asked.  This is a hateful (but not yet evil) practice.  I don't want to "configure" my settings.  I want out.

The third-from-worst is a form in which I must enter my email address.  What?  I have several email aliases that redirect to a common mailbox.  I have to -- what? -- guess which of the aliases was used?  This is pernicious because I can make a spelling mistake and they can continue to send me dunning email.  This fill-in-the-blanks unsubscribe is simply evil because it gives them plausible deniability when the continue to send me email.  It's now my fault that I didn't spell my name correctly.

The next-to-worst is a "mailto:" link that jumps into my emailer.  I have to -- what? -- fill in the magic word "Complete" somewhere?  You're kidding, right?  This is so 1980's-vintage listserv that I'm hoping these companies can be sued because they failed to actually unsubscribe folks.  Again, this gives the spammer a legitimate excuse because I failed to do the arcane step properly.

The worst is no link at all.  Just instructions explaining that an email must be send with the magic word "Complete" or "Unsubscribe" in the subject or body.  Because I use aliases, this will probably not unsubscribe anything useful, but will only unsubscribe my outbound email address.  This is the worst kind of evil.  In a way, it meets the user story.  But only in a very, very oblique way.

Monday, June 27, 2011

Simplicity vs. Depth

During  chapter technical reviews, the question of technical depth has come up time and again.  Essentially, in every single chapter.

In the older Building Skills in Python book, there are a number of topics that feel "digressive" to the reviewer and editor.  Too much depth.

However, there are a number of Python tutorials, many of which are very shallow.  I'd like to find a way to retain the technical depth, without it feeling "digressive".

Choice 1.  Split each chapter into different "basic" and "advanced" sections.  This would retain a sensible outline of parts (Language Fundamentals, Data Structures, Classes, Modules and a bunch of advanced projects) and chapters within each part.  Some chapters would still have to be split because a number of "advanced" concepts (i.e. alternative function argument passing with * and **) really has to be delayed until after an appropriate data structure chapter.

Choice 2.  Separate material two kinds of chapters "basic" and "pro".  This would lead to a "basics" thread for n00bz (read all the "basics" chapters) and an "pro" thread for professionals where you'd just read all the chapters in order without skipping.    This would create some more chapters, but each chapter would be shorter and more focused.


Tuesday, June 21, 2011


Just started learning about "Hackerspace".

Without really knowing what I was doing, I fell into the 757 Labs Hackerspace.

The 757 Python Users' Group, specifically.

What a great idea.  Bright people.  Interested in the same area of technology.

It's like hanging around with sailors at a marina.

Thursday, June 9, 2011

An Object-Lesson in How to Stifle Innovation

Read this: How Ma Bell Shelved the Future for 60 Years.
AT&T firmly believed that the answering machine, and its magnetic tapes, would lead the public to abandon the telephone.
How many good ideas are set aside by managers who simply don't have a clue what users actually want?

How many great IT projects are rejected because of this kind of delusional paranoia?

Tuesday, June 7, 2011

Multithreading -- Fear, Uncertainty and Doubt

Read this: "How to explain why multi-threading is difficult".

We need to talk. This is not that difficult.

Multi-threading is only difficult if you do it badly. There are an almost infinite number of ways to do it badly. Many magazines and bloggers have decided that the multithreading hurdle is the Next Big Thing (NBT™). We need new, fancy, expensive language and library support for this and we need it right now.

Parallel Computing is the secret to following Moore's Law. All those extra cores will go unused if we can't write multithreaded apps. And we can't write multi-threaded apps because—well—there are lots of reasons, split between ignorance and arrogance. All of which can be solved by throwing money after tools. Right?


One thing that makes multi-threaded applications error-prone is simple arrogance. There are lots and lots of race conditions that can arise. And folks aren't trained to think about how simple it is to have a sequence of instructions interrupted at just the wrong spot. Any sequence of "read, work, update" operations will have threads doing reads (in any order), threads doing the work (in any order) and then doing the updates in the worst possible order.

Compound "read, work, update" sequences need locks. And the locations of the locks can be obscure because we rarely think twice about reading a variable. Setting a variable is a little less confusing. Because we don't think much about reads, we fail to see the consequences of moving the read of a variable around as part of an optimization effort.


The best kind of lock is not a mutex or a semaphore. It surely isn't an RDBMS (but God knows, numerous organizations have used an RDBMS as a large, slow, complex and expensive message queue.)

The best kind of lock seems to be a message queue. The various concurrent elements can simply dequeue pieces of data, do their tasks and enqueue the results. It's really elegant. It has many, simple, uncoupled pieces. It can be scaled by increasing the number of threads sharing a queue.

A queue (read with an official "get") means that the reads aren't casually ignored and moved around during optimization. Further, the creation of a complex object can be done by one thread which gets pieces of data from a queue shared by multiple writers. No locking on the complex object.

Using message queues means that there's no weird race condition when getting data to start doing useful work; a get is atomic and guaranteed to have that property. Each thread gets an thread-local, thread-safe object. There's no weird race condition when passing a result on to the next step in a pipeline. It's dropped into the queue, where it's available to another thread.

Dining Philosophers

The Dining Philosophers Code Kata has a queue-based solution that's pretty cool.

A queue of Forks can be shared by the various Philosopher threads. Each Philosopher must get two Fork resources from the queue, eat, philosophize and then enqueue the two Forks again. It's quite short, easy to write and easy to demonstrate that it must work.

Perhaps the hardest thing is designing the Dining Room (also know as the Waiter, Conductor or Footman) that only allows four of the five philosophers to dine concurrently. To do this, a departing Philosopher must enqueue themselves into a "done eating" queue so that the next waiting Philosopher can be seated.

A queue-based solution is delightfully simple. 200 or so lines of code including docstrings comments so that the documentation looked nice, too.

Additional Constraints

The simplest solution uses a single queue of anonymous Forks. A common constraint is to insist that each Philosopher use only the two adjacent forks. Philosopher p can use forks (p+1 mod 5) and (p-1 mod 5).

This is pleasant to implement. The Philosopher simply dequeues a fork, checks the position, and re-enqueues it if it's a wrong fork.

FUD Factor

I think that the publicity around parallel programming and multithreaded applications is designed to create Fear, Uncertainty and Doubt (FUD™).
  1. Too many questions on StackOverflow seem to indicate that a slow program might magically get faster if somehow threads where involved. For programs that involve scanning the entire hard drive or downloading Wikipedia or doing a giant SQL query, the number of threads has little relevance to the real work involved. These programs are I/O bound; since threads must share the I/O resources of the containing process, multi-threading won't help.
  2. Too many questions on StackOverflow seem to have simple message queue solutions. But folks seem to start out using inappropriate technology. Just learn how to use a message queue. Move on.
  3. Too many vendors of tools (or languages) are pandering to (or creating) the FUD factor. If programmers are made suitably fearful, uncertain or doubtful, they'll lobby for spending lots of money for a language or package that "solves" the problem.
Sigh. The answer isn't software tools, it's design. Break the problem down into independent parallel tasks and feed them from message queues. Collect the results in message queues.

Some Code

class Philosopher( threading.Thread ):
    """A Philosopher.  When invited to dine, they will
    cycle through their standard dining loop.
    -   Acquire two forks from the fork Queue
    -   Eat for a random interval
    -   Release the two forks
    -   Philosophize for a random interval
    When done, they will enqueue themselves with
    the "footman" to indicate that they are leaving.
    def __init__( self, name, cycles=None ):
        """Create this philosopher.
        :param name: the number of this philosopher.  
            This is used by a subclass to find the correct fork.
        :param cycles: the number of cycles they will eat.
            If unspecified, it's a random number, u, 4 <= u < 7
        super( Philosopher, self ).__init__() name
        self.cycles= cycles if cycles is not None else random.randrange(4,7)
        self.log= logging.getLogger( "{0}.{1}".format(self.__class__.__name__, name) ) "cycles={0:d}".format( self.cycles ) )
        self.forks= None
        self.leaving= None
    def enter( self, forks, leaving ):
        """Enter the dining room.  This must be done before the 
        thread can be started.
        :param forks: The queue of available forks
        :param leaving: A queue to notify the footman that they are
        self.forks= forks
        self.leaving= leaving
    def dine( self ):
        """The standard dining cycle: 
        acquire forks, eat, release forks, philosophize.
        for cycle in range(self.cycles):
            f1= self.acquire_fork()
            f2= self.acquire_fork()
            self.release_fork( f1 )
            self.release_fork( f2 )
        self.leaving.put( self )
    def eat( self ):
        """Eating task.""" "Eating" )
        time.sleep( random.random() )
    def philosophize( self ):
        """Philosophizing task.""" "Philosophizing" )
        time.sleep( random.random() )
    def acquire_fork( self ):
        """Acquire a fork.
        :returns: The Fork acquired.
        fork= self.forks.get()
        return fork
    def release_fork( self, fork ):
        """Acquire a fork.
        :param fork: The Fork to release.
        fork.held_by= None
        self.forks.put( fork )
    def run( self ):
        """Interface to Thread.  After the Philosopher
        has entered the dining room, they may engage
        in the main dining cycle.
        assert self.forks and self.leaving

The point is to have the dine method be a direct expression of the Philosopher's dining experience.  We might want to override the acquire_fork method to permit different fork acquisition strategies.

For example, a picky philosopher may only want to use the forks adjacent to their place at the table, rather than reaching across the table for the next available Fork.

The Fork, by comparison, is boring.

class Fork( object ):
    """A Fork.  A Philosopher requires two of these to eat."""
    def __init__( self, name ):
        """Create the Fork.
        :param name: The number of this fork.  This may 
            be used by a Philosopher looking for the correct Fork.
        """ name
        self.holder= None
        self.log= logging.getLogger( "{0}.{1}".format(self.__class__.__name__, name) )
    def held_by( self ):
        """The Philosopher currently holding this Fork."""
        return self.holder
    def held_by( self, philosopher ):
        if philosopher:
   "Acquired by {0}".format( philosopher ) )
   "Released by {0}".format( self.holder ) )
        self.holder= philosopher

The Table, however, is interesting.  It includes the special "leaving" queue that's not a proper part of the problem domain, but is a part of this particular solution.

class Table( object ):
    """The dining Table.  This uses a queue of Philosophers
    waiting to dine and a queue of forks.
    This sets Philosophers, allows them to dine and then
    cleans up after each one is finished dining.
    To prevent deadlock, there's a limit on the number
    of concurrent Philosophers allowed to dine.
    def __init__( self, philosophers, forks, limit=4 ):
        """Create the Table.
        :param philosophers: The queue of Philosophers waiting to dine.
        :param forks: The queue of available Forks.
        :param limit: A limit on the number of concurrently dining Philosophers.
        self.philosophers= philosophers
        self.forks= forks
        self.limit= limit
        self.leaving= Queue.Queue()
        self.log= logging.getLogger( "table" )
    def dinner( self ):
        """The essential dinner cycle:
        admit philosophers (to the stated limit);
        as philosophers finish dining, remove them and admit more;
        when the dining queue is empty, simply clean up.
        self.at_table= self.limit
        while not self.philosophers.empty():
            while self.at_table != 0:
                p= self.philosophers.get()
       p )
            # Must do a Queue.get() to wait for a resource
            p= self.leaving.get()
            self.excuse( p )
        assert self.philosophers.empty()
        while self.at_table != self.limit:
            p= self.leaving.get()
            self.excuse( p )
        assert self.at_table == self.limit
    def seat( self, philosopher ):
        """Seat a philosopher.  This increments the count 
        of currently-eating Philosophers.
        :param philosopher: The Philosopher to be seated.
        """ "Seating {0}".format( )
        philosopher.enter( self.forks, self.leaving)
        self.at_table -= 1 # Consume a seat
    def excuse( self, philosopher ):
        """Excuse a philosopher.  This decrements the count 
        of currently-eating Philosophers.
        :param philosopher: The Philosopher to be excused.
        philosopher.join() # Cleanup the thread "Excusing {0}".format( )
        self.at_table += 1 # Release a seat

The dinner method assures that all Philosophers eat until they are finished.  It also assures that four Philosophers sit at the table and when one finishes, another takes their place.  Finally, it also assures that all Philosophers are done eating before the dining room is closed.

Friday, June 3, 2011

Changed the Page Template

The "default" template I chose was too narrow for presenting code samples.  Changed it.

Thursday, May 26, 2011

Code Kata : "Simple" Database Design

Here's a pretty simple set of use cases for a code-kata database application.

This is largely transactional, not analytical.

It's a simple inventory of ingredients, recipes and locations.

  • 42' sailboat.
  • Lots of places to keep stuff. Lots.
Stuff gets lots or misplaced. It's helpful to marry recipes with ingredients to use up the last of something before it goes bad and stinks up the boat.

Actor is essentially the cook.

Use Cases
  • Perishables to be eaten soon?
  • Shopping list for specific recipes.
  • Where did I put that?

  • Ingredient. A generic description: "lime", "coconut". Not too much more is needed. A "food safety" notation (refrigeration required, etc.) is a helpful attribute. Maybe a "food group" or other nutrition information.
  • Location. A text description of where things can be stored. This shouldn't have too many attributes, because boats aren't big grids. Phrases like "port saloon upper cabinet", or "galley outer cooler" make sense to folks who live on the boat.
  • On Hand. This is simply ingredient, location and a measurement of some kind. Example: 3 limes in the starboard galley center cooler. There's a lot of magic around units and unit conversion that can be fun. But that strays outside the database domain.
  • Recipe. Example: "One of sour, two of sweet, three of strong, and four of weak.", lime, simple syrup, rum, water. Plain text using a lightweight markup is what's required here. Along with a many-to-many relationship with ingredients. This is not carefully defined above because it should be done as a "more advanced" exercise.
I think this has the right amount of complexity and isn't very abstract. Since the use cases are pretty obvious to anyone who's cooked or been to a grocery store, use case details aren't essential.

Wednesday, May 25, 2011

Meetup Tonight

Tonight (May 25th). Red Dog. Colley Ave. Ghent. I'll be wearing my Stack Overflow shirt. I'll be there about 7. I know that at least one other person won't be there until 8.

The Meetup link.

I like this meetup idea a lot. Probably because the WFH life-style is a little isolating.

There's the small "Hampton Stack Overflow Community". We have a common interest in Stack Overflow.

Also, there's the 757 Python Users Group. We have a common interest in Python. I've decided to become the "official" organizer for this. I'm going to join the 757 Labs Hackerspace, also.