Saturday, May 30, 2009

Paranoid Schizophrenic Programming (Revised)

Some folks love the twin ideas that (1) "someone" might break the API rules and (2) they must write lots of bonus code to "prevent" problems.

Sigh.

There are three distinct things here.
  • API definition - something we do all the time.
  • "Defensive Programming" - something that may or may not actually exist.
  • Paranoid Schizophrenic programming - a symptom of larger problems; this exists far too often.
It's not that complicated, there's a simple 3-element checklist for API design.  Unless "someone" is out to break your API.   Whatever that means.  

A related topic is this kind of thing on Stack Overflow:  How Do I Protect Python Code? and Secure Plugin System For Python Application.

Following the Rules

When we define an API for a module, we define some rules.  Failure to follow the rules is -- simply -- bad behavior.  And, just as simply, when someone breaks the API rules, the module can't work.  Calling the API improperly is the as same as trying to install and execute a binary on the wrong platform.

It's the obligation of the designer to specify what will happen when the rules are followed.  While it might be nice to specify what will happen if the rules are not followed, it is not an obligation.

Here's my canonical example.



def sqrt( n ):
    """sqrt(n) -> x such that x**2 == n, where n >= 0."""


The definition of what will happen is stated.  The definition of what happens when you attempt sqrt(-1) is not defined.  It would be nice if sqrt(-1) raises an exception, and it would be nice to include that in the documentation, but it isn't an obligation of the designer.  It's entirely possible that sqrt(-1) could return 0.  Or (0+1j).  Or nan.

Item one on the checklist: define what the function will do.

And note that there's a world of difference between failing, and being used improperly.  We're talking about improper use here; failure is unrelated.

Complete Specification

When I remind people that they are only obligated to specify the correct behavior, some folks say "That's just wrong!  An API document should specify every behavior!  You can't omit the most important behavior -- the edge cases!"  

Ummm... That position makes no sense. 

There are lots and lots of situations unspecified in the API documentation.  What about sqrt(2) when the underlying math libraries are mis-installed?  What about sqrt(2) when the OS has been corrupted by a virus in the math libraries?  What about sqrt(2) when the floating-point processor has been partially fried?  What about sqrt(2) when the floating-point processor has been replaced by a nearly-equivalent experimental chipset that doesn't raise exceptions properly?

Indeed, there are an infinite number of situations not specified in the API documentation.  For the most part, there is only one situation defined in the API documentation: the proper use.  All other situations may as well be left unspecified.    Sometimes, a few additional behaviors are specified, but only when those behaviors provide value in diagnosing problems.

Diagnosing Problems

An API with thoughtful documentation will at least list the exceptions that are most likely to be raised.  What's important is that it does not include an exhaustive list of exceptions.  Again, that's an absurd position -- why list MemoryError on every single function definition?

What's important about things like exceptions and error conditions is the diagnostic value of this information.  A good designer will provide some diagnostic hints instead of lots of words covering every "possible" case.

If there's no helpful diagnostic value, don't specify it.  For example, there's little good to be done by adding a "Could raise MemoryError" on every method function description.  It's true, but it isn't helpful.  Except in a rare case of an API function that -- if used wrong -- will raise a MemoryError; in this rare case you're providing diagnostic information that can be helpful.  You are overwriting the API, but you're being helpful.

Item two on the checklist: provide diagnostic hints where they're actually meaningful and helpful.

Error Checking

How much error checking should our sqrt() function do?
  • None?  Just fail to produce an answer, or perhaps throw an exception?
  • Minimal.  This is easy to define, but many folks are unhappy with minimal.
  • More than minimal but not everything.  This is troubling.
  • Everything.  This is equally troubling.
No error checking is easiest.  And it fits with our philosophy.  If our sqrt function is used improperly -- i.e., someone broke the rule and provided a negative number -- then any exception (or nan value) will propagate to the caller and we're in good shape.  We didn't overspecify -- we provided a wrong answer when someone asked a wrong question.

Again, we're not talking about some failure to process the data.  We're talking about being called in a senseless way by a client that's not following the rules.

There's a subtlety to this, however.

A Non-Math Examples

Yesterday, I tried to use a postal scale to measure the temperature in my oven.  The scale read 2.5 oz.  

What does that mean?

I asked an ill-formed question.  I got something back.  It isn't an answer -- the question was ill-formed -- but it looks like an answer.  It's a number where I expected a number.

Here's another one.  "Which is heavier, the number 7 or the color green?"  Any answer ("7", "green" or "splice the main brace") is valid when confronted with a question like that.

Perhaps I should have run a calibration (or "unit") test first.

The Termination Question

In the case of a function like square root, there is an additional subtlety.  If we're using logarithms to compute square root, our log function may raise an exception for sqrt(-1) or it may return nan; either of which work out well - an ill-formed question gets an improper answer.

However, we might be using a search algorithm that will fail to terminate (a bisection algorithm, or Newton's method, for example.) Failure to terminate is a much, much worse thing.  In this case -- and this case only -- we have to actually do some validation on the range of inputs.

Termination is undecidable by automated means.  It's a design feature that we -- as programmers -- must assert independently of any lint, compiler or testing discipline.

Note that this is not "defensive programming".  This is ordinary algorithm design.  Every loop structure must terminate.  If we're trying a simple bisection algorithm and we have not bracketed a root properly (because, for example, it's a complex number), the bisection won't terminate.  A root-finding bisection algorithm must actually do two two things to assure termination:  check the range of the inputs and limit the number of iterations.

This isn't defensive programming because we're not checking that a mysterious "someone" is abusing the API.  We're asserting that our loop terminates.

Item 3 on the checklist: reject values that would lead loops to not terminate properly.



def sqrt( n ):
     """sqrt(n) -> x; such that x**2 == n; where n >= 0"""
    assert n >= 0


Incorrect Error Checking

Once we start checking for loop termination, folks say that "we're on a slippery slope" and ask where's that "fine line" between the minimal level of error checking (loops will terminate) and the paranoid schizophrenic level of error checking.

It isn't a slope.  It's a cliff.  Beyond loop termination, there's (almost) nothing more that's relevant. 

By "almost", I mean that languages like Python have a tiny realm where an additional assertion about the arguments is appropriate.  

Because of duck typing, many algorithms in Python can be written very generically.  Very generically.  Sorting, for example, can be applied to lists of -- almost -- anything.  Except, of course, it isn't meaningful for things with no useful __cmp__ function.  And in the case of things like a dictionary, what's the basis for comparison?  

In the case of dynamic languages and duck typing, it's possible that an algorithm will terminate, producing a wrong answer.  (BTW, this one reason why Python has / and // as distinct division operators -- to assure that ints and floats can be used interchangeably and the algorithm still works.)

Item 4 on the checklist: When you have a known problem with a type, reject only those types that are a problem.   This is very rare, BTW.  Mostly it occurs with overlapping types (lists and tuples, floats and ints.)  Most well-designed algorithms work with a wide variety of types.  Except in the overlapping types situation, Python will raise exceptions for types that don't work; make use of this.

What About "Business Rules"?

By "business rules" most people mean value ranges or codes that are defined by some externality.  As in "the claim value must be a number between the co-pay and the life-time limit".  

This is not a "Defensive Programming" issue.  This is just a policy statement written into the code.  Your API won't break if the claim value is less than the co-pay.  Your users will be pissed off, but that's a separate problem.

Also, you rarely raise an exception for business rules.  Usually, you'll collect business rule violations into a formal error report or log.  For example, Django's Forms will collection a dictionary of validation errors.  Each element in the dictionary has a list of problems with a particular field on the form.

What About "Someone" Who Can't Use The API?

Here's where the conversation goes awry.  

First, if this is a hypothetical "someone", you need to relax.  Consider these use cases. Are you worried that "someone" will download your software, install it, configures it, start to use it, and refuse to follow the documented API?  Are you worried that they will send you angry emails saying that they insist on doing the wrong thing and your software doesn't work?  You don't need "defensive programming", you need to either add the features they want or steer them to a package that does what they're expecting.

Here's another version of a hypothetical someone: you're working as part of a larger team, and you provide a package with an API.  Are you worried that a team member will refuse to follow the documented API?  Are you worried that they will send you angry emails saying that they insist on doing the wrong thing and your software doesn't work?  This isn't a call for "defensive programming," this is a call for a conversation.  Perhaps you built the wrong thing.  Perhaps you API documentation isn't as crystal-clear as you thought.

Someone Really Is Using It Wrong

A common situation is someone who's actually using the API wrong.  The conversation didn't help, they refuse to change their software.  Or you can't easily call them out on it because -- for example -- your boss wrote detailed specs for you, which you followed, but someone else isn't following.  What can you do?  The specification contradicts the actual code that uses the API.

Is this a place where we can apply "Defensive Programming"?  

Still no.

This is a call for some diagnostic support.  You need error messages and logs that help you diagnose the problem and locate the root cause.

Root Causes

The issue with "Defensive Programming" is that it conflates two unrelated use cases.
  • API Design.
  • Unwilling (or unable) to Follow Instructions. (UFI™)
API design has four simple rules.
  1. Document what it does.
  2. For diagnostic aid, in common edge cases, document other things it might do.  Specifically, describe conditions that are root causes of exceptions or weird answers.  Sometimes a subclass of exception is handy for handling this. 
  3. Be sure that it terminates.  If necessary, validate arguments to determine if termination can't happen and raise exceptions.
  4. In rare cases, check the data types to be sure the algorithm will actually work.  Most of the time, wrong data types will simply throw exceptions; leverage that built-in behavior.
Sociopaths

If (1) someone refuses to follow the rules and (2) complains that it's your API and (3) you elect to make changes, then...

First, you can't prevent this.  There's no "defensive programming" to head this off.

Second, know that what you're doing is wrong.   Making changes when someone else refuses to follow the rules and blames you is enabling someone else's bad behavior.  But, we'll assume you have to make changes for external political reasons.

Third -- and most important -- you're relaxing the API to tolerate ordinarily invalid data.

Expanding What's "Allowed"

When someone refuses to follow the API -- and demands you make a change -- you're having this conversion.

Them: "I need you to 'handle' sqrt(-1)."
You: "Square Root is undefined for negative numbers."
Them: "I know that, but you need to 'handle' it."
You: "There's no answer, you have to stop requesting sqrt(-1)."
Them: "Can't change it.  I'm going to make sqrt(-1) requests for external political reasons.  I can't stop it, prevent it or even detect it."
You: "What does 'handle' mean?"

At this point, they usually want you to do something that lets them limp along.  Whatever they ask you to do is crazy.  But you've elected to cover their erroneous code in your module.  You're writing diagnostic code for their problem, and you're burying it inside your code.

If you're going to do this, you're not doing "defensive programming", you're writing some unnecessary code that diagnoses a problem elsewhere.  Label it this way and make it stand out.  It isn't "defensive" programming.  It's "dysfunctional co-dependent relationship" programming.

Thursday, May 28, 2009

Updates to Building Skills in Python

I got a bug report (back in April) about an exercise in Building Skills in Python.  It was a change from 2.2 that I never validated in 2.5.   Thanks to my readers for responding with questions and complaints.

I've finally updated and posted the revisions.

Further, after some questions on Stack Overflow, I've decided to revisit parts of Chapter 21.  Specifically this question leads me to conclude that there's an audience that's served by a little more depth in this area.

Wednesday, May 27, 2009

That's odd -- Python faster than Java

Here's an amazing Stack Overflow question.  The follow-up conversation is great stuff.

The question shows two versions of approximately the same processing.  Python is faster than Java.  That's unexpected.

Java has static compilation and the hot-spot translation to machine code.  Apparently, Python has some optimizations that are just as valuable.

Semantic Markup with Docutils Interpreted Text Roles

A resume is a slippery thing -- a package of semi-structured data.

It has a kind of database-like feel to it, but there are so many exceptions and special cases that the database never works out quite the way you wanted.

For example, I've got -- essentially -- one employer over the past 30+ years.  But I've been on hundreds of projects for almost 100 different clients.  Since projects overlap, there's no tidy timeline.  The database has a token "Employer" table, a "Client" table, a "Project", which is an association between "Client" and "Employer".  For each "Project" I can have a number of roles or positions.  Most importantly, each project has a large number of hardware, software, skill, language and other "features" to it.

Relax

A more relaxed model is some kind of markup so that keywords can be identified semantically and culled out to create tag clouds or indices.

The usual culprit for mixed-content models like this is XML.  We would define a DTD or XSD with our tags in a new namespace.  Sadly, this also means that I have to rewrite my resume into XML.  Not that bad, but still...

Can we do similarly detailed semantic markup in RST?

What Role Does These Words Play?

RST offers a flexible mechanism they called Interpreted Text Roles.  There are two parts to getting started with this.

1.  Name the role in a .. role:: name directive.
2.  Markup your content with :name:`words`.  

By default, the role name is the class name that will be put into the HTML <span> tag when the document is written in HTML.  If you want, you can supply special formatting in addition to marking the words with a role.

You can do considerably more with interpreted roles, but we'll look at creating a tag cloud.

Gathering Data

The gathering part is easy.  You can snarf out the interpreted text roles with a simple visitor-based design.

import sys
from collections import defaultdict
from docutils.core import publish_doctree
from docutils.nodes import SparseNodeVisitor

class RoleVisitor( SparseNodeVisitor ):
    def __init__( self, role="skill", *args, **kw ):
        SparseNodeVisitor.__init__( self, *args, **kw )
        self.role= role
        self.cloud = defaultdict(int)
    def visit_inline( self, aNode ):
        if self.role in aNode['classes']:
            self.cloud[ aNode.astext() ] += 1


This visitor will accumulate a map with tag and frequency for a given role.

We can parse the RST resume file and accumulate the tag cloud statistics as follows.

def tagFreq( aFile ):
    source= aFile.read()
    structure= publish_doctree( source )

    skills= RoleVisitor( "skill", structure)

    structure.walkabout(skills)
    return skills.cloud

Once we have the data we can emit a tag cloud.

Frequency to Font Size

Converting frequencies to font sizes is a little alignment exercise.   A clever page designer might have clever style names based on the tag frequency.  I decided to name the styles after the font-sizes, since that seems simple.

def sizeMap( cloud ):
    """Many common tags piled into xx-large."""
    size_name = [ 'xx-small', 'x-small', 'small', 'medium', 'large',
         'x-large', 'xx-large' ]
    freq=list(set(cloud.values()))
    offset = max( 0, (len(size_name)-len(freq))//2 )
    size_map= {}
    for sz, f in enumerate(sorted(freq)):
        size_map[f]= size_name[sz+offset] if sz <>
    #print size
    return size_name, size_map

This assigns all the words that occur just once to the smallest font.  There are usually a large number of tags that occur just once.  A few tags will have a large number of occurrences; these will all wind up with 'xx-large' as their class.

Emitting The Cloud

Writing the tag cloud (in RST) looks this this.

def rst( names, sizes, cloud, destination ):
    sys.stdout= destination
    for s in names:
        print "..  role::", s # The formatting roles that match our CSS.
    print "\n----------\n"
    for k in sorted(cloud):
        print ':%s:`%s`' % ( sizes[cloud[k]], k, )

We can then tack this cloud onto the end of the resume to get a summary of skills, frameworks, OS's, languages and the like.

Style Points

The docutils section on overriding the style sheet suggests we include something like the following in the working directory.

resume.css

@import url(html4css1.css);

span.xx-small { font-size:0.65em; font-family:sans-serif }
span.x-small { font-size:0.7em; font-family:sans-serif }
span.small { font-size:0.85em; font-family:sans-serif }
span.medium { font-size:1em; font-family:sans-serif }
span.large { font-size:1.3em; font-family:sans-serif }
span.x-large { font-size:1.6em; font-family:sans-serif }
span.xx-large { font-size:1.9em; font-family:sans-serif }

We include this with the following command: rst2html.py --stylesheet-path=resume.css

Workflow

This makes it much more pleasant to edit my resume.  

1.  Make the changes.
2.  Run the tag-cloud script.
3.  Run rst2html. 

Now I just have to remember to do it more often than once every five years.

Monday, May 25, 2009

ReStructured Text markup and Content Management

I can't say enough good things about ReStructuredText (RST).  I've used all of the available markup languages (SGML, HTML and XML).  They have their place, but they all fall short of being truly usable.

In This sounds complicated, because it is I reviewed some of my history of cheap content management.   

In looking at content of all kinds, I'm finding that RST is much, much easier to work with than SGML, HTML or XML.  In short, I think that RST makes the file system into a really good content management system (CMS).  Unstructured content is a big win.  Structured content is a "don't care".  But there's a middle ground of semi-structured content that requires sophisticated semantic markup.

SGML At The Dawn Of Time

When the web started it's ascent (back in the 90's), I was lucky.  I had already been working with folks that did military contracting, and folks there had introduced me to SGML.   When I moved from SGML to HTML, I saw it as a pleasant simplification because it had a more-or-less fixed DTD.  

My first personal web pages were lovingly hand-crafted HTML masterpieces.  (Okay, they were lovingly hand-crafted.)   There was  a lot of work involved in markup, cross-references, and presentation. 

HTML via a Class Hierarchy

My first templating was via proper Python classes.  I created class hierarchies that embodied the page template and filled in required data.  The heart of each class was an emit method that wrote the final HTML.

Variant page layouts and special cases were easily handled by Python simple inheritance.  

Of course, the big problem is that HTML is just representation.  There's often some bleed-through between the problem domain model and the HTML representation of that underlying model.  You don't want your problem domain objects to encode any HTML.  You can have a generic Tag class, but the Page class is specific to your problem domain.

The Python class structure is nice, but it's only suitable for structured content management.  When you have semi-structured and unstructured data -- the strong suit of HTML -- you find the class hierarchy to be too rigid.

Some time in the early 00's, I discovered Cheetah.

HTML via Templates

Cheetah (and template engines like Mako, Jinja, and numerous others) did what I wanted.  A base template was -- effectively -- a superclass.  Each block in that template could be overridden by a subclass.

The content, then, becomes a relatively simple template file that extends a page layout.  You can handle unstructured and semi-structured content very nicely.  I changed my ways of working with HTML to leverage this elegant, extensible view of the world.  I redid my personal web site: the content become a collection of Cheetah templates that contained all the content.

Note that I've *added* a markup language.  In addition to HTML, I also have some Cheetah markup on each page.  While this got me consistency and flexibility (and a reduction in the volume of stuff on each page) it did make things slightly more complex.

Look at http://cadesignquilts.com/ for another example of an all-Cheetah static site.  I did several sites like this.  The workflow involved (1) design the overall page, (2) getting the data into a usable form, (3) generating the page-level template files, and (4) running Cheetah to emit HTML from the templates.  All static content.  Runs like lightning.  

The JSP Distraction

Eventually, I started doing development with Struts, which depends heavily on JSP.  You have HTML commingled with Java code.  Plus, you've got custom actions via a tag library to extend JSP processing.  You can create page-level templates with a reasonably smart JSP tag library.

This template solution doesn't work well for unstructured or semi-structured data.  It's a pure programming solution.

DocBook XML and Semantic Markup

I wrote Building Skills in Python entirely in Appleworks.  That was pretty well unmaintainable and unpublishable in that form.

I converted the text to DocBook XML.  I used the Leo outliner to manage the document as a whole.  I wrote my own publishing workflow to transform the XML to HTML and PDF.   It worked reasonably well.

More important, using DocBook reinforced the importance of semantic markup.  It took me back to my SGML days.  It also showed why and other HTML presentation things have to be moved out of the document and into the stylesheet.

This was a very nice way to handle the semi-structured and unstructured content in a book.  Direct use of XML is a pain in the neck.  XML has a lot of syntax.  It's much nicer to do your thinking with something lighter weight.  

ReStructured Text (RST) for Unstructured Content

Somewhere in the late 00's, I found Python's docutils and RST.  I can't figure out when I started -- precisely -- but using RST as part of content management didn't fully click at first.

After reworking my personal site, which includes a lot of really unstructured ("random" might be a better word) content, I'm seeing the value in RST + Filesystem as a CMS.  I think the Sphinx folks are right.  If you have a simple markup system and all the filesystem tools that have evolved over the past few decades, you're covered.

Further, on larger projects, I've found that I can pop out a nice template documentation tree with a simple .. toctree:: directive on the index.rst page and generate a tidy, complete documentation package without much pain.

Structured Content

For structured data, you have ordinary classes and programs.  You have SQL databases, ORM to map to classes; all of that technology.  It's easy to write applications that emit RST which you can then publish.  

Most structured content can be boiled down to tables and charts.  The .. csv-table:: directive makes it easy to have an application emit data that you fold into a more elegant-looking report.

The Nuance -- Semi-Structured Data

My worst-case scenarios are my résumés: sailing, programming and writing.  The data has deep semantic meaning:  it isn't just words.  On the other hand, the data has lots of special-cases and exceptions: it isn't totally amenable to a database.

The absolute best part of docutils is that the parser's output is available for processing.  You can -- easily -- add directives and text roles to create semantic meaning.

I experimented with XML and YAML for my résumés.  The XML is cumbersome.  The YAML requires a fairly sophisticated class model to make use of the information.  

RST with a few text roles, however, rocks.  The .. role:: directive makes it easy to throw roles into a document for later use by applications.

Friday, May 22, 2009

Open Source Use Rising

Or so claims SD Times...


The decision process includes: "find a low-cost solution".  More importantly, it includes "justify the fees to purchase and for support."

This drives down the cost of software and support for commercial products.   It also rationalizes what your buy when you buy a license and pay for support.

In the olden days, you just paid.  Now, you debate the merits of support and determine what you're getting and if the value is commensurate with the cost.

Thursday, May 21, 2009

Name Matching Alternatives

The users want to locate people by last name.  They want flexible matching.  That's not very hard.

The DBA wants to do some wild-card searches efficiently.  The DBA may not be responding to the users actual request, making this more complex than it needs to be.

I'm not in contact with the users, so I don't know the real requirements.  I'm hearing this through the DBA-filter ("all singing, all dancing, all SQL".)  I may also be hearing this through IT management filter ("only use technology I recognize from my programming days".)

In my experience, wild-card searches are rarely the user's first choice.  They want more flexible matching.  While the SQL LIKE-clause is one solution that might work, it is rarely what the users really want.

The DBA knows that the SQL LIKE-clause effectively defeats indexing and forces row-by-row comparison.  And we all know that row-by-row processing is evil.

Premature Optimization

Question 1.  Is this premature optimization?   

There's no way to tell.  The database server may be beefy enough and the query rare enough that a basic LIKE-clause regular expression will work just fine.

Step 1.  Benchmark this baseline solution.

As Fast as Possible -- in SQL

One way to find names quickly is to denormalize the data base.  In addition to the proper names, also store the soundex of the name.  Since this is stored, and there's no function call in the WHERE clause, and this is fully indexed, it will find "similar-sounding" names very quickly.

Soundex has limitations, so some folks use metaphone.  The principle is the same.  When inserting or updating the name, also insert (or update) the metaphone of the name.

This, BTW, does not involve any wild-card.  Except in unusual cases, it always returns a set of candidates.  And the set of candidates is a better fit than any wild-card search.   More focused, and the whole name is considered.

Step 2.  Prototype the soundex solution.  It's hard to explain, and impossible to visualize.  Actual result sets make it concrete.

Throw Memory At It

Here's an alternative that works really well.  

Stop using the database.

Don't waste brain cells trying to write this kind of super-flexible search in SQL. It's better done in code.  Write a simple materialized view with name and PK and nothing else.  Create the smallest possible table that can be used just for name matching -- nothing else in this table.  It's little more than an index.

Write a simple web service that queries this physically small table, doing a search algorithm.  The web service will locate near-matches in this small table.  It could return full rows for the top matches, or simply return the names and PK's for users to pick from. 

You have several candidate algorithms for this server.  A wisely-written web service can use a combination of algorithms and return a match score along with the names and PK's.
Web Service for Wildcards

An alternative web service can query the name/PK table using a nice regular expression library.  Since RE syntax can be complex, you would translate from a user-friendly syntax to a proper RE syntax.  

For instance, the LIKE-like syntax can be reformulated to proper RE syntax.  The %'s become .* and the _'s become .'s.  Or perhaps you offer your users shell-like syntax.  In this case, the *'s become .* and the ?'s become .'s.

Either way, the user's wild-card becomes a proper regular expression.  The web service queries the table, matching all input against the RE.  The service could return full rows for the top matches, or simply return the names and PK's for users to pick from. 

This little web service can be granted a large amount of memory to cache large row sets.  Boy will it be fast.

Also, depending on the pace of change in the underlying table, it may be possible for this service to query all names into a cache once every few minutes.   Perhaps it can do this by first making a SQL request to refresh the materialized view and then a query to fetch the updated view into memory.

What the DBA wants

The DBA wants some magical pixie dust that somehow makes a query with a LIKE clause use an index and behave like other properly indexed columns. 

The actual email enumerated four of the possible ways a LIKE clause could be used.  I'm guessing the hope was that somehow the enumeration of a subset of candidate LIKE clauses would help locate the pixie dust.

Here's my advice.  If this magical LIKE clause feature already existed, it would be in the DBA guide.  Since it isn't in the DBA guide, perhaps it doesn't exist.   Enumerating four use cases (name, *name, name* and *name*) doesn't help, it's still not going to work out well.  Remember, SQL's been around in this form for decades; the LIKE clause continues to be a challenge.

First, benchmark.  Second, offer the users soundex.  Then, well, you've got work to do.

Applet Not Inited; the "Red X" problem

I haven't done Applet stuff in years.

I do -- intensely -- like embedding functionality in web pages.  RIA/Ajax and what-not are something I have trouble with because I'm not a graphic designer.  Javascript and Applets fall into three clear categories:
  1. Basic usability.  Javascript offers lots of little enhancements to HTML presentation that make sense.  Emphasis on "little".
  2. Client-side features.  Many things are simple calculators or other processing that makes sense on the client side -- the relevant factors can be downloaded and used by an applet or javascript script.  
  3. Junk.  There are lots of graphical effects that vary from gratuitous to irritating.  Too many folks in marketing see some "pop-up" technique and think it's cool.  Worse, they'll take an application that lacks solid use cases and try to add flashing to scrolling to emphasize something instead of reducing clutter and distraction.  Sigh.
The Common Problems

In all web-based software development, the number one problem is always permissions.  Always.  In the case of applet development, this is always hurdle number one.  The file isn't owned by the right person or doesn't have the right permissions.  You see the "applet not inited" and "red X icon" as symptoms of the applet not being downloaded at all.

The number two problem is access to resources.  Usually this is a CLASSPATH issue, but it can also be an HTML page with a wrong URI for the applet's code.  You see the "applet not inited" and "red X icon" as symptoms of the applet not being referenced correctly, or not being able to locate all of its parts.

[Technically, the basic access comes before permissions, but you usually don't get access wrong first.  Usually, you get permissions wrong; later, you discover you have a subtle access issue.]

One of the more subtle manifestations are the case-matching issues.  Your Java class definitions are usually UpperCase.  The source file and resulting class file will have this same UpperCase format.  But if you get the case wrong in your HTML, you just get an applet not inited error.  Arrgh.

When you don't work with applets all that often, the "applet not inited" is baffling.

Misdirection

I wasted hours on Google and Stack Overflow looking up "applet not inited" and "Red X icon" and similar stuff.

Then I looked at the HTML I was testing.

Surprise.  No one had moved the .jar file into the proper directory.

There's a lot of stuff on the applet not inited error.  Most of it misses the usual culprits: permissions and access to the resources.  

Wednesday, May 20, 2009

This sounds complicated, because it is

For a while, I generated documentation with Cheetah. I wrote bodies as a fragment of HTML and used Cheetah to wrap those bodies in standard templates with navigation and branding.

To write my books, I learned DocBook markup and used DocBook XSL tools to create HTML and PDF versions of the book's text. Even though XML is hard to work with, I managed to muddle through. It's painful -- at times -- but doable.  

[Eventually, I found XMLMind's XML Editor.  It rocks.  But that's off-topic.]

Then, I fount RST and RST2HTML.  For a while, I wrote my documentation in RST and used a simple script to create the HTML version of the documentation from RST source.

Why ReStructuredText?

From their site: "reStructuredText is an easy-to-read, what-you-see-is-what-you-get plaintext markup syntax".  
  • Easy-to-Read.  The markup is very, very simple.  Mostly spacing and simple quoting.  Yet, for edge cases, there is enough richness to approach DocBook XML.
  • WYSIWYG.  The markup doesn't get in the way; you write the text with a few conventions for spacing and quoting.
  • Plain Text.  A few spacing and quoting rules are used to distinguish structure from content.  Presentation is a limited part RST (like HTML where some presentation is present in the structural markup, but can be avoided.)
RST lead me, eventually to Sphinx

The Secret of Sphinx

Sphinx is RST-based markup.  You write in plaintext (plus some quoting and spacing) and you get an elegant HTML web site with inter-document references all resolved correctly, contents, indexes, auto-generated API documentation for your Python software, syntax coloring, everything.  Wow.

I can't stop myself from doing everything in Sphinx.  You create a development structure for your source files.  You use a series of toctree directives to build the resulting documentation structure that people will see and use.

I've decided to convert some ancient Cheetah-based stuff to Sphinx.  

Unmarking Up

Revising HTML-based document bodies to RST is annoying.  It can be done with Beautiful Soup.  The HTML is pretty regular (and pretty simple) so it wouldn't be too bad.  Except for a bunch of edge cases that have significant complexity.

The original Cheetah-based site wasn't purely documentation.  It doesn't fit the Sphinx use cases perfectly.  A fairly significant percentage of the Cheetah-based pages are HTML pages with complex, embedded applets to do calculations.

These pages are not -- strictly speaking -- documentation.  They're an application.  They contain markup (<embed> mostly) that RST can't generate.  Further, they have to be unit tested prior to running Sphinx to build the documentation, since the HTML is actually part of the application.

Raw HTML?

The applet pages are -- more or less -- raw HTML pages that need to be folded in with the Sphinx-generated documentation.  Sphinx has an HTML_STATIC_PATH configuration parameter that can copy these applications from project folders into destination directories.

But this leaves me with dozens of Cheetah-generated pages as part of this application.  The presence of Cheetah in the midst this Sphinx operation makes things complicated.

Or, perhaps it doesn't.

It turns out that Sphinx is built on Jinja.  There's a template engine under the hood!  That's handy.  That lets me build the application HTML with a slightly different template engine; one that's compatible with the rest of the Sphinx-generated site.

I think I've got a clean, RST-based replacement for my lovingly hand-crafted HTML.  It's a lot of rework, but the simplification is of immense value.

Sunday, May 17, 2009

Data Structures in Python and SQL

This is -- partially -- about the object-relational impedance mismatch.  But it's also about the parallel concepts between objects and relations.  We'll use Python as our object model.

First, the obvious.

A SQL table is a list of rows.  A row is a dictionary that maps a column name to a column value.  A SQL table has a defined type for a named column; Python doesn't pre-define the type of each column.

Some folks like to think of a table as a rigidly-defined class, which is partly true.  It can be rigidly-defined.  However, the extra meta-data doesn't help much.

Indexing

As a practical matter, most databases go beyond the minimalist definition of a relationship as a collection of rows.  An index extends the structure in one of two ways.

A unique-key index transforms the SQL table into a dictionary that maps a key to a row.
    class UniqueKeyTable( object ):
        def __init__( self ):
            self.rows = {}
        def insert( self, aRow ):
            self.rows[aRow.key()]= [aRow]
The non-unique key index transforms the SQL table into a dictionary that maps a key to a list of rows.
    class KeyedTable( object ):
def __init__( self ):
self.rows = collections.defaultdict(list)
def insert( self, aRow ):
self.rows[aRow.key()].append( aRow )
SQL Operations

The single-table SELECT algorithm has a WHERE clause that gets broken into two parts: key filtering and everything else.

The basic SELECT looks something like this.
    for k in table.rows[key]:
        for r in table.rows[k]:
if other_where_clause( r ):
             select_group_by( r )
That's the essential feature of a basic select -- it expresses a number of design patterns.  There's a key-to-list map, a filter, and the "select-group-by" map to results.

In theory, the SELECT operation is the more general "filter" algorithm, where every row passes through the a general where_clause_filter process.  

The Join Algorithms

We have a number of alternative join algorithms.  In some cases, we have two dictionaries with the same keys.  This leads to a highly optimized query where one key locates rows on both sides of the join.

In other cases, we have a kind of nested-loops join.  We find a row in one table, and use this row's attributes to locate a row in another table.

The "Which is Better?" Question

We always have two alternatives for every algorithm:  the SQL version and the Python version.  This is an essential issue in resolving the Object-Relational Impedance mismatch issue.  We can implement our algorithm on either side: Python objects or SQL relations.

Note that there's no simple "Use SQL for this" or "Use Python for that" decision process.  The two structures -- objects and relations -- are completely isomorphic.  There's no specific set of features that dominate either representation.  

The literal question that I got was "Should I use a complex data structure in a programming language or should I use SQL ?"

Ideally, the answer is "SQL does [X] better", leading to an easy decision.  But this kind of answer doesn't exist.

The two structures are isomorphic; the correct answer is hard to determine.  You want the RDBMS to filter rows and return the smallest relevant set of data to the object representation.  While locating the fewest rows seems simple, a few things make even this hard to determine.  

While it seems that the RDBMS can be the best way to handle join algorithms, this doesn't always work.  When we're doing a join involving small tables, the RDBMS may be less effective than an in-memory dictionary.  It sometimes occurs that SQL is best for filtering very large tables only.

Indeed, the only way to chose among two isomorphic representations (objects vs. relations) is to benchmark each implementation.

Thursday, May 14, 2009

iWeb -- not so nice

For techincal blogging (like this) iWeb is weak.  

The total MacOSX integration -- pictures, podcast, etc. -- is nice.  It's very cool for my travelogues. But for code samples and the kind of customized HTML widgets that are required by Technorati, it's too hard to deal with.

The size of the iWeb pages is magnificent.  Reading a posting is an undertaking.  I don't really like that, since this is mostly simple text; the ultra complex graphics aren't an asset.

For now, I still have to mess around with the ShareThis link.  I'm not sure if I like the sophistication of Share This or if I want button-by-button links to digg, facebook, tweet and track, stumble upon, reddit, del.icio.us, and yahoo buzz buttons on each post.

Wednesday, May 6, 2009

Multi-threaded apps and module globals

Learned about module globals the hard way.


The mod_wsgi daemon by default spawns 15 threads.  This is important, but not obvious.


During load testing, we had intermittent weird errors.  We were seeing an odd inconsistency in replies.  My experience in creating military software in the ’80’s leads me to put loop-back self-tests everywhere.  One of our loopbacks wasn’t looping back properly.


The symptom looked like a single value being overwritten.  After a design review, it appears that one information source -- a module global -- wasn’t working well.


Module globals -- like other Singletons -- are a seductive trap.   The issue is that a multi-threaded application will have one copy of the module.  The one copy may not be thread safe.  


The problem is that  thread-safety requires some fairly detailed analysis. Simple unit testing isn’t quite enough.  But the process of designing for testability is helpful.  Isolation and encapsulation are important for testability as well as locating thread-safety issues.

Monday, May 4, 2009

All Those TODO's

About a year ago, we started out doing Python development with simple rst2html documents for requirements, design, etc.  In the code, we had comments that used epydoc with the epytext markup language.


No, it wasn’t confusing.  Free-text documents (requirements, architecture, design, test plans, etc.) are easy and fun to write in RST.  Just write.  Leave the formatting to someone else.  A little semantic markup doesn’t hurt, but you don’t spend hours with MS-Word trying to  disentangle your bullets and your numbering.


Adding comments to code in epytext was pretty easy, also.


Then I discovered Sphinx.   Sphinx can add module documentation to a document tree very elegantly.  Further, Sphinx can pull in RST-formatted module comment strings.  Very nice.


Except, of course, we have hundreds of modules in epytext.  Today, I started tracking down all of the 150+ modules without proper document strings in RST notation.  Hopefully, this time tomorrow, I’ll have a much, much better -- and internally consistent -- set of documentation.