Bio and Publications

Tuesday, October 26, 2010

Python and the "Syntactic Whitespace Problem"

Check out this list of questions on Stack Overflow:
About 10% of these are really just complaints about Python's syntax. Almost every Stack Overflow question on Python's use of syntactic whitespace is really just a complaint.
Here's today's example: "Python without whitespace requirements".
Here's the money quote: "I could potentially be interested in learning Python but the whitespace restrictions are an absolute no-go for me."
Here's the reality.
Everyone Indents Correctly All The Time In All Languages.
Everyone. All the time. Always.
It's amazing how well, and how carefully people indent code. Not Python code.
All Code. XML. HTML. CSS. Java. C++. SQL. All Code.
Everyone indents. And they always indent correctly. It's truly amazing how well people indent. In particular, when the syntax doesn't require any indentation, they still indent beautifully.
Consider this snippet of C code.
if( a == 0 )
   printf( "a is zero" );
   r = 1;
else
   printf( "a is non-zero" );
   r = a % 2;
Over the last few decades, I've probably spent a complete man-year reading code like that and trying to figure out why it doesn't work. It's not easy to debug.
The indentation completely and accurately reflects the programmer's intention. Everyone gets the indentation right. All the time. In every language.
And people still complain about Python, even when they indent beautifully in other languages.

Thursday, October 21, 2010

Code Base Fragmentation

Here's what I love -- an argument that can only add cost and complexity to a project.

It sounds like this to me: "We need to fragment the code base into several different languages. Some of the application programming simply must be written in a language that's poorly-understood, with tools that are not widely available, and supported by a select few individuals that have exclusive access to this code. We haven't benchmarked the technical benefit."

Further, we'll create complex organizational roadblocks in every single project around this obscure, specialized, hard-to-support language.

Perhaps I'm wrong, but database triggers always seem to create more problems than they solve.

They Totally Solve a Problem

The most common argument boils down to application-specific cross-cutting concerns. The claim is that these concerns (logging, validation, data model integrity, whatever) can only be solved with triggers. For some reason, though, these cross-cutting concerns can't be solved through ordinary software design. I'm not sure why triggers are the only solution when simple OO design would be far simpler.

Some folks like to adopt the "multiple application programming languages" argument. That is, that ordinary OO design won't work because the code would have to be repeated in each language. This is largely bunk. It's mostly folks scent-marking territory and refusing to cooperate.

Step 1. Write a library and share it. It's hard to find a language that can't be used to write a sharable library. It's easy to find an organization where the Visual C# programmers are not on speaking terms with the Java programmers and the isolated Python folks are pariahs. This isn't technology. Any one of the languages can create the necessary shared library. A modicum of cooperation would be simpler than creating triggers.

Step 2. Get over it. "Duplicated" business logic is rampant in most organizations. Now that you know about, you can manage it. You don't need to add Yet Another Language to the problem. Just cooperate to propagate the changes.

They're Totally Essential To The Database

The silly argument is that some business rules are "closer to" or "essential to" the database. The reason I can call this silly is because when the data is converted to another database (or extracted to the data warehouse) the triggers aren't relevant or even needed. If the triggers aren't part of "interpreting" or "using" the data, they aren't essential. They're just convenient.

The data really is separate from the processing. And the data is far, far more valuable than the processing. The processing really is mostly application-specific. Any processing that isn't specific to the application really is a cross-cutting concern (see above). There is no "essential" processing that's magically part of the data.

What If...

Life is simpler if all application programming is done in application programming languages. And all triggers are just methods in classes. And everyone just uses the class library they're supposed to use.

"But what if someone doesn't use the proper library? A trigger would magically prevent problems."

If someone refuses to use the application libraries, they need career coaching. As in "find another job where breaking the rules is tolerated."

Tuesday, October 19, 2010

Technical Debt

Love this from Gartner. "Gartner Estimates Global 'IT Debt' to Be $500 Billion This Year, with Potential to Grow to $1 Trillion by 2015".

NetworkWorld ran a quicky version of the story. Gartner: Global 'IT debt' hits $500 billion, on the way to $1 trillion.

ComputerWorld -- to be proper journalists -- have to get a balancing quote. Their version of the story is this: Gartner warns of app maintenance 'debt'. The balancing quote is the following:
"There are many good reasons to NOT upgrade/modernize many applications, and I believe Gartner is out of line using words like 'debt' which have guilt associated with them,"
"Guilt"? That's a problem? Why are we pandering to an organization's (i.e., CIO's) emotional response?

I'm not sure that using a word like "debt" is a problem. Indeed, I think they should ramp up the threat level on this and add words like "short-sighted" and "daft" and perhaps even "idiotic".

Anyone who doesn't believe (or doesn't understand) technical debt needs only to review the Y2K plans and budgets. A bad technology decision lead to a mountain of rework. Yes, it was all successful, but it made IT budgeting difficult for years afterwords.

The rest of the organization was grumpy about having their projects were stalled until after Y2K. IT created it's own problems by letting the technology debt accumulate to a level where it was "fix or face an unacceptable risk of not being able to stay in business."

How many other latent Y2K-like problems are companies ignoring?

Wednesday, October 13, 2010

Real Security Models

Lots of folks like to wring their hands over the Big Vague Concept (BVC™) labeled "security".

There's a lot of quibbling. Let's move beyond BVC to the interesting stuff.

I've wasted hours listening to people identify risks and costs of something that's not very complex. I've been plagued by folks throwing up the "We don't know what we don't know" objection to a web services interface. This objection amounts to "We don't know every possible vulnerability; therefore we don't know how to secure it; therefore all architectures are bad and we should stop development right now!" The OWASP top-ten list, for some reason, doesn't sway them into thinking that security is actually manageable.

What's more interesting than quibbling over BVC, is determining the authorization rules.

Basics

Two of the pillars of security are Authentication (who are you?) and Authorization (what are you allowed to do?)

Authentication is not something to be invented. It's something to be used. In our case, with an Apache/Django application, the Django authentication system works nicely for identity management. It supports a simple model of users, passwords and profiles.
We're moving to Open SSO. This takes identity management out of Django.

The point is that authentication is -- largely -- a solved problem. Don't invent. It's solved and it's easy to get wrong. Download or License an established product for identity management
and use it for all authentication.

Authorization

The Authorization problem is always more nuanced, and more interesting, than Authentication. Once we know who the user is, we still have to determine what they're really allowed to do. This varies a lot. A small change to the organization, or a business process, can have a ripple effect through the authorization rules.

In the case of Django, there is a "low-level" set of authorization tests that can be attached to each view function. Each model has an implicit set of three permissions (can_add, can_delete and can_change). Each view function can test to see if the current user has the required permission. This is done through a simple permission_required decorator on each view function.

However, that's rarely enough information for practical — and nuanced — problems.

The auth profile module can be used to provide additional authorization information. In our case, we just figured out that we have some "big picture" authorizations. For sales and marketing purposes, some clusters of features are identified as "products" (or "features" or "options" or something). They aren't smallish things like Django models. They aren't largish things like whole sites. They're intermediate things based on what customers like to pay for (and not pay for).

Some of these "features" map to Django applications. That's easy. The application view functions can all simply refuse to work if the user's contract doesn't include the option.

Sadly, however, some "features" are part of an application. Drat. We have two choices here.
  • Assure that there's a "default" option and configure the feature or the default at run time. For a simple class (or even a simple module) this isn't too hard. Picking a class to instantiate at run time is pretty standard OO programming.
  • Rewrite the application to refactor it into two applications: the standard version and the optional version. This can be hard when the feature shows up as one column in a displayed list of objects or one field in a form showing object details. However, it's very Django to have applications configured dynamically in the settings file.
Our current structure is simple: all customers get all applications. We have to move away from that to mix-and-match applications on a per-customer basis. And Django supports this elegantly.

Security In Depth

This leads us to the "Defense in Depth" buzzword bingo. We have SSL. We have SSO. We have high-level "product" authorizations. We have fine-grained Django model authorizations.

So far, all of this is done via Django group memberships, allowing us to tweak permissions through the auth module. Very handy. Very nice. And we didn't invent anything new.

All we invented was our high-level "product" authorization. This is a simple many-to-many relationship between the Django Profile model and a table of license terms and conditions with expiration dates.

Django rocks. The nuanced part is fine-tuning the available bits and pieces to match the marketing and sales pitch and the the legal terms and conditions in the contracts and statements of work.

Monday, October 4, 2010

.xlsm and .xlsx Files -- Finally Reaching Broad Use

For years, I've been using Apache POI in Java and XLRD in Python to read spreadsheets. Finally, now that .XLSX and .XLSM files are in more widespread use, we can move away from those packages and their reliance on successful reverse engineering of undocumented features.

Spreadsheets are -- BTW -- the universal user interface. Everyone likes them, they're almost inescapable. And they work. There's no reason to attempt to replace the spreadsheet with a web page or a form or a desktop application. It's easier to cope with spreadsheet vagaries than to replace them.

The downside is, of course, that users often tweak their spreadsheets, meaning that you never have a truly "stable" interface. However, transforming each row of data into a Python dictionary (or Java mapping) often works out reasonably well to make your application mostly immune to the common spreadsheet tweaks.

Most of the .XLSX and .XLSM spreadsheets we process can be trivially converted to CSV files. It's manual, yes, but a quick audit can check the counts and totals.

Yesterday we got an .XLSM with over 80,000 plus rows. It couldn't be trivially converted to CSV by my installation of Excel.

What to do?

Python to the Rescue

Step 1. Read the standards. Start with the Wikipedia article: "Open Office XML". Move to the ECMA 376 standard.

Step 2. It's a zip archive. So, to process the file, we need to locate the various bits inside the archive. In many cases, the zip members can be processed "in memory". In the case of our 80,000+ row spreadsheet, the archive is 34M. The sheet in question expands to a 215M beast. The shared strings are 3M. This doesn't easily fit into memory.

Further, a simple DOM parser, like Python's excellent ElementTree, won't work on files this huge.

Expanding an XLSX or XLSM file

Here's step 2. Expanding the zip archive to locate the shared strings and sheets.
import zipfile
def get_worksheets(name):
arc= zipfile.ZipFile( name, "r" )
member= arc.getinfo("xl/sharedStrings.xml")
arc.extract( member )
for member in arc.infolist():
if member.filename.startswith("xl/worksheets") and member.filename.endswith('.xml'):
arc.extract(member)
yield member.filename

This does two things. First, it locates the shared strings and the various sheets within the zip archive. Second, it expands the sheets and shared strings into the local working directory.

There are many other parts to the workbook archive. The good news is that we're not interesting in complex workbooks with lots of cool Excel features. We're interested in workbooks that are basically file-transfer containers. Usually a few sheets with a consistent format.

Once we have the raw files, we have to parse the shared strings first. Then we can parse the data. Both of these files are simple XML. However, they don't fit in memory. We're forced to use SAX.

Step 3 -- Parse the Strings

Here's a SAX ContentHandler that finds the shared strings.
import xml.sax
import xml.sax.handler
class GetStrings( xml.sax.handler.ContentHandler ):
"""Locate Shared Strings."""
def __init__( self ):
xml.sax.handler.ContentHandler.__init__(self)
self.context= []
self.count= 0
self.string_dict= {}
def path( self ):
return [ n[1] for n in self.context ]
def startElement( self, name, attrs ):
print( "***Non-Namespace Element", name )
def startElementNS( self, name, qname, attrs ):
self.context.append( name )
self.buffer= ""
def endElementNS( self, name, qname ):
if self.path() == [u'sst', u'si', u't']:
self.string_dict[self.count]= self.buffer
self.buffer= ""
self.count += 1
while self.context[-1] != name:
self.context.pop(-1)
self.context.pop(-1)
def characters( self, content ):
if self.path() == [u'sst', u'si', u't']:
self.buffer += content
This handler collects the strings into a simple dictionary, keyed by their relative position in the XML file.

This handler is used as follows.
string_handler= GetStrings()
rdr= xml.sax.make_parser()
rdr.setContentHandler( string_handler )
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
rdr.parse( "xl/sharedStrings.xml" )
We create the handler, create a parser, and process the shared strings portion of the workbook. When this is done, the handler has a dictionary of all strings. This is string_handler.string_dict. Note that a shelve database could be used if the string dictionary was so epic that it wouldn't fit in memory.

The Final Countdown

Once we have the shared strings, we can then parse each worksheet, using the share string data to reconstruct a simple CSV file (or JSON document or something more usable).

The Content Handler for the worksheet isn't too complex. We only want cell values, so there's little real subtlety. The biggest issue is coping with the fact that sometimes the content of a tag is reported in multiple parts.

class GetSheetData( xml.sax.handler.ContentHandler ):
"""Locate column values."""
def __init__( self, string_dict, writer ):
xml.sax.handler.ContentHandler.__init__(self)
self.id_pat = re.compile( r"(\D+)(\d+)" )
self.string_dict= string_dict
self.context= []
self.row= {}
self.writer= writer
def path( self ):
return [ n[1] for n in self.context ]
def startElement( self, name, attrs ):
print( "***Non-Namespace Element", name )
def startElementNS( self, name, qname, attrs ):
self.context.append( name )
if name[1] == "row":
self.row_num = attrs.getValueByQName(u'r')
elif name[1] == "c":
if u't' in attrs.getQNames():
self.cell_type = attrs.getValueByQName(u't')
else:
self.cell_type = None # defult, not a string
self.cell_id = attrs.getValueByQName(u'r')
id_match = self.id_pat.match( self.cell_id )
self.row_col = self.make_row_col( id_match.groups() )
elif name[1] == "v":
self.buffer= "" # Value of a cell
else:
pass # might do some debugging here.
@staticmethod
def make_row_col( col_row_pair ):
col = 0
for c in col_row_pair[0]:
col = col*26 + (ord(c)-ord("A")+1)
return int(col_row_pair[1]), col-1
def endElementNS( self, name, qname ):
if name[1] == "row":
# write the row to the CSV result file.
self.writer.writerow( [ self.row.get(i) for i in xrange(max(self.row.keys())) ] )
self.row= {}
elif name[1] == "v":
if self.cell_type is None:
try:
self.value= float( self.buffer )
except ValueError:
print( self.row_num, self.cell_id, self.cell_type, self.buffer )
self.value= None
elif self.cell_type == "s":
try:
self.value= self.string_dict[int(self.buffer)]
except ValueError:
print( self.row_num, self.cell_id, self.cell_type, self.buffer )
self.value= None
elif self.cell_type == "b":
self.value= bool(self.buffer)
else:
print( self.row_num, self.cell_id, self.cell_type, self.buffer, self.string_dict.get(int(self.buffer)) )
self.value= None
self.row[self.row_col[1]] = self.value
while self.context[-1] != name:
self.context.pop(-1)
self.context.pop(-1)
def characters( self, content ):
self.buffer += content
This class and the shared string handler could be refactored to eliminate a tiny bit of redundancy.

This class does two things. At the end of a tag, it determines what data was found. It could be a number, a boolean value or a shared string. At the end of a tag, it writes the row to a CSV writer.

This handler is used as follows.
    rdr= xml.sax.make_parser()
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
for s in sheets:
with open(s+".csv","wb") as result:
handler= GetSheetData(string_handler.string_dict,csv.writer(result))
rdr.setContentHandler( handler )
rdr.parse( s )
This iterates through each sheet, transforming it into a simple .CSV file. Once we have the file in CSV format, it's smaller and simpler. It can easily be processed by follow-on applications.

The overall loop actually looks like this.

sheets= list( get_worksheets(name) )

string_handler= GetStrings()
rdr= xml.sax.make_parser()
rdr.setContentHandler( string_handler )
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
rdr.parse( "xl/sharedStrings.xml" )

rdr= xml.sax.make_parser()
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
for s in sheets:
with open(s+".csv","wb") as result:
handler= GetSheetData(string_handler.string_dict,csv.writer(result))
rdr.setContentHandler( handler )
rdr.parse( s )
This expands the shared strings and individual sheets. It iterates through the sheets, using the shared strings, to create a bunch of .CSV files from the .XLSM data.

The resulting .CSV -- stripped of the XML overheads -- is 80,000+ rows and only 39M. Also, it can be processed with the Python csv library.

CSV Processing

This, after all, was the goal. Read the CSV file and do some useful work.
def csv_rows(source):
rdr= csv.reader( source )
headings = []
for n, cols in enumerate( rdr ):
if n < 4:
if headings:
headings = [ (top+' '+nxt).strip() for top, nxt in zip( headings, cols ) ]
else:
headings = cols
continue
yield dict(zip(headings,cols))
We locate the four header rows and build labels from the the four rows of data. Given these big, complex headers, we can then build a dictionary from each data row. The resulting structure is exactly like the results of a csv.DictReader, and can be used to do the "real work" of the application.