Thursday, December 30, 2010

pyWeb Literate Programming Tool | Download pyWeb Literate Programming Tool software for free at SourceForge.net

I've (finally) updated the pyWeb Literate Programming Tool.

There were feature requests and bug reports. Much to do. Sadly, I'm really slow at doing it.

Top Language Skills

Check out this item on eWeek: Java, C, C++: Top Programming Languages for 2011 - Application Development - News & Reviews - eWeek.com.

The presentation starts with Java, C, C++, C# -- not surprising. These are clearly the most popular programming languages. These seem to be the first choice made by many organizations.
In some cases, it's also the last choice. Many places are simply "All C#" or "All Java" without any further thought. This parallels the "All COBOL" mentality that was so pervasive when I started my career. The "All Singing-All Dancing-All One Language" folks find the most shattering disruptions when their business is eclipsed by competitors with language and platform as a technical edge.

The next tier of languages starts with JavaScript, which is expected. Just about every web site in common use has some JavaScript somewhere. Browsers being what they are, there's really no viable alternative.

Weirdly, Perl is 6th. I say weirdly because the TIOBE Programming Community Index puts Perl much further down the popularity list.

PHP is next. Not surprising.

Visual Basic weighs in above Python. Being above Python is less weird than seeing Perl in 6th place. This position is closer to the TIOBE index. It is distressing to think that VB is still so wildly popular. I'm not sure what VB's strong suit is. C# seems to have every possible advantage over VB. Yet, there it is.

Python and Ruby are the next two. Again, this is more-or-less in the order I expected to see them. This is is the second tier of languages: really popular, but not in the same league as Java or one of the innumerable C variants.

After this, they list Objective-C as number 11. This language is tied to Apple's iOS and MacOS platforms, so it's popularity (like C# and VB) is driven in part by platform popularity.

Third Tier

Once we get past the top 10 Java/C/C++/C#/Objective C and PHP/Python/Perl/Ruby/Javascript tier, we get into a third realm of languages that are less popular, but still garnering a large community of users.

ActionScript. A little bit surprising. But -- really -- it fills the same client-side niche as JavaScript, so this makes sense. Further, almost all ActionScript-powered pages will also have a little bit of JavaScript to help launch things smoothly.

Now we're into interesting -- "perhaps I should learn this next" -- languages: Groovy, Go, Scala, Erlang, Clojure and F#. Notable by their absence are Haskell, Lua and Lisp. These seem like languages to learn in order to grab the good ideas that make them both popular and distinctive from Java or Python.

Tuesday, December 28, 2010

Amazing Speedup

A library had unit tests that ran for almost 600 seconds. Two small changes dropped the run time to 26 seconds.

I was amazed.

Step 1. I turned on the cProfile. I added two methods to the slowest unit test module.

def profile():
    import cProfile
    cProfile.run( 'main()', 'the_slow_module.prof' )
    report()

def report():
    import pstats
    p = pstats.Stats( 'the_slow_module.prof' )
    p.sort_stats('time').print_callees(24)
Now I can add profiling or simply review the report. Looking at the "callees" provided some hints as to why a particular method was so slow.

Step 2. I replaced ElementTree with cElementTree (duh.) Everyone should know this. I didn't realize how much this mattered. The trick is to note how much time was spent doing XML parsing. In the case of this unit test suite, it was a LOT of time. In the case of the overall application that uses this library, that won't be true.

Step 3. The slowest method was assembling a list. It did a lot of list.append(), and list.__len__(). It looked approximately like the following.


def something( self ):
result= []
for index, value in some_source:
    while len(result)+1 != index:
        result.append( None )
    result.append( SomeClass( value ) )
return result

This is easily replaced by a generator. The API changes, so every use of this method function may need to be modified to use the generator instead of the list object.


def something_iter( self ):
 counter= 0
 for index, value in some_source:
     while counter+1 != index:
         yield None
         counter += 1
     yield SomeClass( value )
     counter += 1

The generator was significantly faster than list assembly.

Two minor code changes and a significant speed-up.

Thursday, December 23, 2010

The Anti-IF Campaign

Check this out: http://www.antiifcampaign.com/.

I'm totally in favor of reducing complexity. I've seen too many places where a Strategy or some other kind of Delegation design pattern should have been used. Instead a cluster of if-statements was used. Sometimes these if-statements suffer copy-and-paste repetition because someone didn't recognize the design pattern.

What's important is the the if statement -- in general -- isn't the issue. The anti-if folks are simply demanding that folks don't use if as a stand-in for proper polymorphism.

Related Issues

Related to abuse of the if statement is abuse of the else clause.

My pet-peeve is code like this.
if condition1:
work
elif condition2:
work
elif condition3:
work
else:
what condition applies here?
When the various conditions share common variables it can be very difficult to deduce the condition that applies for the else clause.

My suggestion is to Avoid Else.

Write it like this.
if condition1:
work
elif condition2:
work
elif condition3:
work
elif not (condition1 or condition2 or condition3)
work
else:
raise AssertionError( "Oops. Design Error. Sorry" )

Then you'll know when you've screwed up.

[Update]

Using an assert coupled with an else clause is a kind of code-golf optimization that doesn't seem to help much. An elif will have the same conditional expression as the assert would have. But the comment did lead to rewriting this to use AssertionError instead of vague, generic Exception.

Tuesday, December 14, 2010

Code Base Fragmentation -- Again

Check this out: "Stupid Template Languages".

Love this: "The biggest annoyance I have with smart template languages (Mako, Genshi, Jinja2, PHP, Perl, ColdFusion, etc) is that you have the capability to mix core business logic with your end views, hence violating the rules of Model-View-Controller architecture."

Yes, too much power in the template leads to code base fragmentation: critical information is not in the applications, but is pushed into the presentation. This also happens with stored procedures and triggers.

I love the questions on Stack Overflow (like this one) asking how to do something super-sophisticated in the Django Template language. And the answer is often "Don't. That's what view functions are for."

Thursday, December 9, 2010

The Wrapper vs. Library vs. Aspect Problem

Imagine that we've got a collection of applications used by customers to provide data, a collection of applications we use to collect data from vendors. We've got a third collection of analytical tools.

Currently, they share a common database, but the focus, use cases, and interfaces are different.

Okay so far? Three closely-related groups or families of applications.

We need to introduce a new cross-cutting capability. Let's imagine that it's something central like using celery to manage long-running batch jobs. Clearly, we don't want to just hack celery features into all three families of applications. Do we?

Choices

It appears that we have three choices.
  1. A "wrapper" application that unifies all the application families and provides a new central application. Responsibilities shift to the new application.
  2. A site-specific library that layers some common features so that our various families of applications can be more consistent. This involves less of a responsibility shift.
  3. An "aspect" via Aspect-Oriented programming techniques. Perhaps some additional decorators added to the various applications to make them use the new functionality in a consistent way.
Lessons Learned

Adding a new application to be an overall wrapper turned out to be a bad idea. After implementing it, it was difficult to extend. We had two dimensions of extension.
  1. The workflows in the "wrapper" application needed constant tweaking as the other applications evolved. Every time we wanted to add a step, we had to update the real application and also update the wrapper. Python has a lot of introspection, but these aren't technical changes, these are user-visible workflow changes.
  2. Introducing a new data types and file formats was painful. The responsibility for this is effectively split between the wrapper and the underlying applications. The wrapper merely serves to dilute the responsibilities.
Libraries/Aspects

It appears that new common features are almost always new aspects of existing applications.

What makes this realization painful is the process of retrofitting a supporting library into multiple, existing applications. It seems like a lot of cut-and-paste to add the new import statements, add the new decorators and lines of code. However, it's a pervasive change. The point is to add the common decorator in all the right places.

Trying to "finesse" a pervasive change by introducing a higher-level wrapper isn't a very good idea.

A pervasive change is simply a lot of changes and regression tests. Okay, I'm over it.

Tuesday, December 7, 2010

Intuition and Experience

First, read EWD800.

It has harsh things to say about relying on intuition in programming.

Stack Overflow is full of questions where someone takes their experience with one language and applies it incorrectly and inappropriately to another language.

I get email, periodically, also on this subject. I got one recently on the question of "cast", "coercion" and "conversion" which I found incomprehensible for a long time. I had to reread EWD800 to realize that someone was relying on some sort of vague intuition; it appears that they were desperate to map Java (or C++) concepts on Python.

Casting

In my Python 2.6 book, I use the word "cast" exactly twice. In the same paragraph. Here it is.
This also means the "casting" an object to match the declared type
of a variable isn't meaningful in Python. You don't use C++ or Java-style
casting.
I though that would be enough information to close the subject. I guess not. It appears that some folks have some intuition about type casting that they need to see reflected in other languages, no matter how inappropriate the concept is.

The email asked for a "a nice summary with a simple specific example to hit the point home."
It's quite hard to provide an example of something that doesn't exist. But, I guess, intuition provides a strong incentive to see things which aren't there. I'm not sure how to word it more strongly or clearly. I hate to devolve into blow-by-blow comparison between languages because there are concepts that don't map. I'll work on being more forceful on casting.

Coercion

The words coercion (and coerce) occur more often, since they're sensible Python concepts. After all, Python 2 has formal type coercion rules. See "Coercion Rules". I guess my summary ("Section 3.4.8 of the Python Language Reference covers this in more detail; along with the caveat that the Python 2 rules have gotten too complex.") wasn't detailed or explicit enough.

The relevant quote from the Language manual is this: "As the language has evolved, the coercion rules have become hard to document precisely; documenting what one version of one particular implementation does is undesirable. Instead, here are some informal guidelines regarding coercion. In Python 3.0, coercion will not be supported."

I guess I could provide examples of coercion. However, the fact that it is going to be expunged from the language seems to indicate that it isn't deeply relevant. It appears that some readers have an intuition about coercion that requires some kind of additional details. I guess I have to include the entire quote to dissuade people from relying on their intuition regarding coercion.

Further, the request for "a nice summary with a simple specific example to hit the point home" doesn't fit well with something that -- in the long run -- is going to be removed. Maybe I'm wrong, but omitting examples entirely seemed to hit the point home.

Conversion

Conversion gets it's own section, since it's sensible in a Python context. I kind of thought that a whole section on conversion would cement the concepts. Indeed, there are (IMO) too many examples of conversions in the conversion section. But I guess that showing all of the numeric conversions somehow wasn't enough. I have certainly failed at least one reader. However, I can't imagine what more could be helpful, since it is -- essentially -- an exhaustive enumeration of all conversions for all built-in numeric types.

What I'm guessing is that (a) there's some lurking intuition and (b) Python doesn't match that intuition. Hence the question -- in spite of exhaustively enumerating the conversions. I'm not sure what more can be done to make the concept clear.

It appears that all those examples weren't "nice", "simple" or "specific" enough. Okay. I'll work on that.

Thursday, December 2, 2010

More Open Source and More Agile News

ComputerWorld, November 22, 2010, has this: "Open Source Grows Up". The news of the weird is "It's clear that open-source software has moved beyond the zealotry phase." I wasn't aware this phase existed. I hope to see the project plan with "zealotry" in it.

The real news is "More than two-thirds (69%) of the respondents said they expect to increase their investments in open source." That's cool.

Be sure to read the sidebar "Many Enterprises Aren't Giving Back." There's still a lot of concern over intellectual property. I've seen a lot of corporate software -- it's not that good. Most companies that are wringing their hands over losing control of their trade secrets should really be wringing their hands because their in-house software won't measure up to open-source standards.

I like this other quote: 'Five years ago, the South Carolina government was "considering writing a policy to prohibit or at least 'control' open source".' I like the "Must Control Open Source" feeling that IT leadership has. Without this mysterious "control", the organization could be swamped by software it didn't write. How's that different from being swamped by software products that involve contracts and fees? And requires Patch Tuesday?

Agility

SD Times has two articles on Agile methods. Both on the front page of a print publication. That's how you know the technique has "arrived".

First, there's "VersionOne survey finds agile knowledge and use on the rise". My favorite quote: "Interestingly, management support, the ability to change organizational culture and general resistance to change, remained at the forefronts of participants’ minds when indicating barriers to further agile adoption." I like the management barriers. I like it when management tries to exert more 'control' over a process (like software creation) that's so poorly understood.

Here's the companion piece, "For agile success, leaders must let teams loose". This is all good advice. Particularly, this: '"It’s hard to not command and control, but leadership is not about managing work. It’s about creating a capable organization that can manage work," [Rick Simmons] added.'

If you're micro-managing, you're not building an organization. Excellent advice. However, tell that to the financial control crowd.

Budgets and "Control"

Finally, be sure to read this by Frank Hayes in ComputerWorld: "Big Projects, Done Small". Here are the relevant quotes: "The logical conclusion: We should break up all IT projects into sub-million-dollar pieces." "The political reality: Everybody wants multimillion-dollar behemoths." "...huge projects get big political support."

In short, Agile is the right thing to do until you're trying to get approval. Bottom line: use Agile methods. But for purposes of pandering to executives who want to see large numbers with lots of zeroes, it's often necessary to write giant project "plans" that you don't actually use.

Go ahead, write waterfall plans. Don't feel guilty or conflicted. Some folks won't catch up with Agility because they think "Control" is better. Pander to them. It's okay.

Tuesday, November 30, 2010

Questions, or, How to Ask For Help

Half the fun on Stack Overflow is the endless use of closed-ended questions. "Can I do this in Python?" being so common and so hilarious.

The answer is "Yes." You can do it.

Perhaps that's not the question they really meant to ask.

See "Open versus Closed Ended Questions" for a great list of examples.

Closed-Ended Questions have short answers, essentially yes or no. Leading Questions and presuming questions are common variations on this theme. A closed-ended question is sometimes called "dichotomous" because there are only two choices. They can also be called "saturated", possibly because all the possible answers are laid out in the question.

Asking Questions

The most important part about asking questions is to go through a few steps of preparation.
  1. Search. Use Google, use the Stack Overflow search. A huge number of people seem to bang questions into Stack Overflow without taking the time to see if it's been asked (and answered) already.
  2. Define Your Goal. Seriously. Write down your objective. In words. Be sure the goal includes an active-voice verb -- something you want to be able to do. If you want to be able to write code, write down the words "I want to write code for [X]". If you want to be able to tell the difference between two nearly identical things, write down the words "I want to distinguish [Y] from [Z]". When in doubt, use active voice verbs to write down the thing you want to do. Focus on actions you want to take.
  3. Frame Your Question. Rewrite your goal into a sentence by changing the fewest words. 90% of the time, you'll switch "I want to" to "How do I". The rest of the time, you'll have to think for a moment because your goal didn't make sense. If your goal is not an active-voice verb phrase (something you want to do) then you'll have trouble with the rewrite.
In some cases, folks will skip one or more steps. Hilarity Ensues.

Leading/Presuming Questions

Another form of closed-ended question is the veiled complaint. "Why doesn't Python do [X] the way Perl/PHP/Haskell/Java/C# does it?"

Essentially, this is "my favorite other language has a feature Python is missing." The question boils down to, "Why is [Y] not like [Z]?" Often it's qualified by some feature, but the question is the same: "Regarding [X], why is Python not like language [Z]?"

The answer is "Because they're different." The two languages are not the same, that's why there's a difference.

This leads to "probing" questions of no real value. "Why did Python designers decide to leave out [X]" and other variants on this theme.

If the answer was "Because they're evil gnomes" what does it matter? If the answer was "because it's inefficient" how does that help? Feature [X] is still missing, and all the "why?" questions won't really help add it back into the language.

It's possible that there's a legitimate question hidden under the invective. It might be "How do I implement [X] in Python? For examples, see Perl/PHP/Haskell/Java/C#." Notice that this question is transformed into an active-voice verb: "implement".

If we look at the three-step question approach above, there's no active-voice verb behind a "why question". What you "know" isn't really all that easy to provide answers for. Knowledge is simply hard to provide. Questions about what you want to do, are much, much easier to answer.

Probing/Confirming Questions

One other category are the "questions" that post a pile of details looking for confirmation. There are three common variations.
  • tl;dr. The wealth of detail was overwhelming. I'm a big fan of the "detail beat-down". It seems like some folks don't need to summarize. There appear to be people with massive brains that don't need models, abstractions or summaries, but are perfectly capable of coping with endless details. It would be helpful if these folks could "write down" to those of us with small brains who need summaries.
  • No question at all, or the question is a closed-ended "Do you agree?" An answer of "No." is probably not what they wanted. But what can you do? That's all they asked for.
  • Sometimes the question is "Any comments?" This often stems from having no clear goal. Generally, if you've done a lot of research and you simply want confirmation, there's no question there. If you've got doubts, that means you need to do something to correct the problems.
Here's what is really important with tl;dr questions: What do you want to do?

80% of the time, it's "Fix my big, complex tl;dr proposal to correct problem [X]." [X] could be "security" or "deadlock" or "patent infringement" or "cost overrun" or "testability".

Here's how to adjust this question from something difficult to answer to something good.

You want to know if your tl;dr proposal have problem [X]. You're really looking for confirmation that your tl;dr proposal is free from problem [X]. This is something you want to know -- but knowledge is not a great goal. It's too hard to guess what you don't know; lots of answers can provide almost the right information.

Reframe your goal: drop knowledge and switch to action. What do you want to do? You want to show that your tl;dr proposal is free from problem [X]. So ask that: "How do I show my tl;dr proposal is free from problem [X]?"

Once you write that down, you now have to focus your tl;dr proposal to get the answer to this question. In many cases, you can pare things down to some relevant parts that can shown to be free from problem [X]. In most cases, you'll uncover the problem on your own. In other cases, you've got a good open-ended question to start a useful conversation that will give you something you can do.

Tuesday, November 23, 2010

Open-Source, moving from "when" to "how"

Interesting item in the November 1 eWeek: "Open-Source Software in the Enterprise".

Here's the key quote: "rather than asking if or when, organizations are increasingly focusing on how".

Interestingly, the article then goes on to talk about licensing and intellectual property management. I suppose those count, but they're fringe issues, only relevant to lawyers.

Here's the two real issues:
  1. Configuration Management
  2. Quality Assurance
Many organizations do things so poorly that open source software is unusable.

Configuration Management

Many organizations have non-existent or very primitive CM. They may have some source code control and some change management. But the configuration of the test and production technology stacks are absolutely mystifying. No one can positively say what versions of what products are in production or in test.

The funniest conversations center on the interconnectedness of open source projects. You don't just take a library and plug it in. It's not like neatly-stacked laundry, all washed and folded and ready to be used. Open Source software is more like a dryer full of a tangled collection of stuff that's tied in knots and suffers from major static cling.

"How do we upgrade [X]"? You don't simply replace a component. You create a new tech stack with the upgraded [X] and all of the stuff that's knotted together with [X].

Changing from Python 2.5 to 2.6 changes any binary-compiled libraries like PIL or MySQL_python, mod_wsgi, etc. These, in turn, may require OS library upgrades.

A tech stack must be a hallowed thing. Someone must actively manage change to be sure they're complete and consistent across the enterprise.

Quality Assurance

Many organizations have very weak QA. They have an organization, but it has no authority and developers are permitted to run rough-shod over QA any time they use the magic words "the user's demand it".

The truly funny conversations center on how the organization can be sure that open source software works, or is free of hidden malware. I've been asked how a client can vet an open source package to be sure that it is malware free. As if the client's Windows PC's are pristine works of art and the Apache POI project is just a logic bomb.

The idea that you might do acceptance testing on open source software always seems foreign to everyone involved. You test your in-house software. Why not test the downloaded software? Indeed, why not test commercial software for which you pay fees? Why does QA only seem to apply to in-house software?

Goals vs. Directions

I think one other thing that's endlessly confusing is "Architecture is a Direction not a Goal." I get the feeling that many organizations strive for a crazy level of stability where everything is fixed, unchanging and completely static (except for patches.)

The idea that we have systems on a new tech stack and systems on an old tech stack seems to lead to angry words and stalled projects. However, there's really no sensible alternative.

We have tech stack [X.1], [X.2] and [X.3] running in production. We have [X.4] in final quality assurance testing. We have [X.5] in development. The legacy servers running version 1 won't be upgraded, they'll be retired. The legacy servers running version 2 may be upgraded, depending on the value of the new features vs. the cost of upgrading. The data in the version 3 servers will be migrated to the version 4, and the old servers retired.

It can be complex. The architecture is a direction in which most (but not all) servers are heading. The architecture changes, and some servers catch up to the golden ideal and some servers never catch up. Sometimes the upgrade doesn't create enough value.

These are "how" questions that are more important than studying the various licensing provisions.

Thursday, November 18, 2010

Software Patents

Here's an interesting news item: "Red Hat’s Secret Patent Deal and the Fate of JBoss Developers".

Here's an ancient -- but still relevant -- piece from Tim O'Reilly: "Software and Business Method Patents".

Here's a great article in Slate on the consequences of software patents. "Weapons of Business Destruction: How a tiny little 'patent troll' got BlackBerry in a headlock".

The biggest issue with software patents is always the "non-obvious" issue. Generally, this can be debated, so a prior art review is far more valuable.


To participate, see Peer To Patent. Locate prior art and make patent trolls get real jobs.


Thursday, November 11, 2010

Hadoop and SQL/Relational Hegemony

Here's a nice article on why Facebook, Yahoo and eBay use Hadoop: "Asking Any Question Of All Your Data".

The article has one tiny element of pandering to the SQL hegemonists.

Yes, it sounds like a conspiracy theory, but it seems like there really are folks who will tell you that the relational database is effectively perfect for all data processing and should not be questioned. To bolster their point, they often have to conflate all data processing into one amorphous void. Relational transactions aren't central to all processing, just certain elements of data processing. There, I said it.

Here's the pandering quote: "But this only works if the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner".

What?

Is he saying that relational databases do not impose the same constraint?

Clearly, the RDBMS has the same "catch". The relational database only works if "...the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner."

Pandering? Really?

Here's why it seems like the article is pandering. Because it worked. It totally appealed to the target audience. I saw this piece because a DBA -- a card-carrying member of the SQL Hegemony cabal -- sent me the link, and highlighted two things. The DBA highlighted the "powerful enough" quote.

As if to say, "See, it won't happen any time soon, Hadoop is too resource intensive to displace the RDBMS."

Which appears to assume that the RDBMS isn't resource intensive.

Further, the DBA had to add the following. "The other catch which is not stated is the skill level required of the people doing the work."

As if to say, "It won't happen any time soon, ordinary programmers can't understand it."

Which appears to assume that ordinary programmers totally understand SQL and the relational model. If they did understand SQL and the relational model perfectly, why would we have DBA's? Why would we have performance tuning? Why would we have DBA's adjusting normalization to correct application design problems?

Weaknesses

So the weaknesses of Hadoop are that it (a) demands resources and (b) requires specialized skills. Okay. But isn't that the exact same weakness as the relational database?

Which causes me to ask why an article like this has to pander to the SQL cabal by suggesting that Hadoop requires a big compute engine? Or is this just my own conspiracy theory?

Tuesday, November 9, 2010

Data Mapping and Conversion Tools -- Sigh

Yes, ETL is interesting and important.

But creating a home-brewed data mapping and conversion tool isn't interesting or important. Indeed, it's just an attractive nuisance. Sure, it's fun, but it isn't valuable work. The world doesn't need another ETL tool.

The core problem is talking management (and other developers) into a change of course. How do we stop development of Yet Another ETL Tool (YAETLT)?

First, there's products like Talend, CloverETL and Pentaho open source data integration. Open Source. ETL. Done.

Then, there's this list of Open Source ETL products on the Manageability blog. This list all Java, but there's nothing wrong with Java. There are a lot of jumping-off points in this list. Most importantly, the world doesn't need another ETL tool.

Here's a piece on Open Source BI, just to drive the point home.

Business Rules

The ETL tools must have rules. Either simple field alignment or more complex transformations. The rules can either be interpreted ("engine-based" ETL) or used to build a stand-alone program ("code-generating" ETL).

The engine-based ETL, when written in Java, is creepy. We have a JVM running a Java app. The Java app is an interpreter for a bunch of ETL rules. Two levels of interpreter. Why?

Code-generating ETL, OTOH, is a huge pain in the neck because you have to produce reasonably portable code. In Java, that's hard. Your rules are used to build Java code; the resulting Java code can be compiled and run. And it's often very efficient. [Commercial products often produce portable C (or COBOL) so that they can be very efficient. That's really hard to do well.]

Code-generating, BTW, has an additional complication. Bad Behavior. Folks often tweak the resulting code. Either because the tool wasn't able to generate all the proper nuances, or because the tool-generated code was inefficient in a way that's so grotesque that it couldn't be fixed by an optimizing compiler. It happens that we can have rules that run afoul of the boilerplate loops.

Old-School Architecture

First, we need to focus on the "TL" part of ETL. Our applications receive files from our customers. We don't do the extract -- they do. This means that each file we receive has a unique and distinctive "feature". We have a clear SoW and examples. That doesn't help. Each file is an experiment in novel data formatting and Semantic Heterogeneity.

A common old-school design pattern for this could be called "The ETL Two-Step". This design breaks the processing into "T" and "L" operations. There are lots of unique, simple, "T" options, one per distinctive file format. The output from "T" is a standardized file. A simple, standardized "L" loads the database from the standardized file.

Indeed, if you follow the ETL Two Step carefully, you don't need to actually write the "L" pass at all. You prepare files which your RDBMS utilities can simply load. So the ETL boils down to "simple" transformation from input file to output file.

Folks working on YAETLT have to focus on just the "T" step. Indeed, they should be writing Yet Another Transformation Tool (YATT) instead of YAETLT.

Enter the Python

If all we're doing is moving data around, what's involved?

import csv
result = {
'column1': None,
'colmnn2': None,
# etc.
}
with open("source","rb") as source:
rdr= csv.DictReader( source )
with open( "target","wb") as target:
wtr= csv.DictWriter( target, result.keys() )
for row in rdr:
result['column1']= row['some_column']
result['column2']= some_func( row['some_column'] )
# etc.
wtr.writerow( result )

That's really about it. There appear to be 6 or 7 lines of overhead. The rest is working code.

But let's not be too dismissive of the overhead. An ETL depends on the file format, summarized in the import statement. With a little care we can produce libraries similar to Python's csv that work with XLS directly, as well as XLSX and other formats. Dealing with COBOL-style fixed layout files can also be boiled down to an importable module. The import isn't overhead; it's a central part of the rules.

The file open functions could be seen as overhead. Do we really need a full line of code when we could -- more easily -- read from stdin and write to stdout? If we're willing to endure the inefficiency of processing one input file multiple times to create several standardized outputs, then we could eliminate the two with statements. If, however, we have to merge several input files to create a standardized output file, the one-in-one-out model breaks down and we need the with statements and the open functions.

The for statement could be seen as needless overhead. It goes without saying that we're processing the entire input file. Unless, of course, we're merging several files. Then, perhaps, it's not a simple loop that can be somehow implied.

It's Just Code

The point of Python-based ETL is that the problem "solved" by YATT isn't that interesting. Python is an excellent transformation engine ETL. Rather than write a fancy rule interpreter, just write Python. Done.

We don't need a higher-level data transformation engine written in Java. Emit simple Python code and use the Python engine. (We could try to emit Java code, but it's not as simple and requires a rather complex supporting library. Python's Duck Typing simplifies the supporting library.)

If we don't write a new transformation engine, but use Python, that leaves a tiny space left over for the YATT: producing the ETL rules in Python notation. Rather than waste time writing another engine, the YATT developers could create a GUI that drags and drops column names to write the assignment statements in the body of the loop.

That's right, the easiest part of the Python loop is what we can automate. Indeed, that's about all we can automate. Everything else requires complex coding that can't be built as "drag-and-drop" functionality.

Transformations

There are several standard transformations.
  • Column order or name changes. Trivial assignment statements handle this.
  • Mapping functions. Some simple (no hysteresis, idempotent) function is applied to one or more columns to produce one or more columns. This can be as simple as a data type conversion, or a complex calculation.
  • Filter. Some simple function is used to include or exclude rows.
  • Reduction. Some summary (sum, count, min, max, etc.) is applied to a collection of input rows to create output rows. This is an ideal spot for Python generator functions. But there's rarely a simple drag-n-drop for these kinds of transformations.
  • Split. One file comes in, two go out. This breaks the stdin-to-stdout assumption.
  • Merge. Two go in, one comes out. This breaks the stdin-to-stdout assumption, also. Further, the matching can be of several forms. There's the multi-file merge when several similarly large files are involved. There's the lookup merge when a large file is merged with smaller files. Merging also applies to doing key lookups required to match natural keys to locate database FK's.
  • Normalization (or Distinct Processing). This is a more subtle form of filter because the function isn't idempotent; it depends on the state of a database or output file. We include the first of many identical items; we exclude the subsequent copies. This is also an ideal place for Python generator functions.
Of these, only the first three are candidates for drag-and-drop. And for mapping and filtering, we either need to write code or have a huge library of pre-built mapping and filtering functions.

Problems and Solutions

The YATT problem has two parts. Creating the rules and executing the rules.

Writing another engine to execute the rules is a bad idea. Just generate Python code. It's a delightfully simple language for describing data transformation. It already works.

Writing a tool to create rules is a bad idea. Just write the Python code and call it the rule set. Easy to maintain. Easy to test. Clear, complete, precise.

Thursday, November 4, 2010

Pythonic vs. "Clean"

This provokes thought: "Pythonic".

Why does Python have a "Pythonic" style? Why not "clean"?

Is it these lines from Tim Peters' "The Zen of Python" (a/k/a import this)
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Perhaps having a PEP 8, a BDFL (and FLUFL) means that there's a certain "pressure" to conform?

Or do we have higher standards than other languages? Or less intellectual diversity?

I think that "pythonic" is just a catchy phrase that rolls off the tongue. I think a similar concept exists in all languages, but there isn't a good phrase for it in most other languages. Although Ned Batchelder has some really good suggestions. (Except for C++, which should be "C-Posh-Posh" for really good coding style.)

History

When I was a COBOL programmer, there were two buzz-phrases used. "Clean" and "Structured". Clean was poorly-defined and really just a kind of cultural norm. In those days, each shop had a different opinion of "clean" and the lack of widespread connectivity meant that each shop had a more-or-less unique style. Indeed, as a traveling consultant, I helped refine and adjust those standards because of the wide variety of code I saw in my travels.

"Structured" is pretty much an absolute. Each GOTO-like thing had to be reworked as properly nested IFs or PERFORMs. No real issue there. Except from folks who argued that "Structured" was slower than non-Structured. A load of malarkey, but one I heard remarkably often.

When I was a Fortran (and Ada) programmer, I worked for the military in which there were simply absolute standards for every feature of the source code. Boring. And no catchy buzz-word. Just "Compliant" or "Wrong".

Since it was the early '90's (and we were sequestered) we didn't have much Internet access. Once in a while we'd have internal discussions on style where the details weren't covered by any standard. Not surprisingly, they amounted to "Code Golf" questions. Ada has to be perfectly clear, which can be verbose, and some folks don't like clarity.

When I become a C programmer, I found a Thomas Plum's Reliable Data Structures in C. That provided a really good set of standards. The buzzword I used was "Reliable".

The problem with C programming is that "Clean" and "Code Golf" get conflated all the time. Folks write the craziest crap, claim it's "clean" and ignore the resulting obscurity. Sigh. I wish folks with stick with "Reliable" or "Maintainable" rather than "Clean".

While doing Perl programming I noticed that some folks didn't seem to realize the golden rule.
No One Wins At Code Golf
I don't know why. Other than to think that some folks felt that Perl programs weren't "real" programs. They were just "scripts" and could be treated with a casual contempt.

When I learned Java, I noted that an objective was to have a syntax that was familiar. It was a stated goal to have the Java style guidelines completely overlap with C and C++ style guidelines. Fair enough. Doesn't solve the "Code Golf" vs. "Clean" problem. But it doesn't confound it with another syntax, either.

Python

From this history, I think that "Pythonic" exists because we have a BDFL with high standards.

Tuesday, November 2, 2010

"Might Be Misleading" is misleading

My books (Building Skills in Programming, Building Skills in Python and Building Skills in OO Design) develop a steady stream of email. [Also, as a side note, I need to move them to the me.com server, Apple is decommissioning the homepage.mac.com domain.]

The mail falls into several buckets.

Thanks. Always a delight. Keep 'em coming.

Suggestions. These are suggestions for new topics. Recently, I've had a few requests for Python 3 coverage. I'm working with a publisher on this, and hope -- before too long -- to have real news.

Corrections. I get a lot of these. A lot. Keep 'em coming. I didn't pay a copy editor; I tried to do it myself. It's hard and I did a poor job. More valuable than spelling corrections are technical corrections. (I'm happy to report that I don't get as many of these.) Technical corrections are the most serious kind of correction and I try to fix those as quickly as possible.

Source Code Requests. No. I don't supply any source. If I send you the source, what skill did you build? Asking for source? Not a skill that has much value, IMHO. If you want to learn to program, you have to create the source yourself. That is the job. Sorry for making you work, but you have to actually do the work. There's no royal road to programming.

The "Other" Bucket

I get some emails that I file under "other" because they're so funny. They have the following boilerplate.

"Code fragment [X] might is misleading because [Y]."

First, it's a complaint, not a question. That's not very helpful. That's just complaining. Without a suggested improvement, it's the worst kind of bare negativity.

The best part is that — without exception — the person sending the email was not mislead. They correctly understood the code examples.

Clearly, the issue isn't that the code is "misleading" in the usual sense of "lying" or "mendacious". If it was actually misleading, then (a) they wouldn't have understood it and (b) there'd be a proper question instead of a complaint.

Since they correctly understood it, what's misleading?

User Interface Reviews

In software development, we used to go through the "misleading" crap-ola in user interface reviews. In non-Agile ("waterfall") development, we have to get every nuance, every aspect, every feature of the UI completely specified before we can move on. Everyone has to hand-wring over every word, every font choice, field order, button placement, blah, blah and blah.

It seems like 80% of the comments are "label [X] might be misleading". The least useful comment, of course, is this sort of comment with no suggestion. The least useful reviewer is the person who (1) provides a negative comment and, when asked for an improvement, (2) calls a meeting of random people to come up with replacement text.

[Hint: If you eventually understood the misleading label, please state your understanding in your own words. Often, hilarity ensues when their stated understanding cycles back to the original label.]

The "label [X] might be misleading" comment is — perhaps — the most expensive nuisance comment ever. Projects wind up spinning through warrens of rat-holes chasing down some verbiage that is acceptable. After all, you can't go over the waterfall until the entire UI is specified, right?

Worse, of course, the best sales people do not interpose themselves into the sales process. They connect prospective customers with products (or services). Really excellent sales people can have trouble making suggestions. Their transparency is what makes them good. It's not sensible demanding suggestions from them.

Underneath a "Might Be Misleading" comment, the person complaining completely understood the label. They were not actually mislead at all. If it was misleading, then (a) they wouldn't have understood it and (b) there'd be a proper question instead of a complaint.

Thank goodness for Agile product owners who can discard the bad kind of negativity. The right thing to do is put a UI in front of more than one user and bypass the negativity with a consensus that the UI actually is usable and isn't really misleading.

Might Be Misleading

The "Might be Misleading" comments are often code-speak for "I don't like it because..." And the reason why is often "because I had to think." I know that thinking is bad.

I understand that Krug's famous Don't Make me Think is the benchmark in usability. And I totally agree that some thinking is bad.

There are two levels of thinking.
  • Thinking about the problem.
  • Thinking about the UI and how the UI models the problem.
Krug's advice is clear. Don't make users think about the UI and how the UI models the problem. Users still have to think about the problem itself.

In the case of UI labels which "Might Be Misleading", we have to figure out if it's the problem or the UI that folks are complaining about. In many cases, parts of the problem are actually hard and no amount of UI fixup can ever make the problem easier.

Not Completely Accurate

One of the most common UI label complaints is that the label isn't "completely" accurate. They seem to object to fact that a UI label can only contain a few words and they have to actually understand the few words. I assume that folks who complain about UI labels also complain about light switches having just "on" and "off" as labels. Those labels aren't "completely" accurate. It should say "power on". Indeed it should say "110V AC power connected". Indeed it should say "110V AC power connected through load". Indeed it should say "110V AC 15 A power connected via circuit labeled #4 through load with ground".

Apparently this is news. Labels are Summaries.

No label can be "completely" accurate. You heard it here first. Now that you've been notified, you can stop complaining about labels which "might be misleading because they're not completely accurate." They can't be "completely" accurate unless the label recapitulates the entire problem domain description and all source code leading to the value.

Apologies

In too many cases of "Might Be Misleading," people are really complaining that they don't like the UI label (or the code example) because the problem itself is hard. I'm sympathetic that the problem domain is hard and requires thinking.

Please, however, don't complain about what "Might Be Misleading". Please try to focus on "Actually Is Misleading."

Before complaining, please clarify your understanding.

Here's the rule. If you eventually understood it, it may be that the problem itself is hard. If the problem is hard, fixing the label isn't going to help, is it?

If the problem is hard, you have to think. Some days are like that. The UI designer and I apologize for making you think. Can we move on now?

If the label (or example) really is wrong, and you can correct it, that's a good thing. Figure out what is actually misleading. Supply the correction. Try to escalate "Might Be Misleading" to "Actually Mislead Someone". Specifics matter.

Also, please remember that labels are summaries. At some point, details must be elided. If you have trouble with the concept of "summary", you can do this. (1) Write down all the details that you understand. Omit nothing. (2) Rank the details in order of importance. (3) Delete words to pare the description down to an appropriate length to fit in the UI. When you're done, you have a suggestion.

Tuesday, October 26, 2010

Python and the "Syntactic Whitespace Problem"

Check out this list of questions on Stack Overflow:

http://stackoverflow.com/search?q=%5Bpython%5D+whitespace+syntax

About 10% of these are really just complaints about Python's syntax. Almost every Stack Overflow question on Python's use of syntactic whitespace is really just a complaint.

Here's today's example: "Python without whitespace requirements".

Here's the money quote: "I could potentially be interested in learning Python but the whitespace restrictions are an absolute no-go for me."

Here's the reality.

Everyone Indents Correctly All The Time In All Languages.

Everyone. All the time. Always.

It's amazing how well, and how carefully people indent code. Not Python code.

All Code. XML. HTML. CSS. Java. C++. SQL. All Code.

Everyone indents. And they always indent correctly. It's truly amazing how well people indent. In particular, when the syntax doesn't require any indentation, they still indent beautifully.

Consider this snippet of C code.

if( a == 0 )
printf( "a is zero" );
r = 1;
else
printf( "a is non-zero" );
r = a % 2;


Over the last few decades, I've probably spent a complete man-year reading code like that and trying to figure out why it doesn't work. It's not easy to debug.

The indentation completely and accurately reflects the programmer's intention. Everyone gets the indentation right. All the time. In every language.

And people still complain about Python, even when they indent beautifully in other languages.

Thursday, October 21, 2010

Code Base Fragmentation

Here's what I love -- an argument that can only add cost and complexity to a project.

It sounds like this to me: "We need to fragment the code base into several different languages. Some of the application programming simply must be written in a language that's poorly-understood, with tools that are not widely available, and supported by a select few individuals that have exclusive access to this code. We haven't benchmarked the technical benefit."

Further, we'll create complex organizational roadblocks in every single project around this obscure, specialized, hard-to-support language.

Perhaps I'm wrong, but database triggers always seem to create more problems than they solve.

They Totally Solve a Problem

The most common argument boils down to application-specific cross-cutting concerns. The claim is that these concerns (logging, validation, data model integrity, whatever) can only be solved with triggers. For some reason, though, these cross-cutting concerns can't be solved through ordinary software design. I'm not sure why triggers are the only solution when simple OO design would be far simpler.

Some folks like to adopt the "multiple application programming languages" argument. That is, that ordinary OO design won't work because the code would have to be repeated in each language. This is largely bunk. It's mostly folks scent-marking territory and refusing to cooperate.

Step 1. Write a library and share it. It's hard to find a language that can't be used to write a sharable library. It's easy to find an organization where the Visual C# programmers are not on speaking terms with the Java programmers and the isolated Python folks are pariahs. This isn't technology. Any one of the languages can create the necessary shared library. A modicum of cooperation would be simpler than creating triggers.

Step 2. Get over it. "Duplicated" business logic is rampant in most organizations. Now that you know about, you can manage it. You don't need to add Yet Another Language to the problem. Just cooperate to propagate the changes.

They're Totally Essential To The Database

The silly argument is that some business rules are "closer to" or "essential to" the database. The reason I can call this silly is because when the data is converted to another database (or extracted to the data warehouse) the triggers aren't relevant or even needed. If the triggers aren't part of "interpreting" or "using" the data, they aren't essential. They're just convenient.

The data really is separate from the processing. And the data is far, far more valuable than the processing. The processing really is mostly application-specific. Any processing that isn't specific to the application really is a cross-cutting concern (see above). There is no "essential" processing that's magically part of the data.

What If...

Life is simpler if all application programming is done in application programming languages. And all triggers are just methods in classes. And everyone just uses the class library they're supposed to use.

"But what if someone doesn't use the proper library? A trigger would magically prevent problems."

If someone refuses to use the application libraries, they need career coaching. As in "find another job where breaking the rules is tolerated."

Tuesday, October 19, 2010

Technical Debt

Love this from Gartner. "Gartner Estimates Global 'IT Debt' to Be $500 Billion This Year, with Potential to Grow to $1 Trillion by 2015".

NetworkWorld ran a quicky version of the story. Gartner: Global 'IT debt' hits $500 billion, on the way to $1 trillion.

ComputerWorld -- to be proper journalists -- have to get a balancing quote. Their version of the story is this: Gartner warns of app maintenance 'debt'. The balancing quote is the following:
"There are many good reasons to NOT upgrade/modernize many applications, and I believe Gartner is out of line using words like 'debt' which have guilt associated with them,"
"Guilt"? That's a problem? Why are we pandering to an organization's (i.e., CIO's) emotional response?

I'm not sure that using a word like "debt" is a problem. Indeed, I think they should ramp up the threat level on this and add words like "short-sighted" and "daft" and perhaps even "idiotic".

Anyone who doesn't believe (or doesn't understand) technical debt needs only to review the Y2K plans and budgets. A bad technology decision lead to a mountain of rework. Yes, it was all successful, but it made IT budgeting difficult for years afterwords.

The rest of the organization was grumpy about having their projects were stalled until after Y2K. IT created it's own problems by letting the technology debt accumulate to a level where it was "fix or face an unacceptable risk of not being able to stay in business."

How many other latent Y2K-like problems are companies ignoring?

Wednesday, October 13, 2010

Real Security Models

Lots of folks like to wring their hands over the Big Vague Concept (BVC™) labeled "security".

There's a lot of quibbling. Let's move beyond BVC to the interesting stuff.

I've wasted hours listening to people identify risks and costs of something that's not very complex. I've been plagued by folks throwing up the "We don't know what we don't know" objection to a web services interface. This objection amounts to "We don't know every possible vulnerability; therefore we don't know how to secure it; therefore all architectures are bad and we should stop development right now!" The OWASP top-ten list, for some reason, doesn't sway them into thinking that security is actually manageable.

What's more interesting than quibbling over BVC, is determining the authorization rules.

Basics

Two of the pillars of security are Authentication (who are you?) and Authorization (what are you allowed to do?)

Authentication is not something to be invented. It's something to be used. In our case, with an Apache/Django application, the Django authentication system works nicely for identity management. It supports a simple model of users, passwords and profiles.
We're moving to Open SSO. This takes identity management out of Django.

The point is that authentication is -- largely -- a solved problem. Don't invent. It's solved and it's easy to get wrong. Download or License an established product for identity management
and use it for all authentication.

Authorization

The Authorization problem is always more nuanced, and more interesting, than Authentication. Once we know who the user is, we still have to determine what they're really allowed to do. This varies a lot. A small change to the organization, or a business process, can have a ripple effect through the authorization rules.

In the case of Django, there is a "low-level" set of authorization tests that can be attached to each view function. Each model has an implicit set of three permissions (can_add, can_delete and can_change). Each view function can test to see if the current user has the required permission. This is done through a simple permission_required decorator on each view function.

However, that's rarely enough information for practical — and nuanced — problems.

The auth profile module can be used to provide additional authorization information. In our case, we just figured out that we have some "big picture" authorizations. For sales and marketing purposes, some clusters of features are identified as "products" (or "features" or "options" or something). They aren't smallish things like Django models. They aren't largish things like whole sites. They're intermediate things based on what customers like to pay for (and not pay for).

Some of these "features" map to Django applications. That's easy. The application view functions can all simply refuse to work if the user's contract doesn't include the option.

Sadly, however, some "features" are part of an application. Drat. We have two choices here.
  • Assure that there's a "default" option and configure the feature or the default at run time. For a simple class (or even a simple module) this isn't too hard. Picking a class to instantiate at run time is pretty standard OO programming.
  • Rewrite the application to refactor it into two applications: the standard version and the optional version. This can be hard when the feature shows up as one column in a displayed list of objects or one field in a form showing object details. However, it's very Django to have applications configured dynamically in the settings file.
Our current structure is simple: all customers get all applications. We have to move away from that to mix-and-match applications on a per-customer basis. And Django supports this elegantly.

Security In Depth

This leads us to the "Defense in Depth" buzzword bingo. We have SSL. We have SSO. We have high-level "product" authorizations. We have fine-grained Django model authorizations.

So far, all of this is done via Django group memberships, allowing us to tweak permissions through the auth module. Very handy. Very nice. And we didn't invent anything new.

All we invented was our high-level "product" authorization. This is a simple many-to-many relationship between the Django Profile model and a table of license terms and conditions with expiration dates.

Django rocks. The nuanced part is fine-tuning the available bits and pieces to match the marketing and sales pitch and the the legal terms and conditions in the contracts and statements of work.

Monday, October 4, 2010

.xlsm and .xlsx Files -- Finally Reaching Broad Use

For years, I've been using Apache POI in Java and XLRD in Python to read spreadsheets. Finally, now that .XLSX and .XLSM files are in more widespread use, we can move away from those packages and their reliance on successful reverse engineering of undocumented features.

Spreadsheets are -- BTW -- the universal user interface. Everyone likes them, they're almost inescapable. And they work. There's no reason to attempt to replace the spreadsheet with a web page or a form or a desktop application. It's easier to cope with spreadsheet vagaries than to replace them.

The downside is, of course, that users often tweak their spreadsheets, meaning that you never have a truly "stable" interface. However, transforming each row of data into a Python dictionary (or Java mapping) often works out reasonably well to make your application mostly immune to the common spreadsheet tweaks.

Most of the .XLSX and .XLSM spreadsheets we process can be trivially converted to CSV files. It's manual, yes, but a quick audit can check the counts and totals.

Yesterday we got an .XLSM with over 80,000 plus rows. It couldn't be trivially converted to CSV by my installation of Excel.

What to do?

Python to the Rescue

Step 1. Read the standards. Start with the Wikipedia article: "Open Office XML". Move to the ECMA 376 standard.

Step 2. It's a zip archive. So, to process the file, we need to locate the various bits inside the archive. In many cases, the zip members can be processed "in memory". In the case of our 80,000+ row spreadsheet, the archive is 34M. The sheet in question expands to a 215M beast. The shared strings are 3M. This doesn't easily fit into memory.

Further, a simple DOM parser, like Python's excellent ElementTree, won't work on files this huge.

Expanding an XLSX or XLSM file

Here's step 2. Expanding the zip archive to locate the shared strings and sheets.
import zipfile
def get_worksheets(name):
arc= zipfile.ZipFile( name, "r" )
member= arc.getinfo("xl/sharedStrings.xml")
arc.extract( member )
for member in arc.infolist():
if member.filename.startswith("xl/worksheets") and member.filename.endswith('.xml'):
arc.extract(member)
yield member.filename

This does two things. First, it locates the shared strings and the various sheets within the zip archive. Second, it expands the sheets and shared strings into the local working directory.

There are many other parts to the workbook archive. The good news is that we're not interesting in complex workbooks with lots of cool Excel features. We're interested in workbooks that are basically file-transfer containers. Usually a few sheets with a consistent format.

Once we have the raw files, we have to parse the shared strings first. Then we can parse the data. Both of these files are simple XML. However, they don't fit in memory. We're forced to use SAX.

Step 3 -- Parse the Strings

Here's a SAX ContentHandler that finds the shared strings.
import xml.sax
import xml.sax.handler
class GetStrings( xml.sax.handler.ContentHandler ):
"""Locate Shared Strings."""
def __init__( self ):
xml.sax.handler.ContentHandler.__init__(self)
self.context= []
self.count= 0
self.string_dict= {}
def path( self ):
return [ n[1] for n in self.context ]
def startElement( self, name, attrs ):
print( "***Non-Namespace Element", name )
def startElementNS( self, name, qname, attrs ):
self.context.append( name )
self.buffer= ""
def endElementNS( self, name, qname ):
if self.path() == [u'sst', u'si', u't']:
self.string_dict[self.count]= self.buffer
self.buffer= ""
self.count += 1
while self.context[-1] != name:
self.context.pop(-1)
self.context.pop(-1)
def characters( self, content ):
if self.path() == [u'sst', u'si', u't']:
self.buffer += content
This handler collects the strings into a simple dictionary, keyed by their relative position in the XML file.

This handler is used as follows.
string_handler= GetStrings()
rdr= xml.sax.make_parser()
rdr.setContentHandler( string_handler )
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
rdr.parse( "xl/sharedStrings.xml" )
We create the handler, create a parser, and process the shared strings portion of the workbook. When this is done, the handler has a dictionary of all strings. This is string_handler.string_dict. Note that a shelve database could be used if the string dictionary was so epic that it wouldn't fit in memory.

The Final Countdown

Once we have the shared strings, we can then parse each worksheet, using the share string data to reconstruct a simple CSV file (or JSON document or something more usable).

The Content Handler for the worksheet isn't too complex. We only want cell values, so there's little real subtlety. The biggest issue is coping with the fact that sometimes the content of a tag is reported in multiple parts.

class GetSheetData( xml.sax.handler.ContentHandler ):
"""Locate column values."""
def __init__( self, string_dict, writer ):
xml.sax.handler.ContentHandler.__init__(self)
self.id_pat = re.compile( r"(\D+)(\d+)" )
self.string_dict= string_dict
self.context= []
self.row= {}
self.writer= writer
def path( self ):
return [ n[1] for n in self.context ]
def startElement( self, name, attrs ):
print( "***Non-Namespace Element", name )
def startElementNS( self, name, qname, attrs ):
self.context.append( name )
if name[1] == "row":
self.row_num = attrs.getValueByQName(u'r')
elif name[1] == "c":
if u't' in attrs.getQNames():
self.cell_type = attrs.getValueByQName(u't')
else:
self.cell_type = None # defult, not a string
self.cell_id = attrs.getValueByQName(u'r')
id_match = self.id_pat.match( self.cell_id )
self.row_col = self.make_row_col( id_match.groups() )
elif name[1] == "v":
self.buffer= "" # Value of a cell
else:
pass # might do some debugging here.
@staticmethod
def make_row_col( col_row_pair ):
col = 0
for c in col_row_pair[0]:
col = col*26 + (ord(c)-ord("A")+1)
return int(col_row_pair[1]), col-1
def endElementNS( self, name, qname ):
if name[1] == "row":
# write the row to the CSV result file.
self.writer.writerow( [ self.row.get(i) for i in xrange(max(self.row.keys())) ] )
self.row= {}
elif name[1] == "v":
if self.cell_type is None:
try:
self.value= float( self.buffer )
except ValueError:
print( self.row_num, self.cell_id, self.cell_type, self.buffer )
self.value= None
elif self.cell_type == "s":
try:
self.value= self.string_dict[int(self.buffer)]
except ValueError:
print( self.row_num, self.cell_id, self.cell_type, self.buffer )
self.value= None
elif self.cell_type == "b":
self.value= bool(self.buffer)
else:
print( self.row_num, self.cell_id, self.cell_type, self.buffer, self.string_dict.get(int(self.buffer)) )
self.value= None
self.row[self.row_col[1]] = self.value
while self.context[-1] != name:
self.context.pop(-1)
self.context.pop(-1)
def characters( self, content ):
self.buffer += content
This class and the shared string handler could be refactored to eliminate a tiny bit of redundancy.

This class does two things. At the end of a tag, it determines what data was found. It could be a number, a boolean value or a shared string. At the end of a tag, it writes the row to a CSV writer.

This handler is used as follows.
    rdr= xml.sax.make_parser()
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
for s in sheets:
with open(s+".csv","wb") as result:
handler= GetSheetData(string_handler.string_dict,csv.writer(result))
rdr.setContentHandler( handler )
rdr.parse( s )
This iterates through each sheet, transforming it into a simple .CSV file. Once we have the file in CSV format, it's smaller and simpler. It can easily be processed by follow-on applications.

The overall loop actually looks like this.

sheets= list( get_worksheets(name) )

string_handler= GetStrings()
rdr= xml.sax.make_parser()
rdr.setContentHandler( string_handler )
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
rdr.parse( "xl/sharedStrings.xml" )

rdr= xml.sax.make_parser()
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
for s in sheets:
with open(s+".csv","wb") as result:
handler= GetSheetData(string_handler.string_dict,csv.writer(result))
rdr.setContentHandler( handler )
rdr.parse( s )
This expands the shared strings and individual sheets. It iterates through the sheets, using the shared strings, to create a bunch of .CSV files from the .XLSM data.

The resulting .CSV -- stripped of the XML overheads -- is 80,000+ rows and only 39M. Also, it can be processed with the Python csv library.

CSV Processing

This, after all, was the goal. Read the CSV file and do some useful work.
def csv_rows(source):
rdr= csv.reader( source )
headings = []
for n, cols in enumerate( rdr ):
if n < 4:
if headings:
headings = [ (top+' '+nxt).strip() for top, nxt in zip( headings, cols ) ]
else:
headings = cols
continue
yield dict(zip(headings,cols))
We locate the four header rows and build labels from the the four rows of data. Given these big, complex headers, we can then build a dictionary from each data row. The resulting structure is exactly like the results of a csv.DictReader, and can be used to do the "real work" of the application.

Thursday, September 30, 2010

SQL Can Be Slow -- Why Do People Doubt This?

Here's a typical problem that results from "SQL Hegemony" -- all data must be in a database, and all access must be via SQL. This can also be called the "SQL Fetish" school of programming.

War Story. On a Data Warehousing project, we had to load and process the organizational hierarchy. SQL doesn't do hierarchies well because they can (and should) involve an join of indefinite depth. One of the DBA's wanted to use a "pure SQL" traversal of the hierarchy.

My opinion was that it was a waste of code. We were writing Java programs. We could -- trivially -- fetch the entire tree into Java objects and work with the hierarchy as a hierarchy.

The DBA finally "won" because of the SQL Hegemony argument -- all access must be in SQL, right? I say "won" because we eventually had to throw all the SQL away and use flat files. A "pure SQL" data warehouse is generally unacceptably slow for loading. Data mart subsets can be done in pure SQL, but loads can't.

Recent Events. "a table called [LOTSADATA] and it has 14.7 million rows. One of the columns in [LOTSADATA] table is BRAND" for which they need to do a select distinct. "The disadvantage of [SELECT DISTINCT] is that the Oracle database engine will do a sort which is an insanely expensive operation.

Question: Are there alternative approaches to obtaining the unique brands in
a table?"

Response 1. Duh. Of course there are alternatives. What are you, stupid? You have programming languages. Use them.

Response 2. You're kidding, right? Why ask me? Why not just run it? How hard can it be to benchmark this? What are you, stupid? Seriously.

Response 3. Oh. SQL Hegemony. Folks are actually arguing about the cost of a query and -- it appears -- no one can actually write the eight lines of code required to demonstrate that SELECT ALL is faster than SELECT DISTINCT.

[Sorry for calling you stupid. You're paralyzed by fear, not stupidity. What if SQL isn't the perfect end-all, do-all language? If SQL isn't perfect for all data processing, what other lies have we been living? Is this the end of organized data processing? The collapse of western civilization?

Indeed, I'm repeatedly shocked that the question even comes up. And I'm more shocked that the "appeal to authority" argument has to be used. It's trivial to measure. It appears that it's easier to ask me than to gather data.]

Edit. SQL Hegemony? Yes. Rather than run a demonstration program, written in Java or C# or Python, they argued about the SQL. Doing this with minimalist SQL didn't seem to make anyone's radar. Why not? SQL Hegemony. Rather than consider real alternatives, everyone was reduced to looking for sneaky SQL tricks.

Benchmarking. Here is what I did. It's 5 lines of code for each case. [How hard can this be? Apparently, SQL hegemony makes it impossible for some organizations to do even this.]
def select_distinct():
q1= db.cursor()
q1.execute( "SELECT DISTINCT BRAND FROM LOTSADATA" )
print q1.fetchall()
q1.close()

def select_all():
q2= db.cursor()
q2.execute( "SELECT ALL BRAND FROM LOTSADATA" )
print set( q2.fetchall() )
q2.close()

Notes.
  • I only simulated 100,000 rows. [I don't have the patience to wait for 15 million rows to be created, loaded and queried.]
  • The table only had four columns.
  • I used SQLite3 -- which is mostly in-memory -- and runs much, much faster than Oracle.
  • The select all is not a specious result based on cache being filled; the results are repeatable in any ordering of the queries.
Results.

select_distinct 0.417096
select_all 0.162827

For this data, the SQL SELECT DISTINCT took almost 3x as long as the SELECT ALL. It's just that simple.

Want more speed? Use array fetch features to get more rows in bigger buffers.

Consequences.

This is not rocket science. SQL can be Slow. Don't Argue: Benchmark. Your Mileage May Vary.

SQL databases do locking, transaction management, backup and recovery and a bunch of things well. SQL databases are helpful and necessary. However, SQL isn't always fast.

SQL means Slow Query Language. You've been told.

Tuesday, September 28, 2010

Why Professional Certification Might Be Good

Sometimes I think we need professional certification in this industry. I supported the ICCP for a long time.

In addition to certification, which requires ongoing educational credits to maintain, there ought to be a process for revoking one's certification, requiring them to pass their exams again.

Here's three strikes against two clods who wasted hours on -- perhaps -- the dumbest things possible.

Strike 1. Counting From Zero
I then ponited out that the Microsoft doco is weird because the highest
number allowed by ulong is 18,446,744,073,709,551,615 which ends in an odds
number

I remineded him that 2**64 = 18,446,744,073,709,551,616
Apparently, this was the first time anyone realized how counting from zero works. If they had actually thought about this, they could have tried a smaller example. For example three bits. 2**3 = 8. When you enumerate the values you get 0, 1, 2, 3, 4, 5, 6, 7. The highest value is 2**3-1. It's not "weird". It's a trivially obvious mathematical fact.

It works like this: n values have numbers from 0 to n-1. Didn't know that? Consider your certification revoked. Even hobbyists know this.

Strike 2. Wrong Tools and Wrong Approach

This is more subtle and involves two strikes. We'll look at just one of them.
Then he wanted a spreadsheet of 2 raised to nth power.

I put it together and the numbers just looked weird. I then realized that
when you type a number that contains more than 15 digits in a cell,
Microsoft Excel changes any digits past the fifteenth place to zeroes

What I felt like saying is that Python has built in this concept of "long
integers" which has unlimited precision and it automatically switches to
them
One of the clods knew Python. Instead of writing a trivial loop in Python, apparently, clod #1 proceeded to type numbers into a spreadsheet. The clod didn't compute them -- with formulas or software -- the clod typed the numbers. Typed. Have Python. Elected to type. How did they do the calculations? On a pocket calculator? Oh the shame.

Also, additional penalties for exhaustive enumeration. They sent me the spreadsheet as if it was somehow important that they could enumerate values between 2**0 and 2**135. No summary or rule. Just a mountain of useless numbers.

Strike 3. Floating Point

This is not news. Nor should it be. Indeed, this should be the first set of questions on the certification exam. If you can't figure out floating point, you can't write software. Period. Please find another job in an industry where you won't waste time on this.

Floating point is not simple, and everyone should study it before they are allowed to charge money for creating software. Running random experiments and exhaustively enumerating values is not studying. That's not even hobbyist stuff. Try actually reading. Starting with the standard. And David Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic".
contains more than 15 digits in a cell,
Microsoft Excel changes any digits past the fifteenth place to zeroes
This is not "news". The link provided in the email ("Last digits are changed to zeroes when you type long numbers in cells of Excel") indicates a profound lack of understanding.

They could not have noticed that this is near 2**50. They never looked up the IEEE floating point representation that -- pretty clearly -- says that there are only 52 bits of useful information. Wikipedia reminds us that this is about 15 decimal digits. Rather than look this up, they chose to be astonished.

These clods were astonished that floating-point numbers have a finite mantissa. Astonished that -- empirically -- they had stumbled on the fact that the mantissa is about 50 bits.

How much time did they waste on this? More importantly, how can they consider their activities to be "professional"? Unable to count from zero? Using the wrong tools and exhaustively enumerating the obvious? Not realizing the floating-point values have limited precision?

I find it appalling. Their escapades sound like two home hobbyists with their fist ever copy of C#. Not like professionals.

Thursday, September 23, 2010

Comments, Assertions and Unit Tests

See "Commenting the Code". This posting tickled my fancy because it addressed the central issue of "what requires comments outside Python docstrings". All functions, classes, modules and packages require docstrings. That's clear. But which lines of code require additional documentation?

We use Sphinx, so we make extensive use of docstrings. This posting forced me to think about non-docstring commentary. The post makes things a bit more complex than necessary. It enumerated some cases, which is helpful, but didn't see the commonality between them.

The posting lists five cases for comments in the code.
  1. Summarizing the code blocks. Semi-agree. However, many code blocks indicates too few functions or methods. I rarely write a function long enough to have "code blocks". And the few times I did, it became regrettable. We're unwinding a terrible mistake I made regarding an actuarial calculation. It seemed so logical to make it four steps. It's untestable as a 4-step calculation.
  2. Describe every "non-trivial" operation. Hmmm... Hard t0 discern what's trivial and what's non-trivial. The examples on the original post seems to be a repeat of #1. However, it seems more like this is a repeat of #5.
  3. TODO's. I don't use comments for these. These have to be official ".. todo::" notations that will be picked up by Sphinx. So these have to be in docstrings, not comments.
  4. Structures with more than a couple of elements. The example is a tuple of tuples. I'd prefer to use a namedtuple, since that includes documentation.
  5. Any "doubtful" code. This is -- actually -- pretty clear. When in doubt, write it out. This seems to repeat #2.
One of the other cases in the the post was really just a suggestion that comments be "clear as well as short". That's helpful, but not a separate use case for code comments.

So, of the five situations for comments described in the post, I can't distinguish two of them and don't agree with two more.

This leaves me with two use cases for Python code commentary (distinct from docstrings).
  • A "summary" of the blocks in a long-ish method (or function)
  • Any doubtful or "non-trivial" code. I think this is code where the semantics aren't obvious; or code that requires some kind of review of explanation of what the semantics are.
The other situations are better handled through docstrings or named tuples.

Assertions

Comments are interesting and useful, but they aren't real quality assurance.

A slightly stronger form of commentary is the assert statement. Including an assertion formalizes the code into a clear predicate that's actually executable. If the predicate fails, the program was mis-designed or mis-constructed.

Some folks argue that assertions are a lot of overhead. While they are overhead, they aren't a lot of overhead. Assertions in the body of the inner-most, inner-most loops may be expensive. But must of the really important assertions are in the edge and corner cases which (a) occur rarely and (b) are difficult to design and (c) difficult to test.

Since the obscure, oddball cases are rare, cover these with the assert statement in addition to a comment.

That's Fine, But My Colleagues are Imbeciles

There are numerous questions on Stack Overflow that amount to "comments don't work". Look at at the hundreds of question that include the keywords public, protected and private. Here's a particularly bad question with a very common answer.
Because you might not be the only developer in your project and the other developers might not know that they shouldn't change it. ...
This seems silly. "other developers might not know" sounds like "other developers won't read the comments" or "other developers will ignore the comments." In short "comments don't work."

I disagree in general. Comments can work. They work particularly well in languages like Python where the source is always available.

For languages like C++ and Java, where the source can be separated and kept secret, comments don't work. In this case, you have to resort to something even stronger.

Unit Tests

Unit tests are perhaps the best form of documentation. If someone refuses to read the comments, abuses a variable that's supposed to be private, and breaks things, then tests will fail. Done.

Further, the unit test source must be given to all the other developers so they can see how the API is supposed to work. A unit test is a living, breathing document that describes how a class, method or function behaves.

Explanatory Power

Docstrings are essential. Tools can process these.

Comments are important for describing what's supposed to happen. There seem to be two situations that call for comments outside docstrings.

Assertions can be comments which are executable. They aren't always as descriptive and English prose, but they are formal and precise.

Unit tests are important for confirming what actually happens. There's really no alternative to unit testing to supplement the documentation.