Thursday, February 17, 2011

TDD -- From SME Spreadsheet to TestCase to Code

In "Unit Test Case, Subject Matter Experts and Requirements" I suggested that it's often pretty easy to get a spreadsheet of full-worked out examples from subject-matter experts. Indeed, if your following TDD, that spreadsheet of examples is solid gold.

Let's consider something relatively simple. Let's say we're working on some fancy calculations. Our users explain until they're blue in the face. We take careful notes. We think we understand. To confirm, we ask for a simple spreadsheet with inputs and outputs.

We get something like the following. The latitudes and longitudes are inputs. The ranges and bearings are outputs. [The math can be seen at "Calculate distance, bearing and more between Latitude/Longitude points".]

Latitude 1Longitude 1Latitude 2Longitude 2rangebearing
50 21 50N004 09 25W42 21 04N071 02 27W2805 nm260 07 38

Only it has a a few more rows with different examples. Equator Crossing. Prime Meridian Crossing. All the usual suspects.

TDD Means Making Test Cases

Step one, then, is to parse the spreadsheet full of examples and create some domain-specific examples. Since it's far, far easier to work with .CSV files, we'll presume that we can save the carefully-crafted spreadsheet as a simple .CSV with the columns shown above.

Step two will be to create working Python code from the domain-specific examples.

The creation of test cases is a matter of building some intermediate representation out of the spreadsheet. This is where plenty of parsing and obscure special-case data handling may be necessary.

from __future__ import division
import csv
from collections import namedtuple
import re

latlon_pat= re.compile("(\d+)\s+(\d+)\s+(\d+)([NSWE])")
def latlon( txt ):
match= latlon_pat.match( txt )
d, m, s, h = match.groups()
return float(d)+float(m)/60+float(s)/3600, h
angle_pat= re.compile("(\d+)\s+(\d+)\s+(\d+)")
def angle( txt ):
match= angle_pat.match( txt )
d, m, s = match.groups()
return float(d)+float(m)/60+float(s)/3600
range_pat= re.compile("(\d+)\s*(\D+)")
def range( txt ):
match= range_pat.match( txt )
d, units = match.groups()
return float(d), units

RangeBearing= namedtuple("RangeBearing","lat1,lon1,lat2,lon2,rng,brg")

def test_iter( filename="sample_data.csv" ):
with open(filename,"r") as source:
rdr= csv.DictReader( source )
for row in rdr:
print row
tc= RangeBearing(
latlon(row['Latitude 1']), latlon(row['Longitude 1']),
latlon(row['Latitude 2']), latlon(row['Longitude 2']),
range(row['range']),
angle(row['bearing'])
)
yield tc

for tc in test_iter():
print tc

This is long, but, it handles a lot of the formatting vagaries that users are prone to.

From Abstract to TestCase

Once we have a generator to build test cases as abstraction examples, generating code for Java or Python or anything else is just a little template-fu.

   
from string import Template
testcase= Template("""
class Test_${name}( unittest.TestCase ):
def setUp( self ):
self.p1= LatLon( lat=GlobeAngle(*$lat1), lon=GlobeAngle(*$lon1) )
self.p2= LatLon( lat=GlobeAngle(*$lat2), lon=GlobeAngle(*$lon2) )
def test_should_compute( self ):
d, brg = range_bearing( p1, p2, R=$units )
self.assertEquals( $dist, int(d) )
self.assertEquals( $brg, map(int,map(round,brg.deg)))
""")
for name, tc in enumerate( test_iter() ):
units= tc.rng[1].upper()
dist= tc.rng[0]
code= testcase.substitute( name=name, dist=dist, units=units, **tc._asdict() )
print code

This shows a simple template with values filled in. Often, we have to generate a hair more than this. A few imports, a "unittest.main()" is usually sufficient to transform a spreadsheet into unit tests that we can confidently use for test-driven development.

Tuesday, February 8, 2011

Unit Test Case, Subject Matter Experts and Requirements

Here's a typical "I don't like TDD" question: the topic is "Does TDD really work for complex projects?"

Part of the question focused on the difficulty of preparing test cases that cover the requirements. In particular, there was some hand-wringing over conflicting and contradictory requirements.

Here's what's worked for me.

Preparation. The users provide the test cases as a spreadsheet showing the business rules. The columns are attributes of some business document or case. The rows are specific test cases. Users can (and often will) do this at the drop of a hat. Often complex, narrative requirements written by business analysts are based on such a spreadsheet.

This is remarkably easy for must users to produce. It's just a spreadsheet (or multiple spreadsheets) with concrete examples. It's often easier for users to make concrete examples than it is for them to write more general business rules.

Automated Test Case Construction

Here's what can easily happen next.

Write a Python script to parse the spreadsheet and extract the cases. There will be some ad-hoc rules, inconsistent test cases, small technical problems. The spreadsheets will be formatted poorly or inconsistently.

Once the cases are parsed, it's easy to then create a Unittest.TestCase template of some kind. Use Jinja2 or even Python's string.Template class to rough out the template for the test case. The specifics get filled into the unit test template.

The outline of test case construction is something like this. Details vary with target language, test case design, and overall test case packaging approach.

t = SomeTemplate()
for case_dict in testCaseParser( "some.xls" ):
code= t.render( **case_dict )
with open(testcaseName(**case_dict ),'w') as result:
result.write( code )

You now have a fully-populated tree of unit test classes, modules and packages built from the end-user source documents.

You have your tests. You can start doing TDD.

Scenarios

One of the earliest problems you'll have is test case spreadsheets that are broken. Wrong column titles, wrong formatting, something wrong. Go meet with the user or expert that built the spreadsheet and get the thing straightened out.

Perhaps there's some business subtlety to this. Or perhaps they're just careless. What's important is that the spreadsheets have to be parsed by simple scripts to create simple unit tests. If you can't arrive at a workable solution, you have Big Issues and it's better to resolve it now than try to press on to implementation with a user or SME that's uncooperative.

Another problem you'll have is that tests will be inconsistent. This will be confusing at first because you've got code that passed one test, and fails another test and you can't tell what the differences between the tests are. You have to go meet with the users or SME's and resolve what the issue is. Why are the tests inconsistent? Often, attributes are missing from the spreadsheet -- attributes they each assumed -- and attributes you didn't have explicitly written down anywhere. Other times there's confusion that needs to be resolved before any programming should begin.

The Big Payoff

When the tests all pass, you're ready for performance and final acceptance testing. Here's where TDD (and having the users own the test cases) pays out well.

Let's say we're running the final acceptance test cases and the users balk at some result. "Can't be right" they say.

What do we do?

Actually, almost nothing. Get the correct answer into a spreadsheet somewhere. The test cases were incomplete. This always happens. Outside TDD, it's called "requirements problem" or "scope creep" or something else. Inside TDD, it's called "test coverage" and some more test cases are required. Either way, test cases are always incomplete.

It may be that they're actually changing an earlier test case. Users get examples wrong, too. Either way (omission or error) we're just fixing the spreadsheets, regenerating the test cases, and starting up the TDD process with the revised suite of test cases.

Bug Fixing

Interestingly, a bug fix after production roll-out is no different from an acceptance test problem. Indeed it's no different from anything that's happened so far.

A user spots a bug. They report it. We ask for the concrete example that exemplifies the correct answer.

We regenerate the test cases from the spreadsheets and start doing development. 80% of the time, the new example is actually a change to an existing example. And since the users built the example spreadsheets with the test data, they can maintain those spreadsheets to clarify the bugs. 20% of the time it's a new requirement. Either way, the test cases are as complete and consistent as the users are capable of producing.

Wednesday, February 2, 2011

Escaping the Relational Schema Trap

We're struggling with our Relational Schema. We're not alone, of course, everyone struggles with the relational model. The technology imposes difficult limitations and we work around them.

There's kind of a 4-step process through which the relational schema erodes into irrelevance. The concept of a schema is not irrelevant. It's the rigid relational schema that's a problem.

Many DBA's will say that the relational model is the ultimate in flexibility. They're right, but they're missing the point. The relational database clearly separates the physical storage from the logical model as seen in tables and columns. It's flexible, but the presence of a rigid relational schema limits the pace of business change.

"Clearly," the DBA says, "you don't know how to use ALTER." I beg to differ. I can use ALTER; however, it doesn't permit the broad, sweeping scope of change that the business demands.

In order to attempt to match the pace of business change, we're using an ORM layer. This allows us to fabricate methods and properties left, right and center. We can tackle some pretty big problems with simple code changes. This, however, is no longer helping.

Straws and Camels

When designing a database, we have to be cognizant of the nature and tempo of change. In highly-regulated, very settled business applications (back-office accounting, for example) the data model is well known. Changes are mostly distinctive reporting changes and the tempo is pretty lethargic. It's the back office. Sorry, but innovation rarely happens there.

Each change is just a another hand-full of straw thrown on the camel's back. It happens fairly slowly. And there aren't many surprises. Hacks, workarounds and technical debt accumulates slowly.

In innovative, novel, experimental businesses, however, the nature and tempo are very different. The changes are disruptive, "what are you saying?" kinds of changes. They are "throw out the bathwater, the babies, the cribs and fire the nursemaid" kinds of changes. The tempo is semi-annual reinvent everything. Hacks, workarounds and technical debt get out of control suddenly.

Important Lesson Learned. When the customer misunderstands the offering and asks for something completely senseless, it's good to listen and try to build that -- even if it wasn't what you were offering. In some cases, the original offering was too complex or contrived. In other cases, the offering didn't create enough value. But when you offer [X] and the customer asks how much it will cost for [Y], you have disruptive, sudden, and surprising database changes.
This is bales of hay through onto an unprepared camel. Backs can get broken.

Coping

One common coping strategy is SQL ALTER statements to fiddle with the logical model. This has to be coupled with CREATE TABLE AS SELECT scripts to do open-heart surgery on the logical model. Married with modified ORM definitions. This requires some careful "schema versioning" techniques.

Another coping strategy is lots of "Expansion" columns in the tables. These can be renamed and repurposed without physical storage changes. The rows haven't physically changed, but the column name morphed from "EXPANSION_INT_01" to "Some_Real_Attribute". This doesn't prevent the CREATE TABLE AS SELECT scripts to do open-heart surgery. It still requires some careful "schema versioning" techniques to be sure that the ORM layer matches the logical schema.

A third -- and perhaps most popular -- coping strategy is manpower. Just having dedicated DBA's and maintenance programmers is a common way to handle this. Some folks object, saying that a large staff isn't a way to "cope with change" but is a basic "cost of doing business".

It's false, by the way, to claim that dedicated DBA's are essential. A solo developer can design and implement a database and application software with no help at all. Indeed, in most organizations, developers design and build databases, then turn them over to DBA's for operational support. If the nature of change is minor and tempo of change is slow, a solo developer can deal perfectly well with the database. A dedicated DBA is someone we add when the developer gets swamped by too much change.

(Some DBA's like to claim that the developers never get normalization or indexing correct. I counter with the observation that some DBA's don't get this right, either. DBA's aren't essential. They're a popular way to cope with the nature and tempo of change.)

In the ORM world, there are schema migration toolkits. Projects like Storm, this list for Django, Embarcadero Change Manager for Oracle, and numerous others attempt to support the schema evolution and change management problem. All of this is a clever way to cope with a problem inherent in our choice of technology.

Chaos Theory

Rather than invent clever coping mechanisms, let's take a step back. If we're inventing technology to work around the fixed relational schema, it might be time to rethink the relational schema.

"Oh noes," DBA's cry, "we must have a fixed logical model otherwise chaos ensues."

Really? How come we're always altering that schema? How come we're always adding tables and restructuring the tables?

"Oh that? That's 'controlled change'," the DBA responds.

No, that's slow chaos.

Here's how it plays out. We have a disruptive change. We negotiate with the DBA's to restructure the database. And the test database. And the QA database. We do the development database without any help from the DBA's. We fix the ORM layers. We unit test the changes.

Then we plan and coordinate the production rollout of this change with the DBA's. Note. We already made the change in development. We're not allowed to make the change in production. The DBA's then suggest design alternatives. Normalization isn't "right". Or there are physical changes that need to be declared in the table definitions. We redo the development database. And the ORM layer. And rerun the unit tests.

Because the production database couldn't be touched -- and we had paying customers -- we copied production data into a development database and started doing "production" in development. Now that we're about to make the official production change, we have two databases. The official database content is out-of-date. The development database is a mixture of live production and test data. Sigh.

Rethinking Schema

If the schema is a problem, perhaps we can live without it. Enter NoSQL databases.

Here's how you start down the slippery slope.

Phase I. You need a fairly radical database change. Rather than wait weeks for the DBA's, you ask for a single "BLOB" column. You take the extra data elements for the radical change, JSON encode them, and store the JSON representation in the BLOB field. Now you have a "subschema" buried inside a single BLOB column.

Since this is a simple ALTER, the DBA's will do it without a lot of negotiation or delay. You have a hybrid database with a mixture of schema and noSQL.

Phase II. You need an even more radical change. Rather than wait weeks for the DBA's, you ask for a few tables that have just a primary key and a BLOB column. You've basically invented a document-structured database inside SQL, bypassing the SQL schema entirely.

Phase III. While waiting for the Phase II changes to be implemented, you convert the customer data from their obscure, stupid format into a simple sequential file of JSON documents and write your own simple map-reduce algorithms in Python. Sure, performance is poor, but you're up and running without any database overheads.

Phase IV. Start looking for alternatives.


This MongoDB looks really nice. PyMongo offers lots of hints and guidance.

At least one person is looking at mango, a MongoDB database adapter for Django. For us, this isn't the best idea. We use OpenAM for identity management, so our Users and Sessions are simply cloned from OpenAM by an authentication backend that gets the user from OpenAM. SQLite works fine for this.

We think we can use Django's ORM and a relational database for User and Session. For everything else, we need to look closely and MongoDB.

Wins and Losses

The big win is the ability to handle disruptive change a little bit more gracefully.

The big loss in switching away from the Django ORM is we lose the built-in admin pages. We have to build admin Forms and view functions. While this is a bit of a burden, we've already customized every model form heavily. Switching from ModelForm to Form and adding the missing fields isn't much additional work.

The biggest issue with document-oriented data models is assuring that the documents comply with some essential or core schema. Schemas are inescapable. The question is more a matter of how the schema limits change. Having a Django Form to validate JSON documents for the "essential" features is far more flexible than having a Django Model class and a mapping to a relational database.

Schema migration becomes a non-issue until we have to expand the essential schema, which changes the validation rules, and may render old documents retroactively invalid. This is not a new problem -- Relational folks cope with this, also -- but if it's the only problem, then we may have streamlined the process of making disruptive business changes.