Thursday, September 30, 2010

SQL Can Be Slow -- Why Do People Doubt This?

Here's a typical problem that results from "SQL Hegemony" -- all data must be in a database, and all access must be via SQL. This can also be called the "SQL Fetish" school of programming.

War Story. On a Data Warehousing project, we had to load and process the organizational hierarchy. SQL doesn't do hierarchies well because they can (and should) involve an join of indefinite depth. One of the DBA's wanted to use a "pure SQL" traversal of the hierarchy.

My opinion was that it was a waste of code. We were writing Java programs. We could -- trivially -- fetch the entire tree into Java objects and work with the hierarchy as a hierarchy.

The DBA finally "won" because of the SQL Hegemony argument -- all access must be in SQL, right? I say "won" because we eventually had to throw all the SQL away and use flat files. A "pure SQL" data warehouse is generally unacceptably slow for loading. Data mart subsets can be done in pure SQL, but loads can't.

Recent Events. "a table called [LOTSADATA] and it has 14.7 million rows. One of the columns in [LOTSADATA] table is BRAND" for which they need to do a select distinct. "The disadvantage of [SELECT DISTINCT] is that the Oracle database engine will do a sort which is an insanely expensive operation.

Question: Are there alternative approaches to obtaining the unique brands in
a table?"

Response 1. Duh. Of course there are alternatives. What are you, stupid? You have programming languages. Use them.

Response 2. You're kidding, right? Why ask me? Why not just run it? How hard can it be to benchmark this? What are you, stupid? Seriously.

Response 3. Oh. SQL Hegemony. Folks are actually arguing about the cost of a query and -- it appears -- no one can actually write the eight lines of code required to demonstrate that SELECT ALL is faster than SELECT DISTINCT.

[Sorry for calling you stupid. You're paralyzed by fear, not stupidity. What if SQL isn't the perfect end-all, do-all language? If SQL isn't perfect for all data processing, what other lies have we been living? Is this the end of organized data processing? The collapse of western civilization?

Indeed, I'm repeatedly shocked that the question even comes up. And I'm more shocked that the "appeal to authority" argument has to be used. It's trivial to measure. It appears that it's easier to ask me than to gather data.]

Edit. SQL Hegemony? Yes. Rather than run a demonstration program, written in Java or C# or Python, they argued about the SQL. Doing this with minimalist SQL didn't seem to make anyone's radar. Why not? SQL Hegemony. Rather than consider real alternatives, everyone was reduced to looking for sneaky SQL tricks.

Benchmarking. Here is what I did. It's 5 lines of code for each case. [How hard can this be? Apparently, SQL hegemony makes it impossible for some organizations to do even this.]
def select_distinct():
q1= db.cursor()
q1.execute( "SELECT DISTINCT BRAND FROM LOTSADATA" )
print q1.fetchall()
q1.close()

def select_all():
q2= db.cursor()
q2.execute( "SELECT ALL BRAND FROM LOTSADATA" )
print set( q2.fetchall() )
q2.close()

Notes.
  • I only simulated 100,000 rows. [I don't have the patience to wait for 15 million rows to be created, loaded and queried.]
  • The table only had four columns.
  • I used SQLite3 -- which is mostly in-memory -- and runs much, much faster than Oracle.
  • The select all is not a specious result based on cache being filled; the results are repeatable in any ordering of the queries.
Results.

select_distinct 0.417096
select_all 0.162827

For this data, the SQL SELECT DISTINCT took almost 3x as long as the SELECT ALL. It's just that simple.

Want more speed? Use array fetch features to get more rows in bigger buffers.

Consequences.

This is not rocket science. SQL can be Slow. Don't Argue: Benchmark. Your Mileage May Vary.

SQL databases do locking, transaction management, backup and recovery and a bunch of things well. SQL databases are helpful and necessary. However, SQL isn't always fast.

SQL means Slow Query Language. You've been told.

Tuesday, September 28, 2010

Why Professional Certification Might Be Good

Sometimes I think we need professional certification in this industry. I supported the ICCP for a long time.

In addition to certification, which requires ongoing educational credits to maintain, there ought to be a process for revoking one's certification, requiring them to pass their exams again.

Here's three strikes against two clods who wasted hours on -- perhaps -- the dumbest things possible.

Strike 1. Counting From Zero
I then ponited out that the Microsoft doco is weird because the highest
number allowed by ulong is 18,446,744,073,709,551,615 which ends in an odds
number

I remineded him that 2**64 = 18,446,744,073,709,551,616
Apparently, this was the first time anyone realized how counting from zero works. If they had actually thought about this, they could have tried a smaller example. For example three bits. 2**3 = 8. When you enumerate the values you get 0, 1, 2, 3, 4, 5, 6, 7. The highest value is 2**3-1. It's not "weird". It's a trivially obvious mathematical fact.

It works like this: n values have numbers from 0 to n-1. Didn't know that? Consider your certification revoked. Even hobbyists know this.

Strike 2. Wrong Tools and Wrong Approach

This is more subtle and involves two strikes. We'll look at just one of them.
Then he wanted a spreadsheet of 2 raised to nth power.

I put it together and the numbers just looked weird. I then realized that
when you type a number that contains more than 15 digits in a cell,
Microsoft Excel changes any digits past the fifteenth place to zeroes

What I felt like saying is that Python has built in this concept of "long
integers" which has unlimited precision and it automatically switches to
them
One of the clods knew Python. Instead of writing a trivial loop in Python, apparently, clod #1 proceeded to type numbers into a spreadsheet. The clod didn't compute them -- with formulas or software -- the clod typed the numbers. Typed. Have Python. Elected to type. How did they do the calculations? On a pocket calculator? Oh the shame.

Also, additional penalties for exhaustive enumeration. They sent me the spreadsheet as if it was somehow important that they could enumerate values between 2**0 and 2**135. No summary or rule. Just a mountain of useless numbers.

Strike 3. Floating Point

This is not news. Nor should it be. Indeed, this should be the first set of questions on the certification exam. If you can't figure out floating point, you can't write software. Period. Please find another job in an industry where you won't waste time on this.

Floating point is not simple, and everyone should study it before they are allowed to charge money for creating software. Running random experiments and exhaustively enumerating values is not studying. That's not even hobbyist stuff. Try actually reading. Starting with the standard. And David Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic".
contains more than 15 digits in a cell,
Microsoft Excel changes any digits past the fifteenth place to zeroes
This is not "news". The link provided in the email ("Last digits are changed to zeroes when you type long numbers in cells of Excel") indicates a profound lack of understanding.

They could not have noticed that this is near 2**50. They never looked up the IEEE floating point representation that -- pretty clearly -- says that there are only 52 bits of useful information. Wikipedia reminds us that this is about 15 decimal digits. Rather than look this up, they chose to be astonished.

These clods were astonished that floating-point numbers have a finite mantissa. Astonished that -- empirically -- they had stumbled on the fact that the mantissa is about 50 bits.

How much time did they waste on this? More importantly, how can they consider their activities to be "professional"? Unable to count from zero? Using the wrong tools and exhaustively enumerating the obvious? Not realizing the floating-point values have limited precision?

I find it appalling. Their escapades sound like two home hobbyists with their fist ever copy of C#. Not like professionals.

Thursday, September 23, 2010

Comments, Assertions and Unit Tests

See "Commenting the Code". This posting tickled my fancy because it addressed the central issue of "what requires comments outside Python docstrings". All functions, classes, modules and packages require docstrings. That's clear. But which lines of code require additional documentation?

We use Sphinx, so we make extensive use of docstrings. This posting forced me to think about non-docstring commentary. The post makes things a bit more complex than necessary. It enumerated some cases, which is helpful, but didn't see the commonality between them.

The posting lists five cases for comments in the code.
  1. Summarizing the code blocks. Semi-agree. However, many code blocks indicates too few functions or methods. I rarely write a function long enough to have "code blocks". And the few times I did, it became regrettable. We're unwinding a terrible mistake I made regarding an actuarial calculation. It seemed so logical to make it four steps. It's untestable as a 4-step calculation.
  2. Describe every "non-trivial" operation. Hmmm... Hard t0 discern what's trivial and what's non-trivial. The examples on the original post seems to be a repeat of #1. However, it seems more like this is a repeat of #5.
  3. TODO's. I don't use comments for these. These have to be official ".. todo::" notations that will be picked up by Sphinx. So these have to be in docstrings, not comments.
  4. Structures with more than a couple of elements. The example is a tuple of tuples. I'd prefer to use a namedtuple, since that includes documentation.
  5. Any "doubtful" code. This is -- actually -- pretty clear. When in doubt, write it out. This seems to repeat #2.
One of the other cases in the the post was really just a suggestion that comments be "clear as well as short". That's helpful, but not a separate use case for code comments.

So, of the five situations for comments described in the post, I can't distinguish two of them and don't agree with two more.

This leaves me with two use cases for Python code commentary (distinct from docstrings).
  • A "summary" of the blocks in a long-ish method (or function)
  • Any doubtful or "non-trivial" code. I think this is code where the semantics aren't obvious; or code that requires some kind of review of explanation of what the semantics are.
The other situations are better handled through docstrings or named tuples.

Assertions

Comments are interesting and useful, but they aren't real quality assurance.

A slightly stronger form of commentary is the assert statement. Including an assertion formalizes the code into a clear predicate that's actually executable. If the predicate fails, the program was mis-designed or mis-constructed.

Some folks argue that assertions are a lot of overhead. While they are overhead, they aren't a lot of overhead. Assertions in the body of the inner-most, inner-most loops may be expensive. But must of the really important assertions are in the edge and corner cases which (a) occur rarely and (b) are difficult to design and (c) difficult to test.

Since the obscure, oddball cases are rare, cover these with the assert statement in addition to a comment.

That's Fine, But My Colleagues are Imbeciles

There are numerous questions on Stack Overflow that amount to "comments don't work". Look at at the hundreds of question that include the keywords public, protected and private. Here's a particularly bad question with a very common answer.
Because you might not be the only developer in your project and the other developers might not know that they shouldn't change it. ...
This seems silly. "other developers might not know" sounds like "other developers won't read the comments" or "other developers will ignore the comments." In short "comments don't work."

I disagree in general. Comments can work. They work particularly well in languages like Python where the source is always available.

For languages like C++ and Java, where the source can be separated and kept secret, comments don't work. In this case, you have to resort to something even stronger.

Unit Tests

Unit tests are perhaps the best form of documentation. If someone refuses to read the comments, abuses a variable that's supposed to be private, and breaks things, then tests will fail. Done.

Further, the unit test source must be given to all the other developers so they can see how the API is supposed to work. A unit test is a living, breathing document that describes how a class, method or function behaves.

Explanatory Power

Docstrings are essential. Tools can process these.

Comments are important for describing what's supposed to happen. There seem to be two situations that call for comments outside docstrings.

Assertions can be comments which are executable. They aren't always as descriptive and English prose, but they are formal and precise.

Unit tests are important for confirming what actually happens. There's really no alternative to unit testing to supplement the documentation.

Tuesday, September 21, 2010

A Really Bad Idea -- Adding Clutter to A Language

A DBA suggested that I read up on "Practical API Design: Confessions of a Java Framework Architect".

Apparently the DBA had read the phrase "direct compiler support of versioning of APIs" in a review of the book and -- immediately -- become terribly confused.

I can see why a DBA would be confused. From a DBA's point of view all data, all processing and all management-- all of it -- is intimately tied to a single tool. The idea behind Big Relational is to conflate configuration management, quality assurance, programming and the persistent data so that the product is inescapable.

[The idea is so pervasive that not using the RDBMS has to be called a "movement", as in "NoSQL Movement". It's not a new idea -- it's old wine in new bottles -- but Big Relational has become so pervasive that avoiding the database makes some folks feel like renegades.]

Adding to the confusion is the reality that DBA's live in a world where version management is difficult. What is an API version number when applied to the database? Can a table have a version? Can a schema have a version?

[IMO, the answer is yes, database designs -- metadata -- can easily be versioned. There's no support in the database product. But it's easy to do with simple naming conventions.]

For a DBA -- who's mind-set is often twisted into "one product hegemony" and "versioning is hard" -- the phrase "direct compiler support of versioning of APIs" maps to "direct tool/database/everything support of versioning." Nirvana.

All Things in Moderation

A relevant quote from the book is much more sensible than this fragment of a review. "Some parts of the solution should be in the compiler, or at least reflected in the sources, and processed by some annotation processor later."

API versioning is not a good idea for adding to a programming language. At all. It's entirely a management discipline. There's no sensible avenue for "language" support of versioning. It can make sense to carry version information in the source, via annotations or comments. But to augment a language to support management can't work out well in the long run.

Why not?

Rule 1. Programming Languages are Turing Complete. And Nothing More. Syntactic sugar is okay, if it can be proven to be built on the Turing complete core language. Extra "features" like version control are well outside the minimal set of features required to be Turing complete. So far outside that they make a completeness proof hard because there's this extra stuff that doesn't matter to the proof.

Therefore: Don't Add Features. The language is the language. Add features via a toolset or a annotation processor or somewhere else. Your API revision junk will only make the proof of completeness that much more complex; and the proof won't touch the "features".

Rule 2. Today's Management Practice is Only A Fad. Version numbering for API's with a string of Major.Minor.Release.Patch is simply a trendy fad. No one seems to have a consistent understanding of what those numbers mean. Further, some tools (like subversion) simply using monotonically increasing numbers -- no dots.

Someday, someone will come up with an XML Feature Summary (XFS) for describing features and aspects of the the API, and numbers will be dropped as uselessly vague and replaced with a complex namespace of features and feature enumeration and a URI referencing an RDF that identifies the feature set. Numbers will be replaced with URI's.

Therefore: Don't Canonize Today's Management Practice in the Language. When the current practice has faded from memory, we don't want to have to retool our programming languages.

What To Do?

What we do for API version control is -- well -- hard work. Annotations are good. A tool that scrapes out the annotations to create a "profile" of the API might be handy.

In Python (and other dynamic languages) it's a much simpler problem than it is in static languages like Java and C++. Indeed, API version management may be one of the reasons for the slow shift from static to dynamic languages.

If we try to fold in complex language features for API version support, we introduce bugs and problems. Then -- when management practice drifts to a new way of handling API's -- we're stuck with bad language features. We can't simply deprecate them, we have to find a new language that has similar syntax, but lacks the old-bad API management features.

Distutils

Python distutils has a nice "Requires", "Provides" and "Obsoletes" specification that's part of the installation script. This is a handy level of automation: the unit of configuration management (the module) is identified at a high level using simple numbers. More than this is probably ill-advised.

And -- of course -- this isn't part of the Python language. It's just a tool.

Thursday, September 16, 2010

What Innovation Looks Like

Check out "End User 2.0: When Employees Have All The Answers" in InformationWeek. This is about adoption of non-approved technology. Think iPad.

This shows what innovation looks like when it happens.

1. There's no process for innovation.

2. There's no "permission to fail". Folks just fail or succeed without anyone's support or permission.

3. It's disruptive. Many IT departments don't know how to cope with USB drives, iPads and related leading-edge technology. So these things are simply banned. (Ever walked past a sign that says "No Recording Devices Allowed Beyond This Point" with your iPhone?)

Here's one great quote: "Policies around regulatory compliance, reliability, budget approvals, and support all give IT teams reasons to resist technology driven by end users."

Technology innovation is happening. It is disruptive. Therefore, IT tends to resist the disruption.

The best stall tactic: "Security". If IT lifts up security as an issue, they can resist technology innovation effectively.

Other Disruptive Change

This happens everywhere. It isn't just the iPad. All disruptive, innovative change is met with serious resistance.

Agile Methods. Some IT departments resist agile methods because -- obviously -- the lack of a comprehensive and detailed project plan is a problem. Failure to plan is a plan for failure. The idea of building intentional flexibility into predicting the future is rejected. It's too disruptive to the IT chain of command to reduce the need for project managers.

Dynamic or Functional Programming Languages. It was painful to adopt Java (or C#). Adopting another, different language like Python is insanity. Obviously. Anyone in "Big IT" who is a serious Java or C# developer can tell you that a dynamic language is obviously unsuitable for production use. Mostly, the reasons boil down to "it's different"; different is too disruptive.

N0SQL Data Management. Clearly, the relational database is the only form of persistence that can possibly be used. It is perfect in every way. It can be used as a message queue (because adopting an actual message queue is too much work). It can be used for temporary or transient data. It can be used for non-relational objects like XML documents. Any suggestion that we use something other than a database is often met with derision. Clearly, a non-SQL database is disruptive to the orderly flow of data.

Simplified Architecture. [This is code for "No Stored Procedures".] Since stored procedures have been used with mixed success, some folks argue that they should be used more. On the other hand, it seems peculiar to me to intentionally fork application logic into two places -- application code and database. It seems to add complexity with no value. Lots of DBA's try to explain that some logic is "lower-level" or "more closely associated with the data" or "is a more 'essential' business rule." There's no dividing line here that makes stored procedures necessary or useful.

Try to prevent the problems associated with stored procedures and you will receive a $#!+-storm of abuse. Every time. Reducing the use of stored procedures is a disruptive change. An innovation. A bad thing.

[Want proof of the non-essential nature of stored procedures? Watch what happens when to upgrade or replace an application and migrate your data. Did you need the stored procedures? No, you left those behind. You only kept the data.]

Tuesday, September 14, 2010

Sustaining Innovation or Placating Management

Computerworld, July 26, 2010 cover story was "Sustaining Innovation". The magazine was chock-full of thoughts on innovation.

Much of it was good, but some of it is just appeasement to management types who are focused on cost and schedule.

"Establish processes and ownership. Organizations should have a process for taking a promising idea and testing it out..."

Innovation is disruptive. Having a process seeks to minimize disruption. Indeed, most processes are there to block real innovation and constraint the energies into pre-approved areas of innovation.

Which do you want? Innovative change or a nice, controlled process so that things don't change?

The number one item -- in the side bar -- is brilliant. "Give employees the right to fail."

Epic advice. And almost impossible to follow, hence the appeasement item. Permission to fail means -- well -- failures and money "wasted". Cost and schedule shot in an attempt to improve.

Juxtaposition

There's an interesting juxtaposition with another article in the same issue. "When Good Projects Go Bad".

This: "Push for due diligence at the start of a project." This is the perfect way to stifle innovation.

I realize the two articles are almost unrelated, but they're side-by-side in the print edition. Any cowardly project manager knows that we can't really give people permission to fail. We need to have a well-defined process, perform due diligence, and then assure that all projects are either denied up front or are a ringing success.

It's easy to avoid the important lesson on innovation (innovation == failure) and focus on the appeasement words: "process" and "due diligence".

Wednesday, September 8, 2010

Transformation Pipelines

My laptop chartplotter software (GPSNavX) is marvelous for visualizing a route. But, there are elements to route planning that it doesn't handle gracefully.

Specifically, it doesn't provide useful elapsed time calculation at all. While the TTG and ETA (Time to Go and Estimated Time of Arrival) for the next waypoint are essential, they aren't enough. On a sailboat, you need to know the complete sequence of planned arrival times so that you can gauge the overall impact of wind and tide vagaries on the trip.

As the trip progresses, the schedule variance allows informed decision-making. Tack? Hole up for the night? Motor?

GPSNavX has a number of marvelous export capabilities -- as GPX, KML or CSV data. These contain a list of waypoints and little more. We need to enrich this data to produce an overall schedule.

Enriching The Data

Fundamentally, we have a number of enrichment stages. The functional programming features of Python make this complex sequence of enrichment stages into a very tidy, and adaptable application.

    if variance is None: variance= chesapeake
with open(route_file,'rb') as source:
rte= csv.reader(source)
with open(schedule_file,'wb') as target:
rte_rhumb= csv.writer( target )
rte_rhumb.writerow(
["Name", "Lat", "Lon", "Desc",
"Distance (nm)", "True Bearing", "Magnetic Bearing",
"Distance Run", "Elapsed HH:MM", ]
)
for sched in gen_schedule( gen_mag_bearing( gen_rhumb( gen_route_points( rte ) ), variance), speed ):
rte_rhumb.writerow(
[sched.point.name, sched.point.lat, sched.point.lon, sched.point.desc,
sched.distance,
int(sched.true_bearing.deg) if sched.true_bearing is not None else None,
int(sched.magnetic.deg) if sched.magnetic is not None else None,
sched.running, sched.elapsed_hm, ]
)

Essentially, the core of this is a simple composition of generator functions.
for sched in gen_schedule( gen_mag_bearing( gen_rhumb( gen_route_points( rte ) ), variance), speed ):
We have a number of individual transformations to look at. Each of those follows a common pattern. The transformations are:
  • gen_route_points( rte )
  • gen_rhumb( route_points_iter )
  • gen_mag_bearing( rhumb_iter, declination )
  • gen_schedule( rhumb_mag_iter, speed= 5.0 )
There's a bunch of fancy math involved in the rhumb line and distance between two points on the surface of the earth. For my purposes, plane sailing is fine. I don't need great circle routes because I'm not going far.

See Calculate distance, bearing and more between Latitude/Longitude points for a clear and useful treatment of the math. Bowditch's American Practical Navigator, chapter 24, provides alternate methods using table lookup and interpolation.

Seeding the Pipeline

Here, for example, is the first step of the pipeline. Creating a RoutePoint from the four values found in the CSV or GPX route file.
RoutePoint= namedtuple( 'RoutePoint', 'name,lat,lon,desc,point' )

def gen_route_points( rte ):
for name,lat,lon,desc in rte:
point= navigation.LatLon( lat, lon )
yield RoutePoint( name, lat, lon, desc, point )
Since this is a generator function, it can use an iterator and be used by an iterator. The source iterator can be a csv.reader. Or it can be the result of XML parsing -- just so long as it matches the interface specification of being an iterator over a 4-tuple.

Rhumb-Line Calculation

We'll enrich the data. But we won't update an object. We'll stick closely to the philosophy of immutable objects (i.e., named tuples) which are modified by a generator function.
RoutePoint_Rhumb= namedtuple( 'RoutePoint', 'point,distance,bearing' )

def gen_rhumb( route_points_iter ):
p1= route_points_iter.next()
for p2 in route_points_iter:
r, theta= navigation.range_bearing( p1.point, p2.point )
yield RoutePoint_Rhumb( p1, r, theta )
p1= p2
yield RoutePoint_Rhumb( p2, None, None )
The essential calculations are in a separate module, navigation. What we've done, however, is merge information from adjacent values so that we can transform a simple list of points into a list of pairs of points: the from and to for each leg of the trip. Between the two points, we compute the simple rhumb line, the distance and bearing.

True to Magnetic Conversion

We need to enrich our waypoint rhumb-line information with magnetic compass information. The true course needs a declination or variance value added to it. Again, we're just creating new objects from existing objects, using immutable named tuples.
RoutePoint_Rhumb_Magnetic= namedtuple( 'RoutePoint', 'point,distance,true_bearing,magnetic' )

def gen_mag_bearing( rhumb_iter, declination ): # A/k/a Variation
for rp_rhumb in rhumb_iter:
if rp_rhumb.bearing is None:
yield RoutePoint_Rhumb_Magnetic(rp_rhumb.point, None, None, None)
else:
magnetic= rp_rhumb.bearing+declination(rp_rhumb.point)
yield RoutePoint_Rhumb_Magnetic(rp_rhumb.point, rp_rhumb.distance, rp_rhumb.bearing, magnetic )
In this case, we're simply including a declination calculation. While the model is available from IUGG, we can often use averages or approximations. And -- in the Chesapeake -- the approximation is simply to add 11 degrees.

Spot-on accuracy doesn't matter, since we're driving a sailboat. The compass isn't very accurate; the boat motion makes it hard to read precisely; and current as well as leeway have profound effects. It is helpful to have magnetic courses in the schedule instead of true courses. Further, the chartplotter will be computing the final CTS (Course to Steer).

So this stage in the pipeline might be optional.

Distance Run and Elapsed Time

The final distance run and elapsed time is pretty simple. We're creating a new tuple from existing tuples.
SchedulePoint = namedtuple( 'RoutePoint', 'point,distance,true_bearing,magnetic,running,elapsed_min,elapsed_hm' )

def gen_schedule( rhumb_mag_iter, speed= 5.0 ):
distance = 0.0
for rp in rhumb_mag_iter:
if rp.true_bearing is None:
yield SchedulePoint( rp.point, rp.distance, rp.true_bearing, rp.magnetic, None, None, None )
else:
distance += rp.distance
elapsed_min= 60.*distance/speed
h, m = divmod( int(elapsed_min), 60 )
elapsed_hm = "{0:02d}h {1:02d}m".format( h, m )
yield SchedulePoint( rp.point, rp.distance, rp.true_bearing, rp.magnetic, distance, elapsed_min, elapsed_hm )
This gives us a tuple that includes the original way point, the next waypoint, the distance, true bearing, magnetic bearing, total distance run, and elapsed time.

The amount of programming is minimal. The overall design seems reasonably flexible and built from small, easy-to-validate pieces.

Python's functional programming features -- particularly generator functions and named tuples -- seem to make it pleasant to write this kind of transformation pipeline.

Legacy Data Preservation

Extracting legacy data can be really, really hard. However, it's of central importance because data lives forever. Application "logic" and "business rules" come and go.

Today's case study is a dusty old Dell Inspiron Laptop running Windows 98 with Chartview software.

Problem 1. Chartview. No extract or export capability, except to a GPS via a serial port. I guess I could solder up a PC serial connector to a serial-USB interface so my Mac could read the stream of NMEA 0183 messages that contain routes and waypoints. But that seems complex for a one-time transfer.

Problem 2. Windows 98. Won't mount any USB device I own. No solid-state disk, no rotating disk. Nothing. I have the original install CD with all the extra drivers. Didn't help.

Problem 3. Dell Laptop. DVD player, floppy disk drive, and a USB port that Windows 98 doesn't seem to know what to do with. No ethernet, only a modem connection.

How do we preserve the waypoints and routes on this ancient Dell so that we can replace it with a nice, new Standard Horizon CP 300i and MacBook Pro running GPSNavX?

Raw Data

To get the raw data, I pulled the disk drive, mounted it in an IDE-USB enclosure and pulled the relevant routes and waypoints files. Now we have something we can work with.

The file formats are undocumented, but the data's not complex, making it easy to explore. Also, we can look at the old Chartview GUI to see the data and compare it with the raw bytes on the file.

Modern software is more properly normalized, simplifying the conversion. The legacy Chartview route data included each waypoint -- unrelated to the master list of waypoints -- along with bearing and range information, as well as compass deviation and projected speed. Really. A modern GPX file as used by GPSNavX or iNavX only needs the waypoints. Nothing else. New software will correctly calculate range and bearing to next waypoint as well as lookup the magnetic deviation from standard tables. So we don't need to preserve all of the data.

Pass 1

The first pass is to write simple "hex dump" utility in Python to see what's even in the files.

Something like this seems to allow enough flexibility to get a good view of the record sizes and field contents in the file.

def hex_print( bytes, offset=0 ):
for section in xrange(0,len(bytes)+31,32):
block= bytes[section:section+32]
print( ' ', ' '.join( '{0:3d}'.format(x+section) for x in xrange(32) ) )
print( "{0:4d}".format(offset+section), ' '.join( '{0:3d}'.format(ord(x)) for x in block ) )
print( ' ', ' '.join( " {0}".format(x) if 32 <= ord(x) < 128 else ' ' for x in block ) )
print()

def hex_dump( file, size=32 ):
offset= 0
with open(file,'rb') as data:
print( ' ', ' '.join( '{0:3d}'.format(x) for x in xrange(size) ) )
block= data.read(size)
while block:
hex_print( block, offset )
block= data.read(size)
offset += size
Once we have a sense of what's going on, we can use the Python struct module to get the real data.

Pass 2

In the case of Chartview marks, we have a complex, but manageable structure definition. Some of the field sizes are conjectures. It's possible that all those filler bytes are some kind of word or x386 paragraph alignment; it's also possible that I've misinterpreted some of the less relevant numeric fields. The two double-precision values, however, are rock solid.

mark_structure = "=b 4x 6s 45x 502s d d 10x h h h 10x 8s x 6s x 8s x 6s 31x f f 32x"
Mark= namedtuple( 'Mark',
"record, name, text, lat, lon, display_name, enable_mark, anchor_mark,"
"dt1, tm1, dt2, tm2, arrival_radius, max_xte" )

Given this structure and the associated named tuple, we can write a pleasant (and highly reusable) generator function.

def gen_items( file, structure, record_class ):
size= struct.calcsize( structure )
with open(file,'rb') as data:
block= data.read(size)
while len(block) == size:
raw= record_class( *struct.unpack(structure,block) )
yield raw
block= data.read(size)

This makes for a simple application to extract the marks. An application can reformat them into GPX or CSV format.

Something like this is a good starting point.

def print_marks( file ):
for mark in gen_items( file, mark_structure, Mark ):
print( strip(mark.name), mark.lat, mark.lon )

We can easily write a version which includes the formatting required to get the latitudes and longitudes into a format acceptable by GPSNavX or X-Traverse.

Routes

Routes are more complex than marks because they have a header, followed by the details as a sequence of individual waypoints. Since Chartview doesn't normalize these things, each route can have duplicate waypoints, making it very difficult to get them loaded into an application that normalizes properly.

How many WP1's can you have? The answer should be "one". Each additional WP1 is a problem. But it's a small problem. For the WPx points, we simply disambiguate them a route number. There seemed to have been a limit of 15 routes, so we can just expand WPx to WPx-rr, where rr is the route number.

Bottom Line

Data is preserved. Legacy PC and GPS can be chucked (or sold on eBay to a collector).

It's important to note that data outlives application software. This is a universal truth -- data lasts forever, applications come and go. Highly optimized data structures (like the legacy Chartview files) are a bad policy. Highly usable data structures (like GPX files) are more valuable.

Python does a marvelous job of making a potentially horrifying data conversion into something like a few evenings trying to find the key pieces of data in the legacy files. Perhaps the hardest part was tracking down single and double-precision floating-point numbers. But once they were found -- matching known latitudes and longitudes -- it was clear sailing.

Wednesday, September 1, 2010

Using SCons

In looking at Application Lifecycle Management (see "ALM Tools"), I had found that SCons appears to be pretty popular. It's not as famous as all the make variants, or Apache Ant or Apache Maven, but it seems to have a niche in the forest of Build Automation Software.

While it looks nice, parts of SCons are confusing. I struggled until I found a simple use case.

"SCons proved to be more accurate, mostly due to its stateful, content-based signature model.

On the other hand, GNU Make proved to be more resource friendly, especially regard- ing the memory footprint. SCons needs to address this problem to be a viable alternative to Make when building large software projects."
[Also, it appears that a lot of build and test automation have been reframed as "Continuous Integration". Which isn't really a bad thing. But it can be confusing because there are too many categories into which general-purpose tools can be fit.]

While SCons looks cool, I haven't had a huge need for it at work. Working in Python, there's no real "build". Instead our continuous integration boils down to unit testing. Our "build" is an SVN checkin. Our deployment is an SVN checkout and `python setup.py install`.

At some point, I would like to create an SConstruct file that runs our integration test suite. But it's trapped at a low priority.

SCons and Sphinx

I did find an SConscript example that automated a document build using Sphinx. This
sphinx-scons was quite cool. However, it was challenging to customize. The SCons documentation requires real work to understand. I could see the value, but it was a lot of work.

I'm hoping that No Starch Press finds someone to write a tidy introduction to SCons.

SCons and RST and LaTeX (oh, my!)

Sphinx has made me a total fanboi of ReStructured Text. While I know MS Word and iWorks Pages quite well, I have no patience with all the pointing and clicking. Getting consistency is requires consistent pointing and clicking; some people can do it, but some of us find that manual pointing and clicking is sometimes irreproducible. Semantic markup is a huge pain in the neck because we have to stop typing to click on the proper style hint for the various words.

I also know DocBook XML and LaTeX quite well. I've used very cool XML editors including XML Mind XML Editor (which is very nice.) I no longer have any patience with any of these tools because there's too much GUI.

RST is fun because you write in plain text. There are a few directives and a few bits of inline roles for semantic markup. But your work can focus on the content, leaving presentation aside. A command-line tool (with templates) emits HTML or LaTeX or whatever. The style considerations can be (a) secondary and (b) completely consistent.

RST will easily produce complex LaTeX from plain text. What a joy. LaTeX, of course, isn't the goal, it's just an intermediate result that leads to DVI which leads -- eventually -- to a PDF.

Because of the Unicode and font selection on the Mac, I'm a user of XeTeX and XeLaTeX. I have some problems with getting my copy of Blackadder ITC to work, but generally I'm able to write without much fussing around.

SCons has a great deal of the TeX/DVI/PDF tool chain already installed. However, it doesn't have either the rst2latex script or the XeTeX tools.

An SConscript

While my first attempts to understand SCons didn't work out well, looking at RST and XeLaTex was a much better use case.

I wound up with this.

rst2latex = Builder( action="rst2latex.py $SOURCES >$TARGET",
suffix='.tex', src_suffix='.rst',
)
xelatex = Builder( action=["xelatex $SOURCES", "xelatex $SOURCES"],
suffix='.pdf', src_suffix='.tex',
)
env = Environment(ENV=os.environ,
BUILDERS = { 'rst2latex' : rst2latex, 'xelatex':xelatex }
)

env.rst2latex('someDoc')
env.xelatex('someDoc' )
env.rst2latex('anotherDoc')
env.xelatex('anotherDoc')

Getting this to work was quite pleasant. I can see how I could further optimize the document production pipeline by combining the two Builders.

[And yes, the xelatex step is run twice to guarantee that the references are correct.]

Now, I can get away with write, run `scons` and review the resulting PDF. It's fast and it produces a nice-looking PDF with very little work and no irreproducible pointing and clicking.

Given this baseline, I can now dig into SCons for ways to make this slightly simpler.