Tuesday, April 30, 2013

Legacy Code Preservation: Paving the Cowpaths


No discussion of legacy preservation is complete without some "Paving the Cowpaths" stories.

The phrase refers to the way cows tend to meander across the landscape in a remarkably consistent way. The cows reliably follow a consistent path. The paths tend to wander in ways that seem crazy to us.

Rather than do a survey and move some dirt to lay a straight, efficient road, paving the cowpaths refers to simply using the legacy path without consideration of more efficient alternatives.

There are two, nearly identical paving the cowpath stories, separated by three years. We'll only look at one in detail, since the other is simply a copy-and-paste clone.

The Code Base

In both cases, the code base was not something I saw in any detail. In one case, I saw a presentation, and I talked with the author in depth. In the other case, I had the customer assign a programmer to work with me.

In one case they had a fabulous application system that was the backbone of their business. It was lots and lots of VAX Fortran code that did simply everything they needed, and did it exactly the right way. It was highly optimized and encoded deep knowledge about the business.

[The other case wasn't so fabulous, but the outcome is the same.]

Sadly, each gem was entirely written to use flat files. It was relatively inflexible. A new field or new relationship required lots of tweaking of lots of programs to accommodate the revised file layout.

In 1991, the idea of SQL databases was gaining currency. Products like Oracle, Ingres, Informix and many others battled for market share. This particular customer had chosen Ingres as their RDBMS and had decided to convert their essential, foundational applications from flat file to relational database.

The Failure

There was a singular, and epic failure to understand relational database concepts.

A SQL table is not a file that's been tarted up with SQL access methods.

A foreign key, it turns out, is actually rather important. Not something to be brushed aside as "too much database mumbo-jumbo."

What they did was preserve all of their legacy processing. Including file operations. They replaced OPEN, CLOSE, READ and WRITE with CONNECT, DISCONNECT, SELECT and UPDATE in a remarkably unthinking way.

This also means that they preserved their legacy programs that made file copies. They rewrote a file copy as a table copy, using SELECT and INSERT loops.

Copying data from one file to another file can be a shabby way to implement a one-to-many relationship. It becomes a one-to-one with many copies. A file copy can be amazingly fast. A SQL table copy can never be as fast as a file copy.

They can, of course, easily compare the database results with the old flat file results. The structures are nearly identical. This, I think, creates a false sense of security.

My Condolences

In both cases, I was called in to help them "tune" the database to get it to run faster.

I asked about the longest-running parts of the application. I asked about the most business-critical parts of the application. "What's the most important thing that's being blocked by unacceptable slowness?"
It's not possible to get everything to be fast. It is, however, possible to get important things to be fast. Other, less important things, can be slow. That's okay.

They talked me though a particularly painful part of the application that was very important and unbelievably slow. It cloned a table making a small change to each row.

"Oh," I suggested, "you could have used an UPDATE statement with no WHERE clause to touch all rows."

That suggestion, it turns out, was wrong. The copying was essential because the keys were incomplete.

Then it began to dawn on me.

Their legacy application did file copies because they were almost instant. And the filename (and directory path) become part of the key structure.

They were shocked that a SQL table copy could be so amazingly slow. Somehow, the locking and logging that create transactional integrity wasn't visible enough.

The really hard part was trying to---gently---determine why they thought it necessary to clone tables.

The answer surfaced slowly. They had simply treated SQL as if it was a file access method. They had not redesigned their applications. They did not understand how primary key, foreign key relationships were supposed to work. They had, essentially, wasted a fair amount of time and money doing a very, very bad thing.

Preservation

They preserved the relevant business features.

They also preserved irrelevant technical implementation features.

They didn't understand the distinction between business process and technical implementation details.
In effect, they labored under the assumption that all code was precious, even code that was purely technical in nature.

Thursday, April 25, 2013

Legacy Code Preservation: What's the Cost?


It's 1980-something. We're working on a fairly complex system that includes some big machines and three computers. One of the computers has a magnetic tape drive into which it writes a log of interesting events. In the 80's, this was a pretty big deal.

An operational run will produce a log; then we can use customized applications to analyze and reduce the log to something more useful and focused. The first step is to do some data extraction to get the relevant log entries off the tape and into a disk file that engineers can work with.

Recall that the spreadsheet has only been around for a few weeks at this point in the history of computing. Sums and counts require programs. In this case, they are written in Fortran.
So far, so good. My job is to add yet another feature to the data extraction program. It will pull some new different bits of data off the logs.

The log entries are, of course, fairly complex. This is not different from log scraping in a web server context. Some log entries have to be ignored, others have to be merged. Some have cryptic formats.

The Code Base

The extraction application has been in use (and heavily modified) for a couple of years. Many programmers have touched it. Many.

The data extractor is written in a language called JOVIAL. This is not a problem. It's the language of the large system being built. The engineers are happy to use Fortran for their off-line analysis of the files.

There's a subtlety that arises in this mixed language environment. Any engineer with Fortran skills can whip together an analysis program. But only the favored few programmers know enough JOVIAL to tweak the data extraction program. And they're all busy writing the real software, not supporting analysis and trouble-shooting.

This data extractor program suffers from a lot of "copy-and-paste" programming. Blocks of code are repeated with minor changes. Standard modules are repeated with differences from the official copy that the entire rest of the system uses. Block comments don't nest, so it's hard to remove a large chunk of code which contains a block comment.

Further, it suffers from "Don't Delete Diddly" programming. Large swaths of code are left in place, relegated to a subroutine that never gets used. Other blocks of code are circumvented with a GOTO statement to simply jump over the code.

And, it has a complex history and provenance. In order to debug anything on the complex target system, the logger had to be the first thing up and running. Therefore, the logger specifically predates all other features of the application. It doesn't involve any rational reuse of any other piece of software.
This is the 80's, so version control and forking a new version were simply not done.

My job was to make a minor revision and extract just one certain type of log entry. Effectively a "filter" applied to the log.

After several days of reading this mess, I voted with my feet. I wrote a brand-new, from-scratch, "de novo" program (in Fortran, not JOVIAL) which reads the tape and produces the required log entries.

Why?

It was cheaper than messing with the legacy code base. Less work. Less risk of breaking something. And less long-term cost from continuing to maintain the data extractor.

Grief and Consternation

Discarding the legacy JOVIAL analysis program was a kind of heresy. It was a Bad Thing To Do. It "Raised Questions".

Raised Questions? Really? About what?

Did it raise questions about the sanity of managers who preserved this beast? Or about the sanity of programmers doing copy-and-paste programming?

I had to endure a lengthy lecture on the history of the data extraction program. As if the history somehow made a bad program better.

I had to endure begging. The legacy program should be preserved precisely because it was a legacy. Really. It should be "grandfathered in" somehow. Whatever that means.

Preservation

The original Jovial data extractor program still existed. It still ran. It could still be used. The JOVIAL code base and tools (and skilled programmers) remained available.

No one had deleted anything. There was no actual problem.

We had just started to realize that it was time to move on.

I started with a clean, simple Fortran program that read the logs, extracted records, and created files that engineers could work with.

But, but doing that, I guess that I had called somebody's baby ugly.

This new Fortran program preserved the essential knowledge from the original JOVIAL program. Indeed, I think that one of the reasons for all the grief was that I had exposed relevant details of the implementation, stripped clean of the historical cruft.

The tape file format and the detailed information on the log file records had gone from closed and embedded in just one program to open and available to more than one program.

The Fortran program exposed the log file details so that anyone could write a short (and more widely readable) Fortran program. This allowed them to avoid the cost and complexity of waiting for someone like me to modify the JOVIAL extraction program.

The file format is merely a technical detail. It's the analyses that were of real value. And none of that was in the original JOVIAL program. They remained as separate Fortran programs.

Tuesday, April 23, 2013

Legacy Code Preservation: Are There Quirks?


Let's visit some other conversion activities in the 1970's. The gig was at a company implementing a customized insurance application. The actuaries used a PDP-10 (and Fortran) to compute their various tables and summaries.

I was roped into rewriting an actuarial Fortran programs into PL/1 for an IBM 370.

This program, clearly, encodes deep business knowledge. It must be preserved very precisely, since the actuarial calculations are directly tied to the financial expectations for the particular line of business.
The good news about Fortran to PL/1 conversion is that PL/1 offers features (and syntax) that are similar to Fortran. It's not an exact match, but it's close enough to make the conversion relatively risk-free.

There are, of course, issues.

In particular, Fortran IV was not big on the "structured if-then-else" features of Algol-like languages. PL/1, like Pascal, followed on the heels of Algol 60. Fortran didn't follow Algol; Fortran depended on GOTO statements instead of nested IF-THEN-ELSE statements.

This meant that some logic expressions were rather tangled and difficult to fully understand. Patience and and care were required to unwind the logic from it's tangled nest of Fortran GOTO's into neater PL/1 BEGIN-END blocks.

Test Case

Perhaps the most important gap here was the lack of any kind of definitive test case.

It was the 70's. Testing was---at best---primitive. The languages and tools didn't support very much in the way of automated testing.

Compounding the problem, IT management was so late in getting the project started that we had to do repeated overnighters to get things running. The fog of sleep deprivation doesn't facilitate high quality software.

Further compounding the problem, we don't really have access to the PDP-10 that the actuaries use. We can't run any controlled tests.

And. Bonus.

We were doing "test-in-production". As soon as it worked, that was the official production run. Everything prior to the one that worked was discard as a test run.

The test strategy was simply to do a side-by-side comparison with the legacy PDP-10 output. While it's tedious to read hundreds of pages of mainframe computer print-out, that was the job.

Results

For the first attempts, there were significant logic issues. Regions of IF-GOTO that hadn't been properly rewritten into IF-THEN-ELSE.

At some point, the output would disagree. The PDP-10 Fortran, of course, was deemed to be "right."

So it was a matter of discovering what was unique about the case where there was a difference. Lots of deduction and puzzle solving.

Finally, we got down to one really subtle issue.

The numbers were slightly different. Slightly.

What does this slight discrepancy mean?

Is it a bug? Do we have to chase down some math error? It's unlikely to be a math error, since the expressions convert trivially from Fortran to PL/1. And the numbers are close.

Is it a feature? Is there something in Fortran or PL/1 that we simply failed to understand? Unlikely.

Everything else works.

It's a "quirk". It's not a "bug" because it's not clearly wrong. It's not a feature, because we're not going to define it as being clearly right. It's in this middle realm of behavior best described as quirky.

Quirks

What we've uncovered, it turns out, is the difference between Fortran floating point calculations and PL/1's fixed-point decimal calculations. PL/1's compiler reasons out the proper number of decimal places in the intermediate results and generates fixed-point decimal code appropriately.

Decimal hardware, BTW, was part of the IBM 370 system. Decimal-mode arithmetic was often faster then floating-point.

The PL/1 rules have some odd features regarding division and multiplication. A*0.001 and A/1000 have different deduced number of decimal places. Other than that, the rules are obvious and mathematically sound.

The PL/1 version provides exact decimal answers. Lots of decimal places exact.

The Fortran version involved approximations. All floating-point calculation must be looked at as an approximation. Many numbers have an exact binary representation. But numbers without an exact binary representation will have tiny errors. The tiny errors are magnified through calculations. Generally, subtracting two nearly-equal floating-point values elevates the erroneous parts of the approximation to lofty heights of visibility.

Preservation

It was important to preserve the essential actuarial knowledge encoded in Fortran into PL/1.

It was not as important to preserve the quirks of single-precision floating-point math.

Clearly, we have to distinguish between three separate considerations.
  1. Valuable Features: encoded business knowledge.
  2. Implementation Details: technology knowledge.
  3. Quirks. Aspects of the implementation that lead to low-value discrepancies in the output.

Thursday, April 18, 2013

Legacy Code Preservation: What's the Story?


Wind back the clock to the late 1970's. Yes, there were computers in those days.

Some of my earliest billable gigs where conversions from old OS to new OS. (Specifically DOS/VSE to OS/370, now called Z/OS.) Back when a company owned exactly one computer, all of the in-house customized software had to be rewritten as part of the conversion.

For the most part, this was part of a corporate evolution from an IBM 360-series to 370-series computer. That included revisions of the operating system toward OS/370.

A company's custom software often encoded deep knowledge of business operations. Indeed, back in the day before desktop computers, that software was the business. There was no manual fallback if the one-and-only computer didn't work. Consequently, the entire IT department could be focused on converting the software from old operating system to new.

Every line of code was carefully preserved.

Not all software encoded uniquely valuable knowledge, however.

Flashback

In the days before relational databases, all data was in files. File access required a program. Often a customized piece of programming to extract or transform a file's content.

In old flat-file systems, programs would do the essential add-change-delete operation on a "master" file. In some cases programs would operate on multiple "master" files.

In this specific conversion effort, one program did a kind of join between two files. In effect, it was something like:
SELECT * FROM BIG_TABLE
JOIN OTHER_TABLE ON BIG_TABLE.CODE = OTHER_TABLE.CODE
...

What's interesting about this is the relative cost of access to OTHER_TABLE.

A small subset of OTHER_TABLE rows counts for most of the rows that join with BIG_TABLE. The rest of OTHER_TABLE rows were referenced once or a very few times in BIG_TABLE.

Clearly, a cache of the highly-used rows of OTHER_TABLE has a huge performance benefit. The question is, of course, what's the optimal cache from OTHER_TABLE? What keys in OTHER_TABLE are most used in BIG_TABLE?

Modern databases handle this caching seamlessly, silently and automatically. Back in the 70's, we had to tailor this cache to optimize performance on a computer that---by modern standards---was very small and slow.

The Code Base

In the course of the conversion, I was assigned a script ("JCL" is what they called a shell script on Z/OS) that ran two programs and some utility sort steps. Specifically, the sequence of programs did the following:
SELECT CODE, COUNT(*)
FROM BIG_TABLE
GROUP BY CODE
ORDER BY 2

Really? Two largish programs and two utility sort steps for the above bit of SQL?

Yes. In the days before SQL, the above kind of process was a rather complex extract from BIG_TABLE to get all the codes used. That file could be sorted into order by the codes. The codes could then be reduced to counts. The final reduction was then sorted into order by popularity.
This program did not encode "business knowledge." It's purely technical.

At the time, SQL was not the kind of commodity knowledge it is today. There was no easy way to articulate the fact that the program was purely technical and didn't encode something interesting. However, I eventually made the case that this pair of programs and sorts could be replaced with something simpler (and faster.)

I wrote a program that used a data structure like a Python defaultdict (or a Java TreeMap) and did the operation in one swoop.

Something like the following:
from collections import defaultdict
counts= defaultdict( int )
total= 0
with read( "big_table" ) as source:
    reader= BigTable_iter( source )
    for row in reader:
        counts[row.code] += 1
        total += 1
by_count= defaultdict( list )
for code,count in counts.items():
    by_count[count].append( code )
for frequency in sorted( by_count, reverse=True ):
    print( by_count[frequency] )

def BigTable_iter( source ):
    for line in source:
     yield BigTable( line[field0.start:field0:end], etc. )

Except, I did it in COBOL, so the code was much, much longer.

Preservation

This is one end of the spectrum of legacy code preservation.

What was preserved?

Not business knowledge, certainly.

What this example shows is that there are several kinds of software.
  • Unique features of the business, or company or industry.
  • Purely technical features of the implementation.
We need to examine each software conversion using a yardstick that measures the amount of unique business knowledge encoded.

Tuesday, April 16, 2013

Legacy Code Preservation


Rule One: Writing Software is Capturing Knowledge.

Consequence: Converting Software is Preserving Knowledge.

When software is revised for a new framework or operating system or database or when an algorithm is converted to a new language, then we're "converting" (or "migrating") software. We're preserving code, and preserving the knowledge encoded.

For the next few months, I'm going to post some examples of preserving legacy code and how this ties to the knowledge capture issue.

Once we've looked at some examples of business software, we can turn to something a little less concrete: HamCalc.

These examples are presented in historical order. Each example raises questions and outlines elements of a strategy for legacy code preservation.
  • What's the Story? Late 1970's. What user story was encoded in the software?
  • Are There Quirks? Late 1970's. Is the encoded knowledge really a useful feature? Or is it a bug? What if we can't be sure?
  • What's the Cost? Early 1980's. What if the legacy code is complex and expensive? How can we be sure it doesn't encode some valuable knowledge?
  • Paving the Cowpaths. Throughout the 80's. When converting from flat-file to database, how can we distinguish between encoded user stories and encoded technical details? Isn't all code equally valuable? There are several examples; I've combined them into one.
  • Data Warehouse and Legacy Operations. This is a digression on how data warehouse implementation tends to preserve a great deal of legacy functionality. Some of that legacy functionality exists in stored procedures, a programming nightmare.
  • The Bugs are the Features. Can you do software preservation when user doesn't seem to understand their own use cases?
  • Why Preserve An Abomination? How do we preserve shabby code? How can we separating the user stories from the quirks and bugs? There are several instances, I've used one as an example.
  • How Do We Manage This? The legacy code base was so old that no one could summarize it. It had devolved to a morass of details. With no intellectual handles, how can we talk about the process of converting and what needs to be preserved?
  • Why Preserve the DSL? describes a modern instance of "Test-Driven Reverse Engineering" where the unit test cases were created from the user stories and the legacy code use merely as supporting details. An entirely new application was written which preserved very little of the legacy code, but met all the user's requirements.
These nine examples include some duplicates. It's really more like a dozen individual case studies. Some are simple duplicates; the name of the customer is changed, but little else.

Thursday, April 11, 2013

This Seems Irrational... But... HamCalc

Step 1.  Look at the original HamCalc.  Even if you aren't interested in Ham radio, it's an epic, evolving achievement in a specialized kind of engineering support.  It's a repository of mountains of mathematical models, some published by the ARRL, others scattered around the internet.

Step 2.  Look closely at HamCalc.  It's all written in GW basic.  Really.  The more-or-less final update is from 2011 -- it's no longer an active project -- but it's a clever idea that suffers from a horrible constraint imposted by the implementation language.

A long time ago, I was captivated by the idea of rewriting HamCalc as Java Applets.  It seemed like a good idea at the time, but that's a lot of work: 449 programs, 85,000 lines of code.

Recently, I wanted to make some additional use of HamCalc's amazing collection of formulas.

However.  The distribution kit is rather hard to read.  The .BAS files are in the tokenized "binary" format.

I found a Python project to interpret the byte codes into a more useful format. See http://www.danvk.org/wp/gw-basic-program-decoder/  However, it wasn't terribly well written, and didn't prove completely useful.

GW Basic Bytes Codes

Look at http://www.chebucto.ns.ca/~af380/GW-BASIC-tokens.html for some basic rules on the file format.

See http://www.antonis.de/qbebooks/gwbasman/index.html for a reasonably clear definition of the language itself. Quirks are, of course, studiously ignored, so there's a lot of ambiguity on edge cases.

For simple bytes-to-text translation, this is pretty simple.  The next step -- interpreting GW Basic -- is a bit more complex.

Future

The irrational thing is that I'm captivated by the idea of preserving this legacy gift from the authors in another, more useful language. Indeed, the idea of a community of "HamCalc Ports to Other Languages" appeals to me. This base of knowledge is best preserved by being made open so that it can be rewritten into other commonly-used languages.

There's a subset version here: http://www.softpedia.com/get/Science-CAD/HamCalc.shtml and here http://www.dxzone.com/dx11432/hamcalc-v1-3.html. This is just a few of the calculations, carefully rebuilt to include nice versions of the ASCII-art graphics that are central to the original presentation.

The hard part of preserving HamCalc is the absolute lack of any test cases of any kind.

I think the project should work like this.

  1. Publish the complete plain text source decoded from the tokenized binary format. It will likely be somewhere on http://www.itmaybeahack.com/ or perhaps a Dropbox.
  2. Publish the index of programs and features as a cross-reference to the various programs. This should include the various links and references and documentation snippets that populate the code and output. This forms the backbone of the documentation as well as the unit testing.
  3. Do a patient (and relatively lame) translation to Python3.2 to break HamCalc into two tiers. The calculation library and a simple UI veneer using stdio features of the print() and input() statements. The idea is to do a minimalist rewrite of the core feature set so that a GUI can be laminated onto a working calculation library.
  4. Work out test cases for the initial suite of 449 legacy programs oriented toward the calculation layer, avoiding the UI behavior. The idea isn't 100% code coverage. The idea is to pick the relevant logic paths based on the more obvious use cases.
A sophisticated GUI is clearly something that was part of the original vision. But the limitations of GW Basic and tiny computers of that era assured that the UI and calculation were inextricably intertwingled.

If we can separate the two, we can provide a useful library that others can build on.

Maybe I should organize http://www.hamcalc.org/ as the jumping-off point for this effort?