Moved. See All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Thursday, May 30, 2013

Legacy Preservation and "Code Modernization"

The correct marketing term is "Code Modernization".

There are a large number of companies in the Code Modernization business.

They appear to offer automated "modernization" of code.

I would suggest going slowly toward automated modernization. I'm not easily convinced that any automated tool can preserve what's meaningful and ignore the parts which are quirks, bugs or legacy cruft that needs to be disposed of.

Indeed, I'm rather sure that an automated modernization will actually be deleterious.

Without some care, cruft from the legacy code could be canonized into an incomprehensible "feature" in the modern implementation. This will eventually become it's own weird legacy quirk.

It's often best to rethink and rewrite the essential parts of the legacy. Why?

Rule One: Writing Software is Capturing Knowledge.

Consequence: Converting Software is Preserving Knowledge.

Knowledge is something that people use. As with modern Agile software development, we need to focus on the connections among people. We need to use code as part of the channels of communication.

Read the Agile Manifesto:

There are four principles:

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

A detailed investigation into the use cases of legacy software, the unique knowledge encoded, and the quirks and cruft requires thinking about who uses the software and what they expect.

Working software (i.e., readable code) is central. An automated modernization that doesn't properly handle quirks and cruft may create "working" software in the sense that it compiles. But software that people can't read and understand doesn't "work" in the more global sense of encoding captured knowledge.

Collaboration is defeated by automated modernization. The users may have features they don't like or additional features they need. Doing the functionality improvements side-by-side with code modernization makes the most sense. Indeed, it often leads to a proper rewrite, which is the best strategy.

It's difficult for users to envision new software that corrects long-standing quirks. When an Agile process makes software available incrementally (e.g., the release cycle of Scrum), then the users learn what they should have requested. An automated modernization cannot easily respond to the way that users learn through each release.

Tuesday, May 28, 2013

HamCalc -- Preserving the Legacy


The response to modernizing HamCalc was overwhelming.

Apparently there are a fair number of people who also think that HamCalc is a treasure to be preserved, improved and added-to.

If you're interested, start here:

You can ask to be a member of this wiki, and I can add you. This gives us a place to share ideas.

I think the most important aspect of this project is to be welcoming and supportive of recreational and amateur programmers. The explanatory notes and documentation need to be clear, with proper references and footnotes. The final application programs need to be simple.

The point of HamCalc is not to build a large, complex application. The point of HamCalc is to be a repository of small, simple solutions that can be easily understood and repackaged. In my estimation, no one simply "uses" HamCalc. They read, extend, modify and make local copies of their private extensions to HamCalc.

Toward this end, we need to allow for alternate implementations. Two people should be able to carve out the same piece of code and provide different implementations that provide consistent results but reflect different optimizations for speed and resource use.

Here's the code repository.

This will let folks who are interested get started. What I think might happen is that people will carve out pieces that interest them and convert bits of HamCalc. There are 449 programs, each one of which is a potential nugget of goodness.

Thursday, May 23, 2013

Legacy Code Preservation: Language Incompatibility and Technology Evolution

It's important to address language or platform incompatibility as consequences of technology modernization. The reason why we have to do manual conversions of software is because of the language incompatibility issue. We must convert manually when no tool can do the conversion.

There are several layers to this.
  • Platform Incompatibility. This means that the supporting libraries for the language are incompatible between versions. This is relatively rare; language libraries are almost always backward compatible. When they aren't, the problem can often be masked with a "shim" or little bit software to "wrap" the new libraries to make them work like the old libraries. Adding the shim is -- generally -- a terrible idea. Why preserve the old version's weird features and quirks? Why add the complexity (and bugs and quirks) of the shim?
  • Support or Framework Incompatibility. A common "Support" incompatibility is a database; there are many other examples. SQL, for example, has a standardized core, but is not consistently implemented and vendor extensions are common. Any large framework will have compatibility issues among versions and platforms.
  • OS Incompatibility. Most POSIX-compliant OS's (Linux, Mac OS X, etc.) are reasonably compatible. Windows throws a monkey-wrench into the works. In some cases, a language offers a library to make the programs in that language OS-agnostic. An OS-unique feature of an application is a disturbing thing to convert. Is the OS-unique feature an essential feature of the application? In some cases, the OS-unique feature stems from specialized drivers for media support (sound, images, video, etc.) This media compatibility issue leads to complex OS-agnostic support or leads to the use of third-party OS-agnostic libraries.
  • Language Incompatibility. This is usually an absolute block to automated conversion. Languages are designed not to be compatible at a conceptual or semantic level. Automated translation from one programming language to another is difficult and in some cases essentially impossible without some kind of supremely sophisticated artificial intelligence effort. If languages were compatible at a conceptual level, we'd have universal translation among programming languages.
When we look at our case studies, we see the following:
  • What's the Story? OS conversion; the language remained more-or-less the same.
  • Are There Quirks? OS and Language conversion: Fortran to PL/1.
  • What's the Cost? Language conversion: JOVIAL to Fortran.
  • Paving the Cowpaths. Persistence framework conversion: flat files to RDBMS.
  • Data Warehouse and Legacy Operations. This often involves OS, language and persistence conversion.
  • The Bugs are the Features. This was a mental problem, not a technical one.
  • Why Preserve An Abomination? OS, language and persistence conversion: Basic to Java.
  • How Do We Manage This? OS, language and persistence conversion: COBOL to Java.
  • Why Preserve the DSL? Language and persistence conversion: C to Java.
In the case where there's a language conversion, the effort simply becomes new development. The "conversion" or "modernization" concept is there merely to make managers feel that value is being preserved.
In the rare case where the language was not converted, deep questions about user stories vs. technical implementation needed to be asked and answered clearly and completely. When they were not asked (or answered) the conversion did not go well.

Automated Language Translation

The idea of automated language conversion is an "attractive nuisance". ( Not only is it generally impossible, it reduces or eliminates the value of the captured knowledge.

Assume you have some program P_1 in language L_1. It captures some knowledge, K, about the problem domain, and encodes that knowledge in a more-or-less readable and meaningful format.
We want to 'automagically' create a new program P_2 in language L_2. Since the two languages employ different concepts, different data structures, different programming paradigms, the conversion doesn't happen at a "high level". This is not a matter of changing the print statement to the print() function. This is a matter of "understanding" the program, P_1 and then creating a new program, P_2. that performs the "same" functions from the user's point of view.

Choice #1 is to create a very high-level technical "specification" that's language-independent. Then, a translator compiles that high-level specification into the new language. In essence, we've "decompiled" from P_1 to P_S and then compiled P_S to P_2, using an intermediate specification language, L_S. The high-level specification language L_S must contain both languages, L_1 and L_2, as features.
There are examples of elements of this. C++ is compiled to C. Eiffel is often compiled to C. We can think of C++ as a specification language that's translated to C.

Further, we know that "control structure" (IF-THEN-ELSE, WHILE, GOTO) can all be mapped to each other. There's an elegant graph-theoretic proof that a program which is a morass of GOTO's can be revised into IF-THEN-ELSE and WHILE loops. Clearly, then, the converse is possible.

While we can go from C++ to C, can we go from C to C++? At least superficially, yes. But that's only true because C++ is defined to be a superset of C. So that example is really poor. We'll ignore C++ as a higher-level language.

Let's look at Eiffel. We can go from Eiffel to C. Can we go from C to Eiffel? Not really. Eiffel lacks the GOTO, which C supports. Also, C has unconstrained pointer coercion (or casting) which Eiffel lacks. In order to "decompile" C to Eiffel, we'd need to "understand" the C programming and essentially rewrite it into a neutral version in Eiffel which could be then translated to another implementation language.

Making the problem worse, C has murky semantics for some constructs. a[i++]= i; for example, is poorly-defined and can do a wide variety of things.

Semantic Loss

Choice #1--to create a very high-level technical "specification"--can't be done automatically.

Choice #2 is to create a very low-level implementation of the program P_1 by compiling it into machine instructions (or JVM instructions or Python byte codes or Forth words). This low-level language is L_M. Given a program in L_M, we want to restructure those machine instructions into a new program, P_2, in the new language, L_2.

It's important to observe that the translation from P_1 to machine code L_M may involve some loss of semantic information. A machine-language "AND" instruction might be part of a P_1 logical "and" operation or part of a P_1 bit mask operation. The context and semantic background is lost.
Without the semantic information, P_2 may not reflect the original knowledge captured in P_1.
Note that this difficulty is the same as choice #1--creating a higher-level specification.

We can't easily "decompile" code into a summary or understanding or description. Indeed, for some languages, we're pretty sure we don't want to try to automatically decompile it. Some legacy C code is so obscure and riddled with potential confusion that it probably should be rewritten rather than decompiled.

Here's a concrete example from HamCalc.
700 A=2:B=1:T=P:X=0
750 IF T/N=INT(T/N)THEN X=X+1:PN(X)=N:T=T/N:GOTO 730
760 A=3:B=2
770 NEXT N

The point is to find prime factors of P, building the array PN with the X factors.

Note that line 750 executes a GOTO back to the FOR statement. What -- precisely -- does this mean? And how can be be automagically decompiled into a specification suitable for compilation into another language?

This, it turns out, is also an example of a place where HamCalc is not a repository of profoundly useful programming. See for more sophisticated algorithms.

Knowledge Capture

It appears that knowledge capture requires thinking.

There's no automatic translation among programming languages, data structures or programming paradigms.

The only viable translation method is manual conversion:
  1. Understand the source program.
  2. Create unit test cases.
  3. Develop a new program that passes the unit test cases.

Tuesday, May 21, 2013

Legacy Code Preservation: Some Patterns

After looking at this suite of examples, we can see some patterns emerging. There seem to be several operating principles.
  1. The Data Matters. In many cases, the data is the only thing that actually matters. The legacy application knowledge may be obsolete, or so riddled with quirks as to be useless. The legacy knowledge may involve so much technical detail---no user story---that it's irrelevant when a newer, better technology is used.
  2. User Stories Matter; Legacy Technology Doesn't Matter. It is essential to distinguish the legacy technology from the meaningful user stories. Once the two are teased apart, the technology can be replaced and the user stories preserved. A cool DSL may be helpful, and needs to be preserved, or may be a distraction from the real solution to the real problem.
  3. Understanding the New Technology Is Central. Misusing the new technology simply creates another horrifying legacy.
  4. Testing is Essential. Legacy code cannot be preserved without test cases. Any effort that doesn't include automated comparisons between legacy and converted is just new development.
  5. Discarding is Acceptable. Unless the legacy code has a seriously brilliant and unique algorithm, most business applications are largely disposable. It may be less expensive to simply do new development using the legacy code as a kind of overly-detailed specification. Calling it "conversion" to make managers feel good about "preserving" an "asset" is acceptable. The project is the same as new development, only the words change to protect the egos of those not really involved.
  6. Quirks are Painful. They might be bugs or they might be features. It will be difficult to tell.
How do these principles match against our various case study projects?
  • What's the Story? The applications were technical, and could be discarded.
  • Are There Quirks? Without a test case, we could not be sure of our conversion. So we accepted the quirks.
  • What's the Cost? The application was technical, and could be discarded.
  • Paving the Cowpaths. New Technology was misused, the result was a disaster.
  • Data Warehouse and Legacy Operations. The legacy software encoding knowledge can be split haphazardly into database and application software buckets. The user stories matter. The technology doesn't matter.
  • The Bugs are the Features. The user stories matter. If you can't articulate them, you're going to struggle doing your conversion.
  • Why Preserve An Abomination? When the code is shabby and has bugs, you have to sort out the quirks that will be carried forward and the junk that will be discarded.
  • How Do We Manage This? The user stories matter. The data matters. Focus on these two can help prioritize.
  • Why Preserve the DSL? The user stories and test cases lead to a successful outcome. While the customer may feel like a conversion was being performed, it was really new development using legacy code as a specification.
With modern languages and tools, legacy code conversion is quite simple. The impediments are simply managerial in nature. No one wants to have a carefully maintained piece of software declared a liability and discarded. Everyone wants to think of it as an asset that will be carefully preserved.

Thursday, May 16, 2013

Legacy Code Preservation: Why Preserve the DSL?

\A Domain-Specific Language (DSL) can provide some intellectual leverage. We can always write long and convoluted programs in a general-purpose programming language (like Python, Java or C). Sometimes it can make more sense to invent a new domain-specific language and implement the solution in that language.

Sometimes, even well-written, highly portable programming becomes a legacy. I once converted a large, well-written program from C to Java. The organization had no skills in C and didn't want to build these skills.

They wanted their legacy C program rewritten into Java and then extended to cover some additional features.

The timeframe for this exercise is the sometime after 2010. This is important because automated unit test and test driven development are common knowledge. We're not fighting an uphill battle just to figure out how to compare legacy with new.

In essence, we're going to be doing "Test Driven Reverse Engineering." Creating test cases, seeing what the legacy software does and then creating new software that does the same thing.
We're preserving exactly those features which we can encode as unit test cases.

The Code Base

In this case, there was an interesting wrinkle in the code base.

The application included a small Domain-Specific Language (DSL) that was used to define processing rules.

There were only a dozen or so rules. The idea was the various combinations of features could be included or excluded via this little DSL. The application included a simple parser that validated the DSL and configured the rest of the application to do the actual work.

The DSL itself is of no value. No one in the user organization knew it was there. The file hadn't been touched in years. It was just a configuration that could have been meaningfully built as source code in C.

The dozen or so rules are extremely important. But the syntax of the DSL was disposable.
It was relatively simple to create class definitions that -- in a limited way -- mirrored the DSL. The configuration could then be translated into first-class Java.
class Configuration {
List theRules= new LinkedList();
Configuration() {
        new Simple_Rule( this_condition, SomeOption, AnotherOption );
        new Simple_Rule( that_condition, SomeOption );
        new Exception_Rule( some_condition, some_exception );

Things that had been keywords in the old DSL became objects or classes full of static declarations that could be used like objects.

By making the configuration a separate module, and using a property file to name that module, alternate configurations could be created. As Java code in Java Syntax, validated by the Java compiler.

The Unit Tests

The bulk of the code was reasonably clear C programming. Reasonably. Not completely.

However, I still insisted on a pair of examples of each of the different transactions this program performed. These mostly paralleled the DSL configuration.

Having this suite of test cases made it easy to confirm that the essential features really had been preserved.

The user acceptance testing, however, revealed some failures that were not part of the unit test suite. Since TDD was new to this customer, there was some fumbling while they created new examples that I could turn into proper unit test cases.

The next round of acceptance testing revealed another few cases not covered by the examples they had supplied. By now, the users were in on the joke, and immediately supplied examples. And they also revised a existing examples to correct an error in their test cases.

What Was Preserved

Of the original C software, very little actually remained. The broad outline of processing was all.
The tiny details were defined by the DSL. This was entirely rewritten to be proper Java classes.

The C data structures where entirely replaced by Java classes.

All of the original SQL database access was replaced with an ORM layer.

Further, all of the testing was done with an entirely fresh set of unit tests.

The project was -- actually -- new development. There was no "conversion" going on. The customer felt good about doing a conversion and "preserving" an "asset". However, nothing much was actually preserved.

Tuesday, May 14, 2013

Legacy Code Preservation: How Do We Manage This?

At an insurance company, I encountered an application that had been in place for thirty years.
Classic flat-file, mainframe COBOL. And decades old.

It had never been replaced with a packaged solution. It had never been converted to a SQL database. It had never been rewritten in VB to run on a desktop.

What had happened is that it had grown and morphed organically. Pieces the original application it had been subsumed by other applications. Additional functionality had been grafted on.

After a few decades of staff turnover, no single person could summarize what the applications did. There was no executive overview. No pithy summary. No elevator pitch.

The company had, further, spent money to have consultants "reverse engineer" the COBOL. This meant that the consultants created narrative English-language versions of the COBOL code.

This reverse engineering replaced detailed, disorganized COBOL with detailed, disorganized English. No summaries were produced that could serve as an explanation of the actual valuable parts of the program.

The question of scope and duration was daunting. The conversion would take years to complete. the central question become "How to manage the conversion?"

The Goal

The goal was to preserve the valuable features while migrating the data out of flat files into a proper SQL database. The focus on the data was important.

The technical obstacle was the hellish complexity of the applications and their various shell scripts ("JCL" in the Z/OS mainframe world.)

One approach to overcoming the complexity is to break the overall collection of applications down into just those applications that write to any of the central "master" files. Other applications that read master files or do other processing are less important than those which update the master files.

The master files themselves are easy to identify. The JCL that references these files is easy to identify.

The programs run by those JCL scripts give us clusters of related functionality.

We want to rank the master files by business value. The one with the most valuable data is something we tackle first. The least valuable data we leave for last.

In some cases, we'll identify programs that work with relatively low-value data; programs which are not actually assets. They don't encode any new, useful knowledge. A wise manager can elect to remove them from the software inventory rather than convert them.

Since the conversion can't happen overnight, there needs to be a period of coexistence between the first conversion and the last. And this coexistence means that database tables will get extracted back to flat files so that legacy programs can continue to operate.

Another component of the plan for this conversion was the assembly of test cases. This is critical when refactoring code.

The idea here is to preserve selected files and run them through the application software to create repeatable test cases. One of the existing file-compare utilities can be used to validate the output.

Other Barriers

The human obstacle was triumphant here. People who had worked with this software for their entire career in IT would rather quit than help with the conversion.

Experienced mainframe people could not see how a "little" Linux processor could ever provide the amazing feature set and performance of their beloved mainframe. They made the case that CICS was higher performance and more scalable than any web-based application platform.

And that lead to an impasse where one camp refused to consider any migration except to COBOL and CICS. The other camp simply wanted to write a web application and be done with it.

The COBOL/CICS group were either confused on in denial of how quick and simple to can be to build default web applications around stable data models. In this case, the relational database version of the flat files would not be difficult to concoct. Once built, 80% of the application programming would be default add-change-delete transactions. The other 20% would be transactions that included the remnants of useful knowledge encoded in the legacy COBOL.

More time would be spent "studying the alternatives" than were required to build working prototypes.


The real question is one of what needs to be preserved.

Clearly, the data is central to the business.

The larger question, then, is how much of the COBOL processing was really essential processing?

How much of the COBOL was technical workaround to implement things that are one-liners in SQL?
How much of the COBOL is workaround for other bugs in other applications? How many programs fix broken interface files? How many programs provide data quality inspections?

How much of the COBOL is actually unique? A substantial fraction of the legacy code was irrelevant because a package replaced it. Another substantial fraction implemented a "Customer Relationship Management" (CRM) application for which a package might have been a better choice than a software conversion.

How much of the legacy code contain quirks? How much code would we would have to understand and consider repairing because it actually contains a long-standing bug?

Perhaps the only thing of value was the data.

And perhaps the reason for the human obstacle was an realization that the cost to convert exceeded the value being preserved. It's difficult to have your life's work simply discarded.

Thursday, May 9, 2013

Legacy Code Preservation: Why Preserve An Abomination?

By the early aughts (2001-2005) Visual Basic had gone from state of the art to a legacy application language. Code written in VB was being replaced with something more modern (generally Java.)

Having worked with COBOL and Fortran legacy programs, it's easy to describe this legacy VB code as an abomination. Several customers. Several applications. Not a simple size of one. Abomination.

The VB programming I've been involved in preserving has been uniformly shabby. It seems reflect an IT department that threw warm bodies at a problem until it appeared solved and did nothing more. No code reviews. No cleanup. Just random acts of maintenance.

I'm sure there's are many good VB programmers. But I haven't seen their work product yet.
And now, a customer is paying consultants like me to clean up their shabby VB and replace it with Java. Or a web site. Or both.

A Code Base

One particular example of abominable VB was a hodgepodge of copy-and-paste programming, GOTO's and other poorly-used features.

The application printed insurance-like summaries of benefits. In order to do this, it extracted a great deal of data from a database. It relied on a collection of stored procedures and an intimate connection with a massive "calculation" module that derived the actual benefits, which the VB application summarized and reported.

The new application architecture was designed to separate the database from the calculations. The printing of letters to summarize benefits would be yet another separate part of a web site.

Instead of being done on one specific model of dot-matrix printer, the letter would be generated as a PDF that could be displayed or downloaded or printed. Pretty conventional stuff by modern standards. A huge revision considering the legacy programming.

The intimate connections between the database, the calculation module and the letter-writing module would have to be narrowed considerably. A formal list of specific pieces of information would have to replace the no-holds-barred access in the VB modules and database.


Essentially, the VB code that produced the letters encoded some business knowledge.

Much of knowledge was encoded in the calculation module, also.

This needed to be refactored so that the business knowledge was focused on the calculation module.
The letter writing had to be stripped down to something that worked like this.
  1. Query some initial stuff from the database.
  2. Determine which letter template to use.
  3. Query the rest of the stuff from the database based on the specific situation and template letter.
  4. Fill in the blanks in the template. Generate the PDF.
The legacy VB code, of course, had to be studied carefully to locate all of the business process and all of the data sources.

And all of the quirks had to be explained. In this case, there are unfixed bugs that had to be preserved in order to say the new output matched the old output. In a few cases, we had to report that the old application was clearly wrong and actually fix the bugs. Interestingly, there was no arguing about this. The bugs we found were all pretty well-known bugs.

An "Interface" module was defined. All of the processing from the VB letter-writing programs was pushed into the interface. The letter-writing was refactored down to template fill-ins.

The interface became the subject of some architectural debate. It was the essential encoded knowledge from the original VB programs. Is it part of letter-writing? Or is it really part of the core calculation module?

Eventually, it was pushed into the calculation module and the "interface" could be removed, leaving a very clean interface between letter-writing and calculation.

I suspect (but I don't know) that the Java calculation module was quite the mishmash of stuff extracted from numerous VB programs. Hopefully, those programmers had proper unit test cases and could refactor the calculation module to get it into some sensible, maintainable form.

Tuesday, May 7, 2013

Legacy Code Preservation: The Bugs Are The Features

The extreme end of "paving the cowpaths" are people for whom the bug list is also the feature list.
This is a very strange phenomenon, rarely seen, but still relevant to this review.

In this particular case, the legacy application was some kind of publishing tool. It used MS-Word documents with appropriate style tags, and built documents in HTML and PDF formats from the MS-Word document. Badly.

To begin with, problems with MS-Word's style tags are very difficult to diagnose. Fail to put the proper style tag around a word or phrase and your MS-Word source file looks great, but it doesn't produce the right HTML or PDF.

More importantly, the PDF files that got created were somewhat broken, having bugs with embedded fonts and weirdness with downloading, saving locally and printing.

And--of course--the vendor was long out of business.

What to do?

Feature Review

In order to replace this broken publishing app, we need to identify what features are essential in the HTML and PDF output. This shouldn't be rocket science.

For instance, there may be inter-document links that should be magically revised when a document moves to a new URL. Or, there must be embedded spreadsheets. Or there have to be fill-in forms that can be printed.

The editor who worked with this tool could not---even after repeated requests---provide a list of features.

Could not or would not, it didn't matter.

Instead, they had a list of 20 or so specific bugs that needed to be fixed.

When we tried to talk about locating a better publishing package, all the editor could bring to the table was this list of bugs.

For example: "The downloaded document has to have the proper flags set so that it uses local fonts."
We suggested that perhaps this should include "Or it has all the fonts embedded correctly?"
We were told---firmly---"No. The local fonts must be used or the document can't be saved and won't print."

Trying to explain that this font bugs doesn't even exist in other packages that create proper PDF's went nowhere. There was a steadfast refusal to understand that the bugs were not timeless features of PDF creation. They were bugs.

Several meetings got sidelined with the Bug List Review.

Eventually, the editor had to be reassigned to something less relevant and visible. A more rational editor was put into the position to work with a technology team to bring in a new package and convert the legacy documents.

The Confusion

It's not easy to see where the confusion between feature and bug comes from.

Why did the desirable feature set become a murky unknown? An editor should be able to locate the list of styles actually used and what those styles did in the resulting documents.

The confusion and bizarre behavior could possibly stem from the stress or terror.

Perhaps learning a new package was too stressful.

Perhaps idea of converting several dozen complex documents from MS-Word to some other markup was terrifying. So terrifying that roadblocks needed to be put in place to prevent meaningful discussion of the conversion effort.

Where was the business knowledge encoded? What needed to be preserved?

Not the software. It was junk.

There was business knowledge represented in the documents in the obvious way. But there was also some business knowledge encoded in the markup that established links and references, emphasis, spreadsheets, forms or whatever other features were being used.

Any semantic markup encodes additional knowledge above and beyond the words being marked up. The semantic markup was what needed to be preserved during the conversion. Maybe this was the source of the terror and confusion.

Thursday, May 2, 2013

Legacy Code Preservation: Data Warehouse and Legacy Operations

A data warehouse preserves data.

It can be argued that a data warehouse preserves only data. This, however, is false.

To an extent, a data warehouse must also preserve processing details.

Indeed, a data warehouse exemplifies knowledge capture because the data and its processing steps are both captured.

The ETL process that prepares data for loading into the warehouse is tied to specific source applications that provide data in a known form and a known processing state. A warehouse isn't populated with random data. It's populated with data that is at a known, consistent state.

For example, when loading financial data, the various accounting applications (like the General Ledger) must be updated with precisely the same data that's captured for data warehouse processing. Failure to assure consistency between ledger and warehouse makes it difficult to believe that the warehouse data is correct.

Preserving Details

In some cases, legacy applications have a tangled architecture. Code can be repeated because of copy-and-paste programming. This can make it difficult to be sure that a data warehouse properly captures data in a consistent state.

What's distressingly common is to have a "code" or "status" field where the first or last position has been co-opted to have additional meanings. A "9" in the last position of a product number may be a flag for special processing.

These cryptic flags and indicators are difficult to identify in the first place. They are often scattered throughout the application code base. Sometimes they reflect work-arounds to handle highly-specialized situations. Other times, they're pervasive changes that were done via cryptic flags rather than make a first-class change to a file format.

When populating a warehouse, these codes and flags and secret processing handshakes need to be found and properly normalized. This may mean that an ETL program will recapitulate different pieces of special-case logic that's scattered around a number of legacy programs.
This is the essence of knowledge capture.

It also drives up the cost and expense of maintaining the ETL pipeline that feeds the data warehouse.
After all, the source application can make processing changes that aren't properly reflected in the ETL processing pipeline.

As if this isn't bad enough, many organization permit technology that makes processing even more obscure.

The Evils of Stored Procedures

In far too many cases, software architectures place code into two locations.
  • Application programs.
  • Data bases.
Putting code into a database is simply a mistake. There's no rational justification. None.

The irrational justifications include the following farcical claims.
  • Stored Procedures are faster.
    Not really. There's no reason why they should be faster, and simple benchmark measurements show that application programs outside the database will be as fast or faster than stored procedures. A process running outside the database doesn't compete for database resources the same way the stored procedure engine does.
  • Some processing is essential to data integrity.
    This is absurd, since it presumes that the folks writing stored procedures are trustworthy and folks writing non-stored procedure applications are a lying bunch of thieving scoundrels who will break the data integrity rules if given half the chance.
Let's look at this second justification.

The argument has two variants.
  1. Some logic is so essential to interpreting the contents of the database that it cannot meaningfully be packaged any other way.
    This makes the claim that all sharable programming technology (Java packages, Python modules, etc.) simply don't work, and the database is the only effective way to share code.
  2. Some logic is so essential to correct status of the database, that no application developer can be trusted to touch it.
    This presumes that application developers are willing to cut corners and break rules and force bad data into an otherwise pristine database. Data integrity problems come from those "other" developers. The DBA's can't trust anyone except the stored procedure author.
When confronted with other ways to share logic, the stored procedure folks fall back on "faster" or possibly the "Us vs. Them" nature of the second variant.

Stored Procedure Consequences

Stored procedures really are code. They should not be separated from the rest of the code base.

Stored procedures are maintained with different tools and through different organizations and processes. This leads to conflict and confusion.

It can also lead to weird secrecy.

A stored procedure can be difficult to extract from the database. It may require privileges and help from DBA's to locate the unencrypted original source text.

In a huge organization, it can take weeks to track down the right DBA to reveal the content of the stored procedure.

Why the secrecy?

Once exposed, of course, the stored procedure can then be rewritten as proper code, eliminating the stored procedure.

The proper question to ask is "Why is critical business knowledge encoded in so many different places?" Why not just application code? Why also try to encode some knowledge in database stored procedures? How does this bifurcation help make the origination more efficient?