Bio and Publications

Thursday, July 25, 2013

Database Conversion or Schema Migration

I was told a horror story about a bad database schema migration. Let's call the author of the horror story HPL.

HPL related a tale of woeful conversion problems from a mismanaged schema migration.

While I could feel HPL's pain, the reasons given for their pain were wrong. They didn't quite get the lessons they had learned. Consequently, HPL sounded like someone doomed to repeat the mistake, or—worse—unlearning their lessons.

Here's HPL's most distressing comment.
"we can't migrate over the weekend and be done w/ it."
Apparently, the horror of a weekend migration was somehow desirable to HPL. Who wants a lost weekend? And who wants to put all of the eggs in a single basket?

Anyone who's done more than one "lost weekend migration"—and who's also honest—knows that they don't go well. There are always subsets of data that (a) don't get converted properly and (b) have to get swept under the carpet in order to claim to meet the schedule.

It's a standard situation to have less than 100% of the data successfully converted and still call the effort a success. If 100% was not required, why lose a weekend over it?

Good Plans and Bad Plans

From far wiser people than me, I learned an important lesson in schema migration.

These Wiser Heads ran a "conversion" business. They moved data and applications from platform to platform. They knew a lot about database schema migrations. A lot.

Their standard plan was to build a schema migration script (usually a sequence of apps) that could be run to convert the database (or files or whatever) from old to new schema as often as was necessary.

I'll repeat that.

As often as was necessary.

They debugged the script to get to an acceptable level of conversion. The data conversion (or schema migration) was perfectly repeatable. Of course, they longed for 100% conversion; but pragmatically, the legacy software had bad data. So some fraction would not convert. And once that fraction was found, the schema migration applications could be modified to treat the non-convertable data in some intelligent way.

Their stated goal was to convert data and run parallel testing with that converted data as often as necessary to create confidence that the new data was as perfect a conversion as was possible. At some point, the confidence became certainty and the parallel testing was deemed complete. Since they were parallel testing with live data, the decision amounted to a formalized "commissioning" of the new application. And by then, the new application was already being used.

There are bad ways to do schema migration, of course. HPL listed many.

Horrible Mistakes

The horror story from HPL included this:
"For the migrated tables, create views in the old system and create instead of triggers on those views to ship data to the new system."
It appears that they used views and triggers to create a new system "facade" over the legacy system. Apparently, they wanted both suites of application software to coexist. Not a good approach to schema migration. It appeared that they were trying to share one database with two application schema.

This seems like it's doomed. Unless they're all geniuses.

Wiser Heads would have strongly suggested that the new system use a extract of the old system's data.

HPL goes on to complain,
"Sometimes we can take over a column or 2 and sometimes we can only take over some of the data in the table".  
HPL emphasizes this point with "This is not that far fetched". I'm not sure why the emphasis was needed.

This is not "far fetched". It doesn't need emphasis. It's not really much of a problem, either. It's a standard part of schema migration. Extracting a copy of the data makes this quite easy. Triggers and views to create some kind of active SQL-based Facade is what created the complexity. Not the number of columns involved.

HPL summarizes,
"So you end up w/ [many] tables/views triggers all moving data back and forth and faking stuff out"
Back and forth. A fundamental mistake. A copy can be much easier to manage. One way data movement: Legacy to New.

HPL concludes with a litany of errors of various types: performance, change management, file system issues, error logging and auditing. Blah blah blah. Yes, it was a nightmare. I feel their pain.

What About Coexistence? 

It appears that HPL was involved in a project where the entire old and new applications were supposed to somehow coexist during the conversion.

It appeared that they failed to do any kind of partitioning.

Coexistence is not a trivial exercise. Nor is it a monolith where the entire legacy application suite must coexist with the entire new schema and the entire new application suite.

Pragmatically, coexistence usually means that some portion of the legacy must be kept running while some other portion is modernized. This means the coexistence requires that the application portfolio be partitioned.

Step 1: Some suite of functionality is migrated. That means data from the legacy database/file system is copied to new. That also means some data from new is copied back into the legacy database file/system. Copied.

Step 2: Some other suite of functionality is migrated. As functionality is moved, less and less data is copied back to the legacy.

At some point, this copying back is of no value and can be discontinued.

What About Timing?

This copying clearly requires some coordination. It's not done haphazardly.

Does it require "real time" data movement? i.e. triggers and views?

Rarely is real time movement required. This is the point behind partitioning wisely. Partitioning includes timing considerations as well as data quality and functionality considerations.

It's remotely possible that timing and partitioning are so pathological that data is required in both legacy and new applications concurrently. This is no reason to throw the baby out with the bathwater. This is nothing more than an indication that the data is being copied back to the legacy application close to real time.

This also means performance must be part of the test plan. As well as error handling and diagnostic logging. None of this is particularly difficult. It simply requires care.

Lessons Learned

HPL appeared to make the claim that schema migration is super hard. Or maybe that coexistence is really hard.

Worse, HPL's horror story may be designed to indicate that a horrifying lost weekend is the only way to do schema migration.

Any or all of these are the wrong lessons to learn.

I think there are several more valuable lessons here.

  1. Schema migration can and should be done incrementally. It's ideally tackled as an Agile project using Scrum techniques. It's okay to have release cycles that are just days apart as each phase of the conversion is run in parallel and tested to the user's satisfaction.
  2. Coexistence requires partitioning to copy any data back to unconverted legacy components. Triggers and views and coexistence of entire suites of software make a difficult problem harder.
  3. The conversion script is just another first-class application. The same quality features apply to the conversion as to every other component of the app suite.
  4. The conversion must be trivially repeatable. It must be the kind of thing that can be run as often as necessary to move legacy data to the new schema. 

Tuesday, July 23, 2013

Almost a good idea

Appleworks (formerly Clarisworks) is software that's been dead meat since 2007.

See http://en.wikipedia.org/wiki/AppleWorks#End_of_Appleworks

Which is fine unless you have an old computer with old applications that still works. For example, a 2002-vintage iMac G4 http://www.imachistory.com/2002/ still works. Slowly.

When someone jumps 11 years to a new iMac, they find that their 2002 iMac with 2007 apps has files which are essentially unreadable by modern applications.

How can someone jump a decade and preserve their content?

1.  iWork Pages is cheap. Really. $19.99.  I could have used it to convert their files to their new iMac and then told them to ignore the app. Pages can be hard to learn. For someone jumping from 2007-vintage apps, it's probably too much. However, they can use TextEdit once the files are converted to RTF format.
2.  iWork for iCloud may be a better idea. But they have to wait a while for it to come out. And they want their files now.
3.  Try to write a data extractor.

Here are some places to start.
Documentation on the Appleworks file format does not seem to exist. It's a very weird void, utterly bereft of information.

In the long run $19.99 for a throw-away copy of Pages is probably the smartest solution.

However, if you're perhaps deranged, you can track down the content through a simple brute-force analysis of the file. This is Python3 code to scrape the content out of a .CWK file.

import argparse
import struct
import sys
import os
from io import open

class CWK:
    """Analyzes a .CWK file; must be given a file opened in "rb" mode.
    """
    DSET = b"DSET"
    BOBO = b"BOBO"
    ETBL = b"ETBL"

    def __init__( self, open_file ):
        self.the_file= open_file
        self.data= open_file.read()

    def header( self ):
        self.version= self.data[0:4]
        #print( self.version[:3] )
        bobo= self.data[4:8]
        assert bobo == self.BOBO
        version_prev= self.data[8:12]
        #print( version_prev[:3] )
        return self.version

    def margins( self ):
        self.height_page= struct.unpack( ">h", self.data[30:32] )
        self.width_page= struct.unpack( ">h", self.data[32:34] )
        self.margin_1= struct.unpack( ">h", self.data[34:36] )
        self.margin_2= struct.unpack( ">h", self.data[36:38] )
        self.margin_3= struct.unpack( ">h", self.data[38:40] )
        self.margin_4= struct.unpack( ">h", self.data[40:42] )
        self.margin_5= struct.unpack( ">h", self.data[42:44] )
        self.margin_6= struct.unpack( ">h", self.data[44:46] )
        self.height_page_inner= struct.unpack( ">h", self.data[46:48] )
        self.width_page_inner= struct.unpack( ">h", self.data[48:50] )

    def dset_iter( self ):
        """First DSET appears to have content.

        This DSET parsing may not be completely correct. 
        But it finds the first DSET, which includes all
        of the content except for headers and footers.

        It seems wrong to simply search for DSET; some part of the
        resource directory should point to this or provide an offset to it.
        """
        for i in range(len(self.data)-4):
            if self.data[i:i+4] == self.DSET:
                    #print( "DSET", i, hex(i) )
                    pos= i+4
                    for b in range(5): # Really? Always 5?
                        size, count= struct.unpack( ">Ih", self.data[pos:pos+6] )
                        pos += size+4
                    #print( self.data[i:pos] )
                    yield pos
    def content_iter( self, position ):
        """A given DSET may have multiple contiguous blocks of text."""
        done= False
        while not done:
            size= struct.unpack( ">I", self.data[position:position+4] )[0]
            content= self.data[position+4:position+4+size].decode("MacRoman")
            #print( "ENDING", repr(self.data[position+4+size-1]) )
            if self.data[position+4+size-1] == 0:
                yield content[:-1]
                done= True
                break
            else:
                yield content
                position += size+4

The function invoked from the command line is this.

def convert( *file_list ):
    for f in file_list:
        base, ext = os.path.splitext( f )
        new_file= base+".txt"
        print( '"Converting {0} to {1}"'.format(f,new_file) )
        with open(f,'rb') as source:
            cwk= CWK( source )
            cwk.header()
            with open(new_file,'w',encoding="MacRoman") as target:
                position = next( cwk.dset_iter() )
                for content in cwk.content_iter(position):
                    # print( content.encode("ASCII",errors="backslashreplace") )
                    target.write( content )
        atime, mtime = os.path.getatime(f), os.path.getmtime(f)
        os.utime( new_file, (atime,mtime) )


This is brute-force. But. It seemed to work. Buying Pages would have been less work and probably produced better results.

This does have the advantage of producing files with the original date stamps.  Other than that, it seems an exercise in futility because there's so little documentation.

What's potentially cool about this is the sane way that Python3 handles bytes as input. Particularly pleasant is the way we can transform the file-system sequence of bytes into proper Python strings with a very simple bytes.decode().

Thursday, July 18, 2013

NoSQL Befuddlement: DML and Persistence

It may be helpful to look back at 'How Managers Say "No"' which is about breaking the RDBMS Hegemony.

I got an email in which the simple concepts of "data manipulation" and "persistence" had become entangled with SQL DML to a degree that the conversation failed to make sense to me.

They had been studying Pandas and had started to realize that the RDBMS and SQL were not an essential feature of all data processing software.

I'll repeat that with some emphasis to show what I found alarming.
They had started to realize that the RDBMS and SQL were not an essential feature of all data processing.
Started to realize.

They were so entrenched in RDBMS thinking that the very idea of persistent data outside the RDBMS was novel to them.

They asked me about extending their growing realization to encompass other SQL DML operations: INSERT, UPDATE and DELETE. Clearly, these four verbs were all the data manipulation they could conceive of.

This request meant several things, all of which are unnerving.
  1. They were sure—absolutely sure—that SQL DML was essential for all persistent data. They couldn't consider read-only data? After all, a tool like Pandas is clearly focused on read-only processing. What part of that was confusing to them? 
  2. They couldn't discuss persistence outside the narrow framework of SQL DML. It appears that they had forgotten about the file system entirely.
  3. They conflated data manipulation and persistence, seeing them as one thing.
After some back-and-forth it appeared that they were looking for something so strange that I was unable to proceed. We'll turn to this, below.

Persistence and Manipulation

We have lots of persistent data and lots of manipulation. Lots. So many that it's hard to understand what they were asking for.

Here's some places to start looking for hints on persistence.

http://docs.python.org/3/library/persistence.html

http://docs.python.org/3/library/archiving.html

http://docs.python.org/3/library/fileformats.html

http://docs.python.org/3/library/netdata.html

http://docs.python.org/3/library/markup.html

http://docs.python.org/3/library/mm.html

This list might provide some utterly random hints as to how persistent data is processed outside of the narrow confines of the RDBMS.

For manipulation... Well... Almost the entire Python library is about data manipulation. Everything except itertools is about stateful objects and how to change state ("manipulate the data.")

Since the above lists are random, I asked them for any hint as to what their proper use cases might be. It's very difficult to provide generic hand-waving answers to questions about concepts as fundamental as state and persistence. State and persistence pervade all of data processing. Failure to grasp the idea of persistence outside the database almost seems like a failure to grasp persistence in the first place.

The Crazy Request

Their original request was—to me—incomprehensible. As fair as I can tell, they appeared to want the following.

I'm guessing they were hoping for some kind of matrix showing how DML or CRUD mapped to other non-RDBMS persistence libraries.

So, it would be something like this.

SQLOSPandasJSONCSV
CREATEfile()some pandas requestjson.dump()csv.writer()
INSERTfile.write()depends on the requirementscould be anythingcsv.writerow()
UPDATEfile.seek(); file.write()doesn't make sensenot something that generalizes welldepends on the requirements
DELETEfile.seek(); file.write()inappropriate for analysisdepends on the requirementshard to make this up without more details
APPEND -- not part of DMLfile.write()depends on requirementscould be anythingcsv.writerow()

The point here is that data manipulation, state and persistence is intimately tied to the application's requirements and processing.

All of which presumes you are persisting stateful objects. It is entirely possible that you're persisting immutable objects, and state change comes from appending new relationships, not changing any objects.

The SQL reductionist view isn't really all that helpful. Indeed, it appears to have been deeply misleading.

The Log File

Here's an example that seems to completely violate the spirit of their request. This is ordinary processing that doesn't fit the SQL DML mold very well at all.

Let's look at log file processing.
  1. Logs can be persisted as simple files in simple directories. Compressed archives are even better than simple files.
  2. For DML, a log file is append-only. There is no insert, update or delete.
  3. For retrieval, a query-like algorithm can be elegantly simple. 
Without any brain-cramping, one can create simple map-reduce style processing for log files. See "Map Reduce -- How Cool is that?" for a snippet of Python code that turns each row of an Apache log file into a record-like tuple. It also shows how to scan multiple files and directories in simple map-reduce style loops.

Interestingly, we would probably loose considerable performance if we tried to load a log file into an RDBMS table. Why? The RDBMS file for a table that represents a given log file is much, much larger than the original file. Reading a log file directly involves far fewer physical I/O operations than the table.

Here's something that I can't answer for them without digging into their requirements.

A "filter" could be considered as a DELETE.  Or a DELETE can be used to implement a filter. Indeed, the SQL DELETE may work by changing a row's status, meaning the the SQL DELETE operation is actually a filter that rejects deleted records from future queries.

Which is it? Filter or Delete? This little conundrum seems to violate the spirit of their request, also.

Python Code

Here's an example of using persistence to filter the "raw" log files. We keep the relevant events and write these in a more regular, easier-to-parse format. Or, perhaps, we delete the irrelevant records. In this case, we'll use CSV file (with quotes and commas) to speed up future parsing.

We might have something like this:


log_row_pat= re.compile( r'(\d+\.\d+\.\d+\.\d+) (\S+?) (\S+?) (\[[^\]]+?]) ("[^"]*?") (\S+?) (\S+?) ("[^"]*?") ("[^"]*?")' )

def log_reader( row_source ):
 for row in row_source:
     m= log_row_pat.match( row )
     if m is not None:
         yield m.groups()


def some_filter( source ):
    for row in source:
        if some_condition(row): 
            yield row

with open( subset_file, "w" ) as target:
    with open( source_file ) as source:
        rdr= log_reader( source )
        wtr= csv.writer( target )
        wtr.writerows( some_filter( rdr ) )

This is a amazingly fast and very simple. It uses minimal memory and results in a subset file that can be used for further analysis.

Is the filter operation really a DELETE?

This should not be new; it should not even be interesting.

As far as I can tell, they were asking me to show them how is data processing can be done outside a relational database. This seems obvious beyond repeating. Obvious to the point where it's hard to imagine what knowledge gap needs to be filled.

Conclusion

Persistence is not a thing you haphazardly laminate onto an application as an afterthought.

Data Manipulation is not a reductionist thing that has exactly four verbs and no more.

Persistence—like security, auditability, testability, maintainability—and all the quality attributes—is not a checklist item that you install or decline.

Without tangible, specific use cases, it's impossible to engage in general hand-waving about data manipulation and persistence. The answers don't generalize well and depend in a very specific way on the nature of the problem and the use cases.

Tuesday, July 16, 2013

How Managers Say "No": The RDBMS Hegemony Example

Got an email looking for help in attempting break through the RDBMS Hegemony. It's a little confusing, but this is the important part of how management says "no".
"Their response was nice but can you flush [sic] it out more"
[First: the word is "flesh": "flesh it out." Repeat after me: "Flesh it out," "Flesh it out," "Flesh it out." Flesh. Put flesh on the bones. No wonder your presentation went nowhere, either you or the manager or both need help. English as a second language is only an excuse if you never read anything in English.]

There's a specific suggestion for this "more". But it indicates a profound failure to grasp the true nature of the problem. It amounts to a drowning person asking us to throw them a different colored brick. It's a brick! You want a life preserver! "No," they insist, "I want a brick to build steps to climb out."

Yes, RDBMS Hegemony is a real problem. I've talked about it before "Hadoop and SQL/Relational Hegemony". Others have noted it: "NoSQL and NewSQL overturning the relational database hegemony". You can read more concrete details in articles like this: "Introduction to Non-Relational Data Storage using Hbase".

RDBMS Hegemony is most visible when every single in-house project seems to involve the database. And some of those uses of the database are clearly inappropriate.

For example, trying to mash relatively free-form "documents" into an RDBMS is simple craziness. Documents—you know, the stuff created by word processors—are largely unstructured or at best semi-structured. For most RDBMS's, they're represented as Binary Large Objects (BLOBs). To make it possible to process them, you can decorate each document with "metadata" or tags and populate a bunch of RDBMS attributes. Which is fine for the first few queries. Then you realize you need more metadata. Then you need more flexible metadata. Then you need interrelated metadata to properly reflect the interrelationships among the documents. Maybe you flirt with a formal ontology. Then you eventually realize you really should have started with document storage, not a BLOB in an RDBMS.

Yes, some companies offer combo products that do both. The point is this: avoiding the RDBMS pitfall in the first place would have been a big time and money saver. Google Exists. The RDBMS is not the best choice for all problems.

The problem is this:
  • Getting away from RDBMS Hegemony requires management thinking and action.
  • Management thinking is a form of pain.
  • Management action is a form of pain. 
  • Managers hate pain.
In short, the only way to make progress away from the RDBMS is to create or expose existing pain. Or make it possible for the manager to avoid pain entirely.

Let's look at the various approaches.

Doing A "Presentation"

The email hinted at a conversation or presentation on the problem of RDBMS Hegemony. 
"I finally convinced my current client that RDBMS's are expensive in terms of adding another layer to the archtiecture [sic] and then trying to maintain it."
It's not clear from the email what the details of this conversation or presentation were, but it clearly involved the two key technical points (1) the RDBMS has specific use cases, and (2) not all applications fit those use cases.

However. Those two key technical points involve no real management pain.

Real pain comes from cost. And since the RDBMS license is usually site-wide, there's no obvious cost to the technology.

The labor cost for DBA support, similarly, is side-wide and already in the budget. So there's no obvious cost to the labor.

No cost means no pain. No pain means no change.

Asking a manger to think, however, causes actual pain. Managers want technical people to do the thinking for them.

Asking a manager to consider the future means they may have to take action in the future. That's potential pain. 

Either way, a management presentation on database hegemony is pure pain. No useful action will ever come from a simple, direct encapsulation of how the RDBMS is not really the universal data tool. Management said "no" by asking for more information.

We'll return to the "more information" part below.

It was good to start the conversation.

It's good to continue the conversation. But the specific request was silliness.

Exposing the Existing Pain

What's more important than a hypothetical conversation is showing how the RDBMS is causing pain right now. It's easier to convince managers of the hidden cost of the RDBMS by exposing existing actual pain in the current environment. And it has to be a level of pain that exceeds the pain of thinking and taking action.

What's most clear is a specific and avoidable labor cost. Ideally, this specific—and avoidable—labor cost will obviously be associated with something obviously database-related. It must be obvious or it won't yield a technology-related management understanding. If it's not obvious, management will say "no", by asking for more data; they'll claim it's people or process or measurement error.

The best place to look for avoidable labor is break-fix problem reports, bugs and enhancements. Another good source of avoidable costs are schema migrations: waiting for the DBA's to add columns to a table, or add tables to a database.

If you can point to specific trouble tickets that come from wrong use of an RDBMS, then you might be able to get a manager to think about it.

The Airtight Case

Your goal on breaking RDBMS Hegemony is to have a case that is "airtight". Ideally, so airtight that the manager in question sits up, takes notice, and demands that a project be created to rip out the database and save the company all that cost. Ideally, their action at the end of the presentation is to ask how long it will take to realize the savings.

Ideally.

It is actually pretty easy to make an airtight case. There are often a lot of trouble tickets and project delays due to overuse and misuse of the RDBMS.

However.

Few managers will actually agree to remove the RDBMS from an application that's limping along. Your case may be airtight, and compelling, and backed with solid financials, but that's rarely going to result in actual action.

"If it ain't broke, don't fix it," is often applied to projects with very high thresholds for broken. Very high.

This is another way management says "no". By claiming that the costs are acceptable or the risk of change is unacceptable. Even more farcical claims will often be made in favor of the status quo. They may ask for more cost data, but it's just an elaborate "no".

It's important to make the airtight case.

It's important to accept the "no" gracefully.

Management Rewards

When you look at the management reward structure, project managers and their ilk are happiest when they have a backlog of huge, long-running projects that involve no thinking and no action. Giant development efforts with stable requirements, unchallenging users, mature technology and staff who don't mind multiple-hour status meetings.

A manager with a huge long-running project feels valuable. When the requirements, people and technology are stable, then thinking is effectively prevented.

Suggesting that technology choices are not stable introduces thinking. Thinking is pain. The first response to pain is "no". Usually in the form of "get more data."

Making a technology choice may require that a manager facilitate a conversation which selects among competing technology choices. That involves action. And possible thinking.

Real Management Pain. The response? Some form of "no".

Worse. (And it does get worse.)

Technology selection often becomes highly political. The out-of-favor project managers won't get projects approved because of "risky technology." More Management Pain.

War story. Years ago, I watched the Big Strategic Initiative shot down in flames because it didn't have OS/370 as the platform. The "HIPPO" (Highest Paid Person's Opinion) was that Unix was "too new" and that meant risk. Unix predates OS/370 by many years. When it comes to politics, facts are secondary.

Since no manager wants to think about potential future pain, no manager is going to look outside the box. Indeed, they're often unwilling to look at the edge of the box. The worst are unwilling to admit there is a box.

The "risk" claim is usually used to say "no" to new technology. Or. To say "no" to going back to existing, well-established technology. Switching from database BLOBs to the underlying OS file system can turn into a bizzaro-world conversation where management is sure that the underlying OS file system is somehow less trustworthy than RDBMS BLOBs. The idea that the RDBMS is using the underlying file system for persistence isn't a compelling argument.

It's important to challenge technology choices for every new project every time.

It's necessary to accept the "no" gracefully.

The "stop using the database for everything" idea takes a while to sink in.

Proof Of Concept

The only way to avoid management pain (and the inaction that comes from pain avoidance) is to make the technology choice a fait accompli.

You have to actually build something that actually works and passes unit tests and everything.

Once you have something which works, the RDBMS "question" will have been answered. But—and this is very important—it will involve no management thought or action. By avoiding pain, you also default into a kind of management buy-in.

War Story

The vendors send us archives of spreadsheets. (Really.) We could unpack them and load them into the RDBMS. But. Sadly. The spreadsheets aren't consistent. We either have a constant schema migration problem adding yet another column for each spreadsheet, or we have to get rid of the RDBMS notion of a maximalist schema. We don't want the schema to be an "at most" definition; we'd need the schema be an "at least" that tolerates irregularity.

It turns out that the RDBMS is utterly useless anyway. We're barely using any SQL features. The vendor data is read-only. We can't UPDATE, INSERT or DELETE under any circumstances. The delete action is really a ROLLBACK when we reject their file and a CREATE when they send us a new one.

We're not using any RDBMS features, either. We're not using long-running locks for our transactions; we're using low-level OS locks when creating and removing files. We're not auditing database actions; we're doing our own application logging on several levels.

All that's left are backups and restores. File system backups and restores. It turns out that a simple directory tree handles the vendor-supplied spreadsheet issue gracefully. No RDBMS used.

We had—of course—originally designed a lot of fancy RDBMS tables for loading up the vendor-supplied spreadsheets. Until we were confronted with reality and the inconsistent data formats.

We quietly stopped using the RDBMS for the vendor-supplied data. We wrote some libraries to read the spreadsheets directly. We wrote application code that had methods with names like "query" and "select" and "fetch" to give a SQL-like feel to the code.

Management didn't need to say "no" by asking for more information. They couldn't say no because (a) it was the right thing to do and (b) it was already done. It was cheaper to do it than to talk about doing it.

Failure To See The Problem

The original email continued to say this:
"how you can achieve RDBMS like behavior w/out an actual RDBMS"
What? Or perhaps: Why?

If you need RDBMS-like behavior, then you need an RDBMS. That request makes precious little sense as written. So. Let's dig around in the email for context clues to see what they really meant.
"consider limting [sic] it to
1) CREATE TABLE
2) INSERT
3) UPDATE
    An update requires a unique key. Let's limit the key to contain only 1 column.
4) DELETE
    A delete requires a unique key. Let's limit the key to contain only 1 column."
Oh. Apparently they really are totally fixated on SQL DML.

It appears that they're unable to conceive of anything outside the SQL DML box.

As noted in the above example, INSERT, UPDATE and DELETE are not generic, universal, always-present use cases. For a fairly broad number of "big data" applications, they're not really part of the problem.

The idea that SQL DML CRUD processing forms a core or foundational set of generic, universal, always-present use cases is part of their conceptual confusion. They're deeply inside the SQL box wondering how they can get rid of SQL.

Back to the drowning person metaphor. 

It's actually not like a drowning person asking for a different colored brick because they're building steps to walk out.

It's like a person who fell face down in a puddle claiming they're drowning in the first place. The brick vs. life preserver question isn't relevant. They need to stand up and look around. They're not drowning. They're not even in very deep water.

They've been laying face-down in the puddle so long, they think it's as wide as the ocean and as deep as a well. They've been down so long it looks like up.

Outside the SQL Box

To get outside the SQL box means to actually stop using SQL even for metaphoric conversations about data manipulation, persistence, transactions, auditing, security and anything that seems relevant to data processing.

To FLESH OUT ["flesh", the word is "flesh"] the conversation on breaking the SQL Hegemony, you can't use hypothetical hand-waving. You need tangible real-world requirements. You need something concrete, finite and specific so that you can have a head-to-head benchmark shootout (in principle) between an RDBMS and something not an RDBMS.

You may never actually build the RDBMS version for comparison. But you need to create good logging and measurement hooks around your first noSQL application. The kind of logging and measurement you'd use for a benchmark. The kind of logging and measurement that will prove it actually works outside the RDBMS. And it works well: reliably and inexpensively. 

This is entirely about asking for forgiveness instead of asking for permission.  

Managers can't give permission, it involves too much pain.

They can offer forgiveness because it requires neither thinking nor action.

Tuesday, July 9, 2013

HamCalc Quirk of the Week

The HamCalc program binhop is one of those little nuggets of beauty that might be helpful or might be useless. Or. Perhaps there's some useful stuff commingled with quirky stuff.

For folks in agriculture or manufacturing, I'm hoping that the calculation could be helpful. Although it's also likely that folks don't spend much time designing hoppered bins; they just buy something out of a catalog.

The binhop program is 151 lines long. Of those lines, 13 lines appear to be some orphaned code that doesn't belong there. They appear to be part of a hoppered bin design program that appears to have been split off into binvol.

It's easy to grep for all 46 instances of (?:GOTO|GOSUB|THEN)\s+\d\d\d0 to locate references to line numbers. We're only interested in line numbers from 1050 to 1170. Of course, there are none.

But. 

This is GW-Basic. Maybe there's a language or implementation quirk that makes this code somehow get executed? It seems highly doubtful. After all, it's only 151 lines of code. It's relatively easy to read and understand what's there. 

This kind of quirk demonstrates that an "automated" code conversion is rarely going to be helpful. An automated conversion of orphaned code means that the good stuff is diluted by the bad stuff.

I spent the most time with this fumbling through the alternative use cases to see what the program does. It doesn't do much. But it's important to be sure that this calculation isn't part of the use cases. It doesn't get a unit test.

More Oddness

This program has another cute quirks that is less brain-scrambling than orphaned code.


1020 :REM'
1030 IF C$="SIDE" THEN N=ATN(H/(F-D))*180/PI:RETURN
1040 IF C$="CENTER" THEN N=ATN(2*H/(F-D))*180/PI:RETURN

What if C$ is neither SIDE nor CENTER? There's no "otherwise" case expressed or implied. It just "falls through" to the next line of code.

In this case, the next line of code just happens to be the orphaned code starting on line 1050. This will do some calculations on variables which merely have their GW-Basic default values of 0. Since line 1170 ends with a RETURN, the program will appear to "work". It won't crash outright. It just executes a bunch of useless statements.

In other programs with similar structure, the following line of code is a RETURN (or a STOP or even an END in one case.)

Since C$ is only referenced in five lines of code, it's easy to be certain that it can only have one of the two values. Of the five references, one is a PRINT statement. Two are the above IF statements. The other two are assignment statements. 

Here are the two assignment statements:

650 Y$=INKEY$:IF Y$=""THEN 650
660 IF Y$="1"THEN GOSUB 1340:C$="SIDE":R=1:RETURN
670 IF Y$="2"THEN GOSUB 1340:C$="CENTER":R=0:RETURN
680 BEEP
690 GOTO 650

It's clear that C$ is restricted to a domain of just two values. The lack of a formal "otherwise" case in the 1020-1040 block of code is just a weird little quirk. 

Implementation


By adding documentation and test cases, we've bloated the good bits up to 336 lines of code.


This reflects two decisions. It seemed sensible to (nearly) duplicate some similar blocks of code rather than try to use a single block of code peppered with IF-statements. IF-statements raise the cyclomatic complexity. Lines of code just make it longer. Also, we've combined all of binvol into binhop.

In the long run, I know nothing about the subject matter. Nor did I even do the minimal amount of online research to confirm the formulae in the program. I'm secretly hoping that someone who actually understands this subject area will revise and correct the code to make it more useful and complete.