Data warehouse processing can involve a fair amount of "big batch" programs. Loading 40,000 rows of econometric data in a single swoop, updating dimensions and loading facts, for example.
When you get data from customers and vendors, you have endless file-format problems. To assure that things will work, each of these big batch programs has at least two operating modes.
- Validate. Go through all the motions. Except. Don't commit any changes to the database; don't make any filesystem changes. (i.e., write the new files, but don't do the final renames to make the files current.)
- Load. Go through all the motions including a complete commit to the database and any filesystem changes.
What's the difference between the two modes? Clearly, one is a subclass of the other.
- Load can be the superclass. The Validate subclass simply replaces the save methods stubs that do nothing.
- Validate can be the superclass. The Load subclass simply implements the save method stubs with methods that do something useful.
What Doesn't Work
This design has a smell. The smell is that we can't easily extend the overall processing to include an additional feature.
This design has the persistence feature set as the inheritance axis or dimension. This is kind of limited. We really want a different feature set for inheritance.
Consider a Validate for two dimensions (Company and Time) that loads econometric facts. It has stub "save" methods.
We subclass the Validate to create the proper Load for these two dimensions and one fact. We replace the stub save methods with proper database commits.
After the actuaries think for a while, suddenly we have a file which includes an additional dimension (i.e., business location) or an additional fact (i.e., econometric data at a different level of granularity). What now? If we subclass Validate to add the dimension or fact, we have a problem. We have to repeat the Load subclass methods for the new, extended Load. Oops.
If we subclass Load to add the dimension or fact, we have a problem. We have to repeat the Validate stubs in the new extended Load to make it into a Validate. Oops.
It's difficult to predict inheritance vs. delegation design problems.
The hand-waving advice is to consider the essential features of the object. This isn't too helpful. Often, we're so focused on the database design that persistence seems essential.
Experience shows, however, that some things are not essential. Persistence, for example, is one of those things that should always be delegated.
Another thing that should always be delegated is the more general problem of representation: JSON, XML, etc., should rely on delegation since this is never essential. There's always another representation for data. Representation is always independent of the object's essential internal state changes.
In my case, I've got about a dozen implementations using a clunky inheritance that had some copy-and-paste programming. Oops.
I'm trying to reduce that technical debt by rewriting each to be a proper delegation. With good unit test coverage, there's no real technical risk. Just tedious fixing the same mistake that I rushed into production twelve separate times.
Really. Colossally dumb.