Bio and Publications

Tuesday, June 29, 2010

Creating Complexity Where None Existed

I read a 482-word treatise that amounted to these four words "sales and delivery disagree".
A more useful summary is "Sales and Delivery have different views of the order".

It started out calling the standard sales-delivery differences a "Conflict" requiring "Resolution". The description was so hopelessly enmeshed in the conflict that it code-named sales and delivery as "Flintstones" and "Rubbles" as if they might see their names in the email and object. [Or -- what's more likely the case -- the author refused to see the forest for the drama among the trees.]

What?

Sales and delivery are in perpetual conflict and there is no "resolution" possible. I assume this "resolution" comes from living in a fantasy world where order-to-fulfillment and fulfillment-to-invoice processes somehow are able to agree at each step and the invoice always matches the order in every particular.

If this were actually true, either sales or delivery would be redundant and could be eliminated from the organization.

Fantastic Software

I'm guessing that someone fantasized about an order-to-invoice process and wrote software that didn't reflect any of the actual issues that occur when trying to deliver services. Now, of course, reality doesn't match the fantasy software and someone wants a "solution".

Part of finding that solution appears to be an effort to document (482 words!) this "drama" and "conflict" between sales and delivery.

Here's what I observed.
  1. Take a standard process of perfectly typical complexity and fantasize about it, writing completely useless software.
  2. Document the process as though it's a titanic struggle between two evil empires of vast and malicious sociopaths with innocent little IT stuck in the middle as these worlds collide. Assign code-name to sales and delivery to make the conflict seem larger and more important than it is.
  3. Start layering in yet more complexity regarding "conflict resolution algorithm" and other buzzwords that aren't part of the problem.
  4. Start researching these peripheral issues.
That makes a standard business process into something so complex that one could spend years doing nothing useful. A make-work project of epic proportions.

Thursday, June 24, 2010

TDD and Python

First, let me say that TDD rocks.

Few things are as much fun as (1) writing a test script for a feature, and then (2) debugging the feature incrementally until it passes the test. It's fun because a great deal of hand-wringing and over-thinking is taken off the table.

To paraphrase Obi-Wan Kenobi:

Use The Test, Luke.

The essence of TDD is a pleasant two-step process: write tests, write code.

However, leaving things at this simplistic level isn't appropriate.

Code Quality

Most folks describe TDD as a 3-step process. I like to call this "red-green-gold" (The Lithuanian Flag version of TDD.)
  1. Tests don't pass (red).
  2. Tests pass (green).
  3. Refactor the code until things look good (gold).
The point here is that once you have tests that pass, you can trivially engage in refactoring and other engineering tasks to improve the overall quality of the code. You can optimize or make it more readable or more reusable without breaking it.

Even this isn't quite right.

Test Quality

The issue with a too-simplistic view TDD is that we walk a fine line.
  • Over-engineering the tests.
  • Under-engineering the tests.
We can -- trivially -- fall into the trap of wringing our hands over every potential nuance of our new piece of code. We can be stalled writing tests. Often we hear complaints from folks who fall into this trap. They spend too much time writing tests and indict all of TDD because they dove into details too early in the process.

We can -- equally easily -- fall into the trap of failing to write suitably robust tests for our software.

TDD is really a 3+1 step process.
  1. Write tests, which don't pass (Red).
  2. Write code until tests pass (Green).
  3. (a) Clean up code to improve quality features. (b) Expand tests to add an appropriate level of robustness.
The operating word here is "appropriate".

Costs and Benefits

Some modules -- because of risk or complexity or visibility -- require extensive testing. Some modules don't require this.

Interestingly, portability -- even in Python -- requires some care in testing. It turns out that MySQL and SQLite are not completely identical in their behavior.

Omitting an order-by in a query can "work by accident" in one database and fail in another. So we need appropriate testing to ferret out these RDBMS-specific issues. Until we have the appropriate level of testing we have an application that works in SQLite but fails in MySQL.

The initial gut reaction can sometimes be "TDD failed us".

But this isn't true. TDD actually helped us by (1) identifying code which passed on one platform and failed on another, and (2) leading us to beef up all tests which depend on ordering. Pleasantly, there aren't many.

Wednesday, June 16, 2010

Adobe's Feckless Updater


Consider this dialog box.

The application was modified.

It can't be updated.

Why not just replace it? Replacing a modified application seems to be a perfectly sensible use case.

But no, rather than doing something useful, it shows a dialog box. I guess no one thought through this use case and asked what -- if anything -- the Actor actually cares about. Doing installs is not one of my goals as an actor. Managing installations is not one of my goals. I want to (a) read PDF's and (b) have everything else handled automatically.

Monday, June 14, 2010

Sales Person with Principles

My MacBook has an 80Gb drive with less than 2Gb available. A few times I've totally filled the disk and had to spend time judiciously searching and removing old files. Sigh.

I have (and use) external hard drives, but it seems to violate some kind of "laptop" principle to be tethered to the desk. Backups, yes; general writing, no.

I use a 4Gb thumb drive for much of my writing. That allows me to travel with just my work laptop without synchronizing files. It also allows me to work with poor or no connectivity.

Today, I went to my local Apple Store to talk about a new MacBook Pro. The sales person had a bunch of ways to preserve my old machine by using removable hard drives and my Mobile Me account. After a long conversation I had to beg them to sell me a new computer. It appears that they'd rather preserve my investment than wring more money out of me.

What? Make do with "good enough"? Improve slightly to solve the actual problem I actually have? Spend the least to get the most? That seems downright un-American!

The sales person was pretty sure I didn't need a MacBook Pro. A MacBook was good enough. (I have a older FireWire video camera, so the Pro is necessary.)

And that's not all. The on-line purchases from http://store.apple.com/ allow more customization. In-store purchases do not have the same degree of customization. Since my needs were unique, the sales person sent me home empty-handed rather than sell me a product that's not exactly what I need. We went through the on-line ordering a few different ways to explore what I actually need.

It's odd to meet principled sales people.

Friday, June 11, 2010

Sagan-esque Data Volumes

About once a week a question shows up on Stack Overflow that involves loading a database with truly epic volumes of data. For example "billions of rows in a single table for a month".

Billions of rows per month is a minimum insert rate of 385 row per second.

Also, this quote is killer. "data for the past 5 years". That's a minimum of 60 billion rows.

This is a really, really poor use of an RDBMS. This requires some kind of well-planned hierarchy of storage and analytic solutions. Load and Query can't work.

Goal

The question is "What's the Goal"? Some of the Stack Overflow questions lack essential use cases, making it impossible to determine what these folks are trying to do.

What's certain, however, is that no human being is going to do an ad-hoc SQL query on 60 billion rows of anything. Analysis of data volumes like that will involve fairly narrow and specific queries.

Analysis of a subset may involve ad-hoc SQL queries. But the whole data set isn't really useful -- as a whole. It's useful when sliced and diced.

Heresy

At this point, many DBA's pronounce me Heretic and Apostate. Anyone who suggests that a SQL database is (a) slow, and (b) biased toward ad-hoc queries must have fallen from the true path.

First, SQL is slow. A flat file is always faster. Try it. For reasonably well-structured data -- arriving at a sustained rate of 385 rows per second -- only a concurrent pipeline of flat-file processing can keep up. The dimensional conformance and fact table preparation has to be done in flat files. With no database loads involved at all.

Second, SQL is for ad-hoc processing. Most applications that have embedded SQL don't involve queries that the user types at the command line. However, most applications use SQL specifically to divorce the application from the physical data model. The idea is that SQL offers an ad-hoc scale of flexibility in exchange for glacial processing speed.

Acquisition

The first step is to acquire the data in some storage that will handle 60 billion rows of data. Even if the rows are small, this is a big pile of disk. Super-large files are a mistake, so it means a complex directory tree of several smaller files.

Ideally, some "sharding" algorithm is used so that perhaps a dozen files are in use concurrently, each getting 30 or so rows per second. This is a more sensible processing pace, achievable by heavily loaded devices.

Data acquisition is -- itself -- a highly parallelized operation. The rows must be immediately separated into pipelines. Each pipeline must be a multi-processing sequence of dimension conformance operations. At the end of each pipeline, a standardized row with all of the dimension FK's emerges and is appended to a file. Some flushing and periodic close-reopen operations will probably be reliable enough.

The dimension values can be built into a database. The facts, however, have to reside in flat files.

Analysis

In the unlikely case that someone thinks they want to analyze all 60 billion rows, there are two things to be done. First, talk them out of it. Second, write special-purpose flat-file analyzers which do concurrent map-reduce operations on all of the various source files.

In the more likely use cases, folks want a subset of the data. In this case, there's a three-part process.
  1. Grab the relevant dimensions. They're in a master-dimension database, being constantly updated by the ongoing loads.
  2. Filter the facts. This is a massively parallel map-reduce process that extracts the relevant rows from the fact files and creates a "data mart" fact file.
  3. Load a datamart. This has dimensions and facts. The facts may have to be summarized to create appropriate sums and counts.
This subset datamart can be turned over to people for further slicing and dicing or whatever it is that they do with 60 billion rows of data.

Wednesday, June 9, 2010

The Users Just Want "Search" -- What's So Hard?

Great article on "Search" from back in '08 in Forbes. "Why Google Isn't Enough", by Dan Woods. He's talking about "Enterprise Search": why in-house Google-style search is really hard and often unsatisfying.

Here's the cool quote.
enterprise search systems also index and navigate information that may reside in databases, content management systems and other structured or semi-structured repositories. The contents may include not only text documents, but also spreadsheets, presentations, XML documents and so on. Even text documents may include some amount of structure, perhaps stored in an XML format.
Everyone thinks (hopes) that the mere presence of data is sufficient. That fact that it's structured doesn't seem to influence their hopes.

The complication is simple -- and harsh. Many enterprise databases are really bad. Really, really epically bad. So bad as to be incomprehensible to a search engine.

Explanations

How many spreadsheets or reports "stand alone" as tidy, complete, usable documents?

Almost none.

You create a budget for a project. It seems clear enough. Then the project director wants to know if the labor costs are "burdened or unburdened". So the column labeled "cost" has to be further qualified. And "burdened" costs need to be detailed as to which -- exact -- overheads are included.

So a search engine might find your spreadsheet. If a person can't interpret the data, neither can a search engine.

Star Schema Nuance

You can build a clever star schema from source data. But what you find is that your sources have nuanced definitions. Each field isn't directly mappable because it includes one or more subtleties.

Customer name and address. Seems simple enough. But... is that mailing address or shipping address or billing address? Phone number. Seems simple. Fax, Voice, Mobile, Land-line, corporate switch-board, direct? Sigh. So much detail.

Of course the users "just want search".

Sadly, they've created data so subtle and nuanced that they can't have search.

Thursday, June 3, 2010

Buzz in the general public regarding software bugs

I got this the other day: "there seems to be a lot of buzz out in the general public regarding software bugs".
Attached to this was an article from The Economist in 2003 plus one from 2010. To me, this doesn't seem to be a "lot" of buzz. But what do I know?
Further, it did not come from someone outside the software/IT industry. It came from a DBA. I guess the presence of this email in my inbox must mean some DBA's are surprised that there are bugs. I guess they were surprised to see "bug" in a general-interest magazine.
They also forwarded a link to http://www.glitchthebook.com/. This looks more interesting than a writer for The Economist (http://www.economist.com/) providing information to a general audience that every professional should already know.
I guess it could be interesting when someone notices "bug" in a general-interest magazine.
Hidden Cost Hogwash
I object, however to this "hidden cost" hogwash. Bugs have an explicit, obvious, direct cost. There may be "hidden costs" but they are largely irrelevant and pale in comparison to direct costs.
What we need are articles not on the "hidden cost", but on actual bugs. In particular, there are two kinds of actual costs that we need to look at: "hidden bugs" and "compound bugs".
  • Hidden Bugs. These are things simply below the user interface level. They're present and they're often worked-around by UI hacks. Hidden Bugs are more costly than visible bugs. Complex multi-layered and multi-component architectures are packed with hidden bugs.
  • Compound Bugs. These are hidden bugs where the workaround also has a bug. The interface file has an intermittent glitch, so the web services are cluttered with try: statements. The try: statements, themselves, harbor bugs, so we have to then add assert statements and declare it "defensive programming". The net effect is to simply log something that was provided to the interface incorrectly. Sigh.
We shouldn't waste time talking about "hidden costs" of glitches when we aren't even sure what the actual up-front costs are. If we knew the costs, we'd spend a bit more on the software to prevent the bugs in the first place.
We also shouldn't be surprised to see "bug" in a general-interest magazine.