Bio and Publications

Monday, August 30, 2010

Oracle Analytics and Functional Programming

As a hypothetical, I was asked about doing Oracle Analytic functions in Python.

[Sidebar: Politics. The question came from a DBA in a C# shop. That's why it's just hypothetical. Politically, they could never make use of this information. In C# world, Functional Programming is often scary and evil. It does exist -- obviously -- but it's called "Oracle Analytics", which makes it acceptable. If functional programming was called "functional programming" it would be unacceptable.]

[Sidebar: SQL. A big apples-to-oranges issue here is that the Oracle analytic functions can be composed in subqueries to write complex reporting queries. When you implement these functions in Python, you appear to lose some SQL "features" for report-writing. Actually, you move the SQL functional programming into Python functional programming. Everyone complains when they have to give up SQL. The true haterz say that using Python's functional programming for database processing subverts the database and leads to moral decay and the end of Western Civilization. Godwin's Law then applies.]

Specific functions lifted up to me were RANK, FIRST, LAST, ROW_NUMBER and DENSE_RANK. All of these are relatively simple functional programming examples.

GROUP BY

First -- and foremost -- SQL GROUP-BY can be slow. No one ever wants to hear this. The true haterz will claim that it's supposed to be faster and provide a bunch of non-reasons. Please benchmark.

Every database has to provide a GROUP-BY implementation that's perfectly general; it has to sort. A procedural program can sometimes do the same operation much more quickly because it doens't have to be perfectly general; we can make different memory use tradeoffs than the the database does.

For a fast GROUP-BY, use a hash map of some kind. Read all the rows in the simplest, fastest array fetch possible. Here's how you can do a blindingly fast SUM/GROUP-BY.

from collections import defaultdict
from decimal import Decimal
groups = defaultdict( Decimal )
for row in some_query_result_set:
groups[row['key']] += groups[row['value']]

Writing code like this is based on the assumption that the number of groups is small enough to fit into memory. That assumption allows us to avoid a sort. The database can't make this assumption, and can't easily use this kind of data structure.

Functional Programming

The analytical functions are nice applications of a functional style of programming. The ROW_NUMBER is already part of Python: it's the internal enumerate function.

We can use enumerate to implement FIRST.

def first( the_limit, some_query ):
for number, row in enumerate( some_query ):
if number >= the_limit:
break
yield row

This first function can be then used in another loop.

for row in first( 100, some_query ):
process( row )

LAST is more complex because there's no easy way to skip ahead. Instead we have to buffer some rows.

def last( the_limit, some_query ):
queue = []
for row in some_query:
if len(queue) == the_limit:
queue.pop( 0 )
queue.append( row )
for row in queue:
yield row

This can be sped up a little by using an explicit iterator object and breaking the `for row in some_query` loop into two phases to hoist the nearly-constant if-statement.

These can be composed with Python's sorted function.

for row in first( 10, sorted( some_query, key=lambda row: row['key'] ) ):
process( row )

This is elegant, but may only draw a tie in a race against the database. Python has the potential advantage of in-memory sorting. Oracle, however, is clever enough to use in-memory sorting on small sets of data, offsetting Python's advantage.

More Complex Functions

The more complex functions of RANK and DENSE_RANK require two-phase processing of rows. If we assume that we can fit the rows in memory, this isn't very complex at all. Indeed, the rank function is just a glorified order-by of a subset of the original data. However, it does require a potentially slow sort step.

ranked = enumerate ( sorted( some_query,
key = lambda row: row['key'] ) )

Okay. So RANK isn't so complex after all. Functional programming rocks.

DENSE_RANKED is confusing to me, but it look like the key phrase is "Rows with equal values for the ranking criteria receive the same rank." This means that the simple built-in enumerate isn't appropriate, and we need a key-aware enumeration.

def dense_rank( some_query, key ):
query_iter= iter(some_query)
rank= 1
current = query_iter.next()
yield rank, current
for row in query_iter:
if key(current) != key(row):
rank += 1
current= row
yield rank, row

Composition

One of the strong suits of SQL is that it allows us to define a functional-programming composition. Rather than write a lot of looping, we specify a series of functions which are composed and applied to our data.

For example, using FIRST and DENSE_RANK can be done like this.

for row in first( 10, dense_rank( some_ordered_query, key=lambda row: row['key'] ) ):
process( row )

This functional programming composition is -- BTW -- precisely what SQL specifies. SQL describes incremental processing of each row through a kind of pipeline that does map, filter, reduce and sort algorithms on the row.

The ORDER-BY clause is an initial sort.

The WHERE clause is an initial filter. It may involve a mapping if there are calculations in the various parts of the WHERE clause.

The GROUP-BY clause is a reduction into groups.

The HAVING clause is a second filter, applied to the groups. It may involve a mapping if there are calculations in the various parts of the HAVING clause.

Finally the SELECT clause is a mapping that does calculations on the resulting collection of rows.

The analytic functions, like subqueries, are simple complex mapping operations that involve other query pipelines.

Wednesday, August 25, 2010

Making a bad problem worse

Imagine that you're a beer distributor who provides "just-in-time" beer by type. You don't take orders for a specific brand, you take orders a type: stout, lager, India pale ale, etc. You resolve the bill based on what you actually delivered.

This can be kind of complex. However, there's no call for being crazy about it. Yet, before we get to the end of this saga, it will get crazy.

Your fulfillment (or "order-to-ship") process is rather complex. There's the order as placed and there's the actual products that got shipped. Your invoicing is also rather complex because you have to reconcile the order with the fulfillment and the final delivery.

Recently, I saw a piece of a "database" design for software used to support this kind of business. The design was so dysfunctional that there were heated arguments about how to proceed.

Note that the real business with the dysfunctional database deals in intellectual property, not beer. The fact that they deal in IP instead of tangible assets seems to make them easily confused. One thing I've heard -- but haven't seen -- is that they update the order during the fulfillment process. After all, the customer's order should match the invoice, right? Yes, but. In this business, the fulfillment will diverge from the order; it makes more sense to create a mapping from invoice to order than to rewrite the order.

Not the Crazy Part

The hellish thing that I saw was the many-to-many association between order and beer type. A many-to-many isn't bad. What they did, however, was really bad. But not crazy. Not yet.

It's a many-to-many table. Order has one or more Beer Types. A Beer Type can appear on any number of Orders. Could be simple.

In the world of atoms (tangible goods, not services) there's a pretty standard model where an order is a composite object with multiple line items. Each line item has a reference to a product. For this business, each line item would have a reference to a product type, instead of a specific product.

In a sensible software solution, there'd also be an invoice as a composite object; separate from the order. Only a customer can change an order. The invoice, however, would grow and change throughout the fulfillment process. The invoice, like the order, would have multiple line items. Each invoice line item would reference two things: the product actually delivered, and the order line item that this product fulfilled. This could include some "justification" or "rationale" showing how the fulfillment matches the order.

Because the real business didn't separate order and invoice -- and instead tried to massage the order to also be an invoice -- what they had was a table with flags and 10 (ten, yes ten) business rules that resolved whether or not this type of beer was or was not part of the order.

I'll summarize. The many-to-many table had two columns with flag values and ten business rules to interpret those flag values to determine what the was ordered and what was fulfilled. Two columns of flags. Ten rules. But that's not the crazy part.

Bad Data

A database that requires ten business rules and procedural processing to interpret the data is bad. It gets worse, however.

One of the ten business rules is a tie-breaker. The process that fulfilled orders was so badly broken that it could (and did) create multiple, conflicting invoice-to-type association rows. I was shocked: multiple, conflicting invoice-to-type association rows. Rule 10 was "in the event of a tie, there's 'bad data', pick a row and keep going."

There's "bad data"? Keep going? I would think this would be a show-stopper. Whomever wrote the application that created the bad data needs career guidance (guidance as in, "you're fired".)

It's Broken, But...

Clearly, any database that requires ten procedural business rules is not much of a database. A SQL query cannot be used to produce either order or invoice. To fetch an order or an invoice requires a procedure so complex that the organization cannot even figure out what programming language to use.

A procedure so complex that the organization cannot even figure out what programming language to use. Really.

The DBA's say it can be done as a stored procedure. And they have a worse plan, too. We'll get to that.

The programmers want to do this in C# because -- clearly -- the database is broken.

If you can't agree on the implementation, you've got big, big problems.

The Crazy Part

The pitch from one of the DBA's was to add yet more complexity. There are two flag columns and ten business rules to resolve nuances of order and fulfillment. This is a mistake which requires someone sit down and work out a fix.

Instead, the DBA pitched using Oracle's analytic functions to make the complex procedural processing look like "ordinary" database processing.

Wait, what?

That's right. Take a database design so complex that it's dysfunctional and add complexity to it.

Call me crazy but anyone who uses Oracle analytic functions on this problem now has two problems. They've got a analytic layer that only one DBA understands. This wraps a broken many-to-many table that (apparently) no one understands.

None of this reflects the actual business model very well, does it?

Bottom Line

If the database does not (1) reflect the actual business model and (2) work in simple SQL, it's broken. Adding technology to a broken database makes it more complex but leaves it essentially broken.

Stop. Adding. Complexity.

Tuesday, August 17, 2010

Is It Worth Describing Obscure Features?

I'm rewriting Building Skills in Python. 2.7 is out. As more libraries make the move, 3.1 is getting more and more viable.

I'm looking closely at the Decorators chapter (Part 3, Chapter 6).

And I'm struggling with classmethod. It's a first-class part of Python. And I'm sure there are folks who find it useful.

But I'm struggling to find a "simple" (i.e., under 12 lines of code) example where it might be useful.

Indeed, I'm starting to suffer from a growing feeling that this is one language feature that I can gracefully elide from in-depth to mere mention.

There is a relevant Stack Overflow question: What are Class methods in Python for? But the examples there aren't terribly compelling. Perhaps I have a blind-spot here because I never seen a big need for writing metaclasses. Or perhaps because I don't see a need for creating alternate constructors -- the options in __init__() seem to cover almost all my needs.

Monday, August 9, 2010

End User Programming -- Solution or Nuisance?

The question of "customization" and "extension" is a common one. For example, "non-technical users to be able to extend with simple scripts".


This is -- at best -- little more than an attractive nuisance. At worst, of course, it turns into a maintenance nightmare and the reason for killing the project. Let's look at the ways this can end badly.

1. Your "end users" can't or don't customize it with small scripts. Bottom line: you wrote a scripting interface which only you can use. Not all bad, but hardly worth the effort. Here's one version of rule: "Make it possible for programmers to write in English and you will find that programmers cannot write in English."

2. Your "end users" turn into the worst kind of hackers and break things left, right and center. Bottom line: you spend more time debugging the messes created by scripting.

3. The real world use cases turn out to be more complex than you bargained for. You spend endless time extending and modifying the scripting support to try and balance limited power (to prevent problems) against enough sophistication (to handle unforeseen use cases.)

"Well," you argue, "those are extreme cases. There has to be a middle road where the user's behave properly." No, not really. This middle road doesn't exist. Why? Programming is hard. A phrase like "simple scripts" is contradictory.

War Story I -- The SAS Hacks

At one organization, the accountants had created a monster. Their chart of accounts in the General Ledger (GL) was not consistent across all of the organizations. Consequently, financial analysis programs required careful rearrangement of certain ledger totals to make them reflect the business reality, not the legacy GL mistakes.

So they wrote some "simple" business rules in SAS do GL extracts. The extracts which were processed by these rules were useful for financial reporting because the legacy GL work-arounds were properly allocated.

Our job was to write a proper financial reporting tool that correctly extracted and reported on ledger data. It was going to be a proper data warehouse. The facts were amounts, the various dimensions were time, account, organization, bank product line, etc.

However, because of the SAS hacks, we could not get a single straight answer to any question. Everything had to be explained by a "here's some code" answer. Their business rules numbered over 9,000 individual simple statements.

Question: How is 9,000 of anything "simple"? Answer: it isn't. The volume makes it complex.

Further -- of course -- the business rules were actually more complex than claimed. The superficial claim was that the rules implemented line-of-business (separate from organization) rules. Some statistics showed that the rules likely contained two independent dimensions of each amount.

The "simple" rules were used to create huge complexity. It's human nature. The world is complex; we want to write software which reflects the world as we encounter it.

War Story II -- The MS Access Hack

At another organization, we had helped the end users buy really nice reporting tools from Cognos (before they became part of IBM) and Business Objects (before they became part of SAP). The tools were -- at the time -- the state of the art in flexible end-user reporting. Cool drag-and-drop interfaces, flexible scripting add-on capabilities. Really sweet.

Did the users actually use this power?

No.

They wrote dumb, large, detail-oriented reports. They extracted the bulk data into MS-Access Databases. Then they hacked around in MS-Access to produce the reports they wanted.

Why? (1) They understood Access; learning to use Cognos or BO tools wasn't going to happen. (2) They wanted real power; the limited scripting capabilities weren't solving their problems.

The reason we became involved was because their Chief Access Hack (CAH™) got another job and no one could figure out how the Access stuff worked.

After two weeks of reverse engineering, I found a bunch of business rules that should have been in the data warehouse. We had a nice meeting with their executive; we wanted to talk about the CAH and what they were going to do to prevent this kind of MS-Access nightmare in the future.

The executive reminded us that business rules are subtle and complex. Data warehouse software development is slow and clumsy. The users will always be several steps ahead of the warehouse processing. The MS-Access hacking is how they coped with the limitations of the warehouse.

Bottom Lines

Software is complex. We can't create a language suitable for "non-technical users". Either the language is too technical for our users -- and it isn't used -- or the users become technical and demand more features.

People learn. They're only non-technical users on day 1. If they start using the scripting language, they immediately start on the path to become technical users.

The real world is complex. There are no "simple scripts" except in a few textbook cases. When we first set out to build our simple scripting environment, we start with a few sensible use cases. Then unforeseen use cases crop up and we either tell people they can't use our system [Really, would you turn them down?] or we have to add complexity.

Architectural Implications

Because of this complexity issue, avoid some urges to refactor. Not all refactoring is good refactoring. There is A Limit to Reuse. Specifically, any attempt to create plug-in scripting into a larger application rapidly becomes hard to manage. It's better to provide necessary libraries to build applications around a script than to plug a script into a standardized application.