Thursday, January 27, 2011

FAERIE DUST™

Here's how to recognize a Faerie Dust request:
  1. We have identified a problem. It can be with almost anything: scalability, reliability, auditability, any Quality Measure.
  2. We're pursuing a specific technology. Typically, something that has the lowest impact on our architecture.
  3. We can't address anything other than this specific technology variation -- we can't change the application software or buy hardware.
Once we're in the Faerie Dust realm, what can we do?

Laughing doesn't help. They have a serious problem, they need a solution. The fact that they won't address the cause isn't completely relevant -- we have to work on the denial, anger, negotiation, depression cycle first. Hopefully skipping past the anger, or assuring the anger is directed elsewhere.

Helping doesn't help. If we join the quest for their Faerie Dust, what will we accomplish? We'll burn billable hours to -- eventually -- reach an equivocal non-solution with a complex write-up and recommendations that won't be implemented.

Not helping doesn't help. If we obstinately refuse to join the quest for the Faerie Dust... well... then we've done nothing. We haven't advanced their understanding of their problem.

What's left? Is there a middle road that allows us to join the Faerie Dust quest, but still point out the side roads, other monsters and other treasures along the way?

Perhaps there is, but it would require a kind of saintly patient persistence. We would have to start with an enumeration of problem causes, prioritize them, and then focus on their selected bit of Faerie Dust. My idea is that enumerating the possible causes allows us to identify the missed opportunities, and the possible magnitude of fixing something essential (algorithm or data structure) instead of throwing up window-dressing to cover problems in something inessential (reducing the time required for a table scan).

Example

Here's a concrete example of Faerie Dust.
  1. Pick a data model that doesn't fit the use cases. i.e., lumped many discrete details into a single text field that has "rich semantic content". Work around this mistake by using wild-card matches.
  2. Complained about performance and dug into nuanced details of LIKE clause and full-text search. Lots of study time spent on LIKE clause processing and how to improvement performance.
  3. Refused to discuss the actual use case or the mismatch between data structures and requirements.
The design didn't match the use cases. Faerie Dust won't help.

Tuesday, January 25, 2011

Wild-Card (LIKE-clause) searches are slow. What to do?

Patient: "Doctor, doctor, it hurts when I do this."
Doctor: "Then don't do that."

I got an email with hundreds of words of content. This part made sense: "...doing wild card searches using Oracle's database engine and are wondering why is it so slow and how do they make it go faster."

The rest made very little sense at all. The programmer in question immediately dove into nuances of indexing, Oracle pattern matching, Oracle Text Query and other technical questions. The entire focus was on the technical ins-and-outs.

Not a single word on why wildcards were even being used in the first place. Wildcards appear to solve a business problem; the business problem was never mentioned.

Use Case for Wildcards

After some back-and-forth, the use case emerged. We'll address it below. Essentially, the invoices have names (really) that have "rich semantic content". These invoice names have the form "{customer} {time period} {offering}".

Apparently, the use case is "slice-and-dice" queries. All invoices for a given customer; all invoices in a given time period; all invoices for a given offering; various combinations.

Really. Rather than provide discrete dimensions and use a star schema, they've (a) combined all attributes into a single free-text field and (b) used wild-card searches and now (c) want to complain about it.

We'll return to this use case below.

Basic Rules

Here's are the two rules.

Wild Cards Are The Last Resort For Human-Friendly Search.

Outside Human-Friendly Search, Wild Cards Are Useless.

Let's look at rule 1: Wild Cards Are The Last Resort For Human-Friendly Search.

When a person enters a search string on a web page, we have two choices.
  1. Trust them to enter the exact field as it appears in the database
  2. Presume that people are fallible and cannot be trusted to enter the exact field.
In case #1 (exact match) we might be using an account number, shipping number, an invoice number or some kind of surrogate key. In this case, we do simple equality checks. If the user can't get it right, bummer. In many cases, this is appropriate to prevent snooping.

In case #2 (partial match), we're forced to use a some kind of SQL LIKE clause for the human-friendly search. We have several implementation choices, some in the database, some out of the database. Some in-the-database solutions benefit from clever indexing. Many in-the-database solutions are pretty slow.

Yes, an out-of-the-database solution may actually be faster. Until we benchmark, we can't know. There's no trivial rule that says the database always does search faster. For real speed, we may have to resort to a hybrid solution.

Search Optimization

We might create a small RESTful server for our searchable text fields. This is a cache; the server should handle CRUD rules to assure cache coherence. This search server can uses a Regular Expression engine, or perhaps compute Levenshtein distances or whatever makes sense to optimize user-oriented search.

If we're searching in larger chunks of text, we might want to use a commercial full-text search.

What's essential about this plan is that we're looking at application-specific optimizations. People need flexibility for specific reasons. It's important to look at the actual use cases where a person cannot make an exact match lookup. What problems do they have?

An application may have to deal with customer names. These are often difficult to spell consistently. (Is it "AT&T" or "ATT"?) For this kind of thing Levenshtein Distance might make more sense than wild-card searches.

An application may have to deal with time periods. "2010", "2Q 2010", "July 2010", etc. This is best handled by decomposing time periods into discrete fields and doing appropriate exact match on the specific, relevant fields. The issue is that there are a lot of formulations and some text parsing can be better than a form with a million drop-downs.

An application may have to deal with oddly-named offerings. Marketing calls it one thing. Sales folks call it another. The customer's invoice may call it a third, and the help desk may not use any of those phrases. This may benefit from wild-cards.

Note that we're looking at the business issues. Not the technology issues.

Design Errors

The proper use for LIKE is only to optimize the human-friendly search. Nothing else. Which brings us to rule 2, Outside Human-Friendly Search, Wild Cards are Useless.

Outside human search, every wild-card in a SQL statement indicates a serious database design error. Serious? Error? Yes.

LIKE clauses outside human search indicate a failure to create a design in first normal form (1NF). A field which is used in a LIKE clause has multiple parts, and should have been decomposed into pieces.

Decomposing a multi-part attribute isn't always trivial. There are two cases.
  1. Simple, regular format or punctuation. For example, SSN, US Phones or ZIP codes: 123-45-6789 or (123)555-1234 or 12345-1234.
  2. Complex, irregular format or punctuation. In this case, we have disjoint subtypes in a single table. Most manufacturing part numbers suffer from this.
In case 1, we have two choices: fully decompose or denormalize. In case 2, we can only denormalize because the rules are irregular.

The decomposition solution does not have to lead to a hideous user interface. We can have a web page with a single text field for phone numbers. We can parse that string and decompose the phone number into area code, exchange and number for purposes of database storage. We don't have to thoughtlessly force the users to decompose a field that they don't see as being in three parts.

The denormalization solution means that we have to do some calculation when we accept the input value. We save the full field, plus we extract the various sub-fields based on whatever hellish, complex rules we're faced with.

Implementation Choices

Whenever we have a single text field with "rich semantic content" (i.e., combines multiple disjoint attributes like customer, time period and offering) what we're seeing is a clever way to push database design onto the users. The expectation is that IT will (1) understand the use cases, (2) provide a proper design and (3) optimize performance around that design.

A big text field and wild-card search (and the attendant email traffic) indicates an explicit unwillingness to discuss the real use cases, unwillingness to do design, and a lame hope that somehow wild-card searches can magically be made faster through magical indexing or other super-natural techniques.

The "rich semantic content" field can be decomposed one of two ways.
  • In the GUI. Add drop-downs so users pick the customer, time period, and product offering information.
  • In the Application. Parse the big text field into smaller text fields that don't require wild-card search.
There isn't any magic. If wild-card searches are too slow, they have to be replaced.

Benefits?

The benefit of decomposing (or denormalizing) a complex field is that we can eliminate LIKE processing and wild-cards. Instead of "LONG_TEXT_FIELD LIKE '%2Q 2010%'", we can do "DATE.QUARTER=2 AND DATE.YEAR=2010".

All the technical folderol related to indexing and full-text search and database regular expression engines goes right out the window.

The cost is that we have to "wrap" the INSERT and UPDATE processing in a class definition that does the denormalization. That's what a data model layer is for: these kinds of business rules. The insert/update cost, BTW, will be microscopic compared to the number of SELECTs. The extra time spent at INSERT will be handsomely amortized over all the simplified SELECT operations.

Monday, January 24, 2011

Python Jobs

Looked interesting enough for someone to email it to me. Still can't figure out why they sent it. Posting this feels like advertising, so perhaps I should charge them a promotional fee.

Web Services - Mobile Developer

"A minimum of 2 years of Web Service development, preferably using an established Python-based framework (Django or Pylons).
Demonstrated proficiency with Python is a must."


Maybe they sent it as confirmation that Python is somehow a "real" technology.

Tuesday, January 11, 2011

Client-Server Partitioning

I have slowly grown to love RESTful web services.

I was asked about a nearly empty section in the code repository labeled "Java client".

"Yes," I said, "it's a place-holder for a Java package that includes classes to wrap our RESTful web services."

"Really?" I was asked, "Why? We use FLEX for the client, not Java Applets."

"Today we use FLEX. In the past, we weren't sure. We have a complete Python client library. Indeed, the original concept was to support our customer's building their own web site to use our web services. The Java package would plug into a J2EE web app. No one wanted that, so we built FLEX clients instead."

Then they said that the super-clean separation between all these clients and the RESTful server was taking "flexibility to a whole new level."

I pointed them to this. EWD340: The Humble Programmer. It has this killer quote:
The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague.
I find that this really helps keep the focus on simplicity. I suppose that it leads to flexibility, but that's not the real point. The real point was simplicity.

Thursday, January 6, 2011

Java PHP Python -- Which is "Faster In General"?

Sigh. What a difficult question. There are numerous incarnations on StackOverflow. All nearly unanswerable. The worst part is questions where they add the "in general" qualifier. Which is "faster in general" is essentially impossible to answer. And yet, the question persists.

There are three rules for figuring out which is faster.

And there are three significant problems that make these rules inescapable.

Rule One. Languages don't have speeds. Implementations have speeds.

Info on benchmarking. The idea of a benchmark is to have a single, standard suite of source code, which can be used to compare compilers, run-time libraries or hardware.

Having a standard suite of source is essential because it provides a basis for comparison. A single benchmark source is the fixed reference. We don't compare the top of the Empire State Building with the top of the Stratosphere in Las Vegas without specifying whether we care about height above the ground or height above sea level. There has to be some fixed point of comparison, some common attribute, or the measurements devolve into mere numerosity.

Once we have a basis for comparison (one single body of source code), the other attributes are degrees of freedom; the measurements we make will include the other attributes. This will allow a rational statement of what the experimental results where. We can then compare these various free attributes against each other. For details look at something like the Java Micro Benchmark.

Rule Two. Statistics Aren't a Panacea.

The reason there's no "in general" comparison among languages is because there are too many degrees of freedom to make any kind of rational comparison. We can make irrational comparisons, but that's the trap of numerosity -- throwing numbers around. 1250 vs. 1149, 1300 vs. 3177. What do they mean? Height above ground? Height above sea level? What's being measured?

There's a huge problem with claiming that statistics will yield an answer to which language implementation is faster "in general". We need some population that we can sample and measure. Problem 1: What the population are we measuring? It can't be "programs": we can't compare grep against Apache httpd. Those two programs have almost no common features.

What makes the population of programs difficult to define is the language differences. If we're trying to compare PHP, Python and Java, we need to find a program which somehow -- magically -- is common across all three languages.

The Basis For Comparison

Finding common programs degenerates into Problem 2: what programs could be comparable? For example, we have the Tomcat application, written in Java. We wouldn't want to write Tomcat in Python (since Tomcat is a Java Servlet container). We could probably write something Tomcat-like in PHP, but why try? So we can't just grab programs randomly.

At this point, we devolve to subjectivity. We need to find some kind of problem domain in which these languages overlap. This gets icky. Clearly, big servers aren't a good problem domain. Almost as clearly, command-line applications aren't the best idea. PHP does run from the command-line, but it's always contrived-looking because it doesn't exploit PHP's strengths. So we wind up looking at web applications because that's where PHP excels.

Web applications? Subjective? Correct. PHP is a language plus a web application framework bundled together. Java and Python -- by themselves -- are just languages and require a framework. Which Java (and Python) framework is identical to PHP's framework? Spring, Struts, Django, Pylons? None of these reflects a code base that's even remotely similar. Maybe Java JSP is similar enough to PHP. For Python there are several implementations. Sigh.

Crappy Program Problem

We can't easily compare programs because we're really comparing implementations of an algorithm. This leads to Problem 3: we picked a poor algorithm or did a lousy job of implementing it in the target language.

In order to be "comparable", we don't want to exploit highly-optimized or unique features of a language. So we tried to be generic. This is fraught with risks.

For example, Java and PHP don't have list comprehensions. Do we forbid them from our Python behchmark? In Python, everything is a reference, values cannot be copied. If we pick an algorithm implementation which depends on copying objects, Java may appear to excel. If we pick an algorithm implementation which depends on sharing references, Python may appear to excel.

Somehow we have to get past language differences and programmer mistakes. What to do?

Synthetic Benchmarks

Since we can't easily find comparable programs -- as whole programs -- we're left with the need to create some kind of benchmark based on language primitives. Statements or expressions or something. We can try to follow the Whetstone/Dhrystone approach of analyzing a bunch of programs to find the primitive constructs and their relative frequency.

Here's the plan. We'll take 100 PHP programs, 100 Java programs and 100 Python programs and analyze them to find the relative frequency of different kinds of statements. What then?

The goal is to create one source that reflects the statements actually used in the 300 programs we analyzed. In three different languages. Hmmm... Okay. We'll need to create a magical mapping among the statement constructs in the three languages. Well, that's hard. The languages aren't terribly comparable. A Python expression using a List Comprehension is the same thing a multi-statement Java loop. Rats.

The languages aren't very comparable at the statement level at all. And if we force them to be comparable, we're not comparing real programs, but an artificial mapping.

Virtual Machine Benchmarks

Since we can't compare the languages at the program level or the statement level, what's left? Clearly, the underlying interpreter is what we care about.

We're really comparing the Java Virtual Machine, the PHP interpreter and the Python interpreter. That's what we really care about.

And life is simple because we can compare Java, The Project Zero PHP Interpreter based on the JVM and Jython. We can look at "compiled" PHP, Java Class Files and Python .PYC files to find the VM primitives used by each language and then -- what? Compare the run-time of the various VM primitives? No, that's silly, since the run-times are all JVM run-times.

What We're Left With

The very best we can can do is to compare the statistical distribution of the VM instructions created by Java, PHP or Jython compilers. We could note that maybe PHP or Python uses too many "slow" VM instructions, where Java used more "fast" VM instructions. That would be an "in general" comparison. Right?

See? You can measure anything.

In this case, the compiler itself is a degree of freedom. Sadly, we're not comparing languages "in general". We're comparing the bytecodes created by various compilers. We're actually comparing compilers and compiler optimizations of the bytecode. Sigh.

That's not what we were hoping for. We were hoping for some kind of "in general" comparison of the language, not the JVM compiler.

Java has pretty sophisticated optimization. Python, however, eschews optimization. PHP has it's own weird issues. See this paper from Rob Nicholson from the CodeZero project on how to implement PHP in the JVM. PHP doesn't fit the JVM as well as Python does. So there's a weird bias.

Rule Three. Benchmarking Is Hard.

There is no "in general" comparison of programming languages. All that we can do is benchmark something specific.

It works like this.
  1. Stop quibbling about language performance "in general".
  2. Find something specific and concrete we plan to implement.
  3. Actually write the performance-critical piece in Java, PHP, Python, Ruby, whatever. Yes. Build it several times. Really. We don't want to use "language-independent" or "common" features. We want to optimize ruthlessly -- use the language the way it was meant to be used. -- use the various unique-to-the language features correctly and completely.
  4. Actually run the performance-critical piece to get actual timings.
  5. Since run-time libraries and hardware are degrees of freedom, we have to use multiple run-time libraries, multiple compiler optimization settings and multiple hardware configurations to make a proper decision on which language to use for our specific problem.
Now we know something about our specific problem domain and the available languages. That's the best we can do.

We can only compare a specific problem, with a specific algorithm. That's the basis for all benchmark comparisons. Since each implementation was well-done and properly optimized, the degree of freedom is the language -- and the run-time implementation of that language -- and the selected OS and hardware.

Search For Expertise

I'm looking for a unbiased Python expert to help with a book I'm working on. We need "an unbiased python expert with a keen eye for detail."

The role is technical reviewer. I've never done this before, but it appears that the tech reviewer is a paid position somewhere in a publication pipeline along with copy editing and other production steps. It doesn't sound like a full-time job, since the chapters trickle through the pipeline slowly.

I'm guessing that you'd get to correct the misstatements and run all the code examples.

Tuesday, January 4, 2011

Integration Testing, unittest and Python 2.7

Many folks use Python's unittest module for integration testing. It sometimes leads to whining and hand-wringing, but it is very effective.

Ordinary "unit" tests use mocks and focus on a class or a module more-or-less in isolation. The purists say "complete isolation". But that's sometimes unrealistic. A class that's part of a State design pattern or a Strategy design pattern is often so trivial that "pure" unit tests aren't really very enlightening. Using the stateful class along with it's state class hierarchy is usually far more interesting and helpful.

I argue that it's still "unit" testing because parts of the application have been extracted from their processing context.

OS and Module Mocks

Some folks will hand-wring over the use of OS API's or other library API's in a unit test. They feel that the OS API's should be completely mocked in order to call it a "unit" test. In most cases, simplistic mocks can be created, but for most use cases, the mock grows to hellish complexity.

For example, we have a large number of tests which depend on Python's csv module. While we could mock csv to produce row after row of mocked data, it seems much simpler to trust that our infrastructure works, use csv and simply provide files of test data in CSV format. The file is part of the test fixture and is locked up in the source code repository.

Is this "unit" testing when we're integrated with some of the underlying modules?

We don't separately test csv. We simply trust that it library modules have their own tests. If we're willing to trust the already-supplied tests for csv, why not use it in our tests? Why Mock something we've decided to trust?

[For that matter, we also trust Python, the OS and the unittest module itself. Why draw lines between Python, unittest and csv?]

ORM and RDBMS Mocks

A similar kind of worry comes with an ORM layer. Somehow, the "trust" factor starts to break down here. Folks want to mock the ORM layer. Or -- worse -- they want to try and mock the RDBMS layer so that they're testing the application and ORM layer. This is another artificial distinction between stuff we'll trust (the RDBMS layer) and stuff we feel we must write additional tests for (the ORM layer).

It's so much nicer to adopt the Django philosophy of building a test database as part of the test fixture. The ORM and RDBMS are integrated into the test itself. The thing that's "mocked" is the data which gets pre-loaded into the database.

Outside of Django, this can be a bit complex to set up properly. You need to create a temporary database, execute the SQL DDL, load the fixture data. This is an annoying bit of code to write the first time. However, it has handsome rewards because the "unit" testing includes the ORM and RDBMS at a higher level where it's easier to work with.

Integration Testing

We use the unittest module to do integration testing also. In Python 2.6, this involved a fair amount of work. We had to start our RESTful server, run the unit test, and then kill the server. Because of the way the subprocess module works, we can't be completely sure the server is running, so we simply cheat and use a time.sleep(12) to wait for the DB to be built and loaded.

Python 2.7 adds module-level test cases. It checks for setUpModule and tearDownModule functions. I've been gleefully revising all our unit tests to make ready for this. Our previous testing needed pervasive (but minor) rewrites to refactor the database setup and tear down and provide proper names.

Once that's in place, our big test shell script can be simplified down to a little Python unittest module. This module will build a unittest.Suite from all the other test modules (there are dozens) and simply execute the suite as a single, integrated whole.