Bio and Publications

Tuesday, July 27, 2010

NoSQL -- Old Wine, New Bottle

Got an email with links about NoSQL. Links like "Going NoSQL with MongoDB". This -- like many such articles -- includes the phrase "the NoSQL movement" as if there's something new going on. Thank goodness Ted Neward includes quotes around "new". This isn't new. And doubly good, Neward doesn't use words like "excitement".

Some folks like to link to http://en.wikipedia.org/wiki/NoSQL. This is misleading, of course, since avoiding SQL isn't new or even that interesting. If you're going to treat avoiding SQL specially, then you should have a NoProceduralProgramming, NoFunctionalProgramming, NoAssembler, NoShellScript and NoHTML movements, also.

Why stop there? Why not have a NoDumbAssArchitecture movement, too?

If you want to see dumb, breathless stuff, however, use Google and search for "nosql excitement". You'd think that the file system was new technology. In particular, posts like "NOSQL Movement - Excited with the coexistence of Divergent Thoughts" seem silly.

Unless -- I guess -- you've been solving all data management problems with a relational database. I guess when you discover that you don't have to use the hammer, then it's exciting to see that everything isn't simply a nail, either.

If avoiding the hegemony of SQL seems important, or even interesting, perhaps you've been living in a cave. Seriously. The file system has always been there and has always worked nicely for lots of problems. My 2002-era Ralph Kimball Data Warehouse Toolkit books are very clear that large, high-volume data warehouses are mostly flat files. Data marts are SQL databases suitable for ad-hoc SQL queries. But the RDBMS isn't always the best place for large volumes of data.

Bottom Line

NoSQL isn't new or even very interesting.

Consequences

If you're an architect, but you're not looking at alternatives to the RDBMS -- and running benchmarks to compare the choices -- you're not really doing architectural work. You're probably a glorified programmer and should consider working in a place that doesn't stifle you by imposing a "one world -- one architecture" viewpoint.

If you're a manager and think that "everything in SQL" is a risk-reducer, you need to actually talk to your people. If you think that your people's skills are limited to SQL, you're doing your team (and your customers) a disservice. Consider a skill upgrade of your own. Your team can learn other non-RDBMS technologies. Perhaps you should stop stifling them.

If you're a DBA and you know -- for a fact -- that the relational database is perfect and complete, you should perhaps pause a moment and consider things the relational databases don't do well. Graph-theory problems and hierarchies require fairly complex workarounds. Even a many-to-many relationship requires this extra association table. Perhaps those things are the signs of force-fitting data into the RDBMS model.

Thursday, July 22, 2010

Scrum Made Difficult

Here's a great post called "Toward a Catalog of Scrum Smells". This lists some "Management Smells": specifically doing clumsy, ineffective things and calling it "Scrum".

I found this in StackOverflow question, titled "Any stories where trying to apply Scrum went wrong?"


What's interesting to me is that (1) Scrum works -- or you'd have more horror stories -- and (2) people do it wrong all the time.

Most of the Scrum-Done-Badly smells amount to management-as-usual. Rather than empower the developers, managers insist on long, stupid status meetings the purpose of which is to inform management. Rather than trust the developers to get things done, managers insist on detailed plans of little value to the developers.

Monday, July 12, 2010

Complexity and Simplicity

Here's an interesting -- and common -- question.

"... any tools that I could use to create a web scraper that I could use to interact with a .aspx website?

I want to build a tool that will read an input file (e.g. an excel spreadsheet) containing a list of property parcel numbers, and for each parcel number:
- connect to the property appraiser's website (which happens to be the .aspx page),
- enter the parcel number,
- scrape selected data (which is contained in a table on the search results page)
- store the scraped data in an output file (e.g. in the excel file that contains the input list)
- repeat the process for each parcel number"

The follow-on is interesting, also.

"I've created an excel macro which does the above with 'simple / plain vanilla html pages using the WebQuery feature, but it can't interact with an .aspx page."

Let's consider some of the complexities and simplicities that are present here.

Solution-Speak

First, and most important, this is written in solution-speak. It's an IT habit, and it's very hard to break. The input appears to be a spreadsheet. It may not actually be a spreadsheet, but this description essentially forces the solution to be built around the spreadsheet. The source may be another web page or some other file format. Since the problem is written in solution-speak, we don't know and can't -- easily -- explore the alternatives.

Let's assume that the source actually is a spreadsheet. And that this is the real source; it's maintained manually by the person who really "owns" the data.

The "update-in-place" nature of the question ("e.g. in the excel file that contains the input list") constrains the solution. This tends to add complexity because it somehow seems simpler to update a file in place.

What's actually simpler is often a process that creates a next revision of the file, leaving the first one intact and read-only. It's actually simpler because the "revert" strategy -- in case of problems -- is trivial. Simply delete the new file, fix the data (or the software) and run things again. Backup and history are simpler when creating a new file, also.

Technology Choices

Since it's written in solution speak, many technology choices have been made that might be inappropriate.

First, it appears that Excel is the "database" of choice. This is a terrible thing, but very, very common. The person has a problem. They tried to solve it with a spreadsheet. Now they have two problems.

A spreadsheet has a great GUI, but -- sadly -- leads to weird, inconsistent, undisciplined and generally "out-of-control" data. It doesn't have to create a mess, but it's hard to constrain it to prevent creating a mess.

Alternatives

This problem is ubiquitous and -- often -- trivial to solve if we cut Excel out of the picture.

Consider this workflow.
  1. A small Python program uses xlrd to read this "list of property parcel numbers" and creates a simple CSV file. Excel is now officially out of the picture. If this process can't run (because the spreadsheet got tweaked) we can produce elegant reports with row and column information so that the person creating the spreadsheet can fix their problem. Let's say this is 20 lines of code, assuming the spreadsheet is hellishly complex.
  2. Some small Python programs read a CSV file, uses urllib2 to "connect to the property appraiser's website (which happens to be the .aspx page), enter the parcel number", do the POST and retrieve the resulting page. This can be written to a file for future reference purposes. Numerous problems will be encountered here every time an appraiser's web site changes. It's best to keep this separate, since there may be several, each unique to an appraiser. There's no reason to generalize. Each of these is under 20 lines of code. Often under a dozen.
  3. Some small Python program reads the resulting pages, uses Beautiful Soup to parse the resulting HTML. Again, numerous problems will be encountered here every time an appraiser's web site changes. It's pleasant to keep this separate from posting the query since this is just parsing result pages and doing nothing more. Easy to tweak and fix to keep up with changes. However, because of the potential complexity of each page, these might be complex. Let's pretend they're 20+ lines of code.
  4. Some small Python program merges the original "list of property parcel numbers" and parsed results into a new .CSV file. With a double-click, this will be loaded back into Excel to make it look like the file was updated more-or-less in place. This should be about a dozen lines of code.
Since each step is separate, each can be written, tested and debugged separately. Once they work, some kind of master script can sequence through all four steps. That master script should be under a dozen lines of code.

Design Patterns

One important design pattern is to get out of "Office Product" mode as early as possible. Office Produces (like Excel) are fine for people, but dreadful for automation. They're too complex.

Another important design pattern is to decompose the problem into small scripts that can be run independently. Each step creates a work result that can be viewed and used for debugging. The files aren't big and can be deleted when the final work product is created. But an overly automated system is very, very hard to debug.

Another design pattern is to separate the various web services requests (in this case a form POST) by destination web site. Each site has unique security and validation considerations. It's too complex to write a super-universal, uber-form-filler-outer. It's easier to write a bunch of specific RESTful web services requests that are tailored to the unique problems present in each site.

Finally, it's important to avoid "update in place". It's hard to do well, and it's a pain in the neck when something goes wrong and you want to fall back to the previous version of the database.

Thursday, July 1, 2010

Finding Simplicity

In Creating Complexity Where None Existed, I noted that it's possible to create complexity out of thin air.

Indeed, by wallowing in the supposed drama, one can turn the differences between sales and service delivery into a hopelessly complex situation. A focus on a manufactured "conflict" leads to the following question: "What are the standard techniques for conflict resolution ?"

Standard Techniques. Conflict Resolution.

First, there's no "conflict". Sales offers something. The customer may or may not understand that offer. The customer commits to something. Sales may or may not understand what the customer thinks they're buying. And delivery has to fill in these gaps between what sales offered and what the customer thought they were buying.

It's very simple. I called the lawn service, asking someone to mow my lawn. But they didn't trim my hedge. No one asked me if I had a hedge. I didn't have one when I called. I had it put in after I placed the order for the services. Is this "conflict"? Does it require "resolution"? Or, does it require that the folks selling the services on the phone and the folks delivering the service have some smarter way of coping with the inevitable differences of understanding?

Reality

The trick to avoiding 482 words of drama (using code names!) to describe sales and delivery is simple. Get Out Of Fantasy Land.

In the real world, sales has one view of the order, and delivery has a different view.

This is not news. Accounts cope with this variability all the time. They ask people to create a "budget" or a "plan". And then they measure actual expenditures against the budget. Budgets changes. Actuals don't match the budget. This is not "conflict". There's nothing to "resolve".
There's planned and actual and they're different.

The real sales promise, memorialized in an "order" is one thing. This order can change, of course, making things complex.

The delivery on that promise, memorialized in an "invoice" is another thing. The delivery may be done in "scramble" mode; folks struggling to balance ability to deliver against promises made. Or, the delivery may be done in a more leisurely pace; the work being fit in to a schedule as time and resources permit.

Ideally -- of course -- order and invoice match. In reality, they don't always match.

Before the invoice is sent, someone needs to be sure it matches reality; someone has to affirm that the work was actually done. It may not be what the customer ordered, in which case there will be issues to resolve.

But there is not "conflict". There's no "drama". We don't need to assign code names ("Flintstones", "Rubbles") to sales and delivery.

Above all, we don't need to impose a weird legacy software world-view on something as simple as order, delivery and invoice. Sales has orders. Delivery has invoices. Hopefully, sales order changes get to delivery in time to adjust what's really happening. Some has to look at the mismatches and exceptions to determine what the consequences are. Maybe the customer gets a credit. Maybe someone sales is over-promising. Maybe someone in delivery, is under-delivering.