Bio and Publications

Tuesday, February 16, 2016

SQL Hegemony and Document Databases

A surpassingly strange question is this: "How do I get the data out of MongoDB into a spreadsheet?"

The variation is "How can we load the MongoDB data into a relational database?"

I'm always perplexed by this question. It has a subtext that I find baffling. The subtext is this "all databases are relational, right?"

In order to ask the question, one has to be laboring under the assumption that the only difference between MongoDB and a relational database is the clever sticker on your laptop. Mongo folks have a little green Mango leaf. Postgres has a blue/gray elephant.

This assumption is remarkably hard to overcome.

THEM: "How can we move this mongo data into a spreadsheet?"
ME: "What?"
THEM: "You know. Get a bulk CSV extract."
ME: "Of complex, nested documents?"
THEM: "Nested documents?"
ME: "Mongo database documents include arrays and -- well -- subdocuments. They're not in first normal form. They don't fit the spreadsheet data model."
THEM: "Whatever. Every database has a bulk unload into CSV. How do you do that in Mongo?"
ME: "You can't represent a mongo document in rows and columns."
THEM: (Thumping desk for emphasis.) "Relational Theory is explicit. ALL DATA CAN BE REDUCED TO ROWS AND COLUMNS!"
ME: "Right. Through a process of normalization. The Mongo data you're looking at isn't normalized. You'd have to normalize it into a relational table model. Then you could write a customized extract focused on that relational model."
THEM: "That's absurd."

At this point, all we can do is give them the minimal pymongo MongoClient code block. Hands-on queries seem to be the only way to make progress.

from pymongo import MongoClient
from pprint import pprint
with MongoClient("mongodb://somehost:27017") as mongo:
    collection = mongo.database.collection
    for document in collection.find():
         pprint(document)

Explanations seem to wind up in a weird circular pattern where they keep repeating their relational assumptions. Not much seems to work: diagrams, hand-waving, links to tutorials are all implicitly rejected because they don't confirm SQL bias.

A few days later they call asking how they are supposed to work with a document that has complex nested fields inside it.

This could be the beginning of wisdom. Or it could be the beginning of a lengthy reiteration of SQL Hegemony talking points and desk thumping.

THEM: "The document has an array of values."
ME: "Correct."
THEM: "What's that mean?"
ME: "It means there are multiple occurrences of the child object within each parent object."
THEM: "I can see that. What does it mean?"
ME: (Rising inflection.) "The parent is associated with multiple instances of the child."
THEM: "Don't patronize me! Stop using mongo mumbo-jumbo. Just a simple explanation is all I want."
ME: "One Parent. Many Children."
THEM: "That's stupid. One-to-many absolutely requires a foreign key. The children don't even have keys. Mongo must have hidden keys somewhere. How can I see the keys on the children in this so-called 'array' structure? How can expose the underlying implementation?"

The best I can do is show them an approach to normalizing some of the data in their collection.

from pymongo import MongoClient
from pprint import pprint
with MongoClient("mongodb://your_host:27017") as mongo:
    collection = mongo.your_database.your_collection
    for document in collection.find():
         for child in parent['child_array']:
              print( document['parent_field'], child['child_field'] )

This leads to endless confusion when some documents lack a particular field. The Python document.get('field') is an elegant way to handle optional fields. I like to warn them that they should not rely on this. Sometimes document['field'] is appropriate because the field really is mandatory. If it's missing, there are serious problems. Of course, the simple get() method doesn't work for optional nested documents. For this, we need document.get('field', {}). And for optional arrays, we can use document.get('field', []).

Interestingly we sometimes have confusion over {} for document and [] for array. I chalk that up to folks who are too used to very wordy SQL and Java. I save the questions for my next book on Python.

At some point, the "optional" items may be more significant than this. Perhaps an if statement is required to handle business rules that are reflected as different document structures in a single collection.

This leads to yet more desk-thumping. It's accompanied with the laughable claim that a "real" database doesn't rely on if statements to distinguish variant subentities that are persisted in a single table. The presence of SQL ifnull() functions, case expressions, and application code with if statements apparently doesn't exist. Or -- when it is pointed out -- isn't the same thing as writing an if statement to handle variant document subentities in a Mongo database.

It appears to take about two weeks to successfully challenge entrenched relational assumptions. Even then, we have to go over some of the basics of optional fields and arrays more than once.

Tuesday, February 9, 2016

The Spike Solution

I've had a lingering question about the "spike solution" phrase. What's the etymology of this?

For a long time, I thought of spike as in "nailing down some details."

Today, I found this: http://agiledictionary.com/209/spike/

I was wrong. I'm happy to see this explanation.

The phrase appears to come from spike (or piton) used in technical rock climbing. Strictly speaking, from "traditional" climbing with removable gear.

Since I started rock climbing about a year ago, I now have a deeper sense of what this "spike" thing really means.

Setting a spike makes safe progress possible. Once the spike is set, the climber has defined where the route is. A carabiner can be attached and the belay line placed there.

For people like me -- old and fat -- rock climbing is an indoor activity using a fixed "top-rope". The climbers call this "sport" climbing. There are tiers to this activity:

  • Sport climbing is the most accessible level of climbing. Indoor sport climbing involves plastic holds bolted to a concrete wall. It's pretty easy. It's as much a problem-solving exercise as it is a physical exertion. 
  • Outdoor sport climbing involves fixed bolts. I've seen this from the ground. It looks like it might be fun. 
  • Traditional climbing involves removable gear like spikes, pitons, and cams. I've seen people training for this in indoor rock gyms. I've even belayed for folks doing a kind of hybrid training ascending a crack on top rope. This means no plastic holds -- just a crack in the concrete wall.  I'm not sure I'll ever be able to do this.
  • Alpine climbing, different from trail-head climbing. This is the kind of thing that may involve high altitudes, varied terrain, and long distances.

I don't have any direct experience with spikes, nor even placing carabiners on fixed bolts.  I've seen some of this, though. I aspire to being able to do indoor "lead climbing". This is a common sport technique where the belay line is clipped to fixed bolts as the climber ascends. Someone who does lead climbing isn't relying on a previously set top rope.
Lead Climbing in Red Rock Canyon


For more information you'll need to look for the phrase "trad gear rack".

http://mojagear.com/learn/2014/05/16/building-a-trad-rack/

http://www.alpineinstitute.com/articles/expert-tips/how-to-select-an-alpine-rack/

As with software, there's a lot of jargon in rock climbing.  "Pro" -- for example -- means protection. It's what you place to keep from getting hurt. A spike or a cam is a piece of pro.

I'm much happier using the term "spike solution" now that I see that my wintertime hobby fits with my day job.

And. Just to be complete, I need to point out that sailors use spikes all the time. For a sailor a spike is a tool used to untie (and tie) knots in line. See The Marlinspike Sailor for examples of what sailors do with line.

The problem with trying to find a sailing-related etymology is that (1) the term doesn't seem to be used, and (2) a sailor's "spike solution" would be a relatively permanently rigged arrangement of line. Sailors use the term "jury-rigged" to describe something temporary or experimental. The best part of marine terminology is that sailors speak their own language. Often there is no clear etymology to any other language.

Software folks, however, are masters of borrowing existing terms. "Spike" being an example of borrowing from technical rock climbing.

Tuesday, February 2, 2016

Why I don't want to share your screen -- OR -- What I learned from stackoverflow

I know it sounds arrogant, but I don't want to share your screen to sort out a Python programming problem. I have two reasons and I think one of them is a good one.

It's both pedagogical and personal. 

Personally, I'm often left breathless by demos. Watching the cursor fly around the screen is -- well -- dizzying. What was I supposed to be watching? Who's IM messages are popping up? What meeting reminders are you ignoring?

It may seem helpful to wave the cursor around, and show me your whole desktop world. And for some people, the discussion may actually be helpful. Sometimes they have an epiphany while they're explaining stuff to me. That's good. For me, it's bewildering. Sorry. I'm only going to read the visible fragments of your emails in the background window.

From a pedagogical perspective, there's this point:

I think that it's very important to learn how to focus on the details that matter.

This breaks down into several related skills:

  1. I think everyone needs to be able to copy and paste text. Screenshot images are hard to work with. On Stack Overflow, a 4-space indent is mandatory. It's not hard. A surprising number of programmers struggle with it.
  2. Articulate the actual problem. "Doesn't work" really is not sensible. I think it's important to insist on a concrete statement of the problem. Asking me to deduce it while looking at your screen isn't building any of your skills. 
  3. Find the relevant portion of the Python traceback. Yes, that's hard. But it's part of coding. Asking me to read the traceback doesn't build your skills.
  4. Find the relevant portions of the code that's broken. Again, when I pinpoint the line of code from reading the traceback, your skills haven't grown. I'm well aware that it's confusing when there's a long traceback from a framework that only seems to include your module 6 levels in. If you aspire to mastering code, that has to be part of your aspiration.
  5. Hypothesize a root cause. This is perhaps the hardest skill. The confirmation bias problem leads many people to write wrong code and complain that it's "broken" in a vague way. During screen sharing they scroll past their assumptions as if they're always correct. I have sympathy. But, it's essential to understand the semantics of alanguage. More importantly, it's essential to learn to judge where our assumptions might deviate from reality. Overcoming confirmation bias is hard. Maybe a long conversation is the only way to realize this; I hope not.
  6. Experiment. Python offers the >>> prompt at which you can experiment. Use it. This is the best way to explore your assumptions and see what the actual language semantics are.
Maybe I'm just being hypersensitive, but there's little to really talk about. If we could focus on the relevant code, perhaps through copy-and-paste, I can help. Otherwise, I feel like I'm just watching helplessly while an amusement park ride spins me around for a while, leaving me dizzy and confused. And not having offered any concrete help.