Tuesday, February 16, 2016

SQL Hegemony and Document Databases

A surpassingly strange question is this: "How do I get the data out of MongoDB into a spreadsheet?"

The variation is "How can we load the MongoDB data into a relational database?"

I'm always perplexed by this question. It has a subtext that I find baffling. The subtext is this "all databases are relational, right?"

In order to ask the question, one has to be laboring under the assumption that the only difference between MongoDB and a relational database is the clever sticker on your laptop. Mongo folks have a little green Mango leaf. Postgres has a blue/gray elephant.

This assumption is remarkably hard to overcome.

THEM: "How can we move this mongo data into a spreadsheet?"
ME: "What?"
THEM: "You know. Get a bulk CSV extract."
ME: "Of complex, nested documents?"
THEM: "Nested documents?"
ME: "Mongo database documents include arrays and -- well -- subdocuments. They're not in first normal form. They don't fit the spreadsheet data model."
THEM: "Whatever. Every database has a bulk unload into CSV. How do you do that in Mongo?"
ME: "You can't represent a mongo document in rows and columns."
THEM: (Thumping desk for emphasis.) "Relational Theory is explicit. ALL DATA CAN BE REDUCED TO ROWS AND COLUMNS!"
ME: "Right. Through a process of normalization. The Mongo data you're looking at isn't normalized. You'd have to normalize it into a relational table model. Then you could write a customized extract focused on that relational model."
THEM: "That's absurd."

At this point, all we can do is give them the minimal pymongo MongoClient code block. Hands-on queries seem to be the only way to make progress.

from pymongo import MongoClient
from pprint import pprint
with MongoClient("mongodb://somehost:27017") as mongo:
    collection = mongo.database.collection
    for document in collection.find():
         pprint(document)

Explanations seem to wind up in a weird circular pattern where they keep repeating their relational assumptions. Not much seems to work: diagrams, hand-waving, links to tutorials are all implicitly rejected because they don't confirm SQL bias.

A few days later they call asking how they are supposed to work with a document that has complex nested fields inside it.

This could be the beginning of wisdom. Or it could be the beginning of a lengthy reiteration of SQL Hegemony talking points and desk thumping.

THEM: "The document has an array of values."
ME: "Correct."
THEM: "What's that mean?"
ME: "It means there are multiple occurrences of the child object within each parent object."
THEM: "I can see that. What does it mean?"
ME: (Rising inflection.) "The parent is associated with multiple instances of the child."
THEM: "Don't patronize me! Stop using mongo mumbo-jumbo. Just a simple explanation is all I want."
ME: "One Parent. Many Children."
THEM: "That's stupid. One-to-many absolutely requires a foreign key. The children don't even have keys. Mongo must have hidden keys somewhere. How can I see the keys on the children in this so-called 'array' structure? How can expose the underlying implementation?"

The best I can do is show them an approach to normalizing some of the data in their collection.

from pymongo import MongoClient
from pprint import pprint
with MongoClient("mongodb://your_host:27017") as mongo:
    collection = mongo.your_database.your_collection
    for document in collection.find():
         for child in parent['child_array']:
              print( document['parent_field'], child['child_field'] )

This leads to endless confusion when some documents lack a particular field. The Python document.get('field') is an elegant way to handle optional fields. I like to warn them that they should not rely on this. Sometimes document['field'] is appropriate because the field really is mandatory. If it's missing, there are serious problems. Of course, the simple get() method doesn't work for optional nested documents. For this, we need document.get('field', {}). And for optional arrays, we can use document.get('field', []).

Interestingly we sometimes have confusion over {} for document and [] for array. I chalk that up to folks who are too used to very wordy SQL and Java. I save the questions for my next book on Python.

At some point, the "optional" items may be more significant than this. Perhaps an if statement is required to handle business rules that are reflected as different document structures in a single collection.

This leads to yet more desk-thumping. It's accompanied with the laughable claim that a "real" database doesn't rely on if statements to distinguish variant subentities that are persisted in a single table. The presence of SQL ifnull() functions, case expressions, and application code with if statements apparently doesn't exist. Or -- when it is pointed out -- isn't the same thing as writing an if statement to handle variant document subentities in a Mongo database.

It appears to take about two weeks to successfully challenge entrenched relational assumptions. Even then, we have to go over some of the basics of optional fields and arrays more than once.