Tuesday, January 19, 2016

SQL Hegemony -- the "Pivot Table" problem

As far as I can tell, the Pivot Table Problem™ only exists for people who have actively put on blinders so that they can only see data one way.

This leads to the following.

The context appears to be millions of rows of data. Hundreds of columns.  It appears that someone we'll call DesKtop tried to load a spreadsheet to "pivot" the data. And the spreadsheet -- of course -- breaks because it's too much data.

DesKtop then calls the DBA.

DesKtop: "We need to load a table and use the database to create a spread-sheet like pivot table."

DBA: "Okay. Cool. It's complex SQL, though. Check this out..." They look at the Oracle 11g PIVOT and UNPIVOT.

DesKtop: "Oh my. That's really complex. Okay, I guess that's the only choice, right?"

DBA: "Right, it is our only possible choice."


[I heard about this from DBA who sent me a "humble brag" about something that can only be done with hyper-complex SQL query. DBA had found a Python tutorial on Pandas that mentioned pivoting. The humble brag point was this: the Python stuff was just as hyper-complex as the SQL. Apparently, DBA conflated the entire tutorial with the one line of code that was the pivot example.]

If you restructure the data into (row key, column key, cell value) triples, you don't have a Pivot Table Problem™ any more. You have a SELECT reduction() GROUP BY row vs. SELECT reduction GROUP BY column kind of query. There's no "pivot". Maybe it's a conceptual pivot but there's no hyper-complex SQL.

It requires a non-trivial loader to transform data that's in row order and explode it into triples. This isn't the kind of thing a program like Oracle's SQL*Loader or other bulk loader does particularly well. In Python (without using Pandas) we can expand the data into triples like this:

for row in reader:
    for column in column_names:
        new_row = row['key'], column, row[column]

The idea here is that we're using something like a csv DictReader. We have a list of column names we'd like to pivot. In many row-oriented data sets, there are columns we might like to ignore. For example, the row key column itself shouldn't be exploded into a (row key, column key, row key) triple.

This restructuring idea applies in full force to doing Python-based reduction of the data. Forget loading a database in the first place.

by_column = defaultdict(list)
for row in reader:
    for column in column_names:

We've summarized each column's data into a list of values. This is the "GROUP BY" part of the SQL. Now we can do reductions on the values in each column-based list.

from statistics import mean
for column in by_column:
    print(column, sum(by_column[column]), mean(by_column[column]))

We've done sum and mean reductions on the values in each column. We can -- of course -- layer in mapping and filtering if that's required.

This works well for millions of individual cells of data. We can comfortably hold several hundred million individual values in memory in a 32Gb desktop computer. You may notice the fan kicks on when this is running.

If this turns out to require too much storage, then the reductions can be computed item-by-item rather than simply accumulating a list of values. This a hair more complex, but not in an interesting way.

sum_by_column = defaultdict(int)
count_by_column = defaultdict(int)

for row in reader:
    for column in column_names:
        sum_by_column[column] += row[column]
        count_by_column[column] += 1

Minima and Maxima are a trifle trickier. We don't want to initialize them to None and have an if current_min is None statement executed millions of times. We have to create an iterator and process the first row specially, using it to initialize all of the values. The remaining rows can then be processed free of any initialization question.

row_iter= iter(reader)
first = next(row_iter)
for column in column_names:
    min_by_column[column]= row[column]
    max_by_column[column]= row[column]
for row in row_iter:
    for column in column_names:
        min_by_column[column] = min( min_by_column[column], row[column])
        max_by_column[column] = max(max_by_column[column], row[column])

I like to call this the Head-Tail design pattern.

The DBA and DesKtop appear to be married to SQL. Even when it appears to be an ineffective solution to their problem.