Tuesday, February 9, 2016

The Spike Solution

I've had a lingering question about the "spike solution" phrase. What's the etymology of this?

For a long time, I thought of spike as in "nailing down some details."

Today, I found this: http://agiledictionary.com/209/spike/

I was wrong. I'm happy to see this explanation.

The phrase appears to come from spike (or piton) used in technical rock climbing. Strictly speaking, from "traditional" climbing with removable gear.

Since I started rock climbing about a year ago, I now have a deeper sense of what this "spike" thing really means.

Setting a spike makes safe progress possible. Once the spike is set, the climber has defined where the route is. A carabiner can be attached and the belay line placed there.

For people like me -- old and fat -- rock climbing is an indoor activity using a fixed "top-rope". The climbers call this "sport" climbing. There are tiers to this activity:

  • Sport climbing is the most accessible level of climbing. Indoor sport climbing involves plastic holds bolted to a concrete wall. It's pretty easy. It's as much a problem-solving exercise as it is a physical exertion. 
  • Outdoor sport climbing involves fixed bolts. I've seen this from the ground. It looks like it might be fun. 
  • Traditional climbing involves removable gear like spikes, pitons, and cams. I've seen people training for this in indoor rock gyms. I've even belayed for folks doing a kind of hybrid training ascending a crack on top rope. This means no plastic holds -- just a crack in the concrete wall.  I'm not sure I'll ever be able to do this.
  • Alpine climbing, different from trail-head climbing. This is the kind of thing that may involve high altitudes, varied terrain, and long distances.

I don't have any direct experience with spikes, nor even placing carabiners on fixed bolts.  I've seen some of this, though. I aspire to being able to do indoor "lead climbing". This is a common sport technique where the belay line is clipped to fixed bolts as the climber ascends. Someone who does lead climbing isn't relying on a previously set top rope.
Lead Climbing in Red Rock Canyon

For more information you'll need to look for the phrase "trad gear rack".



As with software, there's a lot of jargon in rock climbing.  "Pro" -- for example -- means protection. It's what you place to keep from getting hurt. A spike or a cam is a piece of pro.

I'm much happier using the term "spike solution" now that I see that my wintertime hobby fits with my day job.

And. Just to be complete, I need to point out that sailors use spikes all the time. For a sailor a spike is a tool used to untie (and tie) knots in line. See The Marlinspike Sailor for examples of what sailors do with line.

The problem with trying to find a sailing-related etymology is that (1) the term doesn't seem to be used, and (2) a sailor's "spike solution" would be a relatively permanently rigged arrangement of line. Sailors use the term "jury-rigged" to describe something temporary or experimental. The best part of marine terminology is that sailors speak their own language. Often there is no clear etymology to any other language.

Software folks, however, are masters of borrowing existing terms. "Spike" being an example of borrowing from technical rock climbing.

Tuesday, February 2, 2016

Why I don't want to share your screen -- OR -- What I learned from stackoverflow

I know it sounds arrogant, but I don't want to share your screen to sort out a Python programming problem. I have two reasons and I think one of them is a good one.

It's both pedagogical and personal. 

Personally, I'm often left breathless by demos. Watching the cursor fly around the screen is -- well -- dizzying. What was I supposed to be watching? Who's IM messages are popping up? What meeting reminders are you ignoring?

It may seem helpful to wave the cursor around, and show me your whole desktop world. And for some people, the discussion may actually be helpful. Sometimes they have an epiphany while they're explaining stuff to me. That's good. For me, it's bewildering. Sorry. I'm only going to read the visible fragments of your emails in the background window.

From a pedagogical perspective, there's this point:

I think that it's very important to learn how to focus on the details that matter.

This breaks down into several related skills:

  1. I think everyone needs to be able to copy and paste text. Screenshot images are hard to work with. On Stack Overflow, a 4-space indent is mandatory. It's not hard. A surprising number of programmers struggle with it.
  2. Articulate the actual problem. "Doesn't work" really is not sensible. I think it's important to insist on a concrete statement of the problem. Asking me to deduce it while looking at your screen isn't building any of your skills. 
  3. Find the relevant portion of the Python traceback. Yes, that's hard. But it's part of coding. Asking me to read the traceback doesn't build your skills.
  4. Find the relevant portions of the code that's broken. Again, when I pinpoint the line of code from reading the traceback, your skills haven't grown. I'm well aware that it's confusing when there's a long traceback from a framework that only seems to include your module 6 levels in. If you aspire to mastering code, that has to be part of your aspiration.
  5. Hypothesize a root cause. This is perhaps the hardest skill. The confirmation bias problem leads many people to write wrong code and complain that it's "broken" in a vague way. During screen sharing they scroll past their assumptions as if they're always correct. I have sympathy. But, it's essential to understand the semantics of alanguage. More importantly, it's essential to learn to judge where our assumptions might deviate from reality. Overcoming confirmation bias is hard. Maybe a long conversation is the only way to realize this; I hope not.
  6. Experiment. Python offers the >>> prompt at which you can experiment. Use it. This is the best way to explore your assumptions and see what the actual language semantics are.
Maybe I'm just being hypersensitive, but there's little to really talk about. If we could focus on the relevant code, perhaps through copy-and-paste, I can help. Otherwise, I feel like I'm just watching helplessly while an amusement park ride spins me around for a while, leaving me dizzy and confused. And not having offered any concrete help.

Tuesday, January 19, 2016

SQL Hegemony -- the "Pivot Table" problem

As far as I can tell, the Pivot Table Problem™ only exists for people who have actively put on blinders so that they can only see data one way.

This leads to the following.

The context appears to be millions of rows of data. Hundreds of columns.  It appears that someone we'll call DesKtop tried to load a spreadsheet to "pivot" the data. And the spreadsheet -- of course -- breaks because it's too much data.

DesKtop then calls the DBA.

DesKtop: "We need to load a table and use the database to create a spread-sheet like pivot table."

DBA: "Okay. Cool. It's complex SQL, though. Check this out..." They look at the Oracle 11g PIVOT and UNPIVOT.

DesKtop: "Oh my. That's really complex. Okay, I guess that's the only choice, right?"

DBA: "Right, it is our only possible choice."


[I heard about this from DBA who sent me a "humble brag" about something that can only be done with hyper-complex SQL query. DBA had found a Python tutorial on Pandas that mentioned pivoting. The humble brag point was this: the Python stuff was just as hyper-complex as the SQL. Apparently, DBA conflated the entire tutorial with the one line of code that was the pivot example.]

If you restructure the data into (row key, column key, cell value) triples, you don't have a Pivot Table Problem™ any more. You have a SELECT reduction() GROUP BY row vs. SELECT reduction GROUP BY column kind of query. There's no "pivot". Maybe it's a conceptual pivot but there's no hyper-complex SQL.

It requires a non-trivial loader to transform data that's in row order and explode it into triples. This isn't the kind of thing a program like Oracle's SQL*Loader or other bulk loader does particularly well. In Python (without using Pandas) we can expand the data into triples like this:

for row in reader:
    for column in column_names:
        new_row = row['key'], column, row[column]

The idea here is that we're using something like a csv DictReader. We have a list of column names we'd like to pivot. In many row-oriented data sets, there are columns we might like to ignore. For example, the row key column itself shouldn't be exploded into a (row key, column key, row key) triple.

This restructuring idea applies in full force to doing Python-based reduction of the data. Forget loading a database in the first place.

by_column = defaultdict(list)
for row in reader:
    for column in column_names:

We've summarized each column's data into a list of values. This is the "GROUP BY" part of the SQL. Now we can do reductions on the values in each column-based list.

from statistics import mean
for column in by_column:
    print(column, sum(by_column[column]), mean(by_column[column]))

We've done sum and mean reductions on the values in each column. We can -- of course -- layer in mapping and filtering if that's required.

This works well for millions of individual cells of data. We can comfortably hold several hundred million individual values in memory in a 32Gb desktop computer. You may notice the fan kicks on when this is running.

If this turns out to require too much storage, then the reductions can be computed item-by-item rather than simply accumulating a list of values. This a hair more complex, but not in an interesting way.

sum_by_column = defaultdict(int)
count_by_column = defaultdict(int)

for row in reader:
    for column in column_names:
        sum_by_column[column] += row[column]
        count_by_column[column] += 1

Minima and Maxima are a trifle trickier. We don't want to initialize them to None and have an if current_min is None statement executed millions of times. We have to create an iterator and process the first row specially, using it to initialize all of the values. The remaining rows can then be processed free of any initialization question.

row_iter= iter(reader)
first = next(row_iter)
for column in column_names:
    min_by_column[column]= row[column]
    max_by_column[column]= row[column]
for row in row_iter:
    for column in column_names:
        min_by_column[column] = min( min_by_column[column], row[column])
        max_by_column[column] = max(max_by_column[column], row[column])

I like to call this the Head-Tail design pattern.

The DBA and DesKtop appear to be married to SQL. Even when it appears to be an ineffective solution to their problem.

Tuesday, January 12, 2016

"Learn Python" is growing


I've seen companies making sincere enterprise-wide commitments to doing all data analysis in Python. There's no reason for quants and analysts to struggle with Java.

I field one or two questions a week from folks pushing the in-house envelope on data acquisition.

I also field questions on language basics.

It's a very exciting thing.

I'm glad I started down this road 15 years ago.

Tuesday, December 29, 2015

SQL Hegemony -- a sad state of affairs

It appears that there are people who don't recognize SQL as a tradeoff.

Here's a complex two-part question that can only come from folks who firmly believe in the magic of SQL.
The sentence that got my attention was "Python has basically made SQL obsolete as a language for data structure manipulation". My question would be about scaling.  If [we? you?] have 30 million rows in a table, would Python still be better than straight up SQL? The other question would be about the amount of time to come up to speed. It just seems easier to learn SQL than Python.
Also, in working with legacy DBA's who are starting to learn Cassandra, I see similar magical thinking. Somehow, Oracle's behavior can become a baseline in some people's minds. When Cassandra's column database shows different behavior, there are DBA's who are surprisingly quick to portray Cassandra as "wrong" or "confusing." Worse, they'll waste a lot of time insisting that Cassandra is misusing the term "key" because Cassandra's idempotency policy means multiple INSERTS with the same primary key are handled differently from Oracle. Labeling Cassandra as "wrong" is a similar problem to the question.

Let's unpack the "SQL is better" question and see why this seems so sad.

I'm not going to address the quote ("Python has basically made SQL obsolete...") since that wasn't part of the question. That's just background. And everyone seems to agree on this. The question appears to be related to clinging to SQL in spite of Python's advantages.

But first, I have to note that the question violates some pretty serious rules of engagement.

The Rules for Questions

Asking hand-waving hypotheticals is generally a pretty bad practice. Sometimes, I'm completely intolerant, and refuse to engage. In this case, I felt compelled to respond, in spite if the vacuousity of the question. 

First, of course, "better" is undefined in the question. That essentially ends any conversation.

Second, there's no code. It's very hard to discuss anything without code. All the hand-waving is essentially meaningless because when code finally does show up, it will fit into some edge or corner not properly covered by hand-waving.

Third, there's no possibility of code. There's nothing resembling a tangible use case or scenario that can be turned into code for comparison purposes.

Also,  the question seems to be creating a false dichotomy between SQL and Python. This is a more subtle issue, and we'll look at this, too.

Python Better Than SQL

We can assign a number of potential meanings to "better". Some other phrases -- "30 million rows in a table" and "about scaling" -- could be dismissed as mere noise. Perhaps they're hints.

Let's assume it's about size of storage. Can Python deal with 30 million rows of data? Since we don't know the row size, there is no actual answer. Without transactions or activities of some kind, we're similarly bereft of the kinds of details that lead to a sensible answer.

Let's say we're limited to 32Gb of memory. If the row size is up to 1Kb, we can fit all of the data in memory. We're pretty much done with size and speed.  Python wins for the canonical CRUD operations.

Python wins because any code we write will be completely customized for the data we're given. We're freed from generalized SQL type conversion complexity, ODBC driver folderol, storage management overheads, SQL language parsing work. Just the data manipulation. No lock escalation or read consistency consideration. Done.

But wait. Not so fast, what about loading 32Gb into memory?

What about it? The problem is so delightfully vague that we have no clue what "loading" might mean. Oracle takes a while to mount a database and do other startup things. Python can open a file and slurp in the data pretty quickly. If you want to amortize the loading time, you can have smarter loader that brings in data incrementally.

def load(data, key_col):
    with data.open() as source:
        rdr = csv.reader(source)
        table = { row[key_col]: row for row in rdr }
    return table

def CRUD(table, key_col, update_col):
    row = tuple(random_text() for i in range(10))

    # INSERT INTO table(col,col,...) VALUES(val,val,...)
    table[row[key_col]]= row

    # SELECT * FROM TABLE WHERE key_col = value
    found = table[row[key_col]]
    #print( found )

    # UPDATE TABLE SET update_col = "value" WHERE key_col = value
    table[row[key_col]][update_col] = "special text"

    # DELETE FROM TABLE WHERE key_col = value
    del table[row[key_col]]

    # Is it gone?
    assert row[key_col] not in table

Rather than go for 30 million rows on this little laptop (with only 8Gb RAM), we'll load 30,000 rows each of which is about 150 characters. Small. The point, however, is this:

load 0.133, CRUD 0.176

We can load 30,000 rows of data in 133 ms.  We can do 1,000 sets of CRUD operations in 176 ms. The load time scales with total number of bytes, row size × number of rows. The CRUD operation time will barely move no matter how many rows or how big the rows are.

The problem with this kind of benchmark is that it plays to SQL's strengths. It makes SQL look like the benchmark. We're forced to show how some non-SQL language can also do what SQL does. And that's silly.

What About Bigger?

Let's pretend the number was supposed to be 30 billion rows of data. Something that clearly can't fit into memory. Wait. Traditional SQL databases struggle with this, too. Let's press on. 30 billion rows of data. Each row is at least 1K in size. 3Tb of storage. Can Python do this?

Recall that the question gives us no help in reasoning about "better".

What's the representation? 3Tb has got to be a implemented as collection of smaller files. All of the files must have a common format. Let's posit CSV. We don't really want all of this storage on a single server. We want to farm this out to several hosts. And we probably want to layer in some redundancy in case one of those hosts fails.

Okay. It might not be obvious, but we're describing the HDFS from Hadoop. We could -- without too much trouble -- implement an HDFS surrogate that has very limited functionality in Python. We can use SFTP to smear two copies of each file among a fixed-size farm of servers. Very hard-wired, unlike Hadoop.

Then the reading part of our imagined app will scroll through the collection of CSV-formatted files on each processor. We'd have to implement a Hadoop map-reduce in Python. Again. Not very difficult if we eliminate some features and stick to a very basic version map-reduce. We can coordinate the reductions by implementing a simple REST-based master-reducer that accepts the reductions from the other processors and does the final reduce.

Now we have a lot of Python language overheads. Have we failed at "better" because we polluted the solution with a fake Hadoop?


The SQL folks had to install, configure, and manage a SQL database that handled 3Tb of storage. The Python folks installed Python. Installed their fake Hadoop. Then they used a few clever abstractions to write delightfully simple map and reduce functions. Python still handles the extremely large amount of data faster than SQL. Also, it does this without some RDBMS features.

Which leads us to the second part of the question. Expressivity.

Easier to Learn

From the Question: "It just seems easier to learn SQL than Python".

This is pretty much meaningless noise. Less meaningful than the rest of the question. Having taught both, I'm confident in saying that SQL can be pretty confusing.


More importantly.

There's no rational basis for comparison.

SQL DML is a very tiny language with only a few concepts. It's not a Turing-complete programming language.

What's important is this:

We have to embed SQL in another language.

You can't actually DO anything in SQL by itself. You need another language.

In the old days, we actually wrote SQL in the middle of some other programming language source. A pre-processor replaced SQL with the other language's code. Now we use ODBC/JDBC or other drivers to execute SQL from within another language. The embedding isn't quite so literal as it once was. But it's still embedding.

The SQL vs. Programming Language is not an "either-or" situation. We never have a stark choice between SQL or "some other language." We always have to learn "some other language." Always.

That "other language" might be PL/SQL or TSQL or whatever scripting tool of choice comes bundled with the database. It isn't SQL, it's another Turing-complete language that shares SQL syntax.

Since "some other language" is required, the real question is "is there value in also learning SQL?" Or -- most importantly -- "What's the value in spreading the knowledge representation around among multiple languages?"

In some contexts, SQL can act as a lingua franca, allowing a kind of uniform access to data irrespective of the application programming language.

In most contexts, however, the SQL -- in isolation -- is incomplete. There is application processing that has semantic significance. The "do everything in stored procedures" crowd spend too much time in raging denial that application logic is still used to wrap their stored procedures.  No matter how enthusiastic one embraces stored procedures, application code still exists, and still implements semantically significant operations.

SQL is merely a short-hand notation for potentially complex algorithms. It's an optimization. SQL elects for universality via abstraction. It can't cover efficiency or scalability. We have to bind in a representation and access algorithm to compare SQL performance with another language's performance. Or scalability.

By itself, SQL is useless. So there's a false dichotomy implied by the question.

The Head-To-Head Problem

Above, I provided code that demonstrates SQL CRUD operations in Python. This is, of course, silly. It presumes that SQL is the benchmark standard which Python must meet.

What if we lift up Python as the benchmark that SQL has to meet?


We can trivially write things in Python which cannot be expressed in SQL at all.  E.g., Compute the 1000th Fibonacci Number. For fun, go to https://projecteuler.net/archives and pick any problem and try to solve it in SQL. Try to even frame the problem in a way that the solution can be expressed in SQL. SQL has profound limitations.

Okay. That's sort of like cheating.

Let's not raise the bar quite so high, then. Here's today's problem.

I got a spreadsheet with 100's of rows of student evaluations. It may have come from Survey Monkey. Or not. It doesn't matter.

Most of the columns are some kind of Agree-Disagree scale. Other columns are comments or usernames, or stuff in an open-ended domain.

Note that I don't know which columns. And I don't care. And I don't need to care.

Here's how we tackle this in Python. It can be done in SQL. That's the point. It's not impossible. It's just kind of complex. Especially because the data loading either requires converting the data to a sequence of INSERT statements or we have to use a "loader" which lives outside the SQL language.

from collections import Counter
def summarize(data):
    with data.open() as source:
        rdr = csv.DictReader(source)
        summaries = {name: Counter() for name in rdr.fieldnames}
        for row in rdr:
            for key, value in row.items():
                summaries[key][value] += 1
    for key in sorted(summaries):
        summary= summaries[key]
        if len(summary) == 5:
            print(key, summary)
            print(key, "More than 5 values")

This is the kind of thing that people do in Python that demonstrates the limitations of SQL.  We've summarized all columns doing a count/group-by in one pass through the data. We've build Counter objects for each column name in the file. Each Counter object will collect a complete histogram for a given column. We'll do all of the columns at once.

This is scalable to millions or billions of rows and runs delightfully quickly. Doing something similar with SELECT COUNT(*) FROM TABLE GROUP BY SOMETHING is remarkably slow.  Databases are forced to do a lot of on-disk sorting and temporary file creation. The Python Counter lives in memory and works at in-memory speeds. Even for millions of rows of data.


Please define "better". Be explicit on what your goals are: speed, ACID, reliability, whatever.

Please provide code. Or provide use cases that map directly to code.

Please stop clinging to SQL. Be realistic.

Please consider the basics: Does it capture knowledge effectively? Is it expressive?

Please don't create dichotomies where none exist.

Tuesday, December 22, 2015

Coming Soon: Python for Secret Agents Part II

I guess it's like a movie franchise or a series of novels. The first one was popular. So, write a second story with similar characters.

You can check find part I here: http://www.amazon.com/gp/product/B00N2RWMMW/ref=dp-kindle-redirect?ie=UTF8&btkr=1 and here: https://www.packtpub.com/hardware-and-creative/python-secret-agents

Part II will be available soon. New missions. New ways to gather and analyze intelligence information assets.

I should probably read some Ian Fleming or Robert Ludlum boos to get some ideas for more exciting missions.

I'm more a fan of John le Carré stories which are less high-tech and more about ordinary selling out.

I'm also a fan of the history of Agent Garbo and Operation Mincement. These are things that are really interesting uses of data, intelligence, and misdirection.

Tuesday, December 15, 2015

Writing About Code -- Or -- Why I love RST

I blog. I write books. I write code. There are profound tool-chain issues in all three of these. Mostly, I'm tired of shabby "What You See Is All You Get" editing.

First. I use this blogger site as well as a Jive-based site at work. They're handy. But. There are a lot of issues. A lot. Web-based editing leaves a lot to be desired.

Second. Books. Packt requires MS-Word for drafts. The idea here is that authors, editors, and reviewers should all use a single tool. I push the boundaries by using Libre Office and Open Office. This works out most of the time, since these tools will absorb the MS-office style sheet that Packt uses. It doesn't work out well for typesetting math, but the technical editors are good about tracking down the formulae when they get lost in the conversions. These over-wrought do-too-much word processing nightmares leave a lot to be desired.

Third. Code. I use ActiveState Komodo Edit.  Both at work and outside of work. This rocks.

Web-Based Editing Fail

What's wrong with Jive or Blogger? The stark contrast between JavaScript-based text edit tools and HTML. It's either too little control or too much detail.

The JS-based editors are fine for simple, running text. They're actually kind of nice for that. Simple styles. Maybe a heading here or there.

Code? Ugh. Epic Fail.

It gets worse.

I've become a real fan of semantic markup. DocBook has a rich set of constructs available.  RST, similarly, has a short list of text roles that can be expanded to include the same kind of rich markup as DocBook. Sphinx leverages these roles to allow very sophisticated references to code from text. LaTeX has a great deal of semantic markup.

Web-based editors lack any of this. We have HTML. We have HTML microformats available. But. For a JavaScript web editor, we're really asking for a lot. More than seems possible for a quick download.

Desktop Tool Fail

What's wrong with desktop tools? We have very rich style sheets available. We should be able to define a useful set of styles and produce a useful document. Right?

Sadly, it's not easy.

First, the desktop tools are extremely tolerant of totally messed-up markup. They're focus is explicitly on making it look acceptable. It doesn't have to be well-structured. It just has to look good.

Second, and more important, the file formats are almost utterly opaque. Yes. There are standards now. Yes. It's all just XML. No. It's still nearly impossible to process. Try it.

Most word-processing documents feel like XML serializations of in-memory data structures. It's possible to locate the relevant document text in there somewhere. It's not like they're being intentionally obscure. But they're obscure.

Third, and most important, is the reliance on either complex GUI gestures (pointing and clicking and what-not) or complex keyboard "shortcuts" and stand-ins for GUI gestures. It might be possible to use that row of F-keys to define some kinds of short-cuts that might be helpful. But there's a lot of semantic markup and only a dozen keys, some of which have common interpretations for help, copy, paste, turn off the keyboard lights, play music, etc.

The Literate Programming ideal is to have the words and the code existing cheek by jowls. No big separation. No hyper-complex tooling. To me, this means sensible pure-text in-line markup.

Text Markup

I find that I really like RST markup. The more I write, the more I like it.

I really like the idea of writing code/documentation in a simple, uniform code-centric tooling. The pure-text world using RST pure-text markup is delightfully simple. 
  1. Write stuff. Words. Code. Whatever. Use RST markup to segregate the formal language (e.g. Python) from the natural language (e.g., English in my case.)
  2. Click on some icon the right side of the screen (or maybe use an F-key) to run the test suite.
  3. Click on some icon (or hit a key) to produce prettified HTML page from python3 -m pylit3 doc.py doc.rst; rst2html.py doc.rst doc.html. Having a simple toolchain to emit doc from code (or emit code from doc) is a delight.
The genesis for this blog post was an at-work blog post (in Jive) that had a code error in it. Because of Jive's code markup features (using non-breaking spaces everywhere) there's no easy copy-and-paste to check syntax. It's nearly impossible to get the code off the web page in a form that's useful.

If people can't copy-and-paste the code, the blog posts are approximately worthless. Sigh.

If I rewrite the whole thing into RST, I lose the Jive-friendly markup. Now it looks out-of-place, but is technically correct.

Either. Or.

Exclusive Xor.

Ugh. Does this mean I have to think about gathering the Jive .CSS files, and create a version of those that's compatible with the classes and ID's that Docutils uses?  I have some doubts about making this work, since the classes and ID's might have overlaps that cause problems.

Or. Do I have to publish on some small web-server at work, and use the <iframe> tag to include RST-built content on the main intranet? This probably works the best. But it leads to a multi-step dance of writing, publishing on a private server, and then using a iframe on the main intranet site. It seems needlessly complex.