Tuesday, July 21, 2015

A Surprising Confusion

Well, it was surprising to me.

And it should not have been a surprise.

This is something I need to recognize as a standard confusion. And rewrite some training material to better address this.

The question comes up when SQL hackers move beyond simple queries and canned desktop tools into "Big Data" analytics. The pure SQL lifestyle (using spreadsheets, or Business Objects, or SAS) leads to an understanding of data that's biased toward working with collections in an autonomous way.

Outside the SELECT clause, everything's a group or a set or some kind of collection. Even in spreadsheet world, a lot of Big Data folks slap summary expressions on the top of a column to show a sum or a count without too much worry or concern.

But when they start wrestling with Python for loops, and the atomic elements which comprise a set (or list or dict), then there's a bit of confusion that creeps in.

An important skill is building a list (or set or dict) from atomic elements. We'll often have code that looks like this:

some_list = []
for source_object in some_source_of_objects:
    if some_filter(source_object):
        useful_object = transform(source_object)
        some_list.append(useful_object)

This is, of course, simply a list comprehension. In some case, we might have a process that breaks one of the rules of using a generator and doesn't work out perfectly cleanly as a comprehension. This is somewhat more advanced topic.

The transformation step is what seems to causes confusion. Or -- more properly -- it's the disconnect between the transformation calculations on atomic items and the group-level processing to accumulate a collection from individual items.

The use of some_list.append() and some_list[index] and some_list is something that folks can't -- trivially -- articulate. The course material isn't clarifying this for them. And (worse) leaping into list comprehensions doesn't seem to help.

These are particularly difficult to explain if the long version isn't clear.

some_list = [transform(source_object) for source_object in some_source_of_objects if some_filter(source_object)]

and

some_list = list( map(transform, filter(some_filter, some_source_of_objects)) )

I'm going to have to build some revised course material that zeroes in on the atomic vs. collection concepts. What we do with an item (singular) and what we do with a list of items (plural).

I've been working with highly experienced programmers too long. I've lost sight of the n00b questions.

The goal is to get to the small data map-reduce. We have some folks who can make big data work, but the big data Hadoop architecture isn't ideal for all problems. We have to help people distinguish between big data and small data, and switch gears when appropriate. Since Python does both very nicely, we think we can build enough education to school up business experts to also make a more informed technology choice.