Tuesday, March 29, 2016

The Data Structures and Algorithms Problem

Here's a snippet of an email
In big data / data science, the curse of dimensionality keeps showing up over and over. A good place to start is the wiki article “curse of dimensionality.” The issue seems to be that a lot of these big data / data science people have not taken the time to study fundamental data structures.
There was more about Foundations of Multidimensional and Metric Data Structures by Hanan Samet being too detailed. And Stack Overflow being too high-level.  And more hand-wringing after that, too.

The email was pleading for some book or series of blog posts that would somehow educate data science folks on more fundamental issues of data structures and algorithms. Perhaps getting them to drop some dimensions when doing k-NN problems or perhaps exploit some other data structure that didn't involve 100's of columns.

I think.

I'm guessing because -- like a lot of hand-waving emails -- it didn't involve code. And yes, I'm very bigoted about the distinction between code and hand-waving.

If there is a lack of awareness of appropriate data structures, the real place to start is The Algorithm Design Manual by Steven Skiena.

I harbor my doubts that this is the real problem, however. I think that the broad spectrum of computing applications leads to a lot of specialization. I don't think that it's really prudent to try and think of generalists who can handle deep data science issues as well as algorithm design and performance issues. No one expects them to write JavaScript and tinker with CSS so that the web site which presents the results looks good.

I actually think the real problem is that some folks expect too much from their data scientists.

In fantasy land the rock stars are full stack developers who can span the entire spectrum from OS to CSS. In the real world, developers have different strengths and interests. In some cases, "full stack" means mediocre skills in a lot of areas.

Here's a more useful response: Bridging the Gap Between Data Science and DevOps. I don't think the problem is "big data / data science people have not taken the time to study fundamental data structures". I think the problem is that big data is a cooperative venture. It takes a team to solve a problem.

Tuesday, March 15, 2016

PacktPub Looking For Python Projects

Do you have a good project? Do you want to write?

The acquisition folks at Packt are looking for this:

"... demonstrate 4-5 projects over the course of the chapters in order to demonstrate how to build scalable Python projects from scratch. These projects cover some of the most important concepts in Python and the common problems that a Python programmer faces on a day-to-day basis..."

I'm busy already. And most of my examples are owned by my employer. I'm not sure the exceptions are interesting enough.

You get to work with a really good publication team. I've been thrilled.

See https://www.packtpub.com/books/info/packt/contact-us Drop Shaon Basu's name.

Tuesday, March 8, 2016

The Composite Builder Pattern, an Example of Declarative Programming [Update]

I'm calling this the Composite Builder pattern. This may have other names, but I haven't seen them. It could simply be lack of research into prior art. I suspect this isn't very new. But I thought it was cool way to do some declarative Python programming.

Here's the concept.

class TheCompositeThing(Builder):
    attribute1 = SomeItem("arg0")
    attribute2 = AnotherItem("arg1")
    more_attributes = MoreItems("more args")

The idea is that when we create an instance of TheCompositeThing, we get a complex object, built from various data sources.  We want to use this in the following kind of context:

with some_config_path.open() as config:
    the_thing = TheCompositeThing().substitute(config)

We want to open some configuration file -- something that's unique to an environment -- and populate the complex object in one smooth motion. Once we have the complex object, it can then be used in some way, perhaps serialized as a JSON or YAML document.

Each Item has a get() method that accepts the configuration as input. These do some computation to return a useful result. In some cases, the computation is kind of degenerate case:

class LiteralItem(Item):
    def __init__(self, value):
        self.value = value
    def get(self, config):
        return self.value

This shows how we jam a literal value into the output. Other values might involve elaborate computations, or lookups in the configuration, or a combination of the two.

Why Use a Declarative Style?

This declarative style can be handy when each of the Items in TheCompositeThing involves rather complex, but completely independent computations. There's no dependency here, so the substitute() method can fill in the attributes in any order. Or -- perhaps -- not fill the attributes until they're actually requested. This pattern allows eager or lazy calculation of the attributes.

This pattern applies to building complex AWS Cloud Formation Templates as an example. We often need to make a global tweak to a large number of templates so that we can rebuild a server farm. There's little or no dependency among the Item values being filled in. There's no strange "ripple effect" of a change in one place also showing up in another place because of an obscure dependency between items.

We can extend this to have a kind of pipeline with each stage created in a declarative style. In this more complex situation, we'll have several tiers of Items that fill in the composite object. The first-stage Items depend on one source. The second stage Items depend on the first-stage Items.

class Stage1(Builder):
    item_1 = Stage_1_Item("arg")
    item_2 = Stage_1_More("another")

class Stage2(Builder):
    item_a = Stage_2_Item("some_arg")
    item_b = Stage_2_Another(355, 113)

We can then create a Stage1 object from external configuration or inputs. We can create the derived Stage2 object from the Stage1 object.

And yes. This seems like useless metaprogramming.  We could -- more simply -- do something like this::

class Stage2:
    def __init__(self, stage_1, config):
        self.item_a = Stage_2_Item("some_arg", stage_1, config)
        self.item_b = Stage_2_Another(355, 113, stage_1, config)

We've eagerly computed the attributes during __init__() processing.

Or perhaps this::

class Stage2:
    def __init__(self, stage_1, config):
        self.stage_1= stage_1
        self.config= config
    def item_a(self):
        return Stage_2_Item("some_arg", self.stage_1, self.config)
    def item_b(self):
        return Stage_2_Another(355, 113, self.stage_1, self.config)

Here we've been lazy and only computed attribute values as they are requested.


We've looked at three ways to build composite objects:
  1. As independent attributes with an flexible but terse implementation.
  2. As attributes during __init__() using sequential code that doesn't assure independence.
  3. As properties using wordy code. 
What's the value proposition? Why is this declarative technique interesting?

I find that the the Declarative Builder pattern is handy because it gives me the following benefits.
  • The attributes must be built independently. We can -- without a second thought -- rearrange the attributes and not worry about one calculation interfering with another attribute. 
  • The attributes can be built eagerly or lazily. Details don't matter. We don't expose the implementation details via __init__ or @property.
  • The class definition becomes a configuration item. A support technician without deep Python knowledge can edit the definition of TheCompositeThing successfully.
I think this kind of lazy, declarative programming is useful for some applications. It's ideal in those cases where we need to isolate a number of computations from each other to allow the software to evolve without breaking.

It may be a stretch, but I think this shows the Depedency Inversion Principle. To an extent, we've moved all of the dependencies to the visible list of attributes within these classes. The items classes do not depend on each other; they depend on configuration or perhaps previous stage composite objects. Since there are no methods involved in the class defintion, we can change the class freely. Each subclass of Builder is more like a configuration item than it is like code. In Python, particularly, we can change the class freely without the agony of a rebuild.

A Build Implementation

We're reluctant to provide a concrete implementation for the above examples because it could go anywhere. It could be done eagerly or lazily. One choice for a lazy implementation is to use a substitute() method. Another choice is to use the __init__() method.

We might do something like this:

def substitute(self, config):
    class_dict= self.__class__.__dict__
    for name in class_dict:
        if name.startswith('__') and name.endswith('__'): continue
        setattr(self, name, class_dict[name].get(config))

This allows us to lazily build the composite object by stepping through the dictionary defined at the class level and filling in values for each item. This could be done via __getattr__() also.

Tuesday, March 1, 2016

Dexy and word-processing toolchains

See http://www.dexy.it

Wow. That seems cool.

I write. A lot.

I've tried a lot of tool chains. A lot. And I mean non-trivial "try". Whole books.  Hundreds of pages.

LEO + my own HTML Templates. A lot of fun at first.  An outliner that generates RST is a very, very handy thing for technical writing.

An XML editor (I forget which one. Maybe http://www.xmlmind.com/xmleditor/?) with the DocBook XML and XSLT tool-chain. This produced HTML from the XML. I think it could also produce LaTeX. Again, the outlining and structuring were kind of handy. What was particularly cool was the diverse semantic markup tags available in DocBook. Getting the tag containment right was a large pain even with a handy GUI editor.

Plain Text using RST "the hard way;" i.e., without LEO. This isn't too bad, it turns out. The outlining features of LEO -- while fun -- aren't essential. A simple RST toolchain is easy to concoct. I used SCONS to rebuild HTML and LaTeX from the RST.

LaTeX. Once I had the base LaTeX from RST, I could then edit that to produce an even richer document using lots of LaTeX add-on packages. I use MacTex and it is truly great. The downside of LaTex -- for me -- was no trivial way to go back to HTML from the LaTeX.  There are some complex back-and-forth toolchains, but it's easier to just produce a PDF. PDF looks great, but wasn't really my goal.

RST with Sphinx. Wow. This is elegant. I often produce chapter drafts here, and then copy and paste the HTML into the word processors preferred by the publishing industry.

[They insist on applying their goofy markup templates to the text. It's a subset of meaningful semantic markup used by Sphinx, but somehow their toolchain must start with .DOCX files, and nothing else will do.]

Dexy was cool right up until I read this: "Dexy is a Python package (Python 2.6-2.7 only)".

Ouch. Web.py Utilities include DBUtils which won't install under Python 3.5. So that put the kibosh on Dexy. Sadface.