Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, December 29, 2015

SQL Hegemony -- a sad state of affairs

It appears that there are people who don't recognize SQL as a tradeoff.

Here's a complex two-part question that can only come from folks who firmly believe in the magic of SQL.
The sentence that got my attention was "Python has basically made SQL obsolete as a language for data structure manipulation". My question would be about scaling.  If [we? you?] have 30 million rows in a table, would Python still be better than straight up SQL? The other question would be about the amount of time to come up to speed. It just seems easier to learn SQL than Python.
Also, in working with legacy DBA's who are starting to learn Cassandra, I see similar magical thinking. Somehow, Oracle's behavior can become a baseline in some people's minds. When Cassandra's column database shows different behavior, there are DBA's who are surprisingly quick to portray Cassandra as "wrong" or "confusing." Worse, they'll waste a lot of time insisting that Cassandra is misusing the term "key" because Cassandra's idempotency policy means multiple INSERTS with the same primary key are handled differently from Oracle. Labeling Cassandra as "wrong" is a similar problem to the question.

Let's unpack the "SQL is better" question and see why this seems so sad.

I'm not going to address the quote ("Python has basically made SQL obsolete...") since that wasn't part of the question. That's just background. And everyone seems to agree on this. The question appears to be related to clinging to SQL in spite of Python's advantages.

But first, I have to note that the question violates some pretty serious rules of engagement.

The Rules for Questions

Asking hand-waving hypotheticals is generally a pretty bad practice. Sometimes, I'm completely intolerant, and refuse to engage. In this case, I felt compelled to respond, in spite if the vacuousity of the question. 

First, of course, "better" is undefined in the question. That essentially ends any conversation.

Second, there's no code. It's very hard to discuss anything without code. All the hand-waving is essentially meaningless because when code finally does show up, it will fit into some edge or corner not properly covered by hand-waving.

Third, there's no possibility of code. There's nothing resembling a tangible use case or scenario that can be turned into code for comparison purposes.

Also,  the question seems to be creating a false dichotomy between SQL and Python. This is a more subtle issue, and we'll look at this, too.

Python Better Than SQL

We can assign a number of potential meanings to "better". Some other phrases -- "30 million rows in a table" and "about scaling" -- could be dismissed as mere noise. Perhaps they're hints.

Let's assume it's about size of storage. Can Python deal with 30 million rows of data? Since we don't know the row size, there is no actual answer. Without transactions or activities of some kind, we're similarly bereft of the kinds of details that lead to a sensible answer.

Let's say we're limited to 32Gb of memory. If the row size is up to 1Kb, we can fit all of the data in memory. We're pretty much done with size and speed.  Python wins for the canonical CRUD operations.

Python wins because any code we write will be completely customized for the data we're given. We're freed from generalized SQL type conversion complexity, ODBC driver folderol, storage management overheads, SQL language parsing work. Just the data manipulation. No lock escalation or read consistency consideration. Done.

But wait. Not so fast, what about loading 32Gb into memory?

What about it? The problem is so delightfully vague that we have no clue what "loading" might mean. Oracle takes a while to mount a database and do other startup things. Python can open a file and slurp in the data pretty quickly. If you want to amortize the loading time, you can have smarter loader that brings in data incrementally.

def load(data, key_col):
    with data.open() as source:
        rdr = csv.reader(source)
        table = { row[key_col]: row for row in rdr }
    return table

def CRUD(table, key_col, update_col):
    row = tuple(random_text() for i in range(10))

    # INSERT INTO table(col,col,...) VALUES(val,val,...)
    table[row[key_col]]= row

    # SELECT * FROM TABLE WHERE key_col = value
    found = table[row[key_col]]
    #print( found )

    # UPDATE TABLE SET update_col = "value" WHERE key_col = value
    table[row[key_col]][update_col] = "special text"

    # DELETE FROM TABLE WHERE key_col = value
    del table[row[key_col]]

    # Is it gone?
    assert row[key_col] not in table

Rather than go for 30 million rows on this little laptop (with only 8Gb RAM), we'll load 30,000 rows each of which is about 150 characters. Small. The point, however, is this:

load 0.133, CRUD 0.176

We can load 30,000 rows of data in 133 ms.  We can do 1,000 sets of CRUD operations in 176 ms. The load time scales with total number of bytes, row size × number of rows. The CRUD operation time will barely move no matter how many rows or how big the rows are.

The problem with this kind of benchmark is that it plays to SQL's strengths. It makes SQL look like the benchmark. We're forced to show how some non-SQL language can also do what SQL does. And that's silly.

What About Bigger?

Let's pretend the number was supposed to be 30 billion rows of data. Something that clearly can't fit into memory. Wait. Traditional SQL databases struggle with this, too. Let's press on. 30 billion rows of data. Each row is at least 1K in size. 3Tb of storage. Can Python do this?

Recall that the question gives us no help in reasoning about "better".

What's the representation? 3Tb has got to be a implemented as collection of smaller files. All of the files must have a common format. Let's posit CSV. We don't really want all of this storage on a single server. We want to farm this out to several hosts. And we probably want to layer in some redundancy in case one of those hosts fails.

Okay. It might not be obvious, but we're describing the HDFS from Hadoop. We could -- without too much trouble -- implement an HDFS surrogate that has very limited functionality in Python. We can use SFTP to smear two copies of each file among a fixed-size farm of servers. Very hard-wired, unlike Hadoop.

Then the reading part of our imagined app will scroll through the collection of CSV-formatted files on each processor. We'd have to implement a Hadoop map-reduce in Python. Again. Not very difficult if we eliminate some features and stick to a very basic version map-reduce. We can coordinate the reductions by implementing a simple REST-based master-reducer that accepts the reductions from the other processors and does the final reduce.

Now we have a lot of Python language overheads. Have we failed at "better" because we polluted the solution with a fake Hadoop?

No.

The SQL folks had to install, configure, and manage a SQL database that handled 3Tb of storage. The Python folks installed Python. Installed their fake Hadoop. Then they used a few clever abstractions to write delightfully simple map and reduce functions. Python still handles the extremely large amount of data faster than SQL. Also, it does this without some RDBMS features.

Which leads us to the second part of the question. Expressivity.

Easier to Learn

From the Question: "It just seems easier to learn SQL than Python".

This is pretty much meaningless noise. Less meaningful than the rest of the question. Having taught both, I'm confident in saying that SQL can be pretty confusing.

But.

More importantly.

There's no rational basis for comparison.

SQL DML is a very tiny language with only a few concepts. It's not a Turing-complete programming language.

What's important is this:

We have to embed SQL in another language.

You can't actually DO anything in SQL by itself. You need another language.

In the old days, we actually wrote SQL in the middle of some other programming language source. A pre-processor replaced SQL with the other language's code. Now we use ODBC/JDBC or other drivers to execute SQL from within another language. The embedding isn't quite so literal as it once was. But it's still embedding.

The SQL vs. Programming Language is not an "either-or" situation. We never have a stark choice between SQL or "some other language." We always have to learn "some other language." Always.

That "other language" might be PL/SQL or TSQL or whatever scripting tool of choice comes bundled with the database. It isn't SQL, it's another Turing-complete language that shares SQL syntax.

Since "some other language" is required, the real question is "is there value in also learning SQL?" Or -- most importantly -- "What's the value in spreading the knowledge representation around among multiple languages?"

In some contexts, SQL can act as a lingua franca, allowing a kind of uniform access to data irrespective of the application programming language.

In most contexts, however, the SQL -- in isolation -- is incomplete. There is application processing that has semantic significance. The "do everything in stored procedures" crowd spend too much time in raging denial that application logic is still used to wrap their stored procedures.  No matter how enthusiastic one embraces stored procedures, application code still exists, and still implements semantically significant operations.

SQL is merely a short-hand notation for potentially complex algorithms. It's an optimization. SQL elects for universality via abstraction. It can't cover efficiency or scalability. We have to bind in a representation and access algorithm to compare SQL performance with another language's performance. Or scalability.

By itself, SQL is useless. So there's a false dichotomy implied by the question.

The Head-To-Head Problem

Above, I provided code that demonstrates SQL CRUD operations in Python. This is, of course, silly. It presumes that SQL is the benchmark standard which Python must meet.

What if we lift up Python as the benchmark that SQL has to meet?

Ooops.

We can trivially write things in Python which cannot be expressed in SQL at all.  E.g., Compute the 1000th Fibonacci Number. For fun, go to https://projecteuler.net/archives and pick any problem and try to solve it in SQL. Try to even frame the problem in a way that the solution can be expressed in SQL. SQL has profound limitations.

Okay. That's sort of like cheating.

Let's not raise the bar quite so high, then. Here's today's problem.

I got a spreadsheet with 100's of rows of student evaluations. It may have come from Survey Monkey. Or not. It doesn't matter.

Most of the columns are some kind of Agree-Disagree scale. Other columns are comments or usernames, or stuff in an open-ended domain.

Note that I don't know which columns. And I don't care. And I don't need to care.

Here's how we tackle this in Python. It can be done in SQL. That's the point. It's not impossible. It's just kind of complex. Especially because the data loading either requires converting the data to a sequence of INSERT statements or we have to use a "loader" which lives outside the SQL language.

from collections import Counter
def summarize(data):
    with data.open() as source:
        rdr = csv.DictReader(source)
        summaries = {name: Counter() for name in rdr.fieldnames}
        for row in rdr:
            for key, value in row.items():
                summaries[key][value] += 1
    for key in sorted(summaries):
        summary= summaries[key]
        if len(summary) == 5:
            print(key, summary)
        else:
            print(key, "More than 5 values")

This is the kind of thing that people do in Python that demonstrates the limitations of SQL.  We've summarized all columns doing a count/group-by in one pass through the data. We've build Counter objects for each column name in the file. Each Counter object will collect a complete histogram for a given column. We'll do all of the columns at once.

This is scalable to millions or billions of rows and runs delightfully quickly. Doing something similar with SELECT COUNT(*) FROM TABLE GROUP BY SOMETHING is remarkably slow.  Databases are forced to do a lot of on-disk sorting and temporary file creation. The Python Counter lives in memory and works at in-memory speeds. Even for millions of rows of data.

Summary

Please define "better". Be explicit on what your goals are: speed, ACID, reliability, whatever.

Please provide code. Or provide use cases that map directly to code.

Please stop clinging to SQL. Be realistic.

Please consider the basics: Does it capture knowledge effectively? Is it expressive?

Please don't create dichotomies where none exist.




Tuesday, December 22, 2015

Coming Soon: Python for Secret Agents Part II

I guess it's like a movie franchise or a series of novels. The first one was popular. So, write a second story with similar characters.

You can check find part I here: http://www.amazon.com/gp/product/B00N2RWMMW/ref=dp-kindle-redirect?ie=UTF8&btkr=1 and here: https://www.packtpub.com/hardware-and-creative/python-secret-agents

Part II will be available soon. New missions. New ways to gather and analyze intelligence information assets.

I should probably read some Ian Fleming or Robert Ludlum boos to get some ideas for more exciting missions.

I'm more a fan of John le Carré stories which are less high-tech and more about ordinary selling out.

I'm also a fan of the history of Agent Garbo and Operation Mincement. These are things that are really interesting uses of data, intelligence, and misdirection.

Tuesday, December 15, 2015

Writing About Code -- Or -- Why I love RST

I blog. I write books. I write code. There are profound tool-chain issues in all three of these. Mostly, I'm tired of shabby "What You See Is All You Get" editing.

First. I use this blogger site as well as a Jive-based site at work. They're handy. But. There are a lot of issues. A lot. Web-based editing leaves a lot to be desired.

Second. Books. Packt requires MS-Word for drafts. The idea here is that authors, editors, and reviewers should all use a single tool. I push the boundaries by using Libre Office and Open Office. This works out most of the time, since these tools will absorb the MS-office style sheet that Packt uses. It doesn't work out well for typesetting math, but the technical editors are good about tracking down the formulae when they get lost in the conversions. These over-wrought do-too-much word processing nightmares leave a lot to be desired.

Third. Code. I use ActiveState Komodo Edit.  Both at work and outside of work. This rocks.

Web-Based Editing Fail

What's wrong with Jive or Blogger? The stark contrast between JavaScript-based text edit tools and HTML. It's either too little control or too much detail.

The JS-based editors are fine for simple, running text. They're actually kind of nice for that. Simple styles. Maybe a heading here or there.

Code? Ugh. Epic Fail.

It gets worse.

I've become a real fan of semantic markup. DocBook has a rich set of constructs available.  RST, similarly, has a short list of text roles that can be expanded to include the same kind of rich markup as DocBook. Sphinx leverages these roles to allow very sophisticated references to code from text. LaTeX has a great deal of semantic markup.

Web-based editors lack any of this. We have HTML. We have HTML microformats available. But. For a JavaScript web editor, we're really asking for a lot. More than seems possible for a quick download.

Desktop Tool Fail

What's wrong with desktop tools? We have very rich style sheets available. We should be able to define a useful set of styles and produce a useful document. Right?

Sadly, it's not easy.

First, the desktop tools are extremely tolerant of totally messed-up markup. They're focus is explicitly on making it look acceptable. It doesn't have to be well-structured. It just has to look good.

Second, and more important, the file formats are almost utterly opaque. Yes. There are standards now. Yes. It's all just XML. No. It's still nearly impossible to process. Try it.

Most word-processing documents feel like XML serializations of in-memory data structures. It's possible to locate the relevant document text in there somewhere. It's not like they're being intentionally obscure. But they're obscure.

Third, and most important, is the reliance on either complex GUI gestures (pointing and clicking and what-not) or complex keyboard "shortcuts" and stand-ins for GUI gestures. It might be possible to use that row of F-keys to define some kinds of short-cuts that might be helpful. But there's a lot of semantic markup and only a dozen keys, some of which have common interpretations for help, copy, paste, turn off the keyboard lights, play music, etc.

The Literate Programming ideal is to have the words and the code existing cheek by jowls. No big separation. No hyper-complex tooling. To me, this means sensible pure-text in-line markup.

Text Markup

I find that I really like RST markup. The more I write, the more I like it.

I really like the idea of writing code/documentation in a simple, uniform code-centric tooling. The pure-text world using RST pure-text markup is delightfully simple. 
  1. Write stuff. Words. Code. Whatever. Use RST markup to segregate the formal language (e.g. Python) from the natural language (e.g., English in my case.)
  2. Click on some icon the right side of the screen (or maybe use an F-key) to run the test suite.
  3. Click on some icon (or hit a key) to produce prettified HTML page from python3 -m pylit3 doc.py doc.rst; rst2html.py doc.rst doc.html. Having a simple toolchain to emit doc from code (or emit code from doc) is a delight.
The genesis for this blog post was an at-work blog post (in Jive) that had a code error in it. Because of Jive's code markup features (using non-breaking spaces everywhere) there's no easy copy-and-paste to check syntax. It's nearly impossible to get the code off the web page in a form that's useful.

If people can't copy-and-paste the code, the blog posts are approximately worthless. Sigh.

If I rewrite the whole thing into RST, I lose the Jive-friendly markup. Now it looks out-of-place, but is technically correct.

Either. Or.

Exclusive Xor.

Ugh. Does this mean I have to think about gathering the Jive .CSS files, and create a version of those that's compatible with the classes and ID's that Docutils uses?  I have some doubts about making this work, since the classes and ID's might have overlaps that cause problems.

Or. Do I have to publish on some small web-server at work, and use the <iframe> tag to include RST-built content on the main intranet? This probably works the best. But it leads to a multi-step dance of writing, publishing on a private server, and then using a iframe on the main intranet site. It seems needlessly complex.

Tuesday, December 8, 2015

Lynda and Educational Content

Just found http://www.lynda.com.

Unlike random YouTube videos, these are professionally edited.

Not everything on YouTube is poorly edited. Some are really good.

Having done a few webcasts for O'Reilly (and I have another scheduled for January 2016,) I know that my "you knows" -- you know -- and my "umms" are -- umm -- annoying.

I know professionals -- actors, pastors, lawyers -- who can extemporize really well. And it raises the bar a lot.

But the idea of having an editor clean up the "you knows" is appealing.

Tuesday, November 24, 2015

Coding Camp vs. Computer Science


Step 1, read this: "Dear GeekWire: A coding bootcamp is not a replacement for a computer science degree".   It's short, it won't hurt.

I got this comment.

"The world runs in legacy code and the cs degrees focus on leading edge 
Most of what is learned in cs [is] never used in the mainstream of business 
Much of computer work is repetitive and uninviting to upwardly mobile people who generally are moving up not improving the breed"

I disagree.  A lot.

"The world runs in legacy code." First, this is reductionist: everything that's been pushed to GitHub is now a "legacy". 
  • Does "legacy" mean "old, bad code?" If so, only CS grads will be equipped to make that judgement. 
  • Does "legacy" mean "COBOL?" If so, only CS grads will be able to articulate the problems with COBOL and make a rational plan to replace it with Microservices. 
  • Does "legacy" mean "not very interesting?" We'll return to this.
"CS degrees focus on leading edge." Not really true at all. The foundations of CS: data structures and algorithms, logic, and computability, haven't changed much since the days of Alan Turing and John von Neumann. They're highly relevant and form the core of a sensible curriculum.  

The "leading edge" would be some Java 1.8 nonsense or some Angular JS hokum. The kind of thing that comes and goes. The point of CS education is to make languages and language features just another thing, not something special and unique. A little CS background allows a programmer to lump all SQL databases into a broad category and deal with them sensibly. A Code Camp grad who only knows SQLite may have trouble seeing that Oracle is superficially different but fundamentally similar.

"cs is never used in the mainstream of business." True for some businesses. This is completely true for those businesses where "legacy" means "not very interesting." 

There is a great deal of not very interesting legacy code that fails to leverage a data structure more advanced than the flat file. This code is a liability, not an asset. The managers that let this happen probably didn't have a strong CS background and hired Code Camp graduates (because they're inexpensive) and created a huge pile of very bad code.

I've met these people and worked at these companies. It's a bad thing. The "leadership" that created such a huge pile of wasteful code needs to be fired. The "all that bad coded evolved during the 70's and 80's" isn't a very good excuse. A large amount of not interesting code can be replaced with a small amount of interesting code quickly and with almost zero risk.

Any company that's unable to pursue new lines of business because -- you know -- we've always done X and it's expensive to pivot to Y is deranged. They're merely holding onto their niche because they're paralyzed by fear of innovation=failure.

"Much of computer work is repetitive".  False. It's made repetitive by unimaginative management types who like to manage repetitive work. If you've done it twice, you need to be prepared to distinguish coincidence from pattern. When you've done it three times, that's a pattern, and you need to automate it. If you do it a fourth time, you're missing the opportunity to automate, wasting money instead of investing it.

"Much of computer work is ... uninviting to upwardly mobile people" Only in places where repetitive is permitted to exist.  If repetitive is not permitted, upward mobility will be the norm for the innovators.

"people who generally are moving up not improving the breed". I get this. The smart people move on. All we have left in this company are Code Camp graduates and their managers who value repetitive work and large volumes of not interesting code. 

Improving the Breed means what? 

Hiring CS graduates instead of Code Camp kiddies.




Navigation: Latitude, Longitude, Haversine, and all that

For a few years, I was a tech nomad. See Team Red Cruising for some stories of life on a sailboat. Warning: it's pretty dull.

As a tech nomad, I lived and died (literally) by my ability to navigate. Modern GPS devices make the dying part relatively unlikely. So, let's not oversell the danger aspect of this.

The prudent mariner plans a long voyage with a great deal of respect for the many things which can go wrong. One aspect of this is to create a "Float Plan". Read more about it here: http://floatplancentral.cgaux.org.

The idea is to create a summary of the voyage, provide that summary to trusted shore crew, and then check in periodically so that the shore crew can confirm that you're making progress safely. Failure to check in is an indicator of a problem, and action needs to be taken. We use a SPOT Messenger to check in at noon (and sometimes at waypoints.)

Creating a float plan involved an extract of the waypoints from our navigation software (GPS NavX). I would enrich the list of waypoints with estimated travel time between the points.  Folding in a departure time would lead to a schedule that could be tracked. I also include some navigation hints in the form of a bearing between waypoints so we know which way to steer to find the next point.

The travel time is the distance (in  nautical miles) coupled with an assumption about speed (5 knots.) It's a really simple thing. But the core haversine calculation is not a first-class part of any spreadsheet app. Because of the degrees-to-radians conversions required, and the common practice of annotating degrees with a lot of internal punctuation (38°54ʹ57″ 077°13ʹ36″), it becomes right awkward to simply implement this as a spreadsheet.

Some clever software has a good planning mode. The chartplotter on the boat can do a respectable job of estimating time between waypoints. But. It's not connected to a computer or the internet. So we can't upload that information in the form of a float plan. The idea of copying the data from the chart plotter to a spreadsheet is fraught with errors.

Navtools

Enter navtools. This is a library that I use to transform a route into a .csv with distances and bearings that I can use to create a useful float plan. I can add an estimated arrival time calculation so that a change to departure time creates the entire check-in schedule.

This isn't a sophisticated GUI app. It's just enough software to transform a GPS NavX extract file into a more useful form. The GUI was a spreadsheet (i.e., Numbers.) From this we created a PDF with the details.

Practically, we don't have good connectivity on the boat.  So we would create a number of alternative plans ("leave tomorrow", "leave the day after", "leave next Monday", etc.) we would go ashore, find a coffee shop, and email the various plans to ourselves. They could sit in our inbox, waiting for weather and tide to be favorable.

Then, when the weather and tides were finally aligned, we could forward the relevant details to our trusted shore crew. This was a quick spurt of cell phone connectivity to forward an email. It worked out well. When the scheduled departure time arrived, we'd coax Mr. Lehman to life, raise the anchor and away.

Literate Programming

This is an exercise in literate programming. The code that's executed and the HTML documentation are both derived from source ReStructured Text (RST) documents. The documentation for the navigation module includes the math along with the code that implements the math.

I have to say that I'm enthralled with the intimate connection between requirements, design, and implementation that literate programming embodies.

I'm excited to (finally) publish the thing to GitHub. See https://github.com/slott56/navtools.  I'm looking at some other projects that require the navtools module. What I wind up doing is copying and pasting the navigation calculation module into other projects. I had something like three separate copies on my laptop. It was time to fold all of the features together, delete the clones, and focus on one authoritative copy going forward.

I still have to remove some crufty old code. One step at a time. First, get all the tests to pass. Then expunge the old code. Then make progress on the other projects that leverage the navtools.navigation module.

Tuesday, November 17, 2015

Events: PyCon 2016, OSCon 2016

Many years ago ('07?) I went to my first PyCon. My situation changed and I didn't get to another PyCon until last year.

The story is a kind of major dumbosity. In '07 I could expense the trip as education. In '08, I'd lost that feature of my employment. After that I was actively figuring out how to be self-employed as a writer and technomad, and completely took my eye off the various kinds of tax deductions and sponsorship opportunities that I might have leveraged. It was too complex, arbitrary, and bewildering for me.

PyCon is an energizing event.  I can't say enough good things about attending session after session on Python and the Python-related ecosystem. In particular, it's a joy to see people pitching their solutions to complex problems.

Here's a reminder: https://us.pycon.org/2016/

Since I do some work for O'Reilly media -- if a pair of webcasts count as work -- I think I want to see if I can finagle my way into OSCon, also.


I think I can leverage some material from Functional Python Programming to create an interesting tutorial.  My webcast on the five kinds of Python functions can expand into a bunch of hands-on-keyboard exercises to build examples of each kind of callable thingy. 

Proposals are in. Waiting for comments. Fingers crossed.

Tuesday, November 10, 2015

Formatting Strings and the str.format() family of functions -- Python 3.4 Notes

I have to be clear that I am obsessed with the str.format() family of functions. I've happily left the string % operator behind. I recently re-discovered the vars() function.

My current go-to technique for providing debugging information is this:

print( "note: local={local!r}, this={this!r}, that={that!r}".format_map(vars)) )

I find this to be handy and expressive. It can be replaced with logging.debug() without a second thought. I can readily expand what's being dumped because all locals are provided by vars().

I also like this as a quick and dirty starting point for a class:

def __repr__(self):
    return "{__class__.__name__}(**{state!r})".format(__class__=self.__class__, state=vars(self))

This captures the name and state. But. There are nicer things we can do. One of the easiest is to use a helper function to reformat the current state in keyword parameter syntax, like this:

def args(obj):
    return ", ".join( "{k}={v!r}".format(k=k,v=v) for k,v in vars(obj).items())

This allows us to dump an object's state in a slightly nicer format. We can replace vars(self) with args(self) in our __repr__ method. We've dumped the state of an object with very little class-specific code. We can focus on the problem domain without having to wrestle with Python considerations.

Format Specifications

The use of !r for formatting is important. I've (frequently) messed up and used things like :s where data might be None. I've discovered that -- starting in Python 3.4 -- the :s format is unhappy with None objects. Here's the exhaustive enumeration of cases. 

>>> "{0} {1}".format("s",None)
's None'
>>> "{0:s} {1:s}".format("s",None)
Traceback (most recent call last):
  File "", line 1, in 
    "{0:s} {1:s}".format("s",None)
TypeError: non-empty format string passed to object.__format__
>>> "{0!s} {1!s}".format("s",None)
's None'
>>> "{0!r} {1!r}".format("s",None)
"'s' None"

Many things are implicitly converted to strings. This happens in a lot of places. Python is riddled with str() function evaluations. But they aren't everywhere. Python 3.3 had one that was removed for Python 3.4 and up.

Bottom Line: be careful where you use :s formatting.  It may do less than you think it should do.

Tuesday, November 3, 2015

Needlessly Redundant Overcommunication and DevOps

At the "day job" I use a Windows laptop. It was essential for a project I might have started, but didn't. So now I'm stuck with it until the budgetary gods deem that it's been paid for and I can request something more useful.  Mostly, however, Windows is fine. It doesn't behave too badly and most of the awful "features" are concealed by Python's libraries.

This is context for a strange interaction today. It seems to exemplify DevOps and the cruddy laptop problem.

The goofy Microsoft Office Communicator -- the one that's so often used instead of a good chat program like Slack or HipChat -- pinged.  The message went something like this.

"I sent you an email just now. Can you read it and reply?"

I was stunned. Too stunned to save the text.  This is either someone being aggressive almost to a point that hints at rudeness, or someone vague on how email works. Let's assume the second option. I can only reply, "I agree with you, that is how email works."

The email was a kind of vague question about server provisioning.  It was something along the lines of

"Do we provision our own server with Ansible or Chef? Or is there a team to provision servers for us? ..."

It went on to describe details of a fantasy world where someone would write Chef scripts for them.  The rest of the email mostly ignored the first question entirely.

The Real Question

If you're familiar with DevOps as a concept, then server provisioning is -- like most problems -- something that the developers need to solve. Technical Support folks may provide tools (Ansible, for example) to help build the server, but there aren't a room full of support people waiting for your story ("make me a server") to appear on their Kanban board.

Indeed, there was never the kind of support implied in the email, even in non-DevOps organizations. In a "traditional"  Dev-vs.-Ops organization, the folks that built servers were (a) overbooked, (b) uninterested in the details of our particular problem, or (c) only grudgingly let us use an existing server that doesn't quite fit our requirements. They rarely built servers for us.

Reason A, of course, is business as usual. Unless we're the Hippo (Highest Paid Person in the Organization,) there's always some other project that's somehow more important than whatever foolishness we're engaged in. How many times have we been told that "The STARS Project is tying up all our resources. It will be 90 days before..."? Gotcha. The bad part about this situation is when the person paying the bills says to me "You need to make them respond." How -- precisely -- do you propose that I change the internal reward system of the ops people?

We could label this as a passive-aggressive approach. They're waiting for us establish a schedule so that they can shoot it down. Or maybe that's reading way too much into the situation. Maybe they're really just overbooked.

Regarding reason B. Years ago, I had a hilarious interaction where we sent a stream of emails explaining our server requirements. The emails were not exactly ignored. But. When we asked about the status of our servers, the person responsible for the team brought a yellow pad and wrote down the requirements. I read the email to them. Without a trace of embarrassment, they wrote down what I was reading from an email.  (It was long enough ago, that we didn't have laptops, and I had a hard-copy of the email. They refused the hard-copy. I had to read it. Really.)

Were they clueless about how email works? Or. Was this a kind of passive-aggressive approach to architecture where our input was discounted to zero because it didn't count until they wrote it on a their yellow pad? The behavior was bizarre.

Something similar happened with another organization. We made server recommendations. They didn't like the server recommendations. Not because the recommendations seemed wrong, but because we didn't have a formal sciency-seeming methodology for fantasizing about servers that were required to support the fantasy software which hadn't been written yet. They felt it necessary to complain. And when we talked with hardware vendors, they felt it necessary to customize the cheap commodity servers.
[It got weirder. They were convinced that a server farm needed to be designed from the bottom up.  I endured a lecture on how a properly sciency-seeming methodology started by deciding on L1 and L2 cache sizing, bus timing, and worked through memory allocation and then I slowly grew to see that they had no clue what they were talking about when buying commodity servers by the rack-full for software that doesn't exist yet.]
We all know about reason C. The reason for DevOps is to avoid being stuffed into a kind of random server where there are upgrades that we all have to agree on. Or -- worse -- a server that can't be upgraded because no one will agree. A single app team vetos all changes.

"We can't install Anaconda 3 because we know that Python 3 is incompatible with Python 2"...

What?

I stopped understanding at that point. It seemed like the rest of the answer amounted to "having the second Anaconda on a separate path could lead to problems. It can't be proven that no problems will arise, so we'll assume that -- somehow -- PATH settings will get altered randomly and a Python 2 job will crash because it accidentally had the wrong PATH and accidentally ran with Python 3."

It was impossible to explain that this is a non-problem. Their response was "But we can't be sure." That's the last resort of someone who refuses to change. And it's the final answer. Even if you do a proof-of-concept, they'll find reasons to doubt the POC's results because they can't be sure the POC mirrors production.

The Real Answer

The answer to the original Ping and the Email was "You're going to do this yourself."  I included links to four or five corporate missives on Chef, Ansible, DevOps, and how to fill in the form for a cloud server.

I have my doubts -- though -- that this would be seen as helpful.

They may not be happy because they don't get to use Communicator and Email and someone else's Kanban board to get this done. They don't get to ask someone else what they're doing and why they're not getting it done on time. They don't get to second-guess their technical decisions. They actually have to do it. And that may not work out well.

The truly passive-aggressive don't seem to do things by themselves. It appears to me that they spend a lot of time looking for reasons to stall. Either they need to get more information or get organized or they need to have some kind more official "permission" to proceed. Lacking any further information, I chalk it up to them only feeling successful when they've found the flaws in what someone else did.

It's challenging sometimes to make it clear that a rambling email asking for someone else to help is going nowhere. A Communicator ping followed by an email isn't actually getting anything done. It's essentially stalling, waiting for more information, getting organized, or waiting for permission. Overcommunication can become a stalling tactic or maybe a way to avoid responsibility.

I'm stuck with a cruddy laptop because the budget gods have laid down some laws that don't make a lick of technical sense. I think that the short-sighted "use it until it physically wears out" might be more costly than "find the right tool, we'll recycle the old one appropriately." In the same way, the shared server world view is clearly costly.  We shouldn't share a server "because it's there."

The move to DevOps allows us to build a server rather than discuss building a server.

I want a DevOps parallel for my developer workstation. I don't want permission or authorization. I don't want to overcommunicate with the budget gods. I want a workstation unencumbered by permission-seeking.

Tuesday, October 27, 2015

The Internet of Things

Wunderbar.

A whole bunch of nicely integrated data collection modules.

I prefer to hack around with Arduino.  I'm not sure why -- perhaps it's the lure of building approximately from scratch.

But this is very cool. No soldering. Just start gathering data.

I have a half-built Arduino-based device to measure the position of the steering quadrant on a sailboat. I really need to take the next few steps and finalize the design so that I can order a few boards from Fitzring and try it out for real. I've had it in pieces here and there for about 3 years. The open issue was (and still is) a digital potentiometer that sets the output voltage level. I think I have the right chip for this. I think I have the wrong resistors that adjust the voltage into the proper range. The response curve for the parts I rigged up (years ago) weren't linear enough.

Then I moved. And moved again. And wrote a bunch of books on Python. And I'm about to move again. I need to finish this and get it off my desk. Literally.

The good news is that I took careful notes. Including pictures. So I can break out the boards and mess around a bit. I have three breadboards covered with jumpers, LED's, buttons, and stuff all piled up around the laptop.

The Wunderbar has an light/color/proximity sensor. I've built just the proximity sensor with an Arduino. Reporting the output as resistance that can be used on 12V boat systems as the stumbling block for me.

After the next move... (Something I've said before.)

Tuesday, October 20, 2015

Why Computer Science for All is good for all

An Open Letter from the Nation’s Tech and Business Leaders: Why Computer Science for All is good for all.

"These are the skills and competencies that will power the growth of every industry..."

Civic leaders and educators need to be in on this. And professionals who have skills to share need to be in on this also. It's not limited to New York City. It's a nationwide (perhaps world-wide) need for skills. There are a lot of talented people. Some of them haven't had the right sequence of opportunities to realize their talents.

Wednesday, October 14, 2015

Chapters to Edit: What do I do instead?

I'm starting to get chapters back from the technical reviewers. This is an important part of the writing process: correcting my mistakes and clarifying things that confused the reviewers.

Packt has had a uniformly excellent cadre of technical reviewers. At this point, I've worked with something like a dozen people on four different books. It's been great (for me) to get detailed, specific feedback point by point.

Instead of working on my reviewed chapters, however, I'm browsing. It's Python Week.


I'll get to the chapters Thursday, I think.

Tuesday, October 13, 2015

Wait, there's more Python goodness from Packt

This just in...

Here's a link to the actual Python Week page, with all the deals there for the week: https://www.packtpub.com/packt/offers/pythonweek

They also have a week of free Python books too, which change daily: https://www.packtpub.com/packt/offers/free-learning/

Feel free to ruthlessly export their largess and build your personal technical library.

Monday, October 12, 2015

Tuesday, October 6, 2015

Today's Milestone: Refactoring and Django Migrations

Once upon a time, when today's old folks were young, we'd debate the two project strategies: Hard Part Do Later (HPDL) vs. Hard Part First (HPF).

The HPDL folks argued that you could pick away at the hard part until -- eventually -- it wasn't hard any more. This doesn't often work out well in practice, but a lot of people like it. Sometimes the attempt to avoid the hard part makes it harder.

The HPF folks, on the other hand, recognized that solving the hard problem correctly, may make the easy problems even easier. It may not, but either way, the hard part was done.

The debate would shift to what -- exactly -- constituted the hard part. Generally, what one person finds hard, another person has already done several times before. It's the part that no one has done before that eventually surfaces as being truly hard.

Young kids today (get off my lawn!) often try to make the case that an Agile approach finesses the "hard part" problem. We define a Minimally Viable Product (MVP) and we (magically) don't have to worry about doing the hard part first or last.

They're wrong.

If the MVP happens to include the hard part, we're back a HPF. If the MVP tries to avoid the hard part, we're looking at HPDL.

The Novelty Factor

Agile methods don't change things. We still have to Confront the Novelty (CTN™). Either it's new technology or it's a new problem domain or a new solution to an existing problem domain. Something must be novel, or we wouldn't be writing software, we'd be downloading it.

I'm a HPF person. If you set the hard part aside to do later, all the things you do instead become constraints, limiting your choices for solving the hard part that comes later. In some rare cases, you can decompose the hard part and solve it in pieces. The decomposition is simply Hard Part First through Decomposition (HPFtD™) followed by Prioritize the Pieces (PtP™) and another round of Hard Part First.

Today, we're at a big milestone in the HPF journey.

The application's data model is simple. However.

The application has a complex pipeline of processing to get from source data to the useful data model.

A strict (and dumb) MVP approach would skip building the complex pipeline and assume that it was magically implemented somehow.

A slightly smarter MVP approach uses some kind of technical spike solution to handle the complex pipeline. We do that manually until we get past MVP and decide to implement the pipeline in something more final and complete.

My HPF strategy tackles the complex pipeline because we have to build it anyway and it's hard. We don't have to build all of it. Just enough to lay out the happy path.

The milestone?

It's time to totally refactor because -- even doing the hard part first -- we have the wrong things in the wrong places. Django application boundaries generally follow the "resources". It's a lot like designing a RESTful API. Define the resources, cluster them together in some kind of ontology that provides a meaningful hierarchy.

Until -- of course -- you get past the problem domain novelty and realize that some portion of the hierarchy is going to become really lopsided. It needs to be restructured so we have a flat group of applications.

Wait. What?

Flatten?

Yes.

When we have a Django application model that's got eleventy-kabillion classes, it's too big. Think the magic number 7±2: there's a limit to our ability to grasp a complex model.

Originally, we thought we'd have apps "A", "B", and "C". However. "A" turned out to be more complex than it seemed when we initially partitioned the apps. Based on the way the classes are named and clustered in the model file, it's clear that we have an internal structure is struggling to emerge. There are too many comments and high-level organizational hints in the docstrings.

It looks like this might be the model that's emerging:
  • Former A
    • A1
    • Conceptual A2
      • A2a
      • A2b
    • A3
  • B
  • C
This means that there will be classes in A3 that depend on separate apps A2a and A2b. Further, A2 is really just a concept that unifies the design; it doesn't need to be implemented as a proper app. Both A2a and A2b depend on A1. A3 depends on A2a, A2b, and A1.  

Ugh. Refactoring. And the associated migrations. 

Django allows us to have nested apps. But. Do we really want to go there? Is a nested collection of packages really all that helpful? 

Or.

Would it be better to flatten the whole thing, and simply annotate the dependencies among apps?

The Zen Of Python suggests that Flat is Better than Nested.

The hidden benefit of Flat is that the Liskov Substitution Principle is actually a bit easier to exploit. Yes, we have a tangled web of dependencies, but we're slightly less constrained when all of the Django apps are peers. Yes, many things will depend on the A1 app, but that will be less of a problem than the current pile of classes is.

The important part here is to start again. This means I need to discard the spike database and discard the history of migrations to date. I always hate disrupting my development databases, since it has test cases I know and remember.

That's the disruptive milestone for me: discarding the old database and starting again.

Tuesday, September 29, 2015

Python 3.5 and the Upgrade Strategy

Start here: https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-484

While new syntax is important, remember your audience in pitching the upgrade from Python 2.7. You may need to pander to people who aren't programmers or don't really know Python.

When selling the upgrade, it can help to focus on the objective measures.
  1. Performance. When anyone asks why we should disturb our precious Python 2 ecosystem, point out the performance improvements. Begin with Python 3.2, 3.3, 3.4, and then 3.5 improvements. The union of these is an impressive list. Faster is better, right?
  2. New Libraries. For some folks who don't know Python well, it helps to give them a concrete list of features you absolutely require. Seriously. Enumerate all the new libraries from Python 3.2, ..., 3.5. It's a big list. Some of them have been backported, so this list isn't a a complete win. You may not really need all of them, but use them to bolster your case.
  3. Other Cleanups. These are important for folks who use Python daily, but aren't too impressive to manager types who aren't deeply into the language details.
    1. The fact that Python 3 handles class/type better than Python 2 isn't impressive to anyone who hasn't dealt with it. 
    2. The fact that Python 3 handles Unicode better than Python 2 isn't going to impress too many people, either. 
    3. The print statement issue will cause some managers to claim that the upgrade is "risky". 
    4. The division issue is a complete win. Weirdly, nay-sayers will claim (a) just use float() a lot, (b) just add +0.0 a lot, or (c) just add from __future__ import division a lot.  How is this workaround better? No clue. Be prepared to make the case that the dumb workarounds are... well... dumb.
You can lift up the type definition and http://mypy-lang.org. If you do, be prepared for snark from the Java/Scala crowd. These folks will (wrongly) claim that a partial type proof is useless, and static type checking is mandatory. This is a difficult discussion to have because the "type safety is important" crowd don't seem to recognize the awful gyrations they're forced into so they can write generic code that's type-agnostic. All Python code is type-agnostic; the type checking just confirms some design constraints. The presence of differing strategies -- type-specific code vs. generic type-agnostic code -- means that neither is right, and the argument is moot. 

Don't focus on async/await. Yes, it's first on the Python web site, but, it can be a tough sell.

Performance

The easy sell is this impressive list of optimizations.

3.2 
  • Peephole optimizer improvements
  • Serializing and unserializing data using the pickle module is now several times faster.
  • The Timsort algorithm used in list.sort() and sorted() now runs faster and uses less memory when called with a key function. 
  • JSON decoding performance is improved and memory consumption is reduced whenever the same string is repeated for multiple keys. 
  • Recursive locks (created with the threading.RLock() API) now benefit from a C implementation which makes them as fast as regular locks, and between 10x and 15x faster than their previous pure Python implementation.
  • The fast-search algorithm in stringlib is now used by the split(), splitlines() and replace() methods on bytes, bytearray and str objects. Likewise, the algorithm is also used by rfind(), rindex(), rsplit() and rpartition().
  • Integer to string conversions now work two “digits” at a time, reducing the number of division and modulo operations.
  • Several other minor optimizations. 
    • Set differencing now runs faster when one operand is much larger than the other
    • The array.repeat() method has a faster implementation
    • The BaseHTTPRequestHandler has more efficient buffering
    • The operator.attrgetter() function has been sped-up
    • ConfigParser loads multi-line arguments a bit faster
3.3 
  • Some operations on Unicode strings have been optimized
  • UTF-8 is now 2x to 4x faster. UTF-16 encoding is now up to 10x faster.
3.4
  • The UTF-32 decoder is now 3x to 4x faster.
  • The cost of hash collisions for sets is now reduced. 
  • The interpreter starts about 30% faster. 
  • bz2.BZ2File is now as fast or faster than the Python2 version for most cases. lzma.LZMAFile has also been optimized.
  • random.getrandbits() is 20%-40% faster for small integers.
  • By taking advantage of the new storage format for strings, pickling of strings is now significantly faster.
  • A performance issue in io.FileIO.readall() has been solved. 
  • html.escape() is now 10x faster.
3.5
  • The os.walk() function has been sped up by 3 to 5 times on POSIX systems, and by 7 to 20 times on Windows. 
  • Construction of bytes(int) (filled by zero bytes) is faster and uses less memory for large objects.
  • Some operations on ipaddress IPv4Network and IPv6Network have been massively sped up,
  • Pickling of ipaddress objects was optimized to produce significantly smaller output. 
  • Many operations on io.BytesIO are now 50% to 100% faster.
  • The marshal.dumps() function is now faster: 65-85% with versions 3 and 4, 20-25% with versions 0 to 2 on typical data, and up to 5 times in best cases. 
  • The UTF-32 encoder is now 3 to 7 times faster. 
  • Regular expressions are now parsed up to 10% faster.
  • The json.dumps() function was optimized.
  • The PyObject_IsInstance() and PyObject_IsSubclass() functions have been sped up.
  • Method caching was slightly improved, yielding up to 5% performance improvement in some benchmarks. 
  • Objects from random module now use two times less memory on 64-bit builds. 
  • The property() getter calls are up to 25% faster.
  • Instantiation of fractions.Fraction is now up to 30% faster.
  • String methods find(), rfind(), split(), partition() and in string operator are now significantly faster for searching 1-character substrings.
I think this list can help move an organization away from Python 2 and toward Python 3. This list and a lot of lobbying from folks who know what the improvements are.

Library

Here's the library upgrade list, FWIW.
The details of the improvements can be overwhelming.

The dozen new modules, however, might help overcome organizational inertia to make progress on ditching Python2. I've been making heavy use of statistics. I need to make better use of pathlib in future projects.

Tuesday, September 22, 2015

Python Tutor

Read This: http://radar.oreilly.com/2015/08/learning-programming-at-scale.html

The core visualization tool (pythontutor.com) can be helpful for many people. The shared environments seem like a cool idea, also, but I don't have any specific comments on the other tools.

While this looks very cool, I'm not a huge fan of this kind of step-by-step visualization. This uses very clear graphics, and looks very clever, it has some limitations. I think that some aspects of "visualization" can be misleading. Following an execution path for a specific initial condition can obscure events and conditions that aren't on the happy path. It's not clear how a group of statements establish a more general condition.

I'm a fan of formal post-conditions. From these, we can postulate a statement, and work out the weakest precondition for the statement. As we work through this exercise, we create a formal proof and a program. It's very elegant. And it covers the general case, not specific examples.

Most importantly, this effort depends on having formal semantics for each statement. To write code, we need to have a concise definition of the general state change is made by each statement in a language. We're looking at the general case for each statement rather than following a specific initial condition through a statement.

Sidebar.
In C, what does this do? a[i++] = ++i; There is no formal definition. The statement has three state changes stated. But how are they ordered? No matter what initial values for a[] and i we provide, this is still pretty murky. A debugger only reveals the specific implementation being debugged.
Visualization may help some people understand the state change created by a statement. Some people do learn things by watching this kind of "debugger" mode. In particular, this may help because it has much better graphics than the built-in character-mode debugger.

This idea works best with programs that already make sense: programs that are well designed. Programs that make orderly progress from some initial state to the desired final state.

Programs written by learners may not be all that clean. Realistically, they may be inept. They may even reach the far end of the spectrum and be downright bad.

While this tool is graphically gorgeous, it's still a debugger. It wallows around in an internal world in which the formal semantics can get obscured. The general case can't easily be shown.

We have a forest and trees problem here. A debugger (or other statement-by-statement visualization tool) emphasizes each individual tree. The larger structures of glades, thickets, groves, stands, brakes, and coppices are lost to view.

The humble while statement (especially one with an internal if-break) can be extremely difficult to understand as a single statement. If we break down the statement-by-statement execution, the presence of two termination conditions (one on the while clause and one on the if clause) can be obscured because a visualization must follow a specific initial condition.

With really well-written tutorials -- and some necessary metadata -- a super-visualizer might be able to highlight the non-happy-path logic that exists.  This alternate path viewing could be helpful for showing how complex logic works (and doesn't work.)

With programs written by learners -- programs which are inept and won't have appropriate metadata -- a super-visualizer would need to reason very carefully about the code to determine what happy path and non-happy-path kinds of logic are present. It would have to locate and highlight

  • contradictory elif clauses, 
  • gaps among elif clauses, 
  • missing else clauses, 
  • hellishly complex else clauses, 
  • break conditions, 
  • continue conditions, as well as 
  • exception handling.

For truly bad programs, the super-visualizer may be stumped as to what is intended. Indeed, it may be impossible to determine how it can be displayed meaningfully to show alternatives and show how the specific code generalizes into a final outcome.

def this_program_terminates(some_code):
    # details omitted

def demo():
    while this_program_terminates(demo):
        print("w00t w00t")

What does this do? How can any visualizer aid the student to show problems?

To take this one step further, I think this kind of thing might also be hazardous to learning how the functional programming feature of Python work.  I think that exposing the underlying mechanics of a generator expression might be more confusing than simply treating it as a "lazy list."

It's very nice. But it isn't perfect. Something that supports reasoning about the general post-conditions established by a statement would be more useful than a step-by-step debugger with great graphics.

Tuesday, September 15, 2015

Exploratory Data Analysis in Functional-Style Python

Here are some tricks to working with log file extracts. We're looking at some Enterprise Splunk extracts. We can fiddle around with Splunk, trying to explore the data. Or we can get a simple extract and fiddle around with the data in Python.

Running different experiments in Python seems to be more effective than trying to do this kind of exploratory fiddling in Splunk. Primarily because there aren't any boundaries on what we can do with the data. We can create very sophisticated statistical models all in one place.

Theoretically, we can do a lot of exploration in Splunk. It has a variety of reporting and analytical features.

But...

Using Splunk presumes we know what we're looking for. In many cases, we don't know what we're looking for: we're exploring. We may have some indication that a few RESTful API transactions are slow, but little more than that. How do we proceed?

Step one is to get raw data in CSV format. Now what?

Reading Raw Data

We'll start by wrapping a CSV.DictReader object with some additional functions.

Object-Oriented Purists will object to this strategy. "Why not just extend DictReader?" they ask. I don't have a great answer. I lean toward functional programming and the resulting orthogonality of components. With a purely OO approach, we have to use more complex-seeming mixins to achieve this.

Our general framework for processing logs is this.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)

This allows us to read the CSV-formatted Splunk extract. We can iterate through rows in the reader. Here's trick #1. It's not really very tricky, but I like it.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    for row in rdr:
        print( "{host} {ResponseTime} {source} {Service}".format_map(row) )

We can -- to a limited extent -- report raw data in a helpful format. If we want to dress up the output, we can change the format string. Maybe "{host:30s} {ReponseTime:8s} {source:s}" or something like that.

Filtering

A common situation is that we've extracted too much, and only need to see a subset. We can change the Splunk filter, but, we hate to overcommit before we've finished our exploration. It's far easier to filter in Python. Once we've learned what we need, we can finalize in Splunk.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
    for row in rdr_perf_log:
        print( "{host} {ResponseTime} {Service}".format_map(row) )

We've injected a generator expression that will filter the source rows, allowing us to work with a meaningful subset.

Projection

In some cases, we'll have additional columns of source data that we don't really want to use. We'll eliminate this data by making a projection of each row.

In principle, Splunk never produces an empty column. However, RESTful API logs may lead to data sets with a huge number of unique column titles based on surrogate keys that are part of request URI's. These columns will have one row of data from the one request that used that surrogate key. For every other row, there's nothing useful in that column. Life is much simpler if we remove the empty columns from each row.

We can do this with a generator expression, also, but it gets a bit long. A generator function is somewhat easier to read.

def project(reader):
    for row in reader:
        yield {k:v for k,v in row.items() if v}

We've built a new row dictionary from a subset of the items in the original reader. We can use this to wrap the output of our filter.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
    for row in project(rdr_perf_log):
        print( "{host} {ResponseTime} {Service}".format_map(row) )

This will reduce the unused columns that are visible in the inside of the for statement.

Notation Change

The row['source'] notation will get clunky. It's much nicer to work with a types.SimpleNamespace than a dictionary. This allows us to use row.source.

Here's a cool trick to create something more useful.

rdr_ns = (types.SimpleNamespace(**row) for row in reader)

We can fold this into our sequence of steps like this.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
    rdr_proj = project(rdr_perf_log)
    rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
    for row in rdr_ns:
        print( "{host} {ResponseTime} {Service}".format_map(vars(row)) )

Note the small change to our format_map() method. We've added the vars() function to extract a dictionary from the attributes of a SimpleNamespace.

We could write this as a function to preserve syntactic symmetry with other functions.

def ns_reader(reader):
    return (types.SimpleNamespace(**row) for row in reader)

Indeed, we could write this as a lambda construct which is used like a function.

ns_reader = lambda reader: (types.SimpleNamespace(**row) for row in reader)

While the ns_reader() function and ns_reader() lambda are used the same way, it's slightly harder to write a document string and doctest unit test for a lambda. For this reason, a lambda should probably be avoided.

We can use map(lambda row: types.SimpleNamespace(**row), reader). Some folks prefer this over the generator expression.

We could use a proper for statement with an internal yield statement, but there doesn't seem to be any benefit from making a big statement out of a small thing.

We have a lot of choices because Python offers so many functional programming features. We don't often see Python touted as a functional language. Yet, we have a variety of ways to handle a simple mapping.

Mappings: Conversions and Derived Data

We'll often have a list of data conversions that are pretty obvious. Plus, we'll have a growing list of derived data items. The derived items will be dynamic and are based on different hypotheses we're testing. Each time we have an experiment or question, we might change the derived data.

Each of these steps: filtering, projection, conversions, and derivation, are stages in the "map" portion of a map-reduce pipeline. We could create a number of smaller functions and apply them with map(). Because we're updating a stateful object, we can't use the general map() function.  If we wanted to achieve a more pure functional programming style, we'd use an immutable namedtuple instead of a mutable SimpleNamespace.

def convert(reader):
    for row in reader:
        row._time = datetime.datetime.strptime(row.Time, "%Y-%m-%dT%H:%M:%S.%F%Z")
        row.response_time = float(row.ResponseTime)
        yield row

As we explore, we'll adjust the body of this conversion function. Perhaps we'll start with some minimal set of conversions and derivations. We'll extend this with some "are these right?" kind of things. We'll take some out when we discover that the don't work.

Our overall processing looks like this:

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
    rdr_proj = project(rdr_perf_log)
    rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
    rdr_converted = convert(rdr_ns)
    for row in rdr_converted:
        row.start_time = row._time - datetime.timedelta(seconds=row.response_time)
        row.service = some_mapping(row.Service)
        print( "{host:30s} {start_time:%H:%M:%S} {response_time:6.3f} {service}".format_map(vars(row)) )

Note that change in the body of our for statement. Our convert() function produces values we're sure of. We've added some additional variables inside the for loop that we're not 100% sure of. We'll see if they're helpful (or even correct) before updating the convert() function.

Reductions

When it comes to reductions, we can adopt a slightly different style of processing. We need to refactor our previous example, and turn it into a generator function.

def converted_log(some_file):
    with open(some_file) as source:
        rdr = csv.DictReader(source)
        rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
        rdr_proj = project(rdr_perf_log)
        rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
        rdr_converted = convert(rdr_ns)
        for row in rdr_converted:
            row.start_time = row._time - datetime.timedelta(seconds=row.response_time)
            row.service = some_mapping(row.Service)
            yield row

We've replace the print() with a yield.

Here's the other part of this refactoring.

for row in converted_log("somefile.csv"):
    print( "{host:30s} {start_time:%H:%M:%S} {response_time:6.3f} {service}".format_map(vars(row)) )

Ideally, all of our programming looks like this. We use a generator function to produce data. The final display of the data is kept entirely separate. This allows us to refactor and change the processing much more freely.

Now we can do things like collect rows into Counter() objects, or perhaps compute some statistics. We might use a defaultdict(list) to group rows by service.

by_service= defaultdict(list)
for row in converted_log("somefile.csv"):
    by_service[row.service] = row.response_time
for svc in sorted(by_service):
    m = statistics.mean( by_service[svc] )
    print( "{svc:15s} {m:.2f}".format_map(vars()) )

We've decided to create concrete list objects here. We can use itertools to group the response times by service. It looks like proper functional programming, but the implementation points up some  limitations in the Pythonic form of functional programming. Either we have to sort the data (creating a list object) or we have to create lists as we group the data. In order to do several different statistics, it's often easier to group data by creating concrete lists.

Rather than simply printing a row object, we're now doing two things.
  1. Create some local variables, like svc and m. We can easily add variance or other measures.
  2. Use the vars() function with no arguments, which creates a dictionary out of the local variables.
This use of vars() with no arguments -- which behaves like locals() -- is a handy trick. It allows us to simply create any local variables we want and include them in the formatted output. We can hack in as many different kinds of statistical measures as we think might be relevant.

Now that our essential processing loop is for row in converted_log("somefile.csv"), we can explore a lot of processing alternatives in a tiny, easy-to-modify script. We can explore a number of hypotheses to determine why a some RESTful API transactions are slow and others are fast. 

Tuesday, September 1, 2015

Audio Synth in Python 3.4, Part II

See Audio Synth.

At first, I imagined the problem was going to be PyAudio. This package has a bunch of installers. But the installers don't recognize Python 3.4, so none of them work for me. The common fallback plan is to install from source, but, I couldn't find the source. That looks like a problem.

Once I spotted this: "% git clone http://people.csail.mit.edu/hubert/git/pyaudio.git", things were much better.  I built the PortAudio library. I installed PyAudio for Python3.4. Things are working. Noises are happening.

Next step is actual synth.

In the past, I have played with pysynth because it has some examples of wave-table additive synth. That's very handy. The examples are hard to follow because a lot of the synth ideas are conflated into small functions.

Complication: The pysynth package is Python2. It lacks even the simple from __future__ import print_function to make it attempt Python3 compatibility.

The pysynth.play_wav module could be a handy wrapper around various audio playback technologies, include pyaudio. It has to be tweaked, however, to make it work with Python3.4. I really need to clone the project, make the changes, and put in a pull request.

The pysynth.pysynth and pysynth.pysynth_beeper modules are helpful for seeing how wave tables work.  How much rework to make these work with Python3.4? And how much reverse engineering to understand the math?

I've since found pyo. Which is also Python 2. See the AjaxSoundStudio pages for details. This may be a better example of wave tables. But it's still Python2. More investigation to follow.

The good news is that there's some forward motion.

Tuesday, August 25, 2015

Visual studio and Python

Why write Python in Visual Studio?

That what I want to know, too.

IntelliSense? ActiveState Komodo does this. And it does it very well considering the potential complexity of trying to determine what identifiers are possibly valid in a dynamic language.

Debugger? No thanks. I haven't used it yet. [I should probably blog on the perils of debuggers.]

Project Management? GitHub seems to be it. Some IDE integration might be helpful, but the three common command-line operations -- git pull, git commit, and git push -- seem to cover an awful lot of bases.

I've been asked about Python IDEs -- more than once -- and my answer remains the same:
The IDE Doesn't Matter. 

One of the more shocking tech decisions I've seen is the development manager who bragged on the benefits of VB. The entire benefit was this: Visual Studio made the otherwise awful VB language acceptable.

The Visual Studio IDE was great. And it made up for the awful language.

Seriously.

The development manager went to to claim that until Eclipse had all the features of Visual Studio, they were sure that Java was not usable. To them, the IDE was the only decision criteria. As though code somehow doesn't have a long tail of support, analysis, and reverse engineering. 

Tuesday, August 18, 2015

Audio Synth [Updated]

I learned about synthesizers in the '70's using a Moog analog device. Epic coolness.

Nowadays, everything is digital. We use wave tables and (relatively) simple additive synth techniques.

I made the mistake of reading about Arduino wave table synthesis:

http://learning.codasign.com/index.php?title=Wavetable_Synthesis

http://makezine.com/projects/make-35/advanced-arduino-sound-synthesis/

http://playground.arduino.cc/Main/ArduinoSynth

The idea of an Arduino alarm that uses a chime instead of a harsh buzz is exciting. The tough part about this is building the wave tables.

What a perfect place to use Python: we can build wave tables that can be pushed down to the Arduino. And test them in the Python world to adjust the frequency spectrum and the complex envelope issues around the various partials.

See http://computermusicresource.com/Simple.bell.tutorial.html

Except.

Python3.4 doesn't have PyAudio support.

Yet.

Sigh. Before I can work with Arduino wave tables, I'll have to start by figuring out how to build PyAudio for Python 3.4 on Mac OS X.

Look here: http://people.csail.mit.edu/hubert/git/pyaudio.git for the code.

Look here for the secret to building this on Mac OS X: https://stackoverflow.com/questions/2893193/building-portaudio-and-pyaudio-on-mac-running-snow-leopard-arch-issues/2906040#2906040.

Summary.

  1. Get pyaudio source.
  2. Inside pyaudio create a portaudio-v19. Get the portaudio source and put it here.
  3. Inside pyaudio/pyaudio, do ./config; make and sudo make install
  4. Inside pyaudio, do python3.4 setup.py install --static-link

Tuesday, August 4, 2015

Mocking and Unit Testing and Test-Driven Development

Mocking is essential to unit testing.

However.

It's also annoyingly difficult to get right.

If we aren't 100% perfectly clear on what we're mocking, we will merely canonize any dumb assumptions into mock objects that don't really work. They work in the sense that they don't crash, but they don't properly test the application objects since they repeat some (bad) assumptions.

When there are doubts, it seems like we have to proceed cautiously. And act like we're breaking some of the test-first test-driven-development rules.

Note. We're not really breaking the rules. Some folks, however, will argue that test-driven development means literally every action you take should be driven by tests. Does this include morning coffee or rotating your monitor into portrait mode? Clearly not. What about technical spikes?

Our position is this.
  1. Set a spike early and often. 
  2. Once you have reason to believe that this crazy thing might work, you can formalize the spike with tests. And mock objects.
  3. Now you can write the rest of the app by creating tests and fitting code around those tests.
The import part here is not to create mocks until you really understand what you're doing.

Book Examples

Now comes the tricky part: Writing a book.

Clearly every example must have a unit test of some kind. I use doctest heavily for this. Each example is in a doctest test string.

The code for a chapter might look like this.


test_hello_world = '''
>>> print( 'hello world')
'hello world'
'''

__test__ = { n:v for n,v in vars().items() 
    if n.startswith('test_') }

if __name__ == '__main__':
    import doctest
    doctest.testmod()

We've used the doctest feature that looks for a dictionary assigned to a variable named __test__. The values from this dictionary are tests that get run as if they were docstrings found inside modules, functions, or classes.

This is delightfully simple. Expostulate. Exemplify. Copy and Paste the example into a script for test purposes and Exhibit in the text.

Until we get to external services. And RESTful API requests, and the like. These are right awkward to mock. Mostly because a mocked unittest is singularly uninformative.

Let's say we're writing about making a RESTful API request to http://www.data.gov. The results of the request are very interesting. The mechanics of making the request are an important example of how REST API's work. And how CKAN-powered web sites work in general.

But if we replace urrlib.request with a mock urllib, the unit test amounts to a check that we called urlopen() with the proper parameters. Important for a lot of practical software development, but also uniformative for folks who download the code associated with the book.

It appears that I have four options:

  1. Grin and bear it. Not all examples have to be wonderfully detailed.
  2. Stick with the spike version. Don't mock things. The results may vary and some of the tests might fail on the editor's desktop.
  3. Skip the test.
  4. Write multiple versions of the test: a "with real internet" version and a "with corporate firewall proxy blockers in place" version that uses mocks and works everywhere.
So far, I've leveraged the first three heavily. The fourth is awkward. We wind up with code like this:

class Test_get_whois(unittest.TestCase):
    def test_should_get_subprocess(self):
        subprocess = MagicMock()
        subprocess.check_output.return_value=b'\nwords\n'
        with patch.dict('sys.modules', subprocess=subprocess):
            import subprocess
            from ch_2_ex_4 import get_whois
            result= get_whois('1.2.3.4')
        self.assertEquals( result, ['', 'words'] )
        subprocess.check_output.assert_called_with(['whois', '1.2.3.4'])


This is not a lot of code for enterprise software development purposes. It's a bit weak, in fact, since it only tests the Happy Path.

But for a book example, it seems to be heavy on the mock module and light on the subject of interest.
Indeed, I defy anyone to figure out what the expository value of this is, since it has only 2 lines of relevant code wrapped in 8 lines of boilerplate required to mock a module successfully.

I'm not unhappy with the unitest.mock module in any way. It's great for mocking modules; I think the boilerplate is acceptable considering what kind of a wrenching change we're making to the runtime environment for the unit under test.

This fails at explication.

I'm waffling over how to handle some of these more complex test cases. In the past, I've skipped cases, and used the doctest Ellipsis feature to work through variant outputs. I think I'll continue to do that, since the mocking code seems to be less helpful for the readers, and too focused on purely technical need of proving that all the code is perfectly correct.

Tuesday, July 28, 2015

Amazon Reviews

Step 1. Go to amazon.com and look for one (or more) of my Python books.

Step 2. Have you read it?

  •     Yes: Thanks! Consider posting a review.
  •     No: Hmmm.
That's all. Consider doing this for other authors, also. 

Social media is its own weird economy. The currency seems to be evidence of eyeballs landing on content.