Bio and Publications

Tuesday, August 29, 2017

The Pipeline Question when Bashing the Bash

Background: https://medium.com/capital-one-developers/bashing-the-bash-replacing-shell-scripts-with-python-d8d201bc0989

And this
The answer to this is interesting because there are two kinds of parallelism. I like to call them architectural and incidental (or casual).  I'll look at architectural parallelism first, because it's what we often think about. Then the incidental parallelism, which I'm convinced is a blight.

Architectural Parallelism

The OS provides big-picture, architectural parallelism. This isn't -- necessarily -- a thing we want to push down into Python applications. There are some tradeoffs here.

One example of big architectural parallelism are big map-reduce processes where the mapping and reducing can (and should) proceed in parallel. There are some constraints around this, and we'll touch on them below.

Another common example is a cluster of microservices that are deployed on the same server. In many cases, each microservice decomposes into a cluster of processes that work in parallel and have a very, very long life. We might have an NGINX front-end for static content and a Python-based Flask back-end for dynamic content.  We might want the OS init process to start these, and we define them in init.d. In other cases, we allocate them to web-based servers where load-balancing handles the details of restarting.

In the map-reduce example, the shell's pipe makes sense. We can define it with a shell script like this: source | map | reduce.  It's hard to beat this for succinct clarity.

In the Ngnix + Flask case, they may talk using a named pipe that outlives the two processes. Conceptually, they work as nginx | flask run.

In some cases, we have log analysis and alerting that are part of microservices management. We can pile this into the processing stream with a conceptual pipeline of nginx | flask run | log reduce | alert. The log reduce filters and reduces the log to find those events that require an alert. If any data makes it into the alert process, it sends the text for human intervention.

There are some distinguishing features.
  • They tend to be resource hogs. Either it's a big map-reduce processing request that uses a lot of CPU and memory resources. Or it's a log-running server.
  • The data being transported is bytes with a very inexpensive (almost free) serialization. When we think of map-reduce, these processes often work with text as input and output. There may be more complex data structures involved in the reduce, but the cost of serialization is an important concern. When we think of web requests, the request, response, and log pipeline is bytes more-or-less by definition. 
  • The parallelism is at the process level because each element does a lot of work and the isolation is beneficial.
  • The compute high-value results for actual users.
The OS does this. The complexity is that each OS does this differently. The Python subprocess module (and related projects outside the standard library) provide an elegant mapping into Python. 

It's not built-in to the language. I think that it's because details vary so widely by OS. I think trying to build this into the language leads to a bulky featyre that's not widely-enough used.

Incidental Parallelism

This is -- to me -- a blight. Here's a typical kind of thing we see in the middle of a longer, more complex shell script.

data=`grep pattern file | cut args | sort | head`
# the interesting processing on $data

Computing a value that's assigned to data is a high-cost, low-value step. It creates an intermediate result that's only part of the shell script, and not really the final result. The parallelism feature of the shell's | operator isn't of any profound value since only a tiny bit of data is passed from step to step.

This can be rewritten into Python, but the resulting code won't be a one-liner. It will be longer. It will also be much, much faster. However, the speed difference is rarely relevant if this kind of processing step inside a larger, iterative process.

A trivial rewrite of just one line of code misses the point. The goal is to refactor the script so that this line of code because a simple part of the processing and uses first-class Python data structures. The reason for doing cut and sort operations is generally because the data structure wasn't optimized for the job. A priority queue might have been a better choice, and would have amortized sorting properly and eliminated the need for separate cut and head operations.

This kind of computation can (and should) be done in a single process. The shell pipeline legacy implementation is little more than a short-hand for passing arguments and results among (simple) functions.

We can rewrite this as nested functions.

with Path(file).open() as source:
    head(sorted(cut_mapping(args, grep_filter(pattern, file))))

This will do the same thing. The gigantic benefits of this kind of rewrite involves eliminating two kinds of overheads.
  • The fork/exec to spawn subprocesses. A single process will be faster.
  • The serialization and deserialization of intermediate results. Avoiding serialization will be faster.
When we rewrite bash to Python, we are able to leverage Python's data structures to write processing that expressive, succinct, and efficient.

This kind of rewriting will also lead to refactoring the adjacent lines of the script -- the interesting processing -- into Python code also. This refactoring can lead to further simplifications and speedups.

The Two Cases

There seem to be two cases of parallelism:
  • Big and Architectural. There are many Python packages that provides these features. Look at plumbum, pipes, and joblib for examples. Since the OS implementation details vary so much, it's hard to imagine making this part of the language.
  • Small and Incidental.  The incidental parallelism is clever, but inefficient. In many cases, it doesn't seem to create significant value. It seems to be a kind of handy little workaround. It has costs that I find to outstrip the value. 
When replacing the bash with Python, some of the parallelism is architectural, and needs to be preserved. Careful engineering choices will be required. The rest is incidental and needs to be discarded.

Tuesday, August 8, 2017

Refusing to Code. Or. How to help the incurious?

The emphasis on code is important. Code defines the behavior of systems -- for the most part Once upon a time, we used clever mechanical designs, or discrete electronic components. The InternetofThings idea exists because high-powered general-purposes CPU's are ubiquitous.

A DevOps mantra is "infrastructure as code". The entire deployment is automated, from the allocation of processors and storage down to pining the health-check endpoint to be sure it's live. Blue-Green deployments, traffic switching, etc., and etc. These all require lots of code and as little manual intervention as possible. 

The gold standard is to use tools to visualize state, make a decision, and use tools to take action. Lots of code.

When I meet the anti-code people, it's confusing.

Outside my narrow realm of tech, anti-code is fine. I have a sailboat, I meet lots of non-tech people who can't code, won't code, and aren't sure what code is.

But when I meet people who claim they want to be data science folks but refuse to code, I'm baffled.

Step 1 was to "learn more" about data science or something like that. I suggested some of the ML tutorials available for Python. Why? It appears that Scikit Learn is the gold standard for ML applications. http://scikit-learn.org/stable/tutorial/index.html

Because they didn't want to code, they insisted on doing things in Excel. Really.

Step 2 was to figure out some simulated annealing process -- in Excel. They had one of the central textbooks on ML algorithms. And they had a spreadsheet. They had some question that can only arise from avoiding open-source code. I suggested they use the open source code available to everyone. Or perhaps find a more modern tutorial like this: http://katrinaeg.com/simulated-annealing.html

Because they don't want to code, they used the fact that scipy.optimize.anneal() was deprecated to indict Python. I almost wish I'd saved all the emails over why basin hopping was unacceptable. The reasoning involved having an old textbook that covered annealing in depth, and not wanting to actually read the code for basin hopping. Or something. 

Step 3 was to grab a Kaggle problem and start working on it. This is too large for a spreadsheet. Indeed, the data sets push the envelope on what can be done on a Windows laptop because the dataframes tend to be quite large. It requires installing Scikit learn, which means installing Anaconda from Continuum. There's no reasonable alternative.

The Kaggle exercise may also involve buying a new laptop or renting time on a cloud-based server that's big enough to handle the data set. ML processing takes time, and GPU acceleration can be a huge help. All of this, however, presumes that there's code to run.

Because they don't want to code, this bled into an amazing number of unproductive directions.  There's some kind of classic "do everything except what you need to do" behavior. I'm sure it has a name. It's more than "work avoidance." It's a kind of active negation of the goals. It was impossible to discern what was actually going on or how I was supposed to help.

I suggested a Trello board. 

The Trello board devolved into dozens of individual lists, each list had one card. Seriously. The card/list thing became a way of avoiding progress. There were cards for considering the implications of installing Anaconda. The cards turned into hand-wringing discussions and weird status updates and memo-to-self notes, instead of actual actions.

Bottom line? 

No code. 

In the middle of the Kaggle something-or-other board, a card appeared asking for comments on some code. :yay2: Something I can actually help with.

The code was bad. And precious. I blogged about this phenomenon earlier. The code can't be changed because it was so hard to create. It was really bad, and riddled with bizarre things that make it look like they'd never seen code before.

Use pylint? This got a grudging kind of reluctant cleanup. But huge_variable_names_with_lots_of_useless_clauses aren't flagged by Pylint. They're still bad, and reading other code would show how atypical these names are. Unless, of course, you hate code; then reading code is not going to happen.

My new model for their behavior? They hate code. So when they do it, they do it badly. Intentionally badly. And because it was so painful, it's precious. (I'm probably wrong, and there's probably a lot more to this, but it seems to fit the observed behavior.)

It gets worse (or better, depending on your attitude.)

Another Trello card appears wondering what [a, b] * 2 or some such Pythonic thing might mean. Um. What?

It appears that they can't find the Standard Library description of the built-in data types and their operators. As if chapter four was deleted from their copy, or something.

The "can't find" seems unlikely. It's pretty prominent. I would think that anyone aspiring to learn Python would see the "keep this under your pillow" admonition on the standard library docs and perhaps glance through the first five sections to see what the fuss was about. Unless they hate code.

I'm left with "won't find."  Perhaps they're refusing to use the documentation? Are they also refusing to use Python's internal help? It's not great, but you can try a bunch of things and get steered around from topic to topic, eventually, you have to find something useful.

Apply my new model: they hate code and Python help() is code.

Do they really hate code that much? I now think they do. I think they truly and deeply hate losing manual, personal. hands-on control over things. If it's not a spreadsheet -- where they typed each cell personally -- it's reviled. (Or feared? Let's not go too far here.)

Test the hypothesis. Ask if they used help().

Answer: Yes. They had tried three things (exactly three) and none of those three had a satisfactory explanation. The help() function did not work. Indeed, two of the things they tried had the same result, and the third reported a syntax error. So they stopped.

They tried three things and stopped.

Okay, then. They hate code. And -- Bonus! -- They refuse to explore. Somehow they're also able to insist they must learn to code. Will the self-beatings continue until the attitude improves?

It's difficult to offer meaningful help under these circumstances. I don't see the value in being someone's personal Google, since that only reinforces the two core refusals to use code or explore by typing code to see what happened.

I like to think that coding is a core life skill. Like cooking. You don't have to become a chef, but you have to know how to handle food. You don't have to create elaborate, scalable meshes of microservices. But you have to be able to find the data types and operators on your own.

And I don't know how to coach someone who is so incurious that three attempts with help() is the limit. Done at three. Count it as a failure and stop trying. "Try something different" seems vague, but it's all I've got. Anything more feels isomorphic to "Here's the link, attached is an audio file of me reading the words out loud for you." 

Other Entries Other Blogs

https://medium.com/@s_lott

Plus, of course, lots of other stuff from lots of other folks. Enjoy.


Tuesday, August 1, 2017

JSON vs. XML: The battle for format supremacy may be wasted energy - SD Times

http://sdtimes.com/json-vs-xml-battle-format-supremacy-may-wasted-energy/

This article seems silly. Perhaps I missed something important.

I'm not sure who's still litigating the JSON vs. XML, but it seems like it's more-or-less done.

XHTML/XML for HTML things.

JSON for everything else.

Maybe there are people still wringing their hands over this. AFAIK, the last folks using SOAP/XML services are commercial and governmental agencies where change tends to happen very slowly.

I remember when Sun Microsystems was a company and had the Java Composite Applications Suite. Very XML. That was -- perhaps -- ten years ago. Since then, I think the problem has been solved. I'm not sure who's battling for supremacy or why.