Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, August 29, 2017

The Pipeline Question when Bashing the Bash

Background: https://medium.com/capital-one-developers/bashing-the-bash-replacing-shell-scripts-with-python-d8d201bc0989

And this
The answer to this is interesting because there are two kinds of parallelism. I like to call them architectural and incidental (or casual).  I'll look at architectural parallelism first, because it's what we often think about. Then the incidental parallelism, which I'm convinced is a blight.

Architectural Parallelism

The OS provides big-picture, architectural parallelism. This isn't -- necessarily -- a thing we want to push down into Python applications. There are some tradeoffs here.

One example of big architectural parallelism are big map-reduce processes where the mapping and reducing can (and should) proceed in parallel. There are some constraints around this, and we'll touch on them below.

Another common example is a cluster of microservices that are deployed on the same server. In many cases, each microservice decomposes into a cluster of processes that work in parallel and have a very, very long life. We might have an NGINX front-end for static content and a Python-based Flask back-end for dynamic content.  We might want the OS init process to start these, and we define them in init.d. In other cases, we allocate them to web-based servers where load-balancing handles the details of restarting.

In the map-reduce example, the shell's pipe makes sense. We can define it with a shell script like this: source | map | reduce.  It's hard to beat this for succinct clarity.

In the Ngnix + Flask case, they may talk using a named pipe that outlives the two processes. Conceptually, they work as nginx | flask run.

In some cases, we have log analysis and alerting that are part of microservices management. We can pile this into the processing stream with a conceptual pipeline of nginx | flask run | log reduce | alert. The log reduce filters and reduces the log to find those events that require an alert. If any data makes it into the alert process, it sends the text for human intervention.

There are some distinguishing features.
  • They tend to be resource hogs. Either it's a big map-reduce processing request that uses a lot of CPU and memory resources. Or it's a log-running server.
  • The data being transported is bytes with a very inexpensive (almost free) serialization. When we think of map-reduce, these processes often work with text as input and output. There may be more complex data structures involved in the reduce, but the cost of serialization is an important concern. When we think of web requests, the request, response, and log pipeline is bytes more-or-less by definition. 
  • The parallelism is at the process level because each element does a lot of work and the isolation is beneficial.
  • The compute high-value results for actual users.
The OS does this. The complexity is that each OS does this differently. The Python subprocess module (and related projects outside the standard library) provide an elegant mapping into Python. 

It's not built-in to the language. I think that it's because details vary so widely by OS. I think trying to build this into the language leads to a bulky featyre that's not widely-enough used.

Incidental Parallelism

This is -- to me -- a blight. Here's a typical kind of thing we see in the middle of a longer, more complex shell script.

data=`grep pattern file | cut args | sort | head`
# the interesting processing on $data

Computing a value that's assigned to data is a high-cost, low-value step. It creates an intermediate result that's only part of the shell script, and not really the final result. The parallelism feature of the shell's | operator isn't of any profound value since only a tiny bit of data is passed from step to step.

This can be rewritten into Python, but the resulting code won't be a one-liner. It will be longer. It will also be much, much faster. However, the speed difference is rarely relevant if this kind of processing step inside a larger, iterative process.

A trivial rewrite of just one line of code misses the point. The goal is to refactor the script so that this line of code because a simple part of the processing and uses first-class Python data structures. The reason for doing cut and sort operations is generally because the data structure wasn't optimized for the job. A priority queue might have been a better choice, and would have amortized sorting properly and eliminated the need for separate cut and head operations.

This kind of computation can (and should) be done in a single process. The shell pipeline legacy implementation is little more than a short-hand for passing arguments and results among (simple) functions.

We can rewrite this as nested functions.

with Path(file).open() as source:
    head(sorted(cut_mapping(args, grep_filter(pattern, file))))

This will do the same thing. The gigantic benefits of this kind of rewrite involves eliminating two kinds of overheads.
  • The fork/exec to spawn subprocesses. A single process will be faster.
  • The serialization and deserialization of intermediate results. Avoiding serialization will be faster.
When we rewrite bash to Python, we are able to leverage Python's data structures to write processing that expressive, succinct, and efficient.

This kind of rewriting will also lead to refactoring the adjacent lines of the script -- the interesting processing -- into Python code also. This refactoring can lead to further simplifications and speedups.

The Two Cases

There seem to be two cases of parallelism:
  • Big and Architectural. There are many Python packages that provides these features. Look at plumbum, pipes, and joblib for examples. Since the OS implementation details vary so much, it's hard to imagine making this part of the language.
  • Small and Incidental.  The incidental parallelism is clever, but inefficient. In many cases, it doesn't seem to create significant value. It seems to be a kind of handy little workaround. It has costs that I find to outstrip the value. 
When replacing the bash with Python, some of the parallelism is architectural, and needs to be preserved. Careful engineering choices will be required. The rest is incidental and needs to be discarded.

Tuesday, August 8, 2017

Refusing to Code. Or. How to help the incurious?

The emphasis on code is important. Code defines the behavior of systems -- for the most part Once upon a time, we used clever mechanical designs, or discrete electronic components. The InternetofThings idea exists because high-powered general-purposes CPU's are ubiquitous.

A DevOps mantra is "infrastructure as code". The entire deployment is automated, from the allocation of processors and storage down to pining the health-check endpoint to be sure it's live. Blue-Green deployments, traffic switching, etc., and etc. These all require lots of code and as little manual intervention as possible. 

The gold standard is to use tools to visualize state, make a decision, and use tools to take action. Lots of code.

When I meet the anti-code people, it's confusing.

Outside my narrow realm of tech, anti-code is fine. I have a sailboat, I meet lots of non-tech people who can't code, won't code, and aren't sure what code is.

But when I meet people who claim they want to be data science folks but refuse to code, I'm baffled.

Step 1 was to "learn more" about data science or something like that. I suggested some of the ML tutorials available for Python. Why? It appears that Scikit Learn is the gold standard for ML applications. http://scikit-learn.org/stable/tutorial/index.html

Because they didn't want to code, they insisted on doing things in Excel. Really.

Step 2 was to figure out some simulated annealing process -- in Excel. They had one of the central textbooks on ML algorithms. And they had a spreadsheet. They had some question that can only arise from avoiding open-source code. I suggested they use the open source code available to everyone. Or perhaps find a more modern tutorial like this: http://katrinaeg.com/simulated-annealing.html

Because they don't want to code, they used the fact that scipy.optimize.anneal() was deprecated to indict Python. I almost wish I'd saved all the emails over why basin hopping was unacceptable. The reasoning involved having an old textbook that covered annealing in depth, and not wanting to actually read the code for basin hopping. Or something. 

Step 3 was to grab a Kaggle problem and start working on it. This is too large for a spreadsheet. Indeed, the data sets push the envelope on what can be done on a Windows laptop because the dataframes tend to be quite large. It requires installing Scikit learn, which means installing Anaconda from Continuum. There's no reasonable alternative.

The Kaggle exercise may also involve buying a new laptop or renting time on a cloud-based server that's big enough to handle the data set. ML processing takes time, and GPU acceleration can be a huge help. All of this, however, presumes that there's code to run.

Because they don't want to code, this bled into an amazing number of unproductive directions.  There's some kind of classic "do everything except what you need to do" behavior. I'm sure it has a name. It's more than "work avoidance." It's a kind of active negation of the goals. It was impossible to discern what was actually going on or how I was supposed to help.

I suggested a Trello board. 

The Trello board devolved into dozens of individual lists, each list had one card. Seriously. The card/list thing became a way of avoiding progress. There were cards for considering the implications of installing Anaconda. The cards turned into hand-wringing discussions and weird status updates and memo-to-self notes, instead of actual actions.

Bottom line? 

No code. 

In the middle of the Kaggle something-or-other board, a card appeared asking for comments on some code. :yay2: Something I can actually help with.

The code was bad. And precious. I blogged about this phenomenon earlier. The code can't be changed because it was so hard to create. It was really bad, and riddled with bizarre things that make it look like they'd never seen code before.

Use pylint? This got a grudging kind of reluctant cleanup. But huge_variable_names_with_lots_of_useless_clauses aren't flagged by Pylint. They're still bad, and reading other code would show how atypical these names are. Unless, of course, you hate code; then reading code is not going to happen.

My new model for their behavior? They hate code. So when they do it, they do it badly. Intentionally badly. And because it was so painful, it's precious. (I'm probably wrong, and there's probably a lot more to this, but it seems to fit the observed behavior.)

It gets worse (or better, depending on your attitude.)

Another Trello card appears wondering what [a, b] * 2 or some such Pythonic thing might mean. Um. What?

It appears that they can't find the Standard Library description of the built-in data types and their operators. As if chapter four was deleted from their copy, or something.

The "can't find" seems unlikely. It's pretty prominent. I would think that anyone aspiring to learn Python would see the "keep this under your pillow" admonition on the standard library docs and perhaps glance through the first five sections to see what the fuss was about. Unless they hate code.

I'm left with "won't find."  Perhaps they're refusing to use the documentation? Are they also refusing to use Python's internal help? It's not great, but you can try a bunch of things and get steered around from topic to topic, eventually, you have to find something useful.

Apply my new model: they hate code and Python help() is code.

Do they really hate code that much? I now think they do. I think they truly and deeply hate losing manual, personal. hands-on control over things. If it's not a spreadsheet -- where they typed each cell personally -- it's reviled. (Or feared? Let's not go too far here.)

Test the hypothesis. Ask if they used help().

Answer: Yes. They had tried three things (exactly three) and none of those three had a satisfactory explanation. The help() function did not work. Indeed, two of the things they tried had the same result, and the third reported a syntax error. So they stopped.

They tried three things and stopped.

Okay, then. They hate code. And -- Bonus! -- They refuse to explore. Somehow they're also able to insist they must learn to code. Will the self-beatings continue until the attitude improves?

It's difficult to offer meaningful help under these circumstances. I don't see the value in being someone's personal Google, since that only reinforces the two core refusals to use code or explore by typing code to see what happened.

I like to think that coding is a core life skill. Like cooking. You don't have to become a chef, but you have to know how to handle food. You don't have to create elaborate, scalable meshes of microservices. But you have to be able to find the data types and operators on your own.

And I don't know how to coach someone who is so incurious that three attempts with help() is the limit. Done at three. Count it as a failure and stop trying. "Try something different" seems vague, but it's all I've got. Anything more feels isomorphic to "Here's the link, attached is an audio file of me reading the words out loud for you." 

Other Entries Other Blogs

https://medium.com/@s_lott

Plus, of course, lots of other stuff from lots of other folks. Enjoy.


Tuesday, August 1, 2017

JSON vs. XML: The battle for format supremacy may be wasted energy - SD Times

http://sdtimes.com/json-vs-xml-battle-format-supremacy-may-wasted-energy/

This article seems silly. Perhaps I missed something important.

I'm not sure who's still litigating the JSON vs. XML, but it seems like it's more-or-less done.

XHTML/XML for HTML things.

JSON for everything else.

Maybe there are people still wringing their hands over this. AFAIK, the last folks using SOAP/XML services are commercial and governmental agencies where change tends to happen very slowly.

I remember when Sun Microsystems was a company and had the Java Composite Applications Suite. Very XML. That was -- perhaps -- ten years ago. Since then, I think the problem has been solved. I'm not sure who's battling for supremacy or why.

Tuesday, July 25, 2017

The "My Code Is Precious To Me" Conundrum

I suspect some people sweat so hard over each line of code that it becomes precious. Valuable. An investment wrung from their very soul. Or something.

When they ask for comments, it becomes difficult.

The Pull Request context can be challenging. There the code is, beaten into submission after Herculean toils, and -- well -- it's not really very good. The review isn't a pleasant validation with some suggested rewrites of the docstrings to remove dangling participles (up with which I will not put.) Perhaps the code makes a fundamentally flawed assumption and proceeds from there to create larger and larger problems until it's really too awful to save.

How do you break the news?

I get non-PR requests for code reviews once in a while. The sincere effort at self-improvement is worthy of praise. It's outside any formal PR process; outside formal project efforts. It's good to ask for help like that.

The code, on the other hand, has to go.

I'm lucky that the people I work with daily can create -- and discard -- a half-dozen working examples in the space of an hour.

I'm unlucky that people who ask for code review advice can't even think rationally about starting again with different assumptions. They'd rather argue about their assumptions than simply create new code based on different (often fewer) assumptions.

I've seen some simple unit conversion problems turned into horrible messes. The first such terrifying thing was a data query filter based on year-month with a rolling 13-month window. Somehow, this turned into dozens and dozens of lines of ineffective code, filled with wrong edge cases.

Similar things happen with hour-minute windows. Lots of wrong code. Muddled confusion. Herculean efforts doing the wrong thing. Herculean.

Both year-month and hour-minute problems are units conversion. Year-month is months in base 12. Hour-minute is minutes in base 60. Technically, they're mixed bases, simple polynomials in two terms. It's a multiply and an add. 12y+m, where 0 ≤ m < 12. Maybe an extra subtract 1 is involved.

The entire algorithm is a multiply and an add. There shouldn't very many lines of code involved. In some cases, there's an additional conversion from integer minutes to float hours. Which is a multiply by a constant (1/720.) Or integer months to float years after an epochal year (another add with a negative number and multiply by 1/12.)

I think it's common that ineffective code need to be replaced. Maybe it's sad that it has to get replaced *after* being written? I don't think so. All code gets rewritten. Some just gets written sooner.

I think that some people may need some life-coaching as well as code reviews.

Perhaps they should be encouraged to participate in a design walk-through before sweating their precious life's blood into code that doesn't solve the problem at hand.

Tuesday, July 18, 2017

Yet Another Python Problem List

This was a cool thing to see in my Twitter feed:

Dan Bader (@dbader_org)
"Why Python Is Not My Favorite Language" zenhack.net/2016/12/25/why…

More Problems with Python. Here's the short list.

1. Encapsulation (Inheritance, really.)
2. With Statement
3. Decorators
4. Duck Typing (and Documentation)
5. Types

I like these kinds of posts because they surface problems that are way, way out at the fringes of Python. What's important to me is that most of the language is fine, but the syntaxes for a few things are sometimes irksome. Also important to me is that it's almost never the deeper semantics; it seems to be entirely a matter of syntax.

The really big problem is people who take the presence of a list like this as a reason dismiss Python in its entirety because they found a few blog posts identifying specific enhancements. That "Python must be bad because people are proposing improvements" is madding. And dismayingly common.

Even in a Python-heavy workplace, there are Java and Node.js people who have opinions shaped by little lists like these. The "semantic whitespace" argument coming from JavaScript people is ludicrous, but there they are: JavaScript has a murky relationship with semi-colons and they're complaining about whitespace. Minifying isn't a virtue. It's a hack. Really.

My point in general is not to say this list is wrong. It's to say that these points are minor. In many cases, I don't disagree that these can be seen as problems. But I don't think they're toweringly important.

1. The body of first point seems to be more about inheritance and accidentally overiding something that shouldn't have been overridden. Java (and C++) folks like to use private for this. Python lets you read the source. I vote for reading the source.

2. Yep. There are other ways to do this. Clever approach. I still prefer with statements.

3. I'm not sold on the syntax change being super helpful.

4. People write bad documentation about their duck types. Good point. People need to be more clear.

5. Agree. A lot of projects need to implement type hints to make it more useful.

Tuesday, July 11, 2017

Extracting Data Subsets and Design By Composition

The request was murky. It evolved over time to this:
Create a function file_record_selection(train.csv, 2, 100, train_2_100.csv)
First parameter: input file name (train.csv)
Second parameter: first record to include (2)
Third parameter: last record to include (100)
Fourth parameter: output file name (train_2_100.csv)
Fundamentally, this is a bad way to think about things. I want to cover some superficial problems first, though.

First superficial dig. It evolved to this. In fairness to people without a technical background, getting to tight, implementable requirements are is difficult. Sadly the first hand-waving garbage was from a DBA. It evolved to this. The early drafts made no sense.

Second superficial whining. The specification -- as written -- is extraordinarily shabby. This seems to be written by someone who's never read a function definition in the Python documentation before. Something I know is not the case. How can someone who is marginally able to code also unable to write a description of a function? In this case, the "marginally able to code" may be a hint that some folks struggle with abstraction: the world is a lot of unique details; patterns don't emerge from related details.

Third. Starting from record 2, seems to show that they don't get the idea that indexes start with zero. They've seen Python. They've written code. They've posted code to the web for comments. And they are still baffled by the start value of indices.

Let's move on to the more interesting topic, functional composition. 

Functional Composition

The actual data file is a .GZ archive. So there's a tiny problem with looking at .CSV extracts from the gzip. Specifically, we're exploding a file all over the hard drive for no real benefit. It's often faster to read the zipped file: it may involve fewer physical I/O operations. The .GZ is small; the computation overhead to decompress may be less than the time waiting for I/O.

To get to functional composition we have to start by decomposing the problem. Then we can build the solution from the pieces. To do this, we'll borrow the interface segregation (ISP) design principle from OO programming.

Here's an application of ISP: Avoid Persistence. It's easier to add persistence than to remove it. This leads peeling off three further tiers of file processing: Physical Format, Logical Layout, and Essential Entities.

We shouldn't write a .CSV file unless it's somehow required. For example, if there are multiple clients for a subset. In this case, the problem domain is exploratory data analysis (EDA) and saving .CSV subsets is unlikely to be helpful. The principle still applies: don't start with persistence in mind. What are the Essential Entities?

This leads away from trying to work with filenames, also. It's better to work with files. And we shouldn't work with file names as strings, we should use pathlib.Path. All consequences of peeling off layers from the interfaces.

Replacing names with files means the overall function is really this. A composition. 

file_record_selection = (lambda source, start, stop, target: 
    file_write(target, file_read_selection(source, start, stop))
)

We applied the ISP again, to avoid opening a named .CSV file. We can work with an open file-like objects, instead of a file names. This doesn't change the overall form of the functions, but it changes the types. Here are the two functions that are part of the composition:

from typing import *
import typing
Record = Any
def file_write(target: typing.TextIO, records: Iterable[Record]):
    pass
def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    pass

We've left the record type unspecified, mostly because we don't know what it just yet. The definition of Record reflects the Essential Entities, and we'll defer that decision until later. CSV readers can produce either dictionaries or lists, so it's not a complex decision; but we can defer it.

The .GZ processing defines the physical format. The content which was zipped was a .CSV file, which defines the logical layout.

Separating physical format, logical layout, and essential entity, gets us code like the following:

with gzip.open('file.gz') as source:
    reader = csv.DictReader(source)  # Iterator[Record]
    for line in file_read_selection(reader, start, stop):
        print(line)

We've opened the .GZ for reading. Wrapped a CSV parser around that. Wrapped our selection filter around that. We didn't write the CSV output because -- actually -- that's not required. The core requirement was to examine the input.

We can, if we want, provide two variations of the file_write() function and use a composition like the file_record_selection() function with the write-to-a-file and print-to-the-console variants. Pragmatically, the print-to-the-console is all we really need.

In the above example, the Record type can be formalized as  List[Text].  If we want to use csv.DictReader instead, then the Record type becomes Dict[Text, Text].

Further Decomposition

There's a further level of decomposition: the essential design pattern is Pagination. In Python parlance, it's a slice operation. We could use itertools to replace the entirety of file_read_selection() with itertools.takewhile() and itertools.dropwhile(). The problem with these methods is they don't short-circuit: they read the entire file.

In this instance, it's helpful to have something like this for paginating an iterable with a start and stop value.

for n, r in enumerate(reader):
    if n < start: continue
    if n = stop: break
    yield r

This covers the bases with a short-circuit design that saves a little bit of time when looking at the first few records of a file. It's not great for looking at the last few records, however. Currently, the "tail" use case doesn't seem to be relevant. If it was, we might want to create an index of the line offsets to allow arbitrary access. Or use a simple buffer of the required size.

If we were really ambitious, we'd use the Slice class definition to make it easy to specify start, stop, and step values. This would allow us to pick every 8th item from the file without too much trouble.

The Slice class doesn't, however support selection of a randomized subset. What we really want is a paginator like this:

def paginator(iterable, start: int, stop: int, selection: Callable[[int], bool]):
    for n, r in enumerate(iterable):
        if n < start: continue
        if n == stop: break
        if selection(n): yield r

file_read_selection = lambda source, start, stop: paginator(source, start, stop, lambda n: True)

file_read_slice = lambda source, start, stop, step: paginator(source, start, stop, lambda n: n%step == 0)

The required file_read_selection() is built from smaller pieces. This function, in turn, is used to build file_record_selection() via functional composition. We can use this for randomized selection, also.

Here are functions with type hints instead of lambdas.

def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: True)

def file_read_slice(source: csv.DictReader, start: int, stop: int, step: int)  -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: n%step == 0)

Specifying type for a generic iterable and the matching result iterable seems to require a type variable like this:

T = TypeVar('T')
def paginator(iterable: Iterable[T], ...) -> Iterable[T]:

This type hint suggests we can make wide reuse of this function. That's a pleasant side-effect of functional composition. Reuse can stem from stripping away the various interface details to decompose the problem to essential elements.

TL;DR

What's essential here is Design By Composition. And decomposition to make that possible.

We got there by stepping away from file names to file objects. We segregated Physical Format and Logical Layout, also. Each application of the Interface Segregation Principle leads to further decomposition. We unbundled the pagination from the file I/O. We have a number of smaller functions. The original feature is built from a composition of functions.

Each function can be comfortably tested as a separate unit. Each function can be reused.

Changing the features is a matter of changing the combination of functions. This can mean adding new functions and creating new combinations.