Tuesday, February 24, 2015

Functional Python Programming

New from Packt Publishing: Functional Python Programming.

Also here on Amazon.

The fun part is covering generator functions, iterators, and higher-order functions in some real depth. There's a lot of powerful programming techniques available.

What's challenging is reconciling Python's approach to FP with languages that are purely functional like Haskell and OCaml and others. Years ago, I saw some discussion in Stack Overflow that Python simply wasn't a proper functional programming language because it lacked some features. I'm vague on specifics (perhaps there weren't any) but the gaps between Python and FP are narrow.

As far as I can tell, the single biggest features missing are non-strict evaluation coupled with an optimizer that can rearrange expressions to optimize performance. This feature pair also tends to also produce nice tail-call optimization of recursions.

Languages which are totally non-strict (or fully lazy) need to introduce monads so that some ordering can be enforced in the cases where ordering really does matter.

Since Python is strict (with only minor exceptions) monads aren't needed. But we also sacrifice some optimization capability because we can't reorder Python's strict expressions. I'm not sure this is a gap which is so huge that we can indict Python as being non-functional or not suitable for a functional approach. I think the lack of an optimizing compiler is a little more than an interesting factoid.

An interesting problem that compiled functional languages have is resolving data types properly. It's a problem that all statically-typed languages share. In order to write a really generic algorithm, we either have to rely on a huge type hierarchy or sophisticated type pattern-matching rules. Python eschews this problem by making all code generic with respect to type. If we've applied a function to an inappropriate object, we find out through unit testing that we have TypeError exceptions.

I think we can (and should) borrow functional programming design patterns and reimplement them in Python. This is quite easy and doesn't involve too much work or overhead. For example, the yield from statement allows us to do manual tail-call optimization rather than trusting that the compiler will recognize the code pattern.

Tuesday, February 17, 2015

Yet Another Complaint about Python in General, SciPy in Particular

The context is an ongoing question about optimization -- not my strong suit -- and the SciPy algorithms for this. See Scipy.optimization.anneal Problems for some additional confusion over simple things.

The new quote is this:
However, firing up Python, NumPy, SciPy and figuring out which solver to use is not convoluted? Keep on writing code and over engineering as opposed to using the minimum tech in order to get the job. After all, we are professionals.
It appears that using a packaged, proven optimizer is somehow "convoluted." Apparently, the Anaconda product never surfaced in a Google search. This seems to indicate that perhaps (a) Google was never used or (b) the author didn't get to page 4 of the search results, or (c) the author never tried another search beyond the single-word "scipy".

I'm guessing they did not google "Python simulated annealing" -- the actual subject -- because there are a fairly large number of existing solutions to this. Lots and lots of lecture notes and tutorials. It seems to be a rich area full of tutorials on both optimization and Python. Reading a few of these would probably have addressed all of the concerns.

Anaconda, BTW, appears to be an amazing product. It seems to be the gold standard for data science. (I know of organizations that have in-house variations on this theme They bundle Python plus numerous extra packages and a variety of installers for Mac OS X, Windows and Linux.)

The "Keep on writing code" complaint is peculiar. The optimization examples in SciPy seem to involve less than a half-dozen lines of code. Reading a CSV file can be digested down to four lines of code.

import cvs
with open("constrains.csv", newline="") as source;
    rdr= DictReader(source)
    data = list(rdr)

I can only guess that the threshold for "over engineering" is a dozen lines of code. Fewer lines are acceptable; more are bad.

I don't know what "using the minimum tech in order to get the job" means, but the context included an example spreadsheet that was somehow a solution to an instance of a problem. I'm guessing from this that "minimum tech" means "spreadsheet."

Read this: When spreadsheets go bad. There are a lot of war stories like this. (For information on the  original quote, read 'What is meant by "Now you have two problems"?')

I regret not asking follow-up questions.

The more complete story is this: rather than actually leverage SciPy, the author of the quote appears to be fixated on rewriting a classic Simulated Annealing textbook example into a spreadsheet because reasons. One of which is that more modern algorithms in SciPy aren't actually classic simulated annealing. The newer algorithms may be better, but since they're not literally from the textbook, this is a problem.

And my suggestion -- just use SciPy -- was dismissed as "convoluted", "over-engineering", and -- I guess -- unprofessional.

Tuesday, February 10, 2015

Thursday, February 5, 2015

Idempotence, Hysteresis and Determinism

Three terms that seem to cause confusion: Idempotence, Hysteresis and Deterministic. The subject came up during my webcast on the Five Kinds of Python Functions. We can use all three terms to describe a function. Two of them are relevant to common design questions in software. The third is a higher-order architectural consideration, and not really part of writing a function definition in Python.

Idempotent -- narrowly defined --  means that f(f(x)) = f(x). In computer science, the meaning is stretched so that we can distinguish functions like random number generators from other functions. A random number generator (Python's random.random(), for example) is not idempotent. It returns a different result each time it's called. Many other functions are idempotent because they always return the same result given the same arguments. The functions in Python's os module are not idempotent. The results change based on external events.

Hysteresis is memory of previous events. A random number generator may have some hidden hysteresis so that it can return the next value in it's random sequence. Note that os.random() is explicitly based on /dev/random; it involves hysteresis where entropy is accumulated. A function that has an internal memoization cache has hysteresis; it will compute subsequent results more quickly after having memorized previous results.

Most functions are simply idempotent and don't generally involve any hysteresis. The result value is a  fixed mapping from the argument values.

A memoization cache preserves idempotence while adding hysteresis. The functools.lru_cache decorator, for example, adds hysteresis to an otherwise idempotent function. The result value still reflects a fixed mapping.

A random number generator cannot have idempotence; it will leverage hysteresis. We should think of a random number generator as an iterator through a sequence of numbers. Given a seed, the sequence is entirely fixed. Between any two numbers, it's very difficult to predict one from the other without also knowing the hidden seed value.

Unrelated Concept

We use idempotence and hysteresis to describe programming language features. These features are entirely deterministic.  There's no room for arbitrary, undefined, unpredictable behavior. Any non-determinism needs to be very tightly boxed in by otherwise deterministic programming constructs.

When writing a Python function, we assume that the language is deterministic. Without this assumption, it's hard to conceive of what the language would be like. What if statements were executed out of the order we wrote them?

External events -- interrupts and the like -- are non-deterministic. They can happen at any time, with no relationship to what's going on inside the software world. Inside the software world, however, we expect that everything is deterministic. This means that a server must always cope with non-deterministic request ordering. Once request processing starts, however, we generally rely on essential non-deterministic software to process the results perfectly consistently.

An important example of bounded non-determinism is in Dijksta's hypothetical programming language described in A Discipline of Programming. Here there is explicit non-determinism among the "elif" and "eldo" clauses. The selection among true alternatives was specifically non-deterministic. Indeed, an evil demon would always strive to select the worst possible choice. There was no "first one that's true" kind of silliness that would allow certain kind of logic errors to survive.

A multiprocessing application leverages the OS to keep all processes separate. Each process can then operate deterministically. Any non-determistic OS events are concealed from the process by the normal OS libraries that generally queue up events into buffers.

A multithreaded application, however, has to use some kind of explicit synchronization to handle the inherent non-determinism of thread scheduling. Thread libraries make no guarantees about the exact sequence of operations between threads; the execution is non-deterministic between threads.

For real fun, read about the non-deterministic memory write orders. The Data Race article from Intel is illuminating. Google "non-deterministic memory write order" for interesting reading on how processors have gotten -- perhaps -- too complex to be completely trustworthy.

This is different, also, from "arbitrary." A word that describes how the C language deals with constructs like a[i]= i++;. There are two unrelated state changes that happen in this statement. The order of those two things is best described as "arbitrary." It's deterministic. But it's not well defined by the language. Depending on the compiler and optimization settings, it will be entirely fixed. A function that uses these constructs could be idempotent. However, the outcome can vary from compiler to compiler and optimization setting to optimization setting. This kind of thing is devoutly to be avoided. It's presence is a less-than-ideal language design choice; writing software around a bad language feature is simply inexcusable.