Tuesday, February 24, 2015

Functional Python Programming

New from Packt Publishing: Functional Python Programming.

Also here on Amazon.

The fun part is covering generator functions, iterators, and higher-order functions in some real depth. There's a lot of powerful programming techniques available.

What's challenging is reconciling Python's approach to FP with languages that are purely functional like Haskell and OCaml and others. Years ago, I saw some discussion in Stack Overflow that Python simply wasn't a proper functional programming language because it lacked some features. I'm vague on specifics (perhaps there weren't any) but the gaps between Python and FP are narrow.

As far as I can tell, the single biggest features missing are non-strict evaluation coupled with an optimizer that can rearrange expressions to optimize performance. This feature pair also tends to also produce nice tail-call optimization of recursions.

Languages which are totally non-strict (or fully lazy) need to introduce monads so that some ordering can be enforced in the cases where ordering really does matter.

Since Python is strict (with only minor exceptions) monads aren't needed. But we also sacrifice some optimization capability because we can't reorder Python's strict expressions. I'm not sure this is a gap which is so huge that we can indict Python as being non-functional or not suitable for a functional approach. I think the lack of an optimizing compiler is a little more than an interesting factoid.

An interesting problem that compiled functional languages have is resolving data types properly. It's a problem that all statically-typed languages share. In order to write a really generic algorithm, we either have to rely on a huge type hierarchy or sophisticated type pattern-matching rules. Python eschews this problem by making all code generic with respect to type. If we've applied a function to an inappropriate object, we find out through unit testing that we have TypeError exceptions.

I think we can (and should) borrow functional programming design patterns and reimplement them in Python. This is quite easy and doesn't involve too much work or overhead. For example, the yield from statement allows us to do manual tail-call optimization rather than trusting that the compiler will recognize the code pattern.

Tuesday, February 17, 2015

Yet Another Complaint about Python in General, SciPy in Particular

The context is an ongoing question about optimization -- not my strong suit -- and the SciPy algorithms for this. See Scipy.optimization.anneal Problems for some additional confusion over simple things.

The new quote is this:
However, firing up Python, NumPy, SciPy and figuring out which solver to use is not convoluted? Keep on writing code and over engineering as opposed to using the minimum tech in order to get the job. After all, we are professionals.
It appears that using a packaged, proven optimizer is somehow "convoluted." Apparently, the Anaconda product never surfaced in a Google search. This seems to indicate that perhaps (a) Google was never used or (b) the author didn't get to page 4 of the search results, or (c) the author never tried another search beyond the single-word "scipy".

I'm guessing they did not google "Python simulated annealing" -- the actual subject -- because there are a fairly large number of existing solutions to this. Lots and lots of lecture notes and tutorials. It seems to be a rich area full of tutorials on both optimization and Python. Reading a few of these would probably have addressed all of the concerns.

Anaconda, BTW, appears to be an amazing product. It seems to be the gold standard for data science. (I know of organizations that have in-house variations on this theme They bundle Python plus numerous extra packages and a variety of installers for Mac OS X, Windows and Linux.)

The "Keep on writing code" complaint is peculiar. The optimization examples in SciPy seem to involve less than a half-dozen lines of code. Reading a CSV file can be digested down to four lines of code.

import cvs
with open("constrains.csv", newline="") as source;
    rdr= DictReader(source)
    data = list(rdr)

I can only guess that the threshold for "over engineering" is a dozen lines of code. Fewer lines are acceptable; more are bad.

I don't know what "using the minimum tech in order to get the job" means, but the context included an example spreadsheet that was somehow a solution to an instance of a problem. I'm guessing from this that "minimum tech" means "spreadsheet."

Read this: When spreadsheets go bad. There are a lot of war stories like this. (For information on the  original quote, read 'What is meant by "Now you have two problems"?')

I regret not asking follow-up questions.

The more complete story is this: rather than actually leverage SciPy, the author of the quote appears to be fixated on rewriting a classic Simulated Annealing textbook example into a spreadsheet because reasons. One of which is that more modern algorithms in SciPy aren't actually classic simulated annealing. The newer algorithms may be better, but since they're not literally from the textbook, this is a problem.

And my suggestion -- just use SciPy -- was dismissed as "convoluted", "over-engineering", and -- I guess -- unprofessional.

Tuesday, February 10, 2015

Thursday, February 5, 2015

Idempotence, Hysteresis and Determinism

Three terms that seem to cause confusion: Idempotence, Hysteresis and Deterministic. The subject came up during my webcast on the Five Kinds of Python Functions. We can use all three terms to describe a function. Two of them are relevant to common design questions in software. The third is a higher-order architectural consideration, and not really part of writing a function definition in Python.

Idempotent -- narrowly defined --  means that f(f(x)) = f(x). In computer science, the meaning is stretched so that we can distinguish functions like random number generators from other functions. A random number generator (Python's random.random(), for example) is not idempotent. It returns a different result each time it's called. Many other functions are idempotent because they always return the same result given the same arguments. The functions in Python's os module are not idempotent. The results change based on external events.

Hysteresis is memory of previous events. A random number generator may have some hidden hysteresis so that it can return the next value in it's random sequence. Note that os.random() is explicitly based on /dev/random; it involves hysteresis where entropy is accumulated. A function that has an internal memoization cache has hysteresis; it will compute subsequent results more quickly after having memorized previous results.

Most functions are simply idempotent and don't generally involve any hysteresis. The result value is a  fixed mapping from the argument values.

A memoization cache preserves idempotence while adding hysteresis. The functools.lru_cache decorator, for example, adds hysteresis to an otherwise idempotent function. The result value still reflects a fixed mapping.

A random number generator cannot have idempotence; it will leverage hysteresis. We should think of a random number generator as an iterator through a sequence of numbers. Given a seed, the sequence is entirely fixed. Between any two numbers, it's very difficult to predict one from the other without also knowing the hidden seed value.

Unrelated Concept

We use idempotence and hysteresis to describe programming language features. These features are entirely deterministic.  There's no room for arbitrary, undefined, unpredictable behavior. Any non-determinism needs to be very tightly boxed in by otherwise deterministic programming constructs.

When writing a Python function, we assume that the language is deterministic. Without this assumption, it's hard to conceive of what the language would be like. What if statements were executed out of the order we wrote them?

External events -- interrupts and the like -- are non-deterministic. They can happen at any time, with no relationship to what's going on inside the software world. Inside the software world, however, we expect that everything is deterministic. This means that a server must always cope with non-deterministic request ordering. Once request processing starts, however, we generally rely on essential non-deterministic software to process the results perfectly consistently.

An important example of bounded non-determinism is in Dijksta's hypothetical programming language described in A Discipline of Programming. Here there is explicit non-determinism among the "elif" and "eldo" clauses. The selection among true alternatives was specifically non-deterministic. Indeed, an evil demon would always strive to select the worst possible choice. There was no "first one that's true" kind of silliness that would allow certain kind of logic errors to survive.

A multiprocessing application leverages the OS to keep all processes separate. Each process can then operate deterministically. Any non-determistic OS events are concealed from the process by the normal OS libraries that generally queue up events into buffers.

A multithreaded application, however, has to use some kind of explicit synchronization to handle the inherent non-determinism of thread scheduling. Thread libraries make no guarantees about the exact sequence of operations between threads; the execution is non-deterministic between threads.

For real fun, read about the non-deterministic memory write orders. The Data Race article from Intel is illuminating. Google "non-deterministic memory write order" for interesting reading on how processors have gotten -- perhaps -- too complex to be completely trustworthy.

This is different, also, from "arbitrary." A word that describes how the C language deals with constructs like a[i]= i++;. There are two unrelated state changes that happen in this statement. The order of those two things is best described as "arbitrary." It's deterministic. But it's not well defined by the language. Depending on the compiler and optimization settings, it will be entirely fixed. A function that uses these constructs could be idempotent. However, the outcome can vary from compiler to compiler and optimization setting to optimization setting. This kind of thing is devoutly to be avoided. It's presence is a less-than-ideal language design choice; writing software around a bad language feature is simply inexcusable.

Thursday, January 29, 2015

Bottle vs. Flask vs. Django vs. a dozen others

There are times when a "micro framework" is actually useful. I wasn't easily convinced that this could be true. Big framework or die trying. Right?

Maybe not so right.

My primary example of a micro framework's value is a quick demo site to show how some API's are going to be consumed.

I've been avoiding an Angular.js app. We're going to be supporting the user experience with a sophisticated Angular.js app. But, as a back-end developer, I don't want to try to write demo pages in the proper app framework. There's too many ways to screw this up because I'll miss some key UX feature. Instead, I want to write fake pages that show a considerably simplified version of consuming an API. Sort of "suggestion" pages to clarify how the API's fit together.

To make it even more complex than necessary, I'm not interested in learning Angular.js, and I'm not interested in figuring out how it works. Running node.js, bower, npm, etc., etc., is too much.

[And Angular.js is just one version of the front-end. There's also mobile web, mobile native, tablet native, and God-alone-only-knows-what-all-else, to name a few. Each unique.]

Several sprints back, I slapped together two fake forms using bootstrap for layout and hosted them with Bottle. The navigation and framework look were simply copied from a screenshot and provided as static graphics. Lame, but acceptable. All that matters is getting the proper logo to show up.

The problem is that the sequence of API requests has grown since then. The demo grew to where we need a session so that alternative sequences will provide proper parameters to the APIs. We're beyond "Happy Path" interactions and into "what-if" demonstrations to show how to get (or avoid) a 404.

Bottle started with the significant advantage in fitting entirely into a single .py module. The barrier to entry was very low. But then the forms got awkwardly complex and Jinja2 was required. Now that sessions are required, the single module benefit isn't as enticing as it once was.

I've been forced to upgrade from Bottle to Flask. This exercise points out that I should have started with Flask in the first place. Few things are small enough for Bottle. In some ways, the two are vaguely similar. The @route() annotation being the biggest similarity. In other ways, of course, the two are wildly different. There's only a single Flask, but we can easily combine multiple Bottles into a larger, more comprehensive site. I like the composability of Bottles, and wish Flask had this.

The Flask Blueprints might be a good stand-in for composing multiple Bottles. Currently, though, each functional cluster of API's has their own unique feature set. The bigger issue is updating the configuration to track the API's through the testing and QA servers as they march toward completion. Since they don't move in lock-step, the configuration is complex and dynamic.

The transparent access to session information is a wonderful thing. I built a quick-and-dirty session management in Bottle. It used shelve and a "simple" cookie. But it rapidly devolved to a lot of code to check for the cookie and persist the cookie. Each superficially simple Bottle @route() needed a bunch of additional functionality.

The whole point was to quickly show how the API's fit together. Not reinvent Yet Another Web Framework Based On Jinja2 and Werkzeug.

Django seems like too much for this job. We don't have a model; and that's the exact point. Lacking a database model doesn't completely break Django, but it makes a large number of Django features moot. We just have some forms that get filled in for different kinds of events and transactions and searches and stuff. And we need a simple way to manage stateful sessions.

Omitted from consideration are the other dozen-or-so frameworks listed here: http://codecondo.com/14-minimal-web-frameworks-for-python/. This is a great resource for comparing and contrasting the various choices. Indeed, this was how I found Bottle to begin with.

Tuesday, January 20, 2015

Webcast Wednesday

Be there: http://www.oreilly.com/pub/e/3255

Of course, I've got too many slides. 58 slides for a 60 minute presentation. That's really about 2 hours of material. Unless people have questions, then it's a half-day seminar.

Seriously.

I think I've gone waaaay too far on this. But it's my first one, and I'd hate to burn through all eight slides, take a few questions and be done too soon.

If this goes well, perhaps I'll see if I can come up with other 1-hour topics.

I worry a great deal about rehashing the obvious.

On the other hand, I'm working with a room full of newbies, and I think I could spend several hours on each of their questions.

And straightening out their confusions.

Case in point. Not directly related to the webcast.

One of my colleagues had seen a webcast which described Python's &, |, and ~ operators, comparing  them with and, or and not.

I'm not 100% sure, but... I think that this podcast -- I'm getting this second-hand; it's just hearsay -- showed that there's an important equivalence between and and &.

This is true, but hopelessly obscure. Since & has a higher priority than the comparison operators, there will be serious confusion when one fails to parenthesize properly.

Examples like this abound:

>>> 3 == 3 & 4 < 5
False
>>> (3 == 3) & (4 < 5)

True

Further, the fact that & can't short-circuit had become confusing to the colleague. I figured out some of what was going on when trying to field some seemingly irrelevant questions on "Why are some operators more efficient?" and "How do you know which to use?"

Um. That's not really the point. There's no confusion if you set the bit-fiddling operators aside.

The point is that and, or, not, and the if-else conditional expression live in their own domain of boolean values. The fact that &, |, ^, and ~ will also operate on boolean values is a kind of weird duplication, not a useful feature. The arithmetic operators also work on booleans. Weirdly.

The Python rules are the rules; it makes sense for True&True to yield True. Results depend on the operands. It would be wrong in that sense for True&True to be 1. But it would also fit the concept of these operators a little better if they always coerced bool to int. This happens for * and +: True+True == 2.

Why can't it be true for & and |? It would reduce potential confusion. 

I'm sure the person who implemented __and__(), __or__(), __xor__(), and __invert__() was happy to create a parallel universe between and and &. I'm not sure I agree.

And perhaps I should have a webcast on Python logic. It seems like a rehash of fundamentals to me. But I have colleagues confused by fundamentals. So perhaps I'm way wrong about what's fundamental and what's useful information.


Thursday, January 15, 2015

Chapter 12 Alternate Example - Normalization and Decorators

In the forthcoming Functional Python Programming (https://www.packtpub.com/application-development/functional-python-programming) I was pressured by one of the technical reviewers to create a better example of composite function creation with decorators.

This was a difficult request. First, of course, "better" is poorly defined. More importantly, the example in the book is extensive and includes the edge-of-the-envelope "don't do this in real code" parts, too. It's important to be thorough. Finally, it's real-world data cleansing code. It's important to be pragmatic, but, it's kind of boring. I really do beat it into submission showing simple decorators, parameterized decorators, and crazy obscurely bad decorators.

In this case, "better" might simply mean "less thorough."

But, perhaps "better" means "less focused on cleansing and more focused on something else."

On Decoration

The essence of the chapter -- and the extensive example -- is that we can use decorators as higher-order functions to build composite functions.

Here's an alternative example. This will combine z-score normalization with another reduction function. Let's say we're doing calculations that require us to normalize a set of data points before using them in some reduction.

Normalizing is the process of scaling a value by the mean and standard deviation of the collection. Chapter 4 covers this in some detail. Reductions like creating a sum are the subject of Chapter 6. I won't rehash the details of these topics in this blog post.

Here's another use of decorators to create a composite function.

def normalize( mean, stdev ):
    normalize = lambda x: (x-mean)/stdev
    def concrete_decorator( function ):
        @wraps(function)
        def wrapped( data_arg ):
            z = map( normalize, data_arg )
            return function( z )
        return wrapped
    return concrete_decorator

The essential feature of the @normalize(mean, stdev) decorator is to apply the normalization to the vector of argument values to the original function. We can use it like this.

>>> d = [ 2, 4, 4, 4, 5, 5, 7, 9 ]
>>> from Chapter_4.ch04_ex4 import mean, stdev
>>> m_d, s_d =  mean(d), stdev(d)
>>> @normalize(m_d, s_d)
>>> def norm_list(d):
...     return list(d)
>>>
>>> norm_list(d)
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]

W've create a norm_list() function which applies a normalization to the given values. This function is a composite of normalization plus list().

Clearly, parts of this are deranged. We can't even define the norm_list() function until we have mean and standard deviation parameters for the samples. This doesn't seem appropriate.

Here's a slightly more interesting composite function. This combines normalization with sum().

>>> @normalize(m_d, s_d)
>>> def norm_sum(d):
...     return sum(d)
>>>
>>> norm_sum(d)
0.0

We've defined the normalized sum function and applied it to a vector of values. The normalization has parameters applied. Those parameters are relatively static compared with the parameters given to the composite function.

It's still a bit creepy because we can't define norm_sum() until we have the mean and standard deviation.

It's not clear to me that a more mathematical example is going to be better. Indeed, the limitation on decorators seems to be this:

  • The original (decorated) function can have lots of parameters;
  • The functions being composed by the decorator must either have no parameters, or have very static "configuration" parameters.
If we try to compose functions in a more general way -- all of the functions have parameters -- we're in for problems. That's why the data cleansing pipeline seems to be the ideal use for decorators.