Tuesday, January 26, 2021

The Awkwardness of Fundamental Definitions

A pragmatic description a language (like Python) from axiomatic -- or really axiom-like -- foundations is exasperatingly hard. I don't think I have any answers, but I sure do have a lot of challenges. I think there's a path that involves a lot of "don't look down -- just edge along the face of this cliff a few more feet and it's going to be okay."

The actual language foundations, and a more useful conceptual foundation don't always match up.

My specific example is trying to use a subset of Python to get started with.

One of the reasons we write software is to see useful results. So you need output, i.e., the print() function. On the other end of this, you might want to accept inputs. But. But... You could finesse that by simply assigning literal values instead of reading something. 

This leads to the print() function and expressions is a kind of minimal language to see the essence of programming. (I know this isn't an original thought. I'm setting up the conflict.)

Do we explain functions in general when we're explaining this subset of Python? Or do we stick with arithmetic? 

print("Hello, world!")
print("355/113=", 355/113)

Maybe the F-string?  Assignment? Or do we really need assignment? (Spoiler alert, no.)

What about other functions? Python has a bunch of built-ins. The math and random modules let us build small games without much intellectual overhead.

And how much can we explain about functions? The whole mathematical notion of function as mapping from domain to range: does that count? Or nah? What about Python's flexible argument and parameter handling? 

Exceptions? Do we have to explain them? Or do we shrug at division by zero and kind of ignore it? If we talk about exceptions, do we have to talk about stack frames and tracebacks?

At the foundation a language is all variable bindings and function evaluation. But. Do you explain any of that? And if so, how?

The additional complication is multiple authors, technical reviews, editors, and everyone else involved. There's a complex web of varying opinions on what's foundational and how much explanation is required. It's new to one person, so it should be up font. But, a lot of up-front material is boring. Everything in programming involves a tricky nuance somewhere; how much of that nuance is relevant now and how much is later and how much is digressive?

Trust

To an extent, there's a trust relationship between writer and reader. It may help to build this if the writer can provide a "trust me on this," scenario that (eventually) results in a more complete explanation. The logical conclusion here is that there's no place in the book for "too advanced, buy my next book." Instead, every difficult and nuanced thing would need to (eventually) be explained. That seems impossible, though.

With editors, co-authors, and reviewers, the trust relationship is exactly wrong. Everything needs to be challenged and clarified.

tl;dr Writing is sometimes hard. And that's the expectation. It's a narrow fairway surrounded by shoals. 

Tuesday, January 5, 2021

Diagrams and UML notation

When I started in this business I was given a flowcharting template.

See https://americanhistory.si.edu/collections/object-groups/flowcharting-templates. I'm pretty sure I had one of these: https://americanhistory.si.edu/collections/search/object/nmah_690078.

Since then, things have changed a little.

I fondly recall using the Rational Rose (and an earlier tool that did Rumbaugh OMT diagrams) to create object models.

But these were expensive.

After much searching, I found ArgoUML. This was my go-to-diagrammer of choice for many years. It's available here: https://argouml-tigris-org.github.io.

Then, I wound up using yuml at https://yuml.me/diagram/scruffy/class/draw. This was very nice because there was a source text version of the diagram. It was a high-level code-like description that would lead to a handy picture you could include in documentation.

Heavenly.

Recently, I spent time using draw.io. Start here https://draw.io. You have plain-text source version of the diagram that's Git-friendly. I liked that. It has a lot of UML features, which is very nice, also.

But now. 

I'm using plantUML, and I think it's pretty handy. https://plantuml.com. It's a big-old JAR file that converts text to a diagram. There's no GUI component to this. You describe the image in a source-code like way. Run it through the tool, and you get a picture you can paste into documentation. Like yuml, it has an easy-to-understand high-level text description. I strongly suspect I could walk a Python AST and emit plantUML source as an intermediate language from which pictures can be created.

The Python-Markup tool (https://python-markdown.github.io/extensions/) has a third-party PlantUML plug-in. PyCharm can leverage this to draw while you're editing in the markdown window.

The Fiddly Bits

It's a little fiddly to get all the parts organized properly, but, it really, really does work. You can write technical documentation, with pictures.

  • Add the Markdown tool to PyCharm.
  • In the preferences for the PyCharm Markdown tool, install and enable PlantUML.
  • You can usually use conda to install graphviz as well as installing the plantuml-markdown tools. You can manually run the markdown_py application to create the HTML copies of the .md files. 
  • Update your OS environment settings to set the GRAPHVIZ_DOT environment variable to name the conda virtual environment where graphviz was installed. For macOS and Linux users update the ~/.zshrc (or ~/.bashrc) file, depending on which shell is in use. Windows users have my heartfelt sympathy; maybe set the system environment variables.
  • You may also need to create a plantuml shell script that's on your PATH. I put it in /usr/local/bin

See https://github.com/mikitex70/plantuml-markdown for details on installation.

After all this fussing around, it worked delightfully. I'm a convert to PlantUML.

I suggest the following in each diagram.

skinparam monochrome true
skinparam handwritten false
skinparam shadowing false
hide class circle

You may want to set a more global configuration, but I sometimes want to change the handwritten parameter to true for "draft" diagrams, separate from final. 

tl;dr

You can integrate plantUML with PyCharm to draw pictures while you're editing in the markdown window.

You do have to trust plantUML to draw more-or-less what you want. There are limits, and if you don't like what plantUML is doing, switch to draw.io. If you are flexible, however, it's really, really good.


Thursday, December 24, 2020

How to avoid writing a clickbait headline. (Click for details.)

It's hard to write shameless promotional material.

I already wrote the books, isn't that bold enough?

It isn't, though. 

Packt’s Head of Product, Oli Huggins, said: “We believe in helping to serve and support the global developer community. By selling our eBooks and Videos for $5, we hope to unlock exciting new opportunities for developers who, in other situations, wouldn’t have access to our products. A key part of our mission is to unlock new opportunities for developers, help them discover new technologies, and help put software to work in new ways."

Packt has curated some of our best titles together for the Python Programmers community:

I'm delighted to be part of the promotion. It's Christmas Eve. Even if you don't celebrate this specific holiday, the passing of the winter solstice is a time of renewal. 

Tuesday, December 8, 2020

Inelegant Python

See https://www.geeksforgeeks.org/find-the-smallest-positive-number-missing-from-an-unsorted-array/ This seems to be a popular coding interview problem. 

The Python code shown on the site seems almost maliciously misleading.

The full problem is this:

You are given an unsorted array with both positive and negative elements. You have to find the smallest positive number missing from the array in O(n) time using constant extra space. You can modify the original array.

Here's a common-enough Python solution.

def smallest_missing_1(x):
    """
    >>> smallest_missing_1([2, 3, 7, 6, 8, -1, -10, 15])
    1
    >>> smallest_missing_1([2, 3, -7, 6, 8, 1, -10, 15])
    4
    >>> smallest_missing_1([1, 1, 0, -1, -2])
    2
    """
    try:
        return min(set(range(1, max(x)+2))-set(x))
    except ValueError as ex:
        # min() arg is an empty sequence
        return 1

Some folks don't like the try/except to detect the edge case of all negative values. If  max(x) <= 0, then the exception will be raised, and we could use an if statement for a LBYL solution.

What's more important is this solution violates the constant extra space constraint. It builds two sets, which isn't a simple constant size object; it depends on the size of the input object.

To avoid the sets, we'll use a generator.

def smallest_missing_2(x):
    """
    >>> smallest_missing_2([2, 3, 7, 6, 8, -1, -10, 15])
    1
    >>> smallest_missing_2([2, 3, -7, 6, 8, 1, -10, 15])
    4
    >>> smallest_missing_2([1, 1, 0, -1, -2])
    2
    """
    try:
        return next(n for n in range(1, max(x)+2) if n not in x)
    except StopIteration as ex:
        # next() found an empty sequence
        return 1

This violates the O(n) constraint with the repeated use of the in operator.

To get to O(n) and no extra storage, we're forced to (a) demand the input is a mutable list, so we can (b) reuse the input list object to track which numbers are present and which are absent. This reuse of a mutable list is atypical for Python programming. Indeed, it seems like a bad idea. 

Consistent with the spirit of the problem, we're constrained to making arithmetic changes to the values in the original list, x, to track the state of the computation. The idea is that the value x[i] will have *both* an original input value, and the presence (or absence) of some value, p+i, in the sequence. 

One traditional approach is to use the sign as a way to carry this extra bit of information. That's why negative numbers are thrown in to the input data. They make the sign business super confusing. Also. That's why zero is excluded. Conventional integer math doesn't have a negative zero, confounding the problem with array slots that have numbers that make sign processing icky.

I'm not a fan of using the sign for this. I'd prefer to use Least Significant Bits (LSB's) because we have a fairly large number of available LSB's. And. We can trivially ignore zero values and their habit of not having useful signs. Unless the list has 2**62 elements, a little shifting won't hurt any.

Here's a solution that would *rarely* be used in normal Python work. Maybe on a Circuit Playground Express MicroPython. But not anywhere else.

from typing import List

def smallest_missing_3(x: List[int]) -> int:
    """
    >>> smallest_missing_3([2, 3, 7, 6, 8, -1, -10, 15])
    1
    >>> smallest_missing_3([2, 3, -7, 6, 8, 1, -10, 15])
    4
    >>> smallest_missing_3([1, 1, 0, -1, -2])
    2
    """
    # Purge negative numbers. Scale the other numbers.
    for i in range(len(x)):
        if x[i] < 0:
            x[i] = 0
        else:
            x[i] = x[i] << 1
    # Set LSB on positions which are filled; ignoring None values.
    # This can raise an index out-of-bounds exception, which we silence.
    for v in filter(None, (scaled >> 1 for scaled in x)):
        try:
            x[v-1] = x[v-1] | 1
        except IndexError:
            pass
    # Find first value with LSB not set.
    for i in range(len(x)):
        if x[i] & 1 == 0:
            return i+1

This is pretty atypical Python code. I'm kind of shocked folks would use something like this as an interview question. It's rather complex and requires some very old-school programming tricks to make the whole thing remotely palatable. 

The index out-of-bounds is particularly nasty. It means there's a number, n, that's greater than len(x). This is worrisome, but, it also means any gap MUST be less than this large number n. For this reason, we can silence array index errors.

I would not be able to simply stand up in a conference room and solve this without some additional direction. The "making arithmetic changes to the values in the original list" secret is something I knew about and did -- when I was younger -- but I haven't done that kind of thing in decades.

Tuesday, November 10, 2020

Mind the Gap -- mypy's slight lag behind Python 3.9

Working on a new book. Fun stuff. It's going to cover Python 3.9. 

I'm adding the type hints material. And that means PEP 585. Which means type hints for generics can use the generic types. We can use list[int] instead of List[int] and avoid from typing import List.

It's all fun.

Until...

Wait... What?

When I run mypy, it doesn't like the PEP 585 changes. 

I make sure I'm running Python 3.9.0 everywhere. I make sure I have mypy 0.790. I'm baffled. 

Then. I find this.

https://github.com/python/mypy/issues/7907

Ah. It appears that mypy is not completely up-to-speed with Python. 

And this makes perfect sense. 

What I think it means is that I'll have to tweak all of the examples when mypy also supports PEP 585. 

For now, I'm sticking with strict checks and the 0.790 release of mypy.

Tuesday, November 3, 2020

"Python doesn’t do tail recursion" -- wait, what?

Yes, that's what the email said.

I was -- well -- shocked. Too shocked to be polite. Content Warning: much snark follows.

BLUF: Tail Recursion is not Tail Recursion Optimization.

Eventually, it became clear they were worried about tail recursion optimization. Maybe I'm too focused on words, but I think words matter. The initial claim was so clearly wrong, I had to challenge it. It took three *more* emails to get to the optimization point.

Hence my overload of snark. Three emails to discover they didn't see the word "optimization."

Tail Recursion

Here's an example. (I wish questions included example code instead of weird assertions that are clearly false.)

def f(x: int) -> int:
    if x == 0: return 1
    return x*f(x-1)

This function demonstrates tail recursion. The "tail" of the function involves a recursive reference to the function. An optimizing compiler can optimize the recursion to make it not recursive.

If Python doesn't do tail recursion, this function would not work.

Since it works, Python does tail recursion.

Python limits recursion so that you crash cleanly before you're out of memory. 

This is important. Older languages would run out of memory for stack frames. Instead of reporting a recursion problem, they'd just crash. Out-of-memory. Sorry. No drop to `pdb`. Nothing. This is bad. 

In the old Mac OS (≤9) the stack and heap memory grew from opposite ends of the available RAM. If they collided, it was a stack overflow, and the running app was crashed.

Here's how the stack limitation plays out in practice:

Python 3.9.0 | packaged by conda-forge | (default, Oct 10 2020, 20:36:05) 
[Clang 10.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f(x: int) -> int:
...     if x == 0: return 1
...     return x*f(x-1)
... 
>>> f(999)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in f
  File "<stdin>", line 3, in f
  File "<stdin>", line 3, in f
  [Previous line repeated 995 more times]
  File "<stdin>", line 2, in f
RecursionError: maximum recursion depth exceeded in comparison
>>> f(997)
403597244616453902342926527652907402110903352461393891430307973735196631901068423726375883385358710213700663198466197719394411318126551961682447808198924968051643330792790545975658652366984953410102994729193397927862707860663312211428139657339199210288839829245965893084586772188847949801354437616450752245066665598898009417557796737695167521343249479413631414534202184726421479392615730781173164526393982880263279118925406206180689438683308644696334133955184235540077242451165903811018277198321800315958279899941381566151490917379981054549852483223292752438981198080270888254399197574536460570473430371595872403169486757166154294941258045311241382930836862005052391967478429035362983199050663230586866257612402804942403442331663944341683350732204123565349869446216232111598995678724462182568501131746383857706790400107507266739002631612931124112227909672935742104968533278074796000335855930432060517447195226436187301231195091058916141500005034486568847599649004940677693185232218378659444854645703908824934015144550035704605317977378620311855095356769488892217130200011250491151641531090120083765159221969755314437880209281708574493693840125338722070514029362985801732618715060934298236579096167095859504053310608725711198457200226544350445941157734863407428532431126485686678788466148681975019174010453297639004006819520704463840773528691224455265229985489764356909675383800245969276679872407757924211918488179598530382266647547907226165479802976547896939656888813256826539067915695278878516257396920983511389029563101112325372395464739783143361362879872578550147571168136083391354242735142803988735616917749898060073075542403509536490539404444972668319521415425667918323473675966566332390993259591959049424070380861864682206986463729281557338747466546627859206287571996491606797979064142819469589200812679026561288124087136359830959867034513441434850212864818601504529520195828528045600869420646442863720485414968365312690523835026508545772659712105161137693595262919371358840019473383802028344531181679417716563013501242477291139042422814166369601152223293596957527530934652046662174154235850073391729650007182794396630407081318880947107940245036774649857429379220776637356890211596540009349092255988047909417594778375705723841918167663026277009033939654785671715045122185315730249393616044737902170116980736000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>>> 

This shows how a large number of operations (around 1,000) exceeds an internal limit. The exception is not out-of-memory, it's too much recursion.

A slightly smaller number (997 in this example) worked.  999 didn't work because it exceeded the threshold.

Manual Tail Recursion Optimization

New word: Optimization. New concept.

If the tail recursion were optimized into a loop, we'd see code that behaves as if we wrote this:

>>> from math import prod
>>> def f1(x: int) -> int:
...     return prod(range(1, x+1))
    

This unwinds the tail recursion into a loop. A generator, range(1, x+1), creates a sequence of values, which are reduced into a product. This doesn't involve recursion or stack frames. Indeed, because it's a generator, it involves very little memory.

And it works for  numbers well over 1,000. Evidence (if any were needed) that tail recursion optimization is *not* being done in Python.

We'll repeat this for those who feel the need to send me crazy-sounding emails.

Automatic Tail Recursion Optimization

There is no automatic transformation from tail recursion to loop in Python.

I'll repeat that for the folks who send me emails claiming tail recursion isn't supported. (Everyone else probably has a pretty good grip on this.)

There is no automatic transformation from tail recursion to loop. 

Tail recursion works. It has a limit.

Manually optimized tail recursion (to create a loop) also works. It doesn't have the same limit.

Stated yet another way: unlike some languages, Python does not optimize for you. You must do the rewrite yourself.

While I would have thought these ideas (tail recursion and tail recursion optimization) were different. I was wrong. Hopefully, this blog post will help folks read *all* the words.  

I'm also pretty sure it's covered in here: https://www.packtpub.com/product/functional-python-programming-second-edition/9781788627061.

Tuesday, October 27, 2020

Python 3.9 Rearranging typing and collections.abc

This is glorious.

There was this sort of awkward shoving match between typing and collections.abc. Both had generic type definitions and it was -- often -- unclear how to use them.

See PEP 585. Now they are all unified into a much happier family. 

And. We wind up writing things like this.

import collections.abc
import typing
import sys

if sys.version_info >= (3, 9):
    BucketCollection = collections.abc.Collection[Sample]
else:
    BucketCollection = typing.Collection[Sample]

Now we can have code that passes tests for 3.8 and 3.9. And at some point we can cut off the 3.8 support and delete the ancient alternative definition.

I'm delighted to be able to move forward with a much simpler future in which collections are in the collections.abc and other, more foundational stuff is in typing.