Tuesday, January 18, 2022

How to Test a Random Number Generator

Nowadays, we don't have the same compelling reasons to test a random number generator. The intervening decades have seen a lot of fruitful research. Good algorithms.

Looking back to my 1968 self, however, I still feel a need to work out the solution to an old problem. See The Old Days -- ca. 1968 for some background on this.

What could I have done on that ancient NCE Fortran -- with four digit integers -- to create random numbers? Step 1 was to stop using the middle-squared generator. It doesn't work.

Step 2 is to find a Linear Congruential Generator that works. LCG's have a (relatively) simple form:

\[X_{n+1} = (X_n \times a + c) \bmod m\]

In this case, the modulo value, m, is 10,000. What's left is step 3: find a and c parameters.

To find suitable parameters, we need battery of empirical tests. Most of them are extensions to the following class:

from collections import Counter
from typing import Hashable
from functools import cache

class Chi2Test:
    """The base class for empirical PRNG tests based on the Chi-2 testing."""
    
    #: The actual distribution, created by ``test()``.
    actual_fq : dict[Hashable, int]
    
    #: The expected distribution, created by ``__init__()``.
    expected_fq: dict[Hashable, int]
    
    #: The lower and upper bound on acceptable chi-squared values.
    expected_chi_2_range: tuple[float, float]
    
    def __init__(self):
        """
        A subclass will override this to call ``super().__init__()`` and then
        create the expected distribution.
        """
        self._chi2 = None
    
    def test(self):
        """
        A subclass will override this to call ``super().test()`` and then
        create an actual distribution, usually with a distinct seed value.
        """
        self._chi2 = None
        
    @property
    def chi2(self) -> float:
        """Return chi-squared metric between actual and expected observations."""
        if self._chi2 is None:
            a_e = (
                (self.actual_fq[k], self.expected_fq[k]) 
                for k in self.expected_fq 
                if self.expected_fq[k] > 0
            )
            v = sum((a-e)**2/e for a, e in a_e)
            self._chi2 = v
        return self._chi2

    @property
    def pass_test(self) -> bool:
        return self.expected_chi_2_range[0] <= self.chi2 <= self.expected_chi_2_range[1]

This defines the essence of a chi-squared test. There's another test that isn't based on chi-squared. The serial correlation where a correlation coefficient is computed between adjacent pairs of samples. We'll ignore this special case for now. Instead, we'll focus on the battery of chi-squared tests. 

Linear Congruential Pseudo-Random Number Generator

We'll also need an LC PRNG that's constrained to 4 decimal digits.

It looks like this:

class LCM4:
    """Constrained by the NCE Fortran 4-digit integer type."""
    def __init__(self, a: int, c: int) -> None:
        self.a = a
        self.c = c
    def seed(self, v: int) -> None:
        self.v = v
    def random(self) -> int:
        self.v = (self.a*self.v % 10_000 + self.c) % 10_000
        return self.v

This mirrors the old NCE Fortran on the IBM 1620 computer. 4 decimal digits. No more. 

We can use this to generate a pile of samples that can be evaluated. I'm a fan of using generators because they're so efficient. The use of a set to create a list seems weird, but it's very fast.

def lcg_samples(rng: LCM4, seed: int, n_samples: int = N_SAMPLES) -> list[int]:
    """
    Generate a bunch of sample values. A repeat implies a cycle, and we'll stop early.

    >>> lcg_samples(LCM4(1621, 3), 1234)[:12]
    [317, 3860, 7063, 9126, 3249, 6632, 475, 9978, 4341, 6764, 4447, 8590]

    """
    rng.seed(seed)
    def until_dup(f: Callable[..., Hashable], n_samples: int) -> Iterator[Hashable]:
        seen: set[Hashable] = set()
        while (v := f()) not in seen and len(seen) < n_samples:
            seen.add(v)
            yield v
    return list(until_dup(rng.random, n_samples))

This function builds a list of values for us. We can then subject the set of samples to a battery of tests. We'll look at one test as an example for the others. They're each devilishy clever, and require a little bit of coding smarts to get them to work correctly and quickly.

Frequency Test

Here's one of the tests in the battery of chi-squared tests. This is the frequency test that examines values to see if they have the right number of occurrences. We pick a domain, d, and parcel numbers out into this domain. We use \(\frac{d \times X_{n}}{10,000}\) because this tends to leverage the left-most digits which are somewhat more random than the right-most digits.

class FQTest(Chi2Test):
    expected_chi_2_range = (7.261, 25.00)

    def __init__(self, d: int = 16, size_samples: int = 6_400) -> None:
        super().__init__()
        #: Size of the domain
        self.d = d
        #: Number of samples expected
        self.size_samples = size_samples
        #: Frequency for Chi-squared comparison
        self.expected_fq = {e: int(self.size_samples/self.d) for e in range(self.d)}
    
    def test(self, sequence: list[int]) -> None:
        super().test()
        self.actual_fq = Counter(int(self.d*s/10_000) for s in sequence)

We can apply this test to some samples, compare with the expectation, and save the chi-squared value. This lets us look at LCM parameters to see if the generator creates suitably random values.

The essential test protocol is this:

samples = lcg_samples(LCM4(1621, 3), seed=1234)
fqt = FQTest()
fqt.test(samples)
fqt.chi2

The test creates some samples, applies the frequency test. The next step is to examine the chi-squared value to see if it's in the allowable range, \(7.261 \leq \chi^2 < 25\).

The search space

Superficially, it seems like there could be 10,000 choices of a and 10,000 choices of c parameter values for this PRNG. That's 100 million combinations. It takes a bit of processing to look at all of those. 

Looking more deeply, the values of c are often small prime numbers. 1 or 11 or some such. That really cuts down on the search. The values of a have a number of other constraints with respect to the modulo value. Because 10,000 has factors of 4 and 5, this suggests values like \(20k + 1\) will work. Sensible combinations are defined by the following domain:

combinations = [
    (a, c)
    for c in (1, 3, 7, 11,)
    for a in range(21, 10_000, 20)
]

This is 2,000 distinct combinations, something we can compute on our laptop. 

The problem we have trying to evaluate these is each combination's testing is compute-intensive. This means we want to use as many cores of our machine as we have available. We don't want this to process each combination serially on a single core. A thread pool isn't going to help much because the OS doesn't scatter threads among all the cores. 

Because the OS likes to scatter processes among all the cores, we need a process pool.

Here's how to spread the work among the cores:

    from concurrent.futures import ProcessPoolExecutor, as_completed

    combinations = [
        (a, c)
        for c in (1, 3, 7, 11)
        for a in range(21, 10_000, 20)
    ]

    with Progress() as progress:
        setup_task = progress.add_task("setup ...", total=len(combinations))
        finish_task = progress.add_task("finish...", total=len(combinations))

        with ProcessPoolExecutor(max_workers=8) as pool:
            futures = [
                pool.submit(evaluate, (a, c))
                for a, c in progress.track(combinations, task_id=setup_task, total=len(combinations))
            ]
            results = [
                f.result()
                for f in progress.track(as_completed(futures), task_id=finish_task, total=len(combinations))
            ]

This will occupy *all* the cores of the computer executing the `evaluate()` function. This function applies the battery of tests to each combination of a and c. We can then check the results for combinations where the chi-squared results for each test are in the acceptable ranges for the test.

It's fun.

TL;DR

Use a=1621 and c=3 can generate acceptable random numbers using 4 decimal digits.

Here's some output using only a subset of the tests.

(rngtest2) % python lcmfinder.py
setup ... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
finish... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
2361  1  11.46  14.22  46.64  63.76   2.30  11.33   2.16   2.16 
 981  3  10.28  15.24  52.56  66.32   2.28  11.08  10.47  10.47 
1221  3  10.19  14.12  48.72  62.08   3.03  10.08   2.59   2.59 
1621  3  11.70  14.91  47.12  69.52   2.23   9.69   0.86   0.86 

The output shows the a and c values followed by the minimum and maximum chi-squared values for each test. The chi-squared values are in pairs for the frequency test, serial pairs test, gap test, and poker test. 

Each test uses about two dozen seed values to generate piles of 3,200 samples and subject each pile of samples to a battery of tests. The seed values, BTW, are range(1, 256, 11); kind of arbitrary. Once I find the short list of candidates, I can test with more seeds. There are only 10,000 seed values, so, this can be done in finite time.

For example, a=1621, c=3, had chi-squared values between 11.70 and 14.91 for the frequency test. Well within the 7.261 to 25.0 range required. The remaining numbers show that it passed the other tests, also.

For completeness, I intend to implement the remaining half-dozen or so tests. Then I need to make sure the sphinx-produced documentation looks good. I've done this before. http://slott.itmaybeahack.com/_static/rngtest/rngdoc.html It's kind of an obsession, I think.

Looking back to my 1968 self, this would have been better than the middle-squared nonsense that caused me to struggle with bad games that behaved badly.

Tuesday, January 11, 2022

The Old Days -- ca. 2000 -- Empirical Tests of Random Numbers (Python and Chi-Square Testing)

See The Old Days -- ca. 1974 Random Numbers Before Python for some background.

We'll get to Python after reminiscing about the olden days. I want to provide some back story on why sympy has had a huge impact on ordinary hacks like myself.

What we're talking about is how we struggled with randomness before

  1. /dev/random
  2. The Mersenne Twister Pseudo-Random Number Generator (PRNG)

Pre-1997, we performed empirical tests of PRNG's to find one that was random enough for our application. Maybe we were doing random samples of data to compare statistical measures. Maybe we were writing a game. What was important was a way to create a sequence of values that passed a battery of statistical tests.

See https://link.springer.com/chapter/10.1007%2F978-1-4612-1690-2_7 for the kind of material we salivated over. 

While there are an infinite number of bad algorithms, some math reveals that the Linear Congruential Generator (LCG) is simple and effective. Each new number is based on the previous number: \(X_{n+1} = (X_n \times a + c) \bmod m\). There's a multiply and an add, modulo some big number. The actual samples are often a subset of the bits in \(X_{n}\). 

After the Mersenne Twister became widely used, we essentially stopped looking at alternative random number algorithms. Before then -- well -- things weren't so good.

Here are some classics that I tested.

  • The ACM Collected Algorithms (CALGO) number 294 is a random-number generator. This is so obsolete, I have trouble finding links to it. It was a 28-bit generator.
  • The ACM Collected Algorithms (CALGO) number 266 has code still available. See toms/266
  • The Cheney-Kincaid generator is available. See random.f plus dependencies.

These formed a kind of benchmark I used when looking at Python's built-in Mersenne Twister.

Nowadays, you can find a great list of LCM PRNG's at  https://en.wikipedia.org/wiki/Linear_congruential_generator

Python Empirical Testing

One of the early questions I had was whether or not the random module in Python stacked up against these older RNG's that I was a little more familiar with.

So, I wrote a big, fancy random number testing tool in Python. 

When? Around 2000. I started this in the Python 1.6 and 2.1 era. I have files showing results from Python 2.3 (#2, Jul 30 2003). This is about when I stopped fooling around with this and moved on to trusting that Python really did work and was -- perhaps -- the best approach to working with randomly-sampled data for statistical work. 

The OO design for the test classes was Lavish Over The Top (LOTT™) OO:

  • Too Many Methods
  • Too Many Superclasses
  • No Duck Typing

We won't look at that code. It's regrettable and stems from trying to make Python into C++.

What I do want to look at is the essential Chi-Squared test methodology. This is some cool stuff.

Comparing Expected and Actual

The chi-squared metric is a way to compare actual and expected distributions. You can read about it on your own time. It's a way to establish if data is random or there's something else going on that's not random. i.e., a trend or a bias. 

The empirical tests for PRNG's that Knuth defines all come with chi-squared values that bracket acceptable levels of randomness. For the purposes of writing a working set of tests the magic chi-squared values supplied by Knuth are fine. Magical. But fine. Really. Trust them.

If you make modifications, you'd use your statistics text-book. You'd open to the back where it had a Chi-Squared table. That table gave you chi-squared values for a given degree of freedom and a given probability of being random.

Or, You could look for the NIST handbook online. It has a section on chi-squared testing. See https://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm. Same drill. Degrees of freedom and probability map to a chi-squared threshold.

But.

Were do these magical Chi-Squared values come from? This gets interesting in a useless-sidebar kind of way.

Chi-Squared Values

There's a really, really terse summary of chi-squared numbers here: https://www.danielsoper.com/statcalc/formulas.aspx?id=11. This is all you need to know. It may be too terse to help you learn about it, but it's a handy reference.

We need to evaluate two functions: partial gamma and gamma. These are defined as integrals. And they're nasty levels of complexity. Nasty.

This kind of nasty:

\[\gamma (s,z)=\int _{0}^{z}t^{s-1} e^{-t} dt\].

\[\Gamma (z)=\int _{0}^{\infty} t^{z-1} e^{-t} dt\].

These are not easy things to evaluate. Back to the ACM Collected Algorithms (CALGO) to find ways to evaluate these integrals. There are algorithms in CALGO 435 and 654 that are expressed as Fortran for evaluating these. This ain't all, of course, we need Stirling Numbers and Bernoulli Numbers. So there's a lot going on here.

A lot of this can be transliterated from Fortran. The resulting code is frankly quite ugly, and requires extensive test cases. Fortran with GOTO's requires some cleverness to unwind the conceptual for/while/if constructs.

OR.

Enter Sympy

In the 20+ years since I implemented my empirical PRNG tests "the hard way," sympy has come of age.

Check this out

from sympy import Sum, rf
from sympy.abc import k, s, z
from sympy.functions import exp
from sympy import oo
Sum(z**s * exp(-z) * z**k / rf(s, k+1), (k, 0, oo)).simplify()

I could use this in Jupyter Lab to display a computation for the partial gamma function.

\[z^{s}e^{-z}\sum _{k=0}^{\infty }{\dfrac {z^{k}}{s^{\overline {k+1}}}}\]

This requires a fancy Rising Factorial computation, the \(s^{\overline {k+1}}\) term. This is available in sympy as the rf(s, k+1) expression.

It turns out that sympy offers lowergamma() and gamm() as first-class functions. I don't even need to work through the closed-form simplifications.

I could do this...

def gammap(s: float, z: float) -> float:
    return (z**s * exp(-z) * Sum(z**k / rf(s, k+1), (k, 0, oo))).evalf()

def gamma(z: float) -> float:
    return integrate(t**(z-1) * exp(-t), (t, 0, oo)).doit()

It works well. And it provides elegant documentation. But I don't need to. I can write this, instead,

def chi2P(chi2: float, degF: int) -> float:
   return lowergamma(degF/2, chi2/2) / gamma(degF/2)

This is used to compute the probability of seeing a chi-squared value. 

For the frequency test, as an example. We partition the random numbers into 16 bins. These gives us 15 degrees of freedom. We want chi-squared values between 7.2578125 and 25.0.

Or.

Given a chi-squared value of 6.0, we can say the probability of 0.02 is suspiciously low, less than 0.05 level that we've decided signifies mostly random. The data is "too random"; that is to say it's too close to the ideal distribution to be trusted.

The established practice was to lookup a chi-squared value because you couldn't easily compute the probability of that value. With sympy, we can compute the probability. It's slow, so we have to optimize this carefully and not compute probabilities more frequently than necessary.

We can, for example, compute chi-squared values for a number of seeds, take the max and min of these and compute the probability of those two boundary values. This will bracket the probability that the pseudo random number generator is producing suitably random numbers.

This also applies to any process we're measuring with results that might vary randomly or might indicate a consistent problem that requires evaluation.

Using sympy eliminates the complexity of understanding these beautifully hand-crafted antique algorithms. It acts as a kind of super-compiler. From Math to an intermediate AST to a concrete implementation.

Tuesday, January 4, 2022

The Old Days -- ca. 1974 -- Random Numbers before Python

See "The Old Old Days -- ca. 1968" for my first exposure to an actual computer. Nothing about Python there. But. It's what motivated me to get started learning to code -- I was fascinated by games that involved randomization. Games with cards or dice.

After filling in a little background, I will get to the Python part of this. First, however, I want to compare the olden days with what we have now.

From 1969 to 1974 I had access to the high school's IBM 1620. This means programming in IBM's SPS assembler, or using the NCE Load-and-Go Fortran compiler. See https://www.cs.utexas.edu/users/EWD/transcriptions/EWD00xx/EWD37.html for a scathing review of the problems with this machine.

See http://www.bitsavers.org/pdf/ibm/1620/GC20-1603-10_1620_Catalog_of_Programs_Jan71.pdf Page 36 has this:

That's a quick overview of my earliest programming language. What's essential here is the NCE Fortran used 4-digit integers.

I'll repeat that for those skimming, and wondering what the Python connection is.

Four. Digit. Integers.

That's four decimal digits. Decimal digits required at least 4 hardware bits. IBM 1620 digits also had flags and signs, so, there were maybe 6 bits per digit. 24 bits of hardware used to represent just under 14 bits of useful information.

My interest is in simulation and randomness. So. I have this question of how to create random sequences of numbers limited to 4-digit integers.

PRNG Algorithms

There are a number of classic Pseudo-Random Number Generator (PRNG) algorithms from the early days before Mersenne Twister took over in 1997.

We used to be super-careful to emphasize the letter P in PRNG because the numbers we're really random. They just behaved randomish. This is contrasted with real randomness, also known as entropy. For example, /dev/random device driver has a fair amount of entropy. I think it's comparable to a person throwing dice across a table. I think it's as random as a noise-generating diode with a sample-and-hold circuit to pluck out random values from the noise.

Pre-Mersenne-Twister -- pre-1997 -- we worried a lot about random number generation. See Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms, Addison-Wesley, 1969. Section 3.3.2. covers empirical testing of random number generators. Section 3.3.1. covers the Chi-squared test for fit between actual and expected frequency distributions.

Back in the olden days, it was stylish to perform an empirical test (or ten) to confirm we really had "good" random numbers. The built-in libraries that came with your compiler or OS could not be trusted without evidence.

One of the classic (bad) PRNG's is the "Middle-Squared" method. See https://en.wikipedia.org/wiki/Middle-square_method. I learned about this in the 70's. And used it in the old NCE Fortran. 

With Four. Digit. Integers.

Did I mention that the Fortran compiler used four decimal digits for integers? That means plucking the middle two digits out of a four-digit number. How random can that be?

Not very. The longest possible sequence is 100 numbers. If, by some miracle, you found a seed number with the right properties and only two digits.

Nowadays I can, in Python, do a quick middle-squared analysis for all 100 seed values.

This kind of thing.

def csqr4(value: int) -> list[int]:
    """The 4 decimal digit center-squared PRNG."""
    sequence = []
    while value not in sequence:
        sequence.append(value)
        value = (((value**2) // 10) % 100)
    return sequence

Which you can run and see that all of my early attempts at games and simulations were doomed. The seed values of 76, 42, and 69 provided kind of long sequences of almost random-seeming numbers. Otherwise, pfft, this was junk computer science. 50% of the seeds provide 5 or fewer numbers before repeating. 

For blackjack, a few random numbers for shuffling might be enough. For other games, the lack of randomness made the outcomes trivially predictable. 

What's funny is how far the state of the art has moved since then.

  1. Hardware now has more than 20,000 decimal digits (about 10K bytes) of storage.
  2. Software with algorithms that are really, really clever.

It's hard to understate these two advances, particularly, the second one. I'll return to the algorithm thing a lot in the next few posts.

My focus was on games and randomization. Ideally, simple stuff. But... under the hood, it's not simple. I've spent some time (not much, and not in much depth) looking at the tip of this iceberg. 

It served me as an incentive to dive just a little more deeply into a topic, like math or a programming language or a statistical tool.

Tuesday, December 28, 2021

The Old Old Days -- ca. 1968

As the olds do, I reminisce sometimes. Not often. Let me rewind the memory tapes a back to 1967 or '68.

(What a dumb metaphor for folks who have never used serial storage.)

This isn't -- directly -- about Python. But it may help folks who live at the edge of programming find a project that motivates them to learn to code.

This was my first exposure to an actual computer. I think it was an old IBM 650 that our high-school had. I wasn't in high school yet, but there was an open house, and they fired this thing up.

What was the demo program?

The thing was running a version of blackjack. 

It would clatter and type something like 

 J S
 9 C  7 D
>

And wait for your input. The first line was the dealer's card and your two cards. You could then enter your plan -- hit, stand, split, or double. Subsequent lines would play out your hand.

And yes, I recall much of the interaction to this day. (Not all the details. It was a long time ago.)

This shaped my understanding of "computing as simulation." 

It also caused me to become interested in random numbers and the idea of generating pseudo-random numbers with digital computers. I'll be posting some more thoughts on random numbers and -- I promise -- there will be Python code.

Tuesday, November 23, 2021

Processing Apple Numbers Files

Apple's freebie tools -- Pages, Numbers, Keynote, Garage Band, etc. -- are wonderful things. I really like Numbers. I'm tolerant of Pages. I've used Pages to write books and publish them to the Apple Bookstore. (Shameless Plug: Pivot to Python.)

These tools have a significant problem. Protobuf.

Some History

Once upon a time, Numbers used an XML-based format. This was back in '09, I think. At some point, version 10 of Numbers (2013?) switched to Protobuf.

I had already unwound XLSX and ODS files, which are XML. I had also unwound Numbers '09 in XML. I had a sense of what a spreadsheet needed to look like.

The switch to Protobuf also meant using Snappy compression. Back in 2014? I worked out my own version of the Snappy decompression algorithm in pure Python. I think I knew about python-snappy but didn't want the complex binary dependency. I wrote my own instead.

I found the iWorkFileFormat project. From this, and a lot of prior knowledge about the XML formats, I worked out a way to unpack the protobuf bytes into Python objects. I didn't leverage the formal protobuf definitions; instead I lazily mapped the objects to a dictionary of keys and bytes. If a field had a complex internal structure, I parsed the subset of bytes.

(I vaguely recall the Protobuf definitions are in XCode somewhere. But. I didn't want to write a protobuf compiler to make a pure-Python implementation. See the protobuf project for what I was looking for, but didn't have at the time.)

Which brings us to today's discovery.

State of the Art

Someone has taken the steps necessary to properly unpack Numbers files. See numbers-parser. This has first-class snappy and protobuf processing. It installs cleanly. It has an issue, and I may try to work on it.

I'm rewriting my own Stingray Reader with intent to dispose of my own XLSX, ODS, and Numbers processing. These can (and should) be imported separately. It's a huge simplification to stand on the shoulders of giants and write a dumb Facade over their work.

Ideally, all the various spreadsheet parsing folks would adopt some kind of standard API. This could be analogous to the database API used by SQL processing in Python. The folks with https://www.excelpython.org or http://www.python-excel.org might be a place to start, since they list a number of packages.

The bonus part? Seeing my name in the Credits for numbers-parser. That was delightful.

At some point, I need to make a coherent pitch for a common API with permits external JSON Schema as part of extracting data from spreadsheets.

First. I need to get Stingray Reader into a more final form.

Tuesday, November 16, 2021

Reading Spreadsheets with Stingray Reader and Type Hinting

See Spreadsheets, COBOL, and schema-driven file processing, etc. for some history on this project.

Also, see Stingray-Reader for the current state of this effort.

(This started almost 20 years ago, I've been refining and revising a lot.)

Big Lesson Up Front

Python is very purely driven by the idea that everything you write is generic with respect to type. Adding type hints narrows the type domain, removing the concept of "generic".

Generally, this is good.

But not universally.

Duck Typing -- and Python's generic approach to types -- is made visible via Protocols and Generics.

An Ugly Type Hinting Problem

One type hint complication arises when writing code that really is generic. Decorators are a canonical example of functions that are generic with respect to the function being decorated. This, then, leads to kind of complicated-looking type hints.

See the mypy page on declaring decorators. The use of a TypeVar to show that how a decorator's argument type matches the the return type is big help. Not all decorators follow the simple pattern, but many do.

from typing import TypeVar
F = TypeVar('F', bound=Callable[..., Any])
def myDecorator(function: F) -> F:
    etc.

The Stingray Reader problem is that a number of abstractions are generic with respect to an underlying instance object.

If we're working with CSV files, the instance is a tuple[str].

If we're working with ND JSON objects, the instance is some JSON type.

If we're working with some Workbook (e.g., via xlrd, openpyxl, or pyexcel) then, the instance is defined by one of these external libraries.

If we're working with COBOL files, then the instances may be str or may be bytes. The typing.AnyStr type provides a useful generic definition.

Traditional OO Design Is The Problem

Once upon a time, in the dark days, we had exactly one design choice: inheritance. 

Actually, we had two, but so many writers get focused on "explaining" OO programming, that they tend to gloss over composition. The focus on the sort-of novel concept of inheritance.

And this leads to folks arguing that inheritance shouldn't be thought of as central. Which is a kind of moot argument over what we're doing when we're writing about OO design. We have to cover both. Inheritance has more drama, so it becomes a bit more visible than composition. Indeed, inheritance creates a number of design constraints, and that's where the drama surfaces.

Any discussion of design patterns tends to be more balanced. Many patterns -- like Strategy and State -- are compositional patterns. Inheritance actually plays a relatively minor role until you reach interesting boundary cases.

Specifically.

What do you do when you have a Strategy class hierarchy and ONE of those strategies has an unique type hint for a parameter? Most of the classes use one type. One unique subclass needs a distinct type. For example, this outlier among the Strategy alternatives uses a str parameter instead of float.

Do you push that type distinction up to the top of the hierarchy? Maybe define it as edge_case: Optional[Union[str, float]] = None?

You can't simply change the parameter's value in one subclass with impunity. mypy will catch you at this, and tell you you have Liskov Substitution problems.

Traditionally, we would often take this to mean that we have a larger problem here. We have a leaky abstraction. Some implementation details are surfacing in a bad way and we need more abstract classes.

It's A Protocol ("Duck Typing")

When I started rewriting Stingray Reader, I started with a fair number of abstract classes. These classes were supposed to have widely varying implementations, but common semantics. 

Applying a schema definition to a CSV file means that data values can be converted from strings to something more useful,

Applying a schema to a JSON file means doing a validation pass to be sure the loaded object meets the schema's expectations.

Applying a schema to a Workbook file is a kind of hybrid between CSV processing and JSON processing. The workbook's values will have been unpacked by the interface module. Each row will look like a list[Any] that can be subject to JSON schema validation. 

Apply a schema to COBOL means using the schema details to figure out how to unpack the bytes. This is suddenly a lot more complex than the other cases.

The concepts of inheritance and composition aren't really applicable. 

This is something even more open-ended. It's a protocol. 

We want a common interface and common semantics. But. We're not really going to leverage any common code. 

This unwinds a lot abstract superclasses, replacing them with Protocol class definitions.

class Workbook(abc.ABC):
    @abc.abstractmethod
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Union[str, bytes, int, float, etc.]]]:
        ...

The above is not universally useful. Liskov Substitution has to apply. In some cases, we don't have a tidy set of relationships. Here's the alternative

class Workbook(Protocol):
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Any]]:
        ...

This gives us the ability to define classes that adhere to the Workbook Protocol but don't have a simple, strict subclass-superclass-Liskov substitution relationship.

It's A Generic Protocol

It turns out, this isn't quite right. What's really required is a Generic[Type], not the simple Protocol.

class Workbook(Generic[Instance]):
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Instance]]:
        ...

This lets us create Workbook variants that are highly type-specific, but not narrowly constrained by inheritance rules.

This type hinting technique describes Python code that really is generic with respect to implementation type details. It allows a single Facade to contain a number of implementations.

Tuesday, November 2, 2021

Welcome to Python: Some hints for ways to explain how truly bad the language is

As an author with many books on Python, I'm captivated by people's hot takes on why Python is so epically bad. Really Bad. Uselessly Bad. Profoundly Broken. etc.

I'll provide some hints on topics that get repeated a lot. If you really need to write a blog post about how bad Python is, please try to take a unique approach on any of these common complaints.  If you have a blog post half-written, skip to the tl;dr section to see if your ideas are truly unique.

Whitespace

Please don't waste time complaining about having to use whitespace in your code. I'm sure it's a burden on your soul to configure your editor to indent in groups of four spaces. I'm sorry it's so painful. But. Python isn't the only language with whitespace.

The shell scripting language has semantic whitespace. (It's not used for indentation, but please try cat$HOME/.bashrc (without any spaces) and tell me what happens. Spaces matter in a lot of languages. 

Even in C, some whitespace is semantic. The rest of the whitespace is for humans to read your code.

If you're *sure* that indentation is a fatal problem, please provide an example in the language of your choice where the {}'s or the case/esac was *required* because ordinary, readable indentation didn't -- somehow -- express the nesting.

The example can be the basis for a Python Enhancement Proposal (PEP) to fix the whitespace problem you've identified.

The self Instance Variable

Using self everywhere is simpler than using this in those obscure special cases where it's ambiguous. Python developers are sure that being uniformly explicit is a terrible burden on your soul. If you really feel that obscure special cases are required, consider writing a pre-processor to sort out the special cases for us.

I'm sure there's a way to inject another level of name resolution into the local v. global choices. Maybe local-self-global or self-local-global could be introduced. 

Please include examples. From this a Python Enhancement Proposal can be drafted to clarify what the improvement is.

No Formal Constants

Python doesn't waste too much time on keywords, like const, to alter the behavior of assignment. Instead, we tend to rely on tools to check our code.

Other languages have compilers to look for assignment to consts. Python has tools like flake8, pyflakes, pylint, and others, to look for this kind of thing. Conventionally, variables at the module level with ALL_CAPS names are likely to be constants. Multiple assignment statements would be a problem. Got it.

"Why can't the language check?" you ask. Python doesn't normally have a separate compile pass to pre-check the code. But. As I said above, you can use tools to create a pre-checking pass. That's what most of us do.

"But what if someone accidentally overwrites a constant?" you insist. Many folks would suggest some better documentation to explain the consequences an clarify how unit tests will fail when this happens. 

"Why should I write unit tests to be sure a constant wasn't changed?" you demand. I'm not really insisting on it. But you said you had developers who would "accidentally" overwrite a constant in an assignment statement, and you couldn't use tools like pylint to check for it. So. I suggested another choice. If you don't like that, use enums. Or write documentation and explain which global items can be changed and which can't be changed.

BTW. If you have global variables that are NOT constants, consider this a code smell. 

If you really need a mixture of constants and variables as module globals, you can use the enum module to create named attribute values of a class definition. You get constants and a namespace. It's pretty sweet.

Lack of Privacy

It appears to be an article of faith that a private keyword is unconditionally required.

Looking at the history of OO languages, it looks like private seems to have been introduced with C++. Not every OO language has the same notion of private the C++ has. CLU has no concept of private. Smalltalk considers instance variables equivalent to C++ protected, not private. Eiffel has a particularly sophisticated feature exportation that doesn't involve a trivial private/public distinction.

Since many languages that aren't C++ or Java have a variety of approaches, it appears private isn't required. The next question, then, is it necessary?

It really helps to have a concrete example of a place where a private method or attribute was absolutely essential. And it helps to do this in a way that a leading _ on the variable name -- every time it's used -- is more confusing than a keyword like private somewhere else in the code.

It also helps when the example does not involve a hypothetical Idiot Developer who (a) doesn't read the documentation and (b) doesn't understand the _leading_underscore variable, and can still manage to use the class. It's not that this developer doesn't exist; it's questionable whether or not a complex language feature is better than a little time spent on a code review. 

It helps when the example does not include the mysterious Evil Genius Developer who (a) reads the documentation, and (b) leverages the _leading_underscore variable to format one of the OS disks or something. This is far-fetched because the Evil Genius Developer had access to the Python source, and didn't need a sophisticated subclassing subterfuge. They could simply edit the code to remove the magical privacy features.

No Declarations

Python is not the only language where variables don't have type declarations. In some languages, there are implied types associated with certain kinds of names. In other languages, there are naming conventions to help a reader understand what's going on.

It's an article of faith that variable declarations are essential. C programmers will insist that a void * pointer is still helpful even though the thing to which it points is left specifically undeclared. 

C (and C++) let you cast a pointer to -- well -- anything. With resulting spectacular run-time crashes. Java has some limitations on casting. Python doesn't have casting. An object is a member of a class and that's the end of that. There's no wiggle-room to push it up or down the class hierarchy.

Since Python isn't the only language without variable declarations, it raises the question: are they necessary?

It really helps to have a concrete example of a place where a variable declaration was absolutely essential for preventing some kind of behavior that could not be prevented with a pylint check or a unit test. While I think it's impossible to find a situation that's untestable and can only be detected by careful scrutiny of the source, I welcome the counter-example that proves me wrong.

And. Please avoid this example.

for data in some_list_of_int:
    if data == 42:
        print("data is int")
for data in some_list_of_str:
    if data == "bletch":
        print("data is str")

This requires reusing a variable name. Not really a good look for code. If you have an example where there's a problem that's not fixed by better variable names, I'm looking forward to it.

This will change the world of Python type annotations. It will become an epic PEP.

Murky Call-By-Value Semantics

Python doesn't have primitive types. There are no call-by-value semantics. It's not that the semantics are confusing: they don't exist. Everything is a reference. It seems simpler to avoid the special case of a few classes of objects that don't have classes.

The complex special cases surrounding unique semantics for bytes or ints or strings or something requires an example. Since this likely involves a lot of hand-waving about performance (e.g., primitive types are faster for certain things) then benchmarking is also required. Sorry to make you do all that work, but the layer of complexity requires some justification.

No Compiler (or All Errors are Runtime Errors)

This isn't completely true. Even without a "compiler" there are a lot of ways to check for errors prior to runtime. Tools like flake8, pyflakes, pylint, and mypy can check code for a number of common problems. Unit tests are another common way to look for problems. 

Code that passes a unit test suite and crashes at runtime doesn't seem to be a language problem. It seems to be a unit testing problem.

"I prefer the compiler/IDE/something else find my errors," you say. Think of pylint as the compiler. Many Python IDE's actually do some static analysis. If you think unit tests aren't appropriate for finding and preventing problems, perhaps programming isn't your calling.

tl;dr

You may have some unique insight. If you do, please share that.

If on the other hand, you're writing about these topics, please realize that Python has been around for over 30 years. These topics are not new. For the following, please try to provide something unique:

  • Whitespace
  • The self Instance Variable
  • No Formal Constants 
  • Lack of Privacy
  • No Declarations
  • Murky Call-By-Value Semantics
  • No Compiler (or All Errors Are Run-Time Errors)

It helps to provide a distinctive spin on these problems. It helps even more when you provide a concrete example. It really helps to write up a Python Enhancement Proposal. Otherwise, we can seem dismissive of Yet Another Repetitive Rant On Whitespace (YARROW).