Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, November 29, 2022

Functional Programming and Finite State Automata (FSA)

When I talk about functional programming in Python, folks like to look for place where functional programming isn't appropriate. They latch onto finite-state automata (FSA) because "state" of an automata doesn't seem to fit with stateless objects used in functional programming.

This is a false dichotomy. 

It's emphatically false in Python, where we don't have a purely functional language.

(In a purely functional language, monads can help make FSA's behave properly and avoid optimization. The use of a recursion to consume an iterable and make state transitions is sometimes hard to visualize. We don't have these constraints.)

Let's look at a trivial kind of FSA: the parity computation. We want to know how many 1-bits are in a given value. Step 1 is to expand an integer into bits.

def bits(n: int) -> Iterable[int]:
    if n < 0:
        raise ValueError(f"{n} must be >= 0")
    while n > 0:
        n, bit = divmod(n, 2)
        yield bit

This will transform a number into a sequence of bits. (They're in order from LSB to MSB, which is the reverse order of the bin() function.)

>>> list(bits(42))
[0, 1, 0, 1, 0, 1]

Given a sequence of bits, is there an odd number or an even number? This is the parity question. The parity FSA is often depicted like this:

When the parity is in the even state, a 1-bit transitions to the odd state. When the parity is in the odd, a 1-bit transitions to the even state.

Clearly, this demands the State design pattern, right?

An OO Implementation

Here's a detailed OO implementation using the State design pattern.

 
class Parity:
    def signal(self, bit: int) -> "Parity":
        ...


class EvenParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return OddParity()
        else:
            return self


class OddParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return EvenParity()
        else:
            return self


class ParityCheck:
    def __init__(self):
        self.parity = EvenParity()

    def check(self, message: Iterable[int]) -> None:
        for bit in message:
            self.parity = self.parity.signal(bit)

    @property
    def even_parity(self) -> bool:
        return isinstance(self.parity, EvenParity)

Each of the Parity subclasses implements one of the states of the FSA. The lonely signal() method implements state-specific behavior. In this case, it's a transition to another state. In more complex examples it may involve side-effects like updating a mutable data structure to log progress.

This mapping from state to diagram to class is pretty pleasant. Folks really like to implement each state as a distinct class. It somehow feels really solid.

It's import to note the loneliness of the lonely signal() method. It's all by itself in that big, empty class.

Hint. This could be a function.

It's also important to note that this kind of design is subject to odd, unpleasant design tweaks. Ideally, the transition is *only* done by the lonely signal() method. Nothing stops the unscrupulous programmer from putting state transitions in other methods. Sigh.

We'll look at more complex kinds of state transitions later. In the UML state chart diagrams sates may also have entry actions and exit actions, a bit more complex behavior than we we're showing in this example.

A Functional Implementation

What's the alternative? Instead of modeling state as an object with methods for behavior, we can model state as a function. The state is a function that transitions to the next state.

def even(bit: int) -> ParityF:
    if bit % 2 == 1:
        return odd
    else:
        return even


def odd(bit: int) -> ParityF:
    if bit % 2 == 1:
        return even
    else:
        return odd


def parity_check(message: Iterable[int], init: ParityF = None) -> ParityF:
    parity = init or even
    for bit in message:
        parity = parity(bit)
    return parity


def even_parity(p: ParityF) -> bool:
    return p is even

Each state is modeled by a function.

The parity_check() function examines each bit, and applies the current state function (either even() or odd()) to compute the next state, and save this as the vakue of the parity variable.

What's the ParityF type? This:

from typing import Protocol


class ParityF(Protocol):
    def __call__(self, bit: int) -> "ParityF":
        ...

This uses a Protocol to define a type with a recursive cycle in it. It would be more fun to use something like ParityF = Callable[[int], "ParityF"], but that's not (yet) supported.

Some Extensions

What if we need each state to have more attributes?

Python functions have attributes. Like this: even.some_value = 2; odd.some_value = 1. We can add all the attributes we require.

What about other functions that happen on entry to a state or exit from a state? This is trickier. My preference is to use a class as a namespace that contains a number of related functions.

class Even:
    @staticmethod
    def __call__(bit: int) -> ParityF:
        if bit % 2 == 1:
            odd.enter()
            return odd
        else:
            return even
    @staticmethod
    def enter() -> None:
        print("even")

even = Even()

This seems to work out well, and keeps each state-specific material in a single namespace. It uses static methods to follow the same design principle as the previous example -- these are pure functions, collected into the class only to provide a namespace so we can use odd.enter() or even.enter().

TL;DR

The State design pattern isn't required to implement a FSA.

Tuesday, November 22, 2022

Testing with PySpark

This isn't about details of pySpark. This is about the philosophy of testing when working with a large, complex framework, like pySpark, pandas, numpy, or whatever. 

BLUF

Use data subsets. 

Write unit tests for the functions that process the data.

Don't test pyspark itself. Test the code you write.

Some History

I've worked with folks -- data scientists specifically -- without a deep background in software engineering.

When we said their model-building applications needed a test case, they supplied the test case they used to validate the model.

Essentially, their test script ran the entire training set. Built the model. Did extensive statistical testing on the resulting decisions made by the model. The test case asserted that the stats were "good." In fact, they recapitulated the entire model review process that had gone on in the data science community to get the model from "someone's idea" to a "central piece of the business." 

The test case ran for hours and required a huge server loaded up with GPUs. It cost a fortune to run. And. It tended to timeout the deployment pipeline.

This isn't what we mean by "test." Our mistake.

We had to explain that a unit test demonstrates the code works. That was all. It shouldn't involve the full training set of data and the full training process with all the hyperparameter tuning and hours of compute time. We don't need to revalidate your model. We want to know the code won't crash. We'd like 100% code coverage. But the objective is little more than show it won't crash when we deploy it.

It was difficult to talk them down from full training sets. They couldn't see the value in testing code in isolation. A phrase like "just enough data to prove the thing could plausibly work with real data" seemed to resonate. 

A few folks complained that a numpy array with a few rows didn't really show very much. We had to explain (more than once) that we didn't really want to know all the algorithmic and performance nuances. We mostly wanted to know it wouldn't crash when we applied it to production data. We agreed with them the test case didn't show much. We weren't qualified to revalidate the model; we were only qualified to run their training process for them. If they had done enough work to be sure we *could* run it.

(It was a bank. Software deployments have rules. An AI model-building app is still an app. It still goes through the same CI/CD pipeline as demand deposit account software changes. It's a batch job, really, just a bit more internally sophisticated than the thing that clears checks.)

Some Structure

I lean toward the following tiers of testing:

  1. Unit tests of every class and function. 100% code coverage here. I suggest using pytest and pytest-cov packages to tracking testing and make sure every line of code has some test case. For a few particularly tricky things, every logic path is better than simply testing lines of code. In some cases, every line of code will tend to touch every logic path, but seems less burdensome.
  2. Use hypothesis for the more sensitive numeric functions. In “data wrangling” applications there may not be too many of these. In the machine learning and model application software, there may be more sophisticated math that benefits from hypothesis testing.
  3. Write larger integration tests that mimic pyspark processing, using multiple functions or classes to be sure they work together correctly, but without the added complication of actually using pySpark. This means creating mocks for some of the libraries using unittest.mock objects. This is a fair bit of work, but it pays handsome dividends when debugging. For well-understood pyspark APIs, it should be easy to provide mocked results for the app components under test to use. For the less well-understood parts, the time spent building a mock will often provide useful insight into how (and why) it works the way it does. In rare cases, building the mock suggests a better design that's easier to test.
  4. Finally. Write a few overall acceptance tests that use your modules and also start and run a small pyspark instance from the command line. For this, I really like using behave, and writing the acceptance testing cases using the Gherkin language. This enforces a very formal “Given-When-Then” structure on the test scenarios, and allows you to write in English. You can share the Gherkin with users and other stakeholders to be sure they agree on what the application should do.

Why?

Each tier of testing builds up a larger, and more complete picture of the overall application. 

More important, we don't emphasize running pySpark and testing it. It already works. It has it's own tests. We need to test the stuff we wrote, not the framework.

We need to test our code in isolation.

We need to test integrated code with mocked pySpark.

Once we're sure our code is likely to work, the next step is confirmation that the important parts do work with pySpark. For life-critical applications, the integration tests will need to touch 100% of the logic paths. For data analytics, extensive integration testing is a lot of cost for relatively little benefit.

Even for data analytics, testing is a lot of work. The alternative is hope and prayer. I suggest starting with small unit tests, and expanding from there.

Tuesday, November 15, 2022

Generators as Stacks of Operations

See https://towardsdatascience.com/building-generator-pipelines-in-python-8931535792ff 

I'm delighted by this article. 

I was shown only the first, horrible, example. I think the idea was to push back on the idea of complex generators. I fumed.

Then I read the entire article.

Now I'm fuming at someone who posted the first example -- apparently having failed to read the rest of the post.

This idea of building a stack of iterators is very, very good.

The example (using simple operations) can be misleading. A follow-on example doing something like file parsing might be helpful. But, if you go too far, you wind up writing an entire book about Functional Programming in Python.

Tuesday, November 8, 2022

Fighting Against Over-Engineering

I've been trying to help some folks who have a "search" algorithm that's slow. 

They know it's slow -- that's pretty obvious.

They're -- unfortunately -- sure that asyncio will help. That's not an obvious conclusion. It involves no useful research. Indeed, that's a kind of magical thinking. Which leads me to consider the process of over-engineering.

The Problem

Over-engineering is essentially a technique for burning brain-calories on planning to build something instead of building something.

The distinction is "planning" vs. "doing."

Lots of folks subscribe to Methodology Magic Thinking (MMT™). The core tenet of MMT is that some  methodology is good, and more methodology is better. 

The classic waterfall methodology expects requirements, design, code, test, and what-not, all flowing downhill. A series of waterfalls.

The more modern agile-fall methodology expects requirements, design, code, test, and what-not all being done in tiny MVP slices.

Why is this bad?

Its bad because it falls apart when confronted with really difficult algorithm and data structure problems.

What Breaks?

The thing that breaks is the "learn about the technology" or "learn about the problem domain" things that we need to do. We like to pretend with understand the technology -- in spite of the obvious information that we're rarely experts. We're smart. We're capable. But. We're not experts.

This applies to both the solution technology (i.e., language, persistence, framework, etc.) and the problem domain. 

When we have a process that takes *forever* to run, we've got a bad algorithm/data structure, and we don't know what to do.

We need to explore.

And.

Managers rarely permit exploration.

They have a schedule. The waterfall comes with a schedule. The agilefall sprints have timelines. And these are rarely negotiable. 

What Are Some Wrong Things to Do?

One wrong thing to do is to pick some technology and dig in hard. The asyncio module is not magical pixie dust. It doesn't make arbitrary bad code run faster. This is true in general. Picking a solution technology isn't right. Exploring alternatives -- emphasis on the plural -- is essential.

Another wrong thing to do is demand yet more process. More design docs. More preliminary analysis docs. More preliminary study. More over-engineering.

This is unhelpful. There are too many intellectual vacuums. And nature abhors a vacuum. So random ideas get sucked in. Some expertise in the language/tool/framework is required. Some expertise in the problem domain is required. Avoid assumptions.

What Should We Do?

We have to step back from the technology trap. We're not experts: we need to learn more. Which means exploring more. Which means putting time in the schedule for this.

We have to understand the problem domain better. We're not experts: we need to learn more. Which means putting time in the schedule for this.

We have to step back from the "deliverable code" trap. Each line of code is not a precious gift from some eternal god of code. It's an idea. And since the thing doesn't run well, it's provably a bad idea.

Code needs to be deleted. And rewritten. And rewritten again. And benchmarked.

Frustration

I like fixing bad code. I like helping people fix bad code.

I can't -- however -- work with folks who can't delete the old bad code.

It's unfortunate when they reach out and then block progress with a number of constraints that amount to "We can't focus on this; we can't make changes rapidly. Indeed, we're unlikely to make any changes."

The only way to learn is to become an expert is something. This takes time. To minimize the time means work with focus and work rapidly.

Instead of working rapidly, they want magical pixie dust that makes things faster. They want me to tell them were the "Turbo Boost" button is hidden.

Sigh.