Bio and Publications

Tuesday, December 20, 2022

Christmas Book Offers

Apple Books

Pivot to Python

A Guide for professionals and skilled beginners

https://books.apple.com/us/book/pivot-to-python/id1586977675 

I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.

But wait, there's more. 

Unlearning SQL

When your only tool is a hammer, every problem looks like a nail

https://books.apple.com/us/book/unlearning-sql/id6443164060

Many folks know some Python, but struggle with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems effectively.

Google Play

Also available on Google Play. Here's Unlearning SQL:

https://play.google.com/store/books/details?id=23WAEAAAQBAJ

I've made a clone of Pivot to Python, also.

https://play.google.com/store/books/details/Steven_F_Lott_Unlearning_SQL?id=23WAEAAAQBAJ&hl=en_US&gl=US

Both books are (intentionally) short to help experts make rapid progress.

Tuesday, December 13, 2022

On Algorithm Design

Some background: FAERIE DUST™, Obstinate Idiocy, Obstinate Idiocy, Expanded, and even Permutations, Combinations and Frustrations. I want to set up algorithm design as the diametric opposite of Obstinate Stupidity. To do that, let's look at Obstinate Stupidity.

The theme? 

We did something wrong, and we don't want to fix it.

I emphasize this because it takes many forms. Another common variant is "We can't afford to continue the way we are, but we can't afford the time to fix it, either." Sometimes, "Management wants this fixed, but we don't have any budget." You know how it is.

The current go-round is someone who has an algorithm of intensely stupid (and largely irrelevant) complexity. See My algorithm performs badly, do I need asyncio?

The situation is touchy. They have pre-reasoned an answer -- asyncio -- and they're looking for (a) confirmation that they're obviously right and (b) help rewriting something needlessly complex to somehow be faster even when it's compute-bound. Specifically, they want Faerie Dust.

Frivolous Complexity

How do I know it has needless, frivolous complexity?

Here are two symptoms.

  1. The problem has a lot of context. In thise case, there's a hierarchy. The hierarchy may seem irrelevant, but it has this mind-numbingly complex back-story, that they can't seem to ignore or abstract out of the essential problem. There's a (large) number of details that don't really explain what the hierarchy means or why it has to be preserved. but somehow make it essential.
  2. The problem can only be described by repeating the legacy algorithm. 

Let's dwell on this second symptom for a moment. We have two competing issues:

  • The legacy algorithm is too slow. AND,
  • There's no other way to describe the problem.

This should make it clear they are looking at asyncio as a kind of Faerie Dust that will magically make the bad algorithm good. Without fixing the bad algorithm.

I want to emphasize the existence of details which can neither be explained nor removed. The hierarchy must be there simply because it must be there. Bizarre complications to walk the hierarchy are, therefore, essential even if no one can explain them.

Algorithm Design

To actually improve the processing they need a new algorithm.

I can't emphasize this enough: they need a new algorithm. (This often means a new data structure.)

"Tuning" in any form is nothing more than nibbling around the edges to make a bad algorithm run 15% faster.

Rewriting may replace \(\textbf{O}(2^n)\) with \(\textbf{O}(n \log n)\). This would be dramatically better. From seconds to milliseconds. You know, 1,000% faster.

There's a disciplined approach to this. Here are the steps.

  1. Write the post-condition for the processing as a whole.
  2. Write code that achieves the post-condition. (This may involve decomposing the big problem into sub-problems, each of which is approached by the same two-step process.)

The intensely painful part of this is creating the post-condition.

I suggested they "write an assert statement that must be true when the algorithm has completed, and computed the right answer."

Hahahah.

What an idiot I was.

They didn't know how to write an assert statement. And at this point, they stopped. Brick Wall. Dead in the water. Cannot proceed. Done. Failed.

The assert statement has become the end-of-the-line. They can't (or won't) do that. And they won't ask about it.

Me: "Do you have a question?"

Them: "I have to think before I can even begin to ask a question."

Me: "How about think less and ask more. Do you have trouble writing the word assert? What's stopping you?"

Them: [silence]

Okay.

Post-Conditions

The post-condition is true when you're done. Let's look at my favorite, M must be the maximum of A and B.

\[M \geq A \textbf{ and } M \geq B\]

This becomes an assert statement through (what seems to me, but boy was I wrong) the following kind of translation.

assert M >= A and M >= B, f"Algorithm Failed {M=} {A=} {B=}"

Again, I acknowledge I was wrong to think creating an assert statement from a post condition was in any way clear. It's absolutely bewilderingly impossible.

It's also important to note that the above condition is incomplete. The value \(M = A+B\) will also satisfy the condition. We need to test our test cases to be sure they really do what we want.

We really need to be more complete on what the domain of values for \(M\) is.

\[M = A \textbf{ or } M = B \textbf{ and } M \geq A \textbf{ and } M \geq B\]

We could rewrite this slightly to be

\[M \in \{A, B \} \textbf{ and } M \geq A \textbf{ and } M \geq B\]

This version directly suggests a potential set comprehension to compute the result:

M = {m for m in {A, B} if m >= A and m >= B}.pop()

This is the advantage of writing post-conditions. They often map to code.

You can even try it as pseudo-SQL if that helps you get past the assert statement.

SELECT M FROM (TABLE INT(X); A; B) WHERE M >= A AND M >= B

I made up a TABLE INT(X); A; B to describe a two-row table with candidate solutions. I'm sure SQL folks have other sort of "interim table" constructs they like.

The point is to write down the final condition. 

I'll repeat that because the folks I was trying to work with refused to understand the assert statement.

Write down the final condition.

The Current Problem's Post-Condition

The problem at hand seems to involve a result set, \(R\), pulled from nodes of some hierarchy, \(H\), \(R \subseteq H\). Each element of the hierarchy, \(h \in H\) has a set of strings, \(s(h)\). It appears that a target string, \(t\), must be a member of \(t \in s(r), r \in R\). I think.

Note that the hierarchy is nothing more than a collection of identified collections of strings. The parent-childness doesn't seem to matter for the search algorithm. Within the result set, there's some importance to the tier of the hierarchy, \(t(h)\), and a node from tier 1 means all others are ignored or something. Can't be sure. (The endless backstory on the hierarchy was little more than a review of the algorithm to query it.)

If any of this is true, it would be a fairly straightforward map() or filter() what could be parallelized with dask or concurrent.futures.

But we can't know if this really is the post-condition until someone in a position to know writes the post-condition.

Things To Do

The post-condition defines the results of test cases. The assert statement becomes part of the pytest test cases. In a kind of direct copy-and-paste process to shift from design aid to test result condition.

Currently, the algorithm they have seems to have no test cases. They can't write a condition to describe correct answers, which suggests they actually don't know what'a correct.

If they wrote test cases, they might be able to visualize an assert statement that confirms the test worked. Might. It appears to be asking a lot to write test cases for the legacy algorithm.

Indeed, if they wrote a conditional expression that described the results of any working example, they'd have taken giant steps toward the necessary assert statement. But that's asking a lot, it appears.

And Then What?

Once you have a target condition, you can then design code to satisfy some (or all) of the target condition. Dijkstra's A Discipline of Programming has a thorough description of the "weakest precondition" operator. It works like this:

  1. Imagine a statement that might satisfy some or all of your post-condition.
  2. Substitute the effect of the statement into the post-condition. 
  3. What's left is the weakest pre-condition for that statement to work. It's often the post-condition for a statement must precede the statement you wrote.

You write the program from the desired post-condition moving forward until you get a weakest pre-condition of True. Back to front. From goal to initialization.

Post-condition gives you statements. Statements have pre-conditions. You iterate, writing conditions, statements, and more conditions.

(You can also spot useless code because the pre-condition matches the post-condition.)

For the silly "maximum" problem?

Try M := A as a statement. This only works if A >= B. That's the pre-condition that is derived from substituting M = A into the post-condition.

Try M := B as a statement. This only works if B >= A. That's the pre-condition that is derived from substituting M = B into the post-condition.

These two pre-conditions describe an if-elif statement. 

Note that this feels weirdly arbitrary and exploratory. It's a kind of empiricism where we try statements and see if they're helpful. There don't need to be any constraints. The post-condition is all that's required to explore the space of statements that might work, or at least might help.

Of course, we're not stupid. And we're lazy. We don't search the infinite space of statements. We can often imagine the statements without a lot of complex work. The formal weakest pre-condition process is necessary to confirm our intuition. Or to assert that something is free of astonishing side-effects.

It all depends on one thing: a clear, formal statement of the post-condition.

Since I made the mistake of describing the post-condition as a line of code, we've hit some kind of brick wall related to "I won't write code." Or "I don't want to be seen writing code." or "I don't want you to critique my code." 

Dunno.

Tuesday, December 6, 2022

My algorithm performs badly, do I need asyncio?

Real Question (somewhat abbreviated): "My algorithm performs badly, do I need asyncio?"

Short answer: No.

Long answer: Sigh. No. Do you need a slap upside the head?

Here's how it plays out:

Q: "We figured that if we 'parallelize' it, then we can apply multiple cores, and it will run 4x as fast."

Me: "What kind of I/O are you doing?"

Q: "None, really. It's compute-intensive."

Me: "Async is for I/O. A function can be computing while other functions are waiting for I/O to complete."

Q: "Right. We can have lots of them, so they each get a core."

Me: "Listen, please. A function can be computing. That's "A". Singular. One. Take a step back from the asyncio package. What are you trying to do?"

Q: "Make things faster."

Me: "Take a breath. Make what faster?"

Q: "A slow algorithm."

Me: 

Q: "Do you want to know what we're trying do?"

Me: 

Q: "First, we query the database to get categories. Then we query the database to get details for the categories. Then we query the database to organize the categories into a hierarchy. Except for certain categories which are special. So we have if-statements to handle the special cases."

Me: "That's I/O intensive."

Q: "That's not the part that's slow."

Me: 

Q: "Context is important. I feel the need to describe all of the background."

Me: "That's trivia. It's as important as your mother's maiden name. What's the problem?"

Q: "The problem is we don't know how to use asyncio to use multiple cores."

Me: "Do you know how to divide by zero?"

Q: "No. It's absurd."

Me: "We already talked about asyncio for compute-intensive processing. Same level of absurd as dividing by zero. What are you trying to do?"

Q: "We have some for loops that compute a result slowly. We want to parallelize them."

Me: "Every for statement that computes a collection is a generator expression. Every generator expression can be made into a list, set, or dictionary comprehension. Start there."

Q: "But what if the for statement has a super-complex body with lots of conditions?"

Me: "Then you might have to take a step back and redesign the algorithm. What does it do?"

Q: <code> "See all these for statements and if-statements?"

Me: "What does it do? What's the final condition?"

Q: "A set of valid answers."

Me: "Define valid."

Q: "What do you mean? 'Define valid?' It's a set that's valid!"

Me: "Write a condition that defines whether or not a result set is valid. Don't hand-wave, write the condition."

Q: "That's impossible. The algorithm is too complex."

Me: "How do you test this process? How do you create test data? How do you know an answer it produces is correct?"

Q:

Me: "That's the fundamental problem. You need to have a well-defined post-condition. Logic. An assert statement that defines all correct answers. From that you can work backwards into an algorithm. You may not need parallelism; you may simply have a wrong data structure somewhere in <code>."

Q: "Can you point out the wrong data structure?"

Me: 

Q: "What? Why won't you? You read the code, you can point out the problems."

Me: 

Q: "Do I have to do all the work?"

Me:

Tuesday, November 29, 2022

Functional Programming and Finite State Automata (FSA)

When I talk about functional programming in Python, folks like to look for place where functional programming isn't appropriate. They latch onto finite-state automata (FSA) because "state" of an automata doesn't seem to fit with stateless objects used in functional programming.

This is a false dichotomy. 

It's emphatically false in Python, where we don't have a purely functional language.

(In a purely functional language, monads can help make FSA's behave properly and avoid optimization. The use of a recursion to consume an iterable and make state transitions is sometimes hard to visualize. We don't have these constraints.)

Let's look at a trivial kind of FSA: the parity computation. We want to know how many 1-bits are in a given value. Step 1 is to expand an integer into bits.

def bits(n: int) -> Iterable[int]:
    if n < 0:
        raise ValueError(f"{n} must be >= 0")
    while n > 0:
        n, bit = divmod(n, 2)
        yield bit

This will transform a number into a sequence of bits. (They're in order from LSB to MSB, which is the reverse order of the bin() function.)

>>> list(bits(42))
[0, 1, 0, 1, 0, 1]

Given a sequence of bits, is there an odd number or an even number? This is the parity question. The parity FSA is often depicted like this:

When the parity is in the even state, a 1-bit transitions to the odd state. When the parity is in the odd, a 1-bit transitions to the even state.

Clearly, this demands the State design pattern, right?

An OO Implementation

Here's a detailed OO implementation using the State design pattern.

 
class Parity:
    def signal(self, bit: int) -> "Parity":
        ...


class EvenParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return OddParity()
        else:
            return self


class OddParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return EvenParity()
        else:
            return self


class ParityCheck:
    def __init__(self):
        self.parity = EvenParity()

    def check(self, message: Iterable[int]) -> None:
        for bit in message:
            self.parity = self.parity.signal(bit)

    @property
    def even_parity(self) -> bool:
        return isinstance(self.parity, EvenParity)

Each of the Parity subclasses implements one of the states of the FSA. The lonely signal() method implements state-specific behavior. In this case, it's a transition to another state. In more complex examples it may involve side-effects like updating a mutable data structure to log progress.

This mapping from state to diagram to class is pretty pleasant. Folks really like to implement each state as a distinct class. It somehow feels really solid.

It's import to note the loneliness of the lonely signal() method. It's all by itself in that big, empty class.

Hint. This could be a function.

It's also important to note that this kind of design is subject to odd, unpleasant design tweaks. Ideally, the transition is *only* done by the lonely signal() method. Nothing stops the unscrupulous programmer from putting state transitions in other methods. Sigh.

We'll look at more complex kinds of state transitions later. In the UML state chart diagrams sates may also have entry actions and exit actions, a bit more complex behavior than we we're showing in this example.

A Functional Implementation

What's the alternative? Instead of modeling state as an object with methods for behavior, we can model state as a function. The state is a function that transitions to the next state.

def even(bit: int) -> ParityF:
    if bit % 2 == 1:
        return odd
    else:
        return even


def odd(bit: int) -> ParityF:
    if bit % 2 == 1:
        return even
    else:
        return odd


def parity_check(message: Iterable[int], init: ParityF = None) -> ParityF:
    parity = init or even
    for bit in message:
        parity = parity(bit)
    return parity


def even_parity(p: ParityF) -> bool:
    return p is even

Each state is modeled by a function.

The parity_check() function examines each bit, and applies the current state function (either even() or odd()) to compute the next state, and save this as the vakue of the parity variable.

What's the ParityF type? This:

from typing import Protocol


class ParityF(Protocol):
    def __call__(self, bit: int) -> "ParityF":
        ...

This uses a Protocol to define a type with a recursive cycle in it. It would be more fun to use something like ParityF = Callable[[int], "ParityF"], but that's not (yet) supported.

Some Extensions

What if we need each state to have more attributes?

Python functions have attributes. Like this: even.some_value = 2; odd.some_value = 1. We can add all the attributes we require.

What about other functions that happen on entry to a state or exit from a state? This is trickier. My preference is to use a class as a namespace that contains a number of related functions.

class Even:
    @staticmethod
    def __call__(bit: int) -> ParityF:
        if bit % 2 == 1:
            odd.enter()
            return odd
        else:
            return even
    @staticmethod
    def enter() -> None:
        print("even")

even = Even()

This seems to work out well, and keeps each state-specific material in a single namespace. It uses static methods to follow the same design principle as the previous example -- these are pure functions, collected into the class only to provide a namespace so we can use odd.enter() or even.enter().

TL;DR

The State design pattern isn't required to implement a FSA.

Tuesday, November 22, 2022

Testing with PySpark

This isn't about details of pySpark. This is about the philosophy of testing when working with a large, complex framework, like pySpark, pandas, numpy, or whatever. 

BLUF

Use data subsets. 

Write unit tests for the functions that process the data.

Don't test pyspark itself. Test the code you write.

Some History

I've worked with folks -- data scientists specifically -- without a deep background in software engineering.

When we said their model-building applications needed a test case, they supplied the test case they used to validate the model.

Essentially, their test script ran the entire training set. Built the model. Did extensive statistical testing on the resulting decisions made by the model. The test case asserted that the stats were "good." In fact, they recapitulated the entire model review process that had gone on in the data science community to get the model from "someone's idea" to a "central piece of the business." 

The test case ran for hours and required a huge server loaded up with GPUs. It cost a fortune to run. And. It tended to timeout the deployment pipeline.

This isn't what we mean by "test." Our mistake.

We had to explain that a unit test demonstrates the code works. That was all. It shouldn't involve the full training set of data and the full training process with all the hyperparameter tuning and hours of compute time. We don't need to revalidate your model. We want to know the code won't crash. We'd like 100% code coverage. But the objective is little more than show it won't crash when we deploy it.

It was difficult to talk them down from full training sets. They couldn't see the value in testing code in isolation. A phrase like "just enough data to prove the thing could plausibly work with real data" seemed to resonate. 

A few folks complained that a numpy array with a few rows didn't really show very much. We had to explain (more than once) that we didn't really want to know all the algorithmic and performance nuances. We mostly wanted to know it wouldn't crash when we applied it to production data. We agreed with them the test case didn't show much. We weren't qualified to revalidate the model; we were only qualified to run their training process for them. If they had done enough work to be sure we *could* run it.

(It was a bank. Software deployments have rules. An AI model-building app is still an app. It still goes through the same CI/CD pipeline as demand deposit account software changes. It's a batch job, really, just a bit more internally sophisticated than the thing that clears checks.)

Some Structure

I lean toward the following tiers of testing:

  1. Unit tests of every class and function. 100% code coverage here. I suggest using pytest and pytest-cov packages to tracking testing and make sure every line of code has some test case. For a few particularly tricky things, every logic path is better than simply testing lines of code. In some cases, every line of code will tend to touch every logic path, but seems less burdensome.
  2. Use hypothesis for the more sensitive numeric functions. In “data wrangling” applications there may not be too many of these. In the machine learning and model application software, there may be more sophisticated math that benefits from hypothesis testing.
  3. Write larger integration tests that mimic pyspark processing, using multiple functions or classes to be sure they work together correctly, but without the added complication of actually using pySpark. This means creating mocks for some of the libraries using unittest.mock objects. This is a fair bit of work, but it pays handsome dividends when debugging. For well-understood pyspark APIs, it should be easy to provide mocked results for the app components under test to use. For the less well-understood parts, the time spent building a mock will often provide useful insight into how (and why) it works the way it does. In rare cases, building the mock suggests a better design that's easier to test.
  4. Finally. Write a few overall acceptance tests that use your modules and also start and run a small pyspark instance from the command line. For this, I really like using behave, and writing the acceptance testing cases using the Gherkin language. This enforces a very formal “Given-When-Then” structure on the test scenarios, and allows you to write in English. You can share the Gherkin with users and other stakeholders to be sure they agree on what the application should do.

Why?

Each tier of testing builds up a larger, and more complete picture of the overall application. 

More important, we don't emphasize running pySpark and testing it. It already works. It has it's own tests. We need to test the stuff we wrote, not the framework.

We need to test our code in isolation.

We need to test integrated code with mocked pySpark.

Once we're sure our code is likely to work, the next step is confirmation that the important parts do work with pySpark. For life-critical applications, the integration tests will need to touch 100% of the logic paths. For data analytics, extensive integration testing is a lot of cost for relatively little benefit.

Even for data analytics, testing is a lot of work. The alternative is hope and prayer. I suggest starting with small unit tests, and expanding from there.

Tuesday, November 15, 2022

Generators as Stacks of Operations

See https://towardsdatascience.com/building-generator-pipelines-in-python-8931535792ff 

I'm delighted by this article. 

I was shown only the first, horrible, example. I think the idea was to push back on the idea of complex generators. I fumed.

Then I read the entire article.

Now I'm fuming at someone who posted the first example -- apparently having failed to read the rest of the post.

This idea of building a stack of iterators is very, very good.

The example (using simple operations) can be misleading. A follow-on example doing something like file parsing might be helpful. But, if you go too far, you wind up writing an entire book about Functional Programming in Python.

Tuesday, November 8, 2022

Fighting Against Over-Engineering

I've been trying to help some folks who have a "search" algorithm that's slow. 

They know it's slow -- that's pretty obvious.

They're -- unfortunately -- sure that asyncio will help. That's not an obvious conclusion. It involves no useful research. Indeed, that's a kind of magical thinking. Which leads me to consider the process of over-engineering.

The Problem

Over-engineering is essentially a technique for burning brain-calories on planning to build something instead of building something.

The distinction is "planning" vs. "doing."

Lots of folks subscribe to Methodology Magic Thinking (MMT™). The core tenet of MMT is that some  methodology is good, and more methodology is better. 

The classic waterfall methodology expects requirements, design, code, test, and what-not, all flowing downhill. A series of waterfalls.

The more modern agile-fall methodology expects requirements, design, code, test, and what-not all being done in tiny MVP slices.

Why is this bad?

Its bad because it falls apart when confronted with really difficult algorithm and data structure problems.

What Breaks?

The thing that breaks is the "learn about the technology" or "learn about the problem domain" things that we need to do. We like to pretend with understand the technology -- in spite of the obvious information that we're rarely experts. We're smart. We're capable. But. We're not experts.

This applies to both the solution technology (i.e., language, persistence, framework, etc.) and the problem domain. 

When we have a process that takes *forever* to run, we've got a bad algorithm/data structure, and we don't know what to do.

We need to explore.

And.

Managers rarely permit exploration.

They have a schedule. The waterfall comes with a schedule. The agilefall sprints have timelines. And these are rarely negotiable. 

What Are Some Wrong Things to Do?

One wrong thing to do is to pick some technology and dig in hard. The asyncio module is not magical pixie dust. It doesn't make arbitrary bad code run faster. This is true in general. Picking a solution technology isn't right. Exploring alternatives -- emphasis on the plural -- is essential.

Another wrong thing to do is demand yet more process. More design docs. More preliminary analysis docs. More preliminary study. More over-engineering.

This is unhelpful. There are too many intellectual vacuums. And nature abhors a vacuum. So random ideas get sucked in. Some expertise in the language/tool/framework is required. Some expertise in the problem domain is required. Avoid assumptions.

What Should We Do?

We have to step back from the technology trap. We're not experts: we need to learn more. Which means exploring more. Which means putting time in the schedule for this.

We have to understand the problem domain better. We're not experts: we need to learn more. Which means putting time in the schedule for this.

We have to step back from the "deliverable code" trap. Each line of code is not a precious gift from some eternal god of code. It's an idea. And since the thing doesn't run well, it's provably a bad idea.

Code needs to be deleted. And rewritten. And rewritten again. And benchmarked.

Frustration

I like fixing bad code. I like helping people fix bad code.

I can't -- however -- work with folks who can't delete the old bad code.

It's unfortunate when they reach out and then block progress with a number of constraints that amount to "We can't focus on this; we can't make changes rapidly. Indeed, we're unlikely to make any changes."

The only way to learn is to become an expert is something. This takes time. To minimize the time means work with focus and work rapidly.

Instead of working rapidly, they want magical pixie dust that makes things faster. They want me to tell them were the "Turbo Boost" button is hidden.

Sigh.

Tuesday, October 25, 2022

Some Functional Programming in Python material

This is bonus content for the forthcoming Functional Python Programming 3rd edition book. It didn't make it into the book because -- well -- it was just too much of the wrong kind of detail.

See this "Tough TCO" document for some thoughts on Tail-Call Optimization that can be particularly difficult. This isn't terribly original, but I think it's helpful for folks working through more complex problems from a functional perspective.

"Why a PDF?" I've been working with with LaTeX, and the switching to other ways of editing and presenting code seemed like too much work.

Tuesday, August 23, 2022

Books! Books! More Channels!

I started with the Apple Books platform because it's an easy default for me. 

Pivot to Python

A Guide for professionals and skilled beginners

https://books.apple.com/us/book/pivot-to-python/id1586977675 

I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.

But wait, there's more. 

Unlearning SQL

When your only tool is a hammer, every problem looks like a nail

https://books.apple.com/us/book/unlearning-sql/id6443164060

Many folks know some Python, but struggle with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems effectively.

Google Play

Now, I'm duplicating the books on Google Play. Here's Unlearning SQL:

https://play.google.com/store/books/details?id=23WAEAAAQBAJ

I've made a clone of Pivot to Python, also.

https://play.google.com/store/books/details/Steven_F_Lott_Unlearning_SQL?id=23WAEAAAQBAJ&hl=en_US&gl=US

Both books are (intentionally) short to help experts make rapid progress.

Tuesday, August 16, 2022

Enterprise Python -- Some initial thoughts

In the long run, I think there's a small book here. See 8 reasons Python will rule the enterprise — and 8 reasons it won’t | InfoWorld. The conclusion, "Teams need to migrate slowly into the future, and adopting more Python is a way to do that," seems to be sensible. Some of the cautionary tales along the way, however, don't make as much sense.

TL;DR. There are no reasons to avoid Python. Indeed, the 8 points suggest that Python is perhaps a smart decision. 

I want to focus on the negatives part of this because some of them are wrong. I think there's a "technology hegemony" viewpoint where everything in an enterprise must be exactly the same. This tends to prevent creative solutions to problems and mires an enterprise into fighting problems that are inherent in bad technology choices. Also, I think there's an enterprises are run by idiots subtext.

1. Popularity. Really this is about having polyglot software portfolio. The reasoning appears to be that a polyglot software portfolio is impossible to maintain because (1) no one can learn an old language, and (2) software will never be rewritten from an obsolete language to a modern language. If these are both true, it appears the  organization is full of idiots. The notion that a polyglot tech stack must devolve into chaos seems to ignore the endless chain of management decisions that are required to create chaos. Leaving obsolete tech in place isn't a consequence of the tech, or the tech's lack of compatibility, it's a management decision to enshrine bad ideas, frozen in amber, forever.

2. Scripting Languages. Specifically, the spreadsheet is already the de facto scripting language of choice, and nothing can be done about it. Nothing. No one can learn to use Jupyter Lab to do business analytics. If this is true, it appears that the organization is full of idiots. Python will not replace all spreadsheets. A pandas data frame will replace an opaque macro-filled nightmare with code that can be unit tested. Imagine unit testing a spreadsheet. Consider the possibilities of expanding business analysis work to include a few test cases; not 100% code coverage, but a few test cases to confirm the analytical process was implemented consistently. 

3. Dynamic Languages. Specifically, dynamic languages are useless for reliable software because there's no comprehensive type checking across some interfaces. Which begs the question of why there are software failures in statically typed languages. More importantly, complaining about dynamic languages raises important questions about integration and acceptance testing procedures in an organization in general. All languages require extensive test suites for all developed code. All languages benefit from static analysis. Sometimes the compiler does this, sometimes external tools do the linting. Sometimes folks use both the compiler and linters to check types. If we are sure dynamic languages will break, are we equally sure statically typed languages cannot break? Or, do we take steps to prevent problems? I think we tend to take a lot of steps to make sure software works.

4. Tooling. I can't figure this point out. But somehow C++ or Java have better tools for managing large source code bases. There are no details behind this claim, so I'm left to guess. I would suggest that the "incremental recompilation" problem of large C++ (and Java) code bases is its own nightmare. Folks go to great lengths to architect C++ so that an implementation change does not require recompilation of everything. While this could be seen as "evolved to handle the jobs that enterprise coders need done", I submit that there's a deeper problem here, and stepping away from the compiler is a better solution than complex architectures. See Lakos Large-Scale C++ Software Design for some architectural features that don't solve any enterprise problem, but solve the scaling problem of big C++ applications. This bumps into the micro-services/monolith discussion, and the question of carefully testing each interface. None of which has anything to do with Python specifically.

5. Machine Learning and Data Science. These are fads, apparently. I'm not sure I can respond to this, since it has little to do with Python. Of course, Python has one of the most complete data science toolsets, so perhaps avoiding data science makes it easier to avoid Python.

6. Rapid Growth. The growth of Python is rapid, and there's no promise of endless backwards compatibility. This is a consequence of active development and learning. I think it's better than the endless backwards compatibility that leads to JavaScript's list of WATs. Or the endless confusion between java.util.date and Joda-Time. The idea that no one will ever look at the Enterprise code base for Common Vulnerabilities and Exposures seems to indicate a lack of concern for reliability or security. Since the entire compiled code base has to be checked for vulnerabilities, why not also check the Python code base for ongoing upgrades and changes and enhancements? Is code really written once and never looked at again? If so, it sounds like an organization run by idiots.

7. Python Shipped With Some OS's. There's a long story of woe that stems from relying on the OS-supplied Python. The lesson learned here is Never Rely on the OS Python; Always Install Your Own. This doesn't seem like a reason to avoid Python in the enterprise. It seems like an important lesson learned for all software that's not part of the OS: always install your own. I've been using Miniconda to spin up Python environments and absolutely love it.

8. Open Source Software. Agreed. Nothing to do with Python specifically. Everything to do with tech stack and architecture. The question of using Open Source in the first place doesn't seem difficult. It's a well-established way to reduce start-up costs for software development.

Of the eight points, two seem to be completely generic issues. Yes, Machine Learning is new, and yes, choices must be made. The two questions around scripting and dynamic languages seem specious; all programming requires careful design and testing. The Python shipped with the OS is a non-concern; the lesson learned is clear.

We have have three remaining points:

  1. Polyglot Portfolio (From pursuing popularity.) This is already the case in most Enterprises, and needs to be managed through aggressive retiring of old software. I may have taken decades to build that old app, but it often takes months to rewrite it in a new language. The legacy app provides acceptance test cases; it's often filled with cruft and detritus of old decisions. 
  2. Tooling. Agreed. Tooling is important. Not sure that Java or C++ have a real edge here, but, tooling is important.
  3. Growth and Change. Python's rapid evolution requires active management. An enterprise must adopt a YBYO (You Build it You Own it) attitude so that every level of management is aware of the components they're responsible for. CVE's are checked, Python PEP's are checked. Tools like tox or nox are used to build (and rebuild) virtual environments.
If these seem like a high bar, perhaps there are deeper issues in the enterprise. If adding Yet Another Language is a problem, then it's time to start retiring some languages. If Adding Another Tool is a problem, it's worth examining the existing tool chain to see why it's such a burden. If the idea of change is terrifying, perhaps the ongoing change is not being watched carefully enough.

Tuesday, August 9, 2022

Tragedy Averted

I almost made a terrible blunder.

See https://github.com/slott56/py-web-tool for some background. This is a "Literate Programming" tool. I started fooling around with this kind of thing back in '05 (maybe even earlier.) This is not the blunder. The whole idea of literate programming is not very popular. I'm a fan of Jupyter{Book} as the state of the art in sophisticated literate programming, if you're interested in it.

In my case, I started this project so long ago, I used docutils. This was long before Sphinx arrived on the scene. I never updated my little project to use Sphinx. The point was to have a kind of pure literate programming tool that could work with a variety of markup languages, including (but not limited to) RST.

Recently, I learned about PlantUML. The idea of a text description of a diagram is appealing. I don't really need to draw it; I just need to specify what's in it and let graphviz do the rest. This tool is very, very cool. You can capture ideas quickly. You can refine and expand on ideas until you reach a point where code makes more sense than a picture of code. 

For some things, you can gather data and draw a picture of things *as they are*. This is particularly valuable for cloud-based infrastructure where a few queries leads to PlantUML source that is depicted very nicely.

Which leads to the idea of Literate Programming including UML diagrams. 

Doesn't sound too difficult. I can create an extension to docutils to introduce a UML directive. The resulting RST would look like this:

..  uml::

    left to right direction
    skinparam actorStyle awesome

    actor "Developer" as Dev
    rectangle PyWeb {
        usecase "Tangle Source" as UC_Tangle
        usecase "Weave Document" as UC_Weave
    }
    rectangle IDE {
        usecase "Create WEB" as UC_Create
        usecase "Run Tests" as UC_Test
    }
    Dev --> UC_Tangle
    Dev --> UC_Weave
    Dev --> UC_Create
    Dev --> UC_Test

    UC_Test --> UC_Tangle

This could be handy to have the diagrams as part of the documentation that tangles the working the code. One source for all of it. 

I started down the path of researching docutils extensions. Got pretty far. Far enough that I had an empty repository and everything. I was about ready to start creating spike solutions.

Then.

[music cue] *duh duh duuuuuuh*

I found that Sphinx already has an extension for PlantUML. I almost started reading the code to see how it worked.

Then I realized how dumb that was. It already works. Why read the code? Why not install it?

I had a choice to make.

  1. Continue building my own docutils plug-in.
  2. Switch to Sphinx.

Some complications:

  • My Literate Programming tool produces RST that *may* not be compatible with Sphinx.
  • It's yet another dependency in a tool that started out with zero dependencies. I've added pytest and tox. What next? 

What to do?

I have to say that Git is amazing. I can make a branch for the spike. If it works, pull request. If it doesn't work, delete the branch. This continues to be game-changing to me. I'm old. I remember when we had to back up the whole project directory tree before making this kind of change.

It worked. My tool's RST (with one exception) worked perfectly with Sphinx. The one exception was an obscure directive, .. class:: name, used to provide an HTML class name for the following block. This always should have been the docutils .. container:: name directive. With this fix, we're good to go.

I'm happy I avoided the trap of reimplementing something. Instead of that, I upgraded from "bare" docutils with my own CSS to Sphinx with it's sophisticated templates and HTML Themes.

Tuesday, August 2, 2022

Books! Books! Books!

First, there's 

Pivot to Python

A Guide for professionals and skilled beginners

https://books.apple.com/us/book/pivot-to-python/id1586977675 

I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.

But wait, there's more. 

Unlearning SQL

When your only tool is a hammer, every problem looks like a nail

https://books.apple.com/us/book/unlearning-sql/id6443164060

This is all new. It's written for folks who know Python, and are struggling with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems.

Tuesday, July 26, 2022

Bashing the Bash -- The shell is awful and what you can do about it

A presentation I did recently.

https://github.com/slott56/bashing-the-bash

Folks were polite and didn't have too many questions. I guess they fundamentally agreed: the shell is awful, we can use it for a few things.

Safe Shell Scripts Stay Simple: Set the environment, Start the application.

The Seven S's of shell scripting.

Many many thanks to Code & Supply for hosting me.

Tuesday, July 19, 2022

I've got a great Proof-of-Concept. How do I go forward with it?

This is the best part about Python -- you can build something quickly. And it really works.

But. 

What are the next steps?

While there are a *lot* of possibilities, I'm focused on an "enterprise work group" application that involves a clever web service/RESTful API built in Flask. Maybe with NLP.

Let me catalog a bunch of things you might want to think about to "productionize" your great idea. Here's a short list to get started.

  • File System Organization
  • Virtual Environments
  • Unit Testing
  • Integration Testing
  • Acceptance Testing
  • Static Analysis
  • Tool Chain
  • Documentation
Let's dive into each one of these. Then we'll look at Flask deployments.

File System Organization

When you're gotten something to work, the directory in which it works is sometimes not organized ideally. There are a lot of ways to do this, but what seems to work well is a structure like the following.

- Some parent directory. Often in Git
  - src -- your code is here
  - tests -- your tests are here
  - docs -- your documentation will be here
  requirements.txt -- the list of packages to install. Exact, pinned version numbers
  requirements-dev.txt -- the list of packages used for maintenance and development
  environment.yml -- another list of packages in conda format
  pyproject.toml -- this has your tox setup in it
  Makefile -- sometimes helpful

Note that a lot of packages you see have a setup.py.  This is **only** needed if you're going open source your code. For enterprise projects, this is not the first thing you will focus on. Ignore it, for now.

Virtual Environments

When you're developing in Python you may not even worry about virtual environments. You have Python. It works. You downloaded NLP and Flask. You put things together and they work.

The trick here is the Python ecosystem is vast, and you have (without really observing it closely) likely downloaded a lot of projects. Projects that depend on projects. 

You can't trust your current environment to be reliable or repeatable. You'll need to use a virtual environment manager of some kind.

Python's built-in virtual environment manager venv is readily available and works nicely. See https://docs.python.org/3/library/venv.html  It's my second choice. 

My first choice is conda. Start with minicondahttps://docs.conda.io/en/latest/miniconda.html. Use this to assemble your environment and retest your application to be sure you've got everything.

You'll be creating (and destroying) virtual environments until you get it right. They're cheap. They don't impact your code in any way. Feel free to make mistakes.

When it works, build conda's environment.yml file and the requirements.txt files. This will rebuild the environment.  You'll use them with tox for testing.

If you don't use conda, you'll omit the environment.yml.  Nothing else will change.

Unit Testing

Of course, you'll need automated unit tests. You'll want 100% code coverage. You *really* want 100% logic path coverage, but that's aspirational. 100% code coverage is a lot of work and uncovers enough problems that the extra testing for all logic paths seems unhelpful.

You have two built-in unit testing toolsets: doctest and unittest. I like doctest. https://docs.python.org/3/library/doctest.html

You'll want to get pytest and the pytest-cov add-on package. https://docs.pytest.org/en/6.2.x/contents.html  https://pytest-cov.readthedocs.io/en/latest/.  

Your test modules go in the tests directory. You know you've done it right when you can use the pytest command at the command line and it finds (and runs) all your tests. 

This is part of your requirements-dev.txt file.

Integration Testing

This is unit testing without so many mocks. I recommend using pytest for this, also. The difference is that your "fixtures" will be much more complex. Files. Databases. Flask Clients. Certificates. Maybe starting multiple services. All kinds of things that have a complex setup and perhaps a complex teardown, also.

See https://docs.pytest.org/en/6.2.x/fixture.html#yield-fixtures-recommended for good ways to handle this more complex setup and teardown.

Acceptance Testing

Depending on the community of users, it may be necessary to provide automated acceptance tests. For this, I recommend behave. https://behave.readthedocs.io/en/stable/ You're can write the test cases in the Gherkin language. This language is open-ended, and many stakeholders can contribute to the test cases. It's not easy to get consensus sometimes, and a more formal Gherkin test case lets people debate, come to an agreement, and prioritize the features and scenarios they need to see.

This is part of your requirements-dev.txt file.

Static Analysis

This is an extra layer of checking to be sure best practices are being followed. There are a variety of tools for this. You *always* want to process your code through blackhttps://black.readthedocs.io/en/stable/ 

Some folks love isort for putting the imports into a canonical order.  https://pycqa.github.io/isort/

Flake8 should be used to be sure there's no obviously bad programming practices. https://flake8.pycqa.org/en/latest/

I'm a huge fan of type hints. I consider mypy to be essential. https://mypy.readthedocs.io/en/stable/  I prefer "--strict" mode, but that can be a high bar. 

Tool Chain

You can try to manage this with make. But don't.

Download tox, instead.  https://tox.wiki/en/latest/index.html  

The point of tox is to combine virtual environment setup with testing in that virtual environment. You can -- without too much pain -- define multiple virtual environments. You can then test the various releases of the various packages your project depends on in various combinations. This is how to manage a clean upgrade. 

1. Figure out the new versions.

2. Setup tox to test existing and new.

3. Run tox.

I often set the tox commands to run black first, then unit testing, then static analysis, ending with mypy --strict.

When the code is reformatted by black, it's technically a build failure. (You should have run black manually before running tox.) When tox works cleanly, you're ready to commit and push and pull request and merge.

Documentation

Not an after-thought.

For human documents, use Sphinx. https://www.sphinx-doc.org/en/master/ 

Put docstrings in every package, every module, every class, every method, and every function. Summarize *what* and *why*. (Don't explain *how*: people can read your code.) 

Use the autodoc feature to create the API reference documentation from the code. Start with this.

Later, you can write a README, and some explanations, and installation instructions, and all the things other people expect to see.

For a RESTful API, be sure to write an OpenAPI specification and be sure to test against that spec. https://www.openapis.org. While a lot of the examples are complicated, you can easily use a small subset to describe your documents, the validation rules, and the transactions. You can add the security details later. They're part of your web server, but they don't need an extensive OpenAPI documentation at the beginning.

Flask Deployments

Some folks like to define a flask application that can be installed in the Python virtual environment. This means the components are on the default sys.path without any "extra" effort. (It's a fair amount of effort to begin with. I'm not sure it's worth it.)

When you run a flask app, you'll be using some kind of engine. NGINX, uWSGI, GUnicorn, etc. (GUnicorn is very nice. https://gunicorn.org). 

See https://flask.palletsprojects.com/en/2.0.x/deploying/wsgi-standalone/.

In all cases, these engines will "wrap" your Flask application. You'll want to make your application visible by setting the PYTHONPATH environment variable, naming your src directory. Do not run from your project's directory.

You will have the engine running in some distinct /opt/the_app or /Users/the_app or /usr/home/the_app or some such directory, unrelated to where the code lives. You'll use GUnicorns command-line options to locate your app, wherever it lives on the filesystem. GUnicorn will use PYTHONPATH to find your app. Since web servers often run as nobody, you'll need to make sure your code base is readable. But. Not. Writable.

Tuesday, July 12, 2022

The Enterprise COBOL Conundrum

Enterprise COBOL is both a liability and an asset. There's tangible value hidden in the code.

See https://github.com/slott56/looking-at-cobol  

I've tweaked the presentation a little. 

The essential ingredients in coping with COBOL are these:

  • Use something like Stingray Reader to parse COBOL DDE's and process the data in the native format.
  • Analyze the Job Control Language (JCL) to work out the directed acyclic graph (DAG) that leads to file and database updates. These "master" files and databases are the data artifacts that matter most. This is the value-creating processing. There aren't many of these files. 
  • Create a process to clone those files, and write Python data access modules to process the data. This is a two-way process. You'll be shipping files from your Z/OS world to another server running Python. In some cases, files will need to come back to Z/OS to permit legacy processing to continue. 
  • Work backwards through the DAG to understand the COBOL apps that update the master files. These can be rewritten as Python apps that consume transactions and update master files/databases. Transfer transaction files out of Z/OS to a server doing the Python processing. Either update a shared database or send updated master files back to Z/OS if there's further processing that needs an updated master. 
  • Continue working backwards through the DAG, replacing COBOL with Python until you've found source files for the transactions. Expect to find transaction validation programs as well as transaction analytics or reporting. The validations are useful; the analytics and reporting can be replaced with simpler, more modern tools.
  • When there's no more legacy processing that depends on a given master file or database, then the Z/OS can be formally decommissioned. Have a party.

This is relatively low risk work. It's high value. The COBOL code encodes enterprise knowledge. Preserving this knowledge in a more modern language is a value-maintaining exercise. Indeed, the improved clarity may be a value-creating exercise.

Tuesday, July 5, 2022

Revised Understanding --> Revised Data Structures --> Revised Type Hints

My literate programming tool, pyWeb, has moved to version 3.1 -- supporting modern Python.

Next up, version 3.2. This is a massive reworking of the data structures involved. The rework lets me use Jinja2 for templates. There's a lot of fiddliness to getting the end-of-line spacing right. Jinja has the following:

{% for construct in container -%}
{{construct}}
{%- endfor %} 

The easy-to-overlook hyphens suppress spacing, allowing the construct to be spread onto multiple lines without introducing extra newlines into the output. This makes it a little easier to debug the templates.

It now works. But. Until I get past strict type checks, there's no reason for calling it done.

Found 94 errors in 1 file (checked 3 source files)

The bulk of the remaining problems seem to be new methods where I forgot to include a type hint. The more pernicious problems are places where I have inconsistent hints and Liskov substitution problems. The worst a places where I had a last-minute change change and switched from str to int and did not actually follow-through and make required changes.

The biggest issue?

When building an AST, it's common to have a union of a wide variety of types. This union often has a discriminator value to separate NamedChunk from OutputChunk. This is "type narrowing" and there are a variety of approaches. I think my best choice is a TypeGuard declaration. This is new to me, so I've got to do some learning before I can properly define the required type guard function(s). (See https://mypy.readthedocs.io/en/stable/type_narrowing.html#user-defined-type-guards)

I'm looking forward (eagerly) to finishing the cleanup. 

The problem is that I'm -- also -- working on the updates to Functional Python Programming. The PyWeb project is a way to relax my brain from editing the book. 

Which means the pyWeb updates have to wait for Chapter 4 and 5 edits. (Sigh.)

Tuesday, June 28, 2022

Massive Rework of Data Structures

As noted in My Shifting Understanding and A Terrible Design Mistake, I had a design that focused on serialization instead of proper modeling of the objects in question.

Specifically, I didn't start with a suitable abstract syntax tree (AST) structure. I started with an algorithmic view of "weaving" and "tangling" to transform a WEB of definitions into documentation and code. The weaving and tangling are two of the three distinct serializations of a common AST. 

The third serialization is the common source format that underpins the WEB of definitions. Here's an example that contains a number of definitions and a tangled output file.

Fast Exponentiation
===================

A classic divide-and-conquer algorithm.

@d fast exp @{
def fast_exp(n: int, p: int) -> int:
    match p:
        case 0: 
            return 1
        case _ if p % 2 == 0:
            t = fast_exp(n, p // 2)
            return t * t
        case _ if p % 1 == 0:
            return n * fast_exp(n, p - 1)
@| fast_exp
@}

With a test case.

@d test case @{
>>> fast_exp(2, 30)
1073741824
@}

@o example.py @{
@< fast exp @>

__test__ = {
    "test 1": '''
@< test case @>
    '''
}
@| __test__
@}

Use ``python -m doctest`` to test.

Macros
------

@m

Names
-----

@u

This example uses RST as the markup language for the woven document. A tool can turn this simplified document into complete RST with appropriate wrappers around the code blocks. The tool can also weave the example.py file from the source document.

The author can focus on exposition, explaining the algorithm. The reader gets the key points without the clutter of programming language overheads and complications.

The compiler gets a tangled source.

The key point is to have a tool that's (mostly) agnostic with respect to programming language and markup language. Being fully agnostic isn't possible, of course. The @d name @{code@} constructs are transformed into markup blocks of some sophistication. The @<name@> becomes a hyperlink, with suitable markup. Similarly, the cross reference-generating commands, @m and @u, generate a fair amount of markup content. 

I now have Jinja templates to do this in RST. I'll also have to provide LaTeX and HTML. Further, I need to provide generic LaTeX along with LaTeX I can use with PacktPub's LaTeX publishing pipeline. But let's not look too far down the road. First things first.

TL;DR

Here's today's progress measurement.

==================== 67 failed, 13 passed, 1 error in 1.53s ====================

This comforts me a great deal. Some elements of the original structure still work. There are two kinds of failures: new test fixtures that require TestCase.setUp() methods, and tests for features that are no longer part of the design.

In order to get the refactoring to a place where it would even run, I had to incorporate some legacy methods that -- it appears -- will eventually become dead code. It's not totally dead, yet, because I'm still mid-way through the refactoring. 

But. I'm no longer beating back and forth trying to see if I've got a better design. I'm now on the downwind broad reach of finding and fixing the 67 test cases that are broken. 

Tuesday, June 21, 2022

My Shifting Understanding and A Terrible Design Mistake

I've been fascinated by Literate Programming forever. 

I have two utterly divergent takes on this.

See https://github.com/slott56/PyLit-3 for one.

See https://github.com/slott56/py-web-tool for another.

And yet, I've still done a really bad design job. Before we get to the design, a little bit of back story.

Back Story

Why two separate literate programming projects? Because it's not clear what's best. It's a field without too many boundaries and a lot of questions about the value produced.

PyLit I found, forked, and upgraded to Python 3. I didn't design it. It's far more clever than something I'd design.

Py-Web-Tool is something I wrote based on using a whole bunch of tools that follow along behind the original WEB tools. Nothing to do with web servers or web.py.

The Problem Domain

The design problem is, in retrospect, pretty obvious. I set it out here as a cautionary tale.

I'm looking at the markup languages for doing literate programming. The idea is to have named blocks of code in your document, presented in an order that makes sense to your reader. A tool will "weave" a document from your source. It will also "tangle" source code by rearranging the code snippets from presentation order into compiler-friendly order.

This means you can present your core algorithm first, even though it's buried in the middle of some module in the middle of your package. 

The presentation order is *not* tied to the order needed by your language's toolchain.

For languages like C this is huge freedom. For Python, it's not such a gigantic win.

The source material is a "web" of code and information about the code. A web file may look like this:

Important insight.

@d core feature you need to know about first @{
    def somecode() -> None:
        pass
@}

And see how this fits into a larger context?

@d something more expansive @{
def this() -> None:
    pass
    
def that() -> None:
    pass
    
@<core feature you need to know about first@>
@}

See how that works?

This is easy to write and (relatively) easy to read. The @<core feature you need to know about first@> becomes a hyperlink in the published documentation. So you can flip between the sections. It's physically expanded inline to tangle the code, but you don't often need to look at the tangled code.

The Design Question

The essential Literate Programming tool is a compiler with two outputs:

  • The "woven" document with markup and such
  • The "tangled" code files which are code, largely untouched, but reordered.

We've got four related problems.

  1. Parsing the input
  2. An AST we can process
  3. Emitting tangled output from the AST
  4. Emitting woven output form the AST

Or, we can look at it as three classic problems: deserialization, AST representation, and serialization. Additionally, we have two distinct serialization alternatives.

What did I do?

I tackled serialization first. Came up with a cool bunch of classes and methods to serialize the two kinds of documents.

Then I wrote the deserialization (or parsing) of the source WEB file. This is pretty easy, since the markup is designed to be as trivial as possible. 

The representation is little more than glue between the two.

What a mistake.

A Wrong Answer

Focusing on serialization was an epic mistake.

I want to try using Jinja2 for the markup templates instead of string.Template

However. 

My AST was such a bad hack job it was essentially impossible to use it. It was a quagmire of inconsistent ad-hoc methods to solve a specific serialization issue.

As I start down the Jinja road, I found a need to be able to build an AST without the overhead of parsing.

Which caused me to realize that the AST was -- while structurally sensible -- far from the simple ideal.

What's the ideal?

The Right Answer

This ideal AST is something that lets me build test fixtures like this:

example = Web(
   chunks=[
       TextChunk("\n"),
       NamedCodeChunk(name="core feature you need to know about first", lines=["def someconme() -> None: ...", "pass"])),
       TextChunk("\nAnd see how this fits into a larger context?\n"),
       NamedCodeChunk(name="something more expansive", lines=[etc. etc.])
   ]
)

Here's my test for usability: I can build the AST "manually" without a parser. 

The parser can build one, also, but I can build it as a sensible, readable, first-class Python object.

This has pointed me to a better design for the overall constructs of the WEB source document. Bonus. It's helping me define Jinja templates that can render this as a sensible woven document.

Tangling does not need Jinja. It's simpler. And -- by convention -- the tangled code does not have anything injected into it. The woven code is in a markup language (Markdown, RST, HTML, LaTeX, ASCII DOC, whatever) and some markup is required to create hyperlinks and code sections. Jinja is super helpful here. 

TL;DR

The essence of the problem is rarely serialization or deserialization.  It's the internal representation.


Tuesday, June 14, 2022

A LaTeX Thing I Did -- And A ToDo:

When writing about code in LaTeX, the essential strategy is to use an environment to format the code so it stands out from surrounding text. There are a few of these environments available as LaTeX add-on packages. The three popular ones are:

These are nice for making code readable and distinct from the surrounding text.

A common way to talk about the code is to use inline verbatim \verb|code| sections. I prefer inline \lstinline|code|, but, my editor prefers \verb. (I have trouble getting all the moving parts of minted installed properly, so I use listings.)

Also. And more important. 

There's the \lstinputlisting[language=Python, firstline=2, lastline=12]{some_module.py} command. This lets an author incorporate examples from working, tested modules. Minted doesn't seem to have this, but it might work with an \input command. Don't know. Haven't tried.

Let's talk about workflow.

Workflow

The idea behind these tools is you have code and after that, you write about the code. I call this code first.

Doing this means you can include code snippets from a file.

Which is okay, but, there's another point of view: you have a document that contains the code. This is closer to the Literate Programming POV. I call this document first. I've got all the code in the document you're reading, I've just broken it up and spread it around in an order to serve my purpose as a writer, not serve the limitations of a parser or compiler.

There is a development environment -- WEB -- to create code that can be run through the Weave and Tangle tools to create working code and usable documentation. This is appealing in many ways. 

For now, I'm settling for the following workflow:

  1. Write the document with code samples. Use \lstlisting environment with explicit unique labels for each snippet. The idea is to focus on the documentation with explanations.
  2. Write a Jinja template that references the code samples. This is a lot of {{extract['lst:listing_1']}} kind of references. There's a bit more that can go in here, we'll return to the templates in a moment.
  3. Run a tool to extract all the \lstlisting environments to a dictionary with the label as the key and the block of text as the value. This serializes nicely as a JSON (or TOML or YAML) file. It can even be pickled, but I prefer to be able to look at the file to see what's in it.
  4. The tool to populate the template is a kind of trivial thing to build a Jinja environment, load up the template, fill in the code samples, and write the result.
  5. I can then use tox (and doctest and pytest and mypy) to test the resulting module to be sure it works.

This tangles code from a source document. There's no weave step, since the source is already designed for publication. This does require me to make changes to the LaTeX document I'm writing and run a make test command to extract, tangle, and test. This is not a huge burden. Indeed, it's easy to implement in PyCharm, because the latest release of PyCharm understands Makefiles and tox. Since each chapter is a distinct environment, I can use tox -e ch01 to limit the testing to only the chapter I'm working on.

I like this because it lets me focus on explanation, not implementation details. It helps me make sure that all the code in the book is fully tested. 

The Templates

The template files for an example module have these three kinds of code blocks:

  1. Ordinary Listings. These fall into two subclasses.
    1. Complete function or class definitions.
    2. Lines of code taken out of context.
  2. REPL Examples. 

These have three different testing requirements. We'll start with the "complete function or class definitions."  For these, the template might look like the following

{{extract['lst:listing_1']}}

def test_listing_1() -> None:
    assert listing_1(42)
    assert not listing_1(None)

This has both the reference to the code in the text of the book and a test case for the code.

For lines of code out of context, we have to be more careful. We might have this.

def some_example(arg: int) -> bool:
    {{extract['lst:listing_2']}}

def test_listing_2() -> None:
    assert listing_2(42)
    assert not listing_2(None)

This is similar to a complete definition, but it has a fiddly indentation that needs to be properly managed, also. Jinja's generally good about not inserting spaces. The template, however, is full of what could appear to be syntax errors, so the code editor could have a conniption with all those {} blocks of code. They happen to be valid Python set literals, so, they're tolerated. PyCharm's type checking hates them.

The REPL examples, look like this.

REPL_listing_3 = """
{{extract['lst:listing_3']}}
"""

I collect these into a __test__ variable to make them easy for doctest to find. The extra fussiness of  a __test__ variable isn't needed, but it provides a handy audit for me to make sure everything has a home.

The following line of code is in most (not all) templates.

__test__ = {
    name: value
    for name, value in globals().items() 
    if name.startswith("REPL")
}

This will locate all of the global variables with names starting with REPL and put them in the __test__ mapping. The REPL names then become the test case names, making any test failures easier to spot.

My Goal

I do have some Literate Programming tools that I might be able to leverage to make myself a Weaver that produces useful LaTeX my publisher can work with. I should do this because it would be slightly simpler. The problem is my Web/Weave/Tangle tooling has a bunch of dumb assumptions about the weave and tangle outputs; a problem I really need to fix.

See py-web-tool.

The idea here is to mimic other WEB-based tooling. These are the two primary applications:

  • Weave. This makes documentation in a fairly transparent way from the source. There are a bunch of substitutions required to fill in HTML or LaTeX or Markdown or RST around the generic source. Right now, this is pretty inept and almost impossible to configure.
  • Tangle. This makes code from the source. The point here is the final source file is not necessarily built in any obvious order. It's a tangle of things from the documentation, put into the order required by parser or compiler or build system or whatever.

The weaving requires a better way to provide the various templates that fill in missing bits. Markdown, for example, works well with fenced blocks. RST uses a code directive that leads to an extra level of indentation that needs to be carefully excised. Futher, most markup languages have a mountain of cruft that goes around the content. This is unpleasantly complex, and very much subject to odd little changes that don't track against the content, but are part of the evolution of the markup language.

My going-in assumption on tangling was the document contained all the code. All of it. Without question or exception. For C/C++ this means all the fiddly little pre-processor directives that add no semantic clarity yet must be in the code file. This means the preprocessor nonsense had to be relegated to an appendix of "yet more code that just has to be there."

After writing a tangler to pull code from a book into a variety of contexts, I'm thinking I need to have a tangler that works with a template engine. I think there would be the following two use cases:

  • No-Template Case. The WEB source is complete. This works well for a lot of languages that don't have the kind of cruft that C/C++ has. It generally means a WEB source document will contain definition(s) for the final code file(s) as a bunch of references to the previously-explained bits. For C/C++, this final presentation can include the fiddly bits of preprocessor cruft.
  • Template Case. A template is used to with the source to create the tangled output. This is what I have now for pulling book content into a context where it is testable. For the most part, the template files are quite small because the book includes test cases in the form of REPL blocks. This presents a bit of a problem because it breaks the "all in one place" principle of a WEB project. I have a WEB source file with the visible content plus one or more templates with invisible content.

What I like about this is an attempt to reduce some of the cruftiness of the various tools. 

I think my py-web-tool might be expanded to handle my expanded understanding of literate programming. 

I have a book to finish, first, though. Then I can look at improving my workflow. (And yes, this is backwards from a properly Agile approach.)

Tuesday, April 12, 2022

Pelican and Static Web Content

In Static Site Blues I was wringing my hands over ways to convert a ton of content from a two different proprietary tools (the very old iWeb, and the merely old Sandvox) into something I could work with.

After a bit of fiddling around, I'm delighted with Pelican.

First, of course, I had to extract all the iWeb and Sandvox content. This was emphatically not fun. While both used XML, they used it in subtly different ways. Apple's frameworks serialize internal state as XML in a way that preserves a lot of semantic details. It also preserves endless irrelevant details.

I wound up with a Markdown data structure definition, plus a higher-level "content model" with sites, pages, blogs, blog entries and images. Plus the iWeb extractor and the Sandvox extractor. It's a lot of code, much of which lacks solid unit test cases. It worked -- once -- and I was tolerant of the results.

I also wound up writing tools to walk the resulting tree of Markdown files doing some post-extraction cleanup. There's a lot of cleanup that should be done.

But.

I can now add to the blog with the state of my voyaging. I've been able to keep Team Red Cruising up to date.

Eventually (i.e., when the boat is laid up for Hurricane Season) I may make an effort to clean up the older content and make it more consistent. In particular, I need to add some annotations around anchorages to make it possible to locate all of the legs of all of the journeys. Since the HTML is what most people can see, that means a class identifier for lat-lon pairs. 

As it is, the blog entries are *mostly* markdown. Getting images and blockquotes even close to readable requires dropping to HTML to make direct use of the bootstrap CSS. This also requires some comprehensive cleanup to properly use the Bootstrap classes. (I think I've may have introduced some misspelled CSS classes into the HTML that aren't doing anything.)

For now, however, it works. I'm still tweaking small things that require republishing *all* the HTML. 

Tuesday, March 1, 2022

Static Site Blues

I have a very large, static site with 10+ years of stuff about my boat. Most of it is pretty boring. http://www.itmaybeahack.com/TeamRedCruising/

I started with iWeb. It was very -- well -- 2000-ish look and feel. Too many pastels and lines and borders.

In 2012, I switched to Sandvox. I lived on a boat back then. I don't have reliable internet. Using blogger.com, for example, required a sincere commitment to bandwidth. I moved ashore in 2014 and returned to the boat in 2020.

Sandvox's creator seems to be out-of-business.

What's next?

Give up on these fancy editors and switch to a static site generator. Write markdown. Run the tool. Upload when in a coffee shop with Wi-Fi. 

What site generator?

See https://www.fullstackpython.com/static-site-generator.html for some suggestions.

There are three parts to this effort.

  1. Extract the goodness from iWeb and Sandvox. I knew this would be real work. iWeb's site has too much javascript to be easy-to-parse. I have to navigate the underlying XML database. Sandvox is much easier to deal with: their published site is clean, static HTML with useful classes and ids in their tags.
  2. Reformat the source material into Markdown. I've grudgingly grown to accept Markdown, even through RST is clearly superior. Some tools work with RST and I may pandoc the entire thing over to RST from Markdown. For now, though, the content seems to be captured.
  3. Fixup internal links and cross references. This is a godawful problem. Media links -- in particular -- seem to be a nightmare. Since iWeb resolves things via Javascript, the HTML is opaque.  Fortunately, the database's internal cross-references aren't horrible. Maybe this was exacerbated a poor choice of generators. 
  4. Convert to HTML for a local server. Validate.
  5. Convert to HTML for the target server. Upload to a staging server and validate again. This requires a coffee shop. Not doing this with my phone's data plan.

Steps 1 and 2 aren't too bad. I've extracted serviceable markdown from the iWeb database and the published Sandvox site. The material parallels the Site/Blog/Page structure of the originals. The markdown seems to be mostly error-free. (Some images have the caption in the wrong place, ![caption](link) isn't as memorable as I'd like.) 

Step 3, the internal links and cross-references, has been a difficult problem, it turns out. I can, mostly, associate media with postings. I can also find all the cross-references among postings and fix those up. The question that arises is how to reference media from a blog post?

Mynt

I started with mynt. And had to bail. It's clever and very simple. Too simple for blog posts that have a lot of associated media assets.

The issue is what to write in the markdown to refer to the images that go with a specific blog post. I resorted to a master _Media directory. Which means each posting has ![caption][../../../../_Media/image.png) in it.  This is semi-manageable. But exasperating in bulk. 

What scrambled my brain is the way a mynt posting becomes a directory, with an index.html. Clearly, the media could be adjacent to the index.html. But. I can't figure out how to get mynt's generator to put the media into each post's published directory. It seems like each post should not be a markdown file. 

Also, I can trivially change the base URL when generating, but I can't change the domain. When I publish, I want to swap domains *only*, leaving the base URL alone. I tried. It's too much fooling around.

Pelican

Next up. Pelican. We'll see if I can get my media and blog posts neatly organized. This http://chdoig.github.io/create-pelican-blog.html seems encouraging. I think I should have started here first. Lektor is another possibility.

Since my legacy sites have RSS feeds, it may be sensible to turn Pelican loose on the RSS and (perhaps) skip steps 1, 2, and 3, entirely.


Tuesday, February 15, 2022

LaTeX Mysteries and an algorithmicx thing I learned.

I've been an on-and-off user of LaTeX since the very, very beginning. Back in the dark days when the one laser printer that could render the images was in a closely-guarded secret location to prevent everyone from using it and exhausting the (expensive) toner cartridges.

A consequence of this is I think the various algorithm environments are a ton of fun. Pseudo-code with math embedded in it. It's marvelous. It's a pain in the neck with this clunky blogging package, so I can't easily show off the coolness. But. You can go to https://www.overleaf.com/learn/latex/Algorithms to see some examples.

None of which have try/except blocks. Not a thing.

Why not? I suspect it's because "algorithmic" meant "Algol-60" for years. The language didn't have exceptions and so, the presentation of algorithms continues to this day without exceptions. 

What can one do?

This.

\algblock{Try}{EndTry}
\algcblock[Try]{Try}{Except}{EndTry}
\algcblockdefx{Try}{Except}{EndTry}
   [1][Exception]{\textbf{except} \texttt{#1}}

\algrenewtext{Try}{\textbf{try}}

This will extend the notation to add \Try, \Except, and \EndTry commands. I think I've done it all more-or-less correctly. I'm vague on where the \algnotext{EndTry} goes, but it seems to be needed in each \Try block to silence the \EndTry.

As far as I know, I'm the only person who seems to care. There seems to be little about this anywhere online. I'm guessing it's because the basics work perfectly, and no one wants this kind of weird add-on.

Tuesday, February 8, 2022

Desktop Notifications and EPIC DESIGN FAIL

I was asked to review code that -- well -- was evil.

Not like "shabby" or "non-pythonic". Nothing so simple as that.

We'll get to the evil in a moment. First, we have to suffer two horrible indignities.

1. Busy Waiting

2. Undefined Post-Conditions.

We'll beat all three issues to death separately, starting with busy waiting.

Busy Waiting

The Busy Waiting is a sleep-loop. If you're not familiar, it's this:

while something has not happened yet AND we haven't timed out:
    time.sleep(2)
    

Which is often a dumb design. Busy waiting is polling. It's a lot of pointless doing something while waiting for something else.

There are dozens of message-passing and event-passing frameworks. Any of those is better than this.

Folks complain "Why install ZMQ when I could instead write a busy-waiting loop?"

Why indeed?

For me, the primary reason is to avoid polling at fixed intervals, and instead wait for the notification. 

The asyncio module, confusing as it is, is better than polling. Because it dispatches events properly.

This is minor compared with the undefined post-conditions.

Undefined Post-Conditions

With this crap design, there are two events. There's a race between them. One will win. The other will be silently lost forever.

If "something has not happened" is false, the thing has happened. Yay. The while statement ends.

If "something has not happened" is true and the timeout occurs, then Boo. The while statement ends.

Note the there are two, unrelated post-conditions: the thing has happened OR the timeout occurred. Is it possible for both to happen? (hint: yes.)

Ideally, the timeout and the thing happening are well-separated in time.

Heh.

Otherwise, they're coincident, and it's a coin-toss as to which one will lead to completion of the while statement. 

The code I was asked to review made no provision for this unhappy coincidence. 

Which leads us to the pure evil.

Pure Evil

What's pure evil about this is the very clear statement that there are not enough desktop notification apps, and there's a need for another.

I asked for justification. Got a stony silence.

They might claim "It's only a little script that runs in the Terminal Window," which is garbage. There are already lots and lots of desktop apps looking for asynchronous notification of events.

Email is one of them.

Do we really need another email-like message queue?  

(Hint: "My email is a lot of junk I ignore" is a personal problem, not a software product description. Consider learning how to create filters before writing yet another desktop app.)

Some enterprises use Slack for notifications. 

What makes it even worse (I said it was pure evil) was a hint about the context. They were doing batch data prep for some kind of analytics/Machine Learning thing. 

They were writing this as if Luigi and related Workflow managers didn't exist.

Did they not know? If they were going to invent their own, they were off to a really bad start. Really bad.