All of the content has been moved
And reformatted.
And lightly edited to clean up a few of the long-standing problems.
Moved to https://slott56.github.io/.
Nothing new will be posted here.
All the new stuff will go to https://slott56.github.io/.
Rants on the daily grind of building software. This has been moved to https://slott56.github.io. Fix your bookmarks.
And reformatted.
And lightly edited to clean up a few of the long-standing problems.
Moved to https://slott56.github.io/.
Nothing new will be posted here.
All the new stuff will go to https://slott56.github.io/.
Check out DZone's Coding zone: https://dzone.com/coding. Hover over the "Coding" drop-down menu.
Notice anything lacking?
I'll give you a hint: Python.
They have "Frameworks", "Java", "Javascript", "Languages" and "Tools".
The "Languages" seems to be general programming, and the posts include Java, JavaScript and numerous other languages.
It leaves me with questions about the basis for the "Zones". It can't be popularity, since Java has slipped behind Python. Maybe there's some other criteria; I wonder what they could be? Sponsorships? Or the historical "Javalobby" web site?
It also leaves me with the urge to suggest they radically rethink their approach to content management. The "Zones" don't seem to have crisp definitions. If they do, it would help to share them. If they don't, perhaps they should be assigned more dynamically as content is reviewed and included on the site.
https://books.apple.com/us/book/pivot-to-python/id1586977675
I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.
But wait, there's more.
https://books.apple.com/us/book/unlearning-sql/id6443164060
Many folks know some Python, but struggle with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems effectively.
Also available on Google Play. Here's Unlearning SQL:
https://play.google.com/store/books/details?id=23WAEAAAQBAJ
I've made a clone of Pivot to Python, also.
Both books are (intentionally) short to help experts make rapid progress.
Some background: FAERIE DUST™, Obstinate Idiocy, Obstinate Idiocy, Expanded, and even Permutations, Combinations and Frustrations. I want to set up algorithm design as the diametric opposite of Obstinate Stupidity. To do that, let's look at Obstinate Stupidity.
The theme?
We did something wrong, and we don't want to fix it.
I emphasize this because it takes many forms. Another common variant is "We can't afford to continue the way we are, but we can't afford the time to fix it, either." Sometimes, "Management wants this fixed, but we don't have any budget." You know how it is.
The current go-round is someone who has an algorithm of intensely stupid (and largely irrelevant) complexity. See My algorithm performs badly, do I need asyncio?.
The situation is touchy. They have pre-reasoned an answer -- asyncio
-- and they're looking for (a) confirmation that they're obviously right and (b) help rewriting something needlessly complex to somehow be faster even when it's compute-bound. Specifically, they want Faerie Dust.
How do I know it has needless, frivolous complexity?
Here are two symptoms.
Let's dwell on this second symptom for a moment. We have two competing issues:
This should make it clear they are looking at asyncio
as a kind of Faerie Dust that will magically make the bad algorithm good. Without fixing the bad algorithm.
I want to emphasize the existence of details which can neither be explained nor removed. The hierarchy must be there simply because it must be there. Bizarre complications to walk the hierarchy are, therefore, essential even if no one can explain them.
To actually improve the processing they need a new algorithm.
I can't emphasize this enough: they need a new algorithm. (This often means a new data structure.)
"Tuning" in any form is nothing more than nibbling around the edges to make a bad algorithm run 15% faster.
Rewriting may replace \(\textbf{O}(2^n)\) with \(\textbf{O}(n \log n)\). This would be dramatically better. From seconds to milliseconds. You know, 1,000% faster.
There's a disciplined approach to this. Here are the steps.
The intensely painful part of this is creating the post-condition.
I suggested they "write an assert statement that must be true when the algorithm has completed, and computed the right answer."
Hahahah.
What an idiot I was.
They didn't know how to write an assert
statement. And at this point, they stopped. Brick Wall. Dead in the water. Cannot proceed. Done. Failed.
The assert
statement has become the end-of-the-line. They can't (or won't) do that. And they won't ask about it.
Me: "Do you have a question?"
Them: "I have to think before I can even begin to ask a question."
Me: "How about think less and ask more. Do you have trouble writing the word assert? What's stopping you?"
Them: [silence]
Okay.
The post-condition is true when you're done. Let's look at my favorite, M must be the maximum of A and B.
\[M \geq A \textbf{ and } M \geq B\]
This becomes an assert
statement through (what seems to me, but boy was I wrong) the following kind of translation.
assert M >= A and M >= B, f"Algorithm Failed {M=} {A=} {B=}"
Again, I acknowledge I was wrong to think creating an assert
statement from a post condition was in any way clear. It's absolutely bewilderingly impossible.
It's also important to note that the above condition is incomplete. The value \(M = A+B\) will also satisfy the condition. We need to test our test cases to be sure they really do what we want.
We really need to be more complete on what the domain of values for \(M\) is.
\[M = A \textbf{ or } M = B \textbf{ and } M \geq A \textbf{ and } M \geq B\]
We could rewrite this slightly to be
\[M \in \{A, B \} \textbf{ and } M \geq A \textbf{ and } M \geq B\]
This version directly suggests a potential set comprehension to compute the result:
M = {m for m in {A, B} if m >= A and m >= B}.pop()
This is the advantage of writing post-conditions. They often map to code.
You can even try it as pseudo-SQL if that helps you get past the assert
statement.
SELECT M FROM (TABLE INT(X); A; B) WHERE M >= A AND M >= B
I made up a TABLE INT(X); A; B
to describe a two-row table with candidate solutions. I'm sure SQL folks have other sort of "interim table" constructs they like.
The point is to write down the final condition.
I'll repeat that because the folks I was trying to work with refused to understand the assert statement.
Write down the final condition.
The problem at hand seems to involve a result set, \(R\), pulled from nodes of some hierarchy, \(H\), \(R \subseteq H\). Each element of the hierarchy, \(h \in H\) has a set of strings, \(s(h)\). It appears that a target string, \(t\), must be a member of \(t \in s(r), r \in R\). I think.
Note that the hierarchy is nothing more than a collection of identified collections of strings. The parent-childness doesn't seem to matter for the search algorithm. Within the result set, there's some importance to the tier of the hierarchy, \(t(h)\), and a node from tier 1 means all others are ignored or something. Can't be sure. (The endless backstory on the hierarchy was little more than a review of the algorithm to query it.)
If any of this is true, it would be a fairly straightforward map()
or filter()
what could be parallelized with dask
or concurrent.futures
.
But we can't know if this really is the post-condition until someone in a position to know writes the post-condition.
The post-condition defines the results of test cases. The assert
statement becomes part of the pytest test cases. In a kind of direct copy-and-paste process to shift from design aid to test result condition.
Currently, the algorithm they have seems to have no test cases. They can't write a condition to describe correct answers, which suggests they actually don't know what'a correct.
If they wrote test cases, they might be able to visualize an assert
statement that confirms the test worked. Might. It appears to be asking a lot to write test cases for the legacy algorithm.
Indeed, if they wrote a conditional expression that described the results of any working example, they'd have taken giant steps toward the necessary assert
statement. But that's asking a lot, it appears.
Once you have a target condition, you can then design code to satisfy some (or all) of the target condition. Dijkstra's A Discipline of Programming has a thorough description of the "weakest precondition" operator. It works like this:
You write the program from the desired post-condition moving forward until you get a weakest pre-condition of True. Back to front. From goal to initialization.
Post-condition gives you statements. Statements have pre-conditions. You iterate, writing conditions, statements, and more conditions.
(You can also spot useless code because the pre-condition matches the post-condition.)
For the silly "maximum" problem?
Try M := A as a statement. This only works if A >= B. That's the pre-condition that is derived from substituting M = A into the post-condition.
Try M := B as a statement. This only works if B >= A. That's the pre-condition that is derived from substituting M = B into the post-condition.
These two pre-conditions describe an if-elif statement.
Note that this feels weirdly arbitrary and exploratory. It's a kind of empiricism where we try statements and see if they're helpful. There don't need to be any constraints. The post-condition is all that's required to explore the space of statements that might work, or at least might help.
Of course, we're not stupid. And we're lazy. We don't search the infinite space of statements. We can often imagine the statements without a lot of complex work. The formal weakest pre-condition process is necessary to confirm our intuition. Or to assert that something is free of astonishing side-effects.
It all depends on one thing: a clear, formal statement of the post-condition.
Since I made the mistake of describing the post-condition as a line of code, we've hit some kind of brick wall related to "I won't write code." Or "I don't want to be seen writing code." or "I don't want you to critique my code."
Dunno.
Real Question (somewhat abbreviated): "My algorithm performs badly, do I need asyncio?"
Short answer: No.
Long answer: Sigh. No. Do you need a slap upside the head?
Here's how it plays out:
Q: "We figured that if we 'parallelize' it, then we can apply multiple cores, and it will run 4x as fast."
Me: "What kind of I/O are you doing?"
Q: "None, really. It's compute-intensive."
Me: "Async is for I/O. A function can be computing while other functions are waiting for I/O to complete."
Q: "Right. We can have lots of them, so they each get a core."
Me: "Listen, please. A function can be computing. That's "A". Singular. One. Take a step back from the asyncio package. What are you trying to do?"
Q: "Make things faster."
Me: "Take a breath. Make what faster?"
Q: "A slow algorithm."
Me:
Q: "Do you want to know what we're trying do?"
Me:
Q: "First, we query the database to get categories. Then we query the database to get details for the categories. Then we query the database to organize the categories into a hierarchy. Except for certain categories which are special. So we have if-statements to handle the special cases."
Me: "That's I/O intensive."
Q: "That's not the part that's slow."
Me:
Q: "Context is important. I feel the need to describe all of the background."
Me: "That's trivia. It's as important as your mother's maiden name. What's the problem?"
Q: "The problem is we don't know how to use asyncio to use multiple cores."
Me: "Do you know how to divide by zero?"
Q: "No. It's absurd."
Me: "We already talked about asyncio for compute-intensive processing. Same level of absurd as dividing by zero. What are you trying to do?"
Q: "We have some for loops that compute a result slowly. We want to parallelize them."
Me: "Every for statement that computes a collection is a generator expression. Every generator expression can be made into a list, set, or dictionary comprehension. Start there."
Q: "But what if the for statement has a super-complex body with lots of conditions?"
Me: "Then you might have to take a step back and redesign the algorithm. What does it do?"
Q: <code> "See all these for statements and if-statements?"
Me: "What does it do? What's the final condition?"
Q: "A set of valid answers."
Me: "Define valid."
Q: "What do you mean? 'Define valid?' It's a set that's valid!"
Me: "Write a condition that defines whether or not a result set is valid. Don't hand-wave, write the condition."
Q: "That's impossible. The algorithm is too complex."
Me: "How do you test this process? How do you create test data? How do you know an answer it produces is correct?"
Q:
Me: "That's the fundamental problem. You need to have a well-defined post-condition. Logic. An assert
statement that defines all correct answers. From that you can work backwards into an algorithm. You may not need parallelism; you may simply have a wrong data structure somewhere in <code>."
Q: "Can you point out the wrong data structure?"
Me:
Q: "What? Why won't you? You read the code, you can point out the problems."
Me:
Q: "Do I have to do all the work?"
Me:
When I talk about functional programming in Python, folks like to look for place where functional programming isn't appropriate. They latch onto finite-state automata (FSA) because "state" of an automata doesn't seem to fit with stateless objects used in functional programming.
This is a false dichotomy.
It's emphatically false in Python, where we don't have a purely functional language.
(In a purely functional language, monads can help make FSA's behave properly and avoid optimization. The use of a recursion to consume an iterable and make state transitions is sometimes hard to visualize. We don't have these constraints.)
Let's look at a trivial kind of FSA: the parity computation. We want to know how many 1-bits are in a given value. Step 1 is to expand an integer into bits.
def bits(n: int) -> Iterable[int]: if n < 0: raise ValueError(f"{n} must be >= 0") while n > 0: n, bit = divmod(n, 2) yield bit
This will transform a number into a sequence of bits. (They're in order from LSB to MSB, which is the reverse order of the bin()
function.)
>>> list(bits(42)) [0, 1, 0, 1, 0, 1]
Given a sequence of bits, is there an odd number or an even number? This is the parity question. The parity FSA is often depicted like this:
When the parity is in the even state, a 1-bit transitions to the odd state. When the parity is in the odd, a 1-bit transitions to the even state.
Clearly, this demands the State design pattern, right?
Here's a detailed OO implementation using the State design pattern.
class Parity: def signal(self, bit: int) -> "Parity": ... class EvenParity(Parity): def signal(self, bit: int) -> Parity: if bit % 2 == 1: return OddParity() else: return self class OddParity(Parity): def signal(self, bit: int) -> Parity: if bit % 2 == 1: return EvenParity() else: return self class ParityCheck: def __init__(self): self.parity = EvenParity() def check(self, message: Iterable[int]) -> None: for bit in message: self.parity = self.parity.signal(bit) @property def even_parity(self) -> bool: return isinstance(self.parity, EvenParity)
Each of the Parity
subclasses implements one of the states of the FSA. The lonely signal()
method implements state-specific behavior. In this case, it's a transition to another state. In more complex examples it may involve side-effects like updating a mutable data structure to log progress.
This mapping from state to diagram to class is pretty pleasant. Folks really like to implement each state as a distinct class. It somehow feels really solid.
It's import to note the loneliness of the lonely signal()
method. It's all by itself in that big, empty class.
Hint. This could be a function.
It's also important to note that this kind of design is subject to odd, unpleasant design tweaks. Ideally, the transition is *only* done by the lonely signal()
method. Nothing stops the unscrupulous programmer from putting state transitions in other methods. Sigh.
We'll look at more complex kinds of state transitions later. In the UML state chart diagrams sates may also have entry actions and exit actions, a bit more complex behavior than we we're showing in this example.
What's the alternative? Instead of modeling state as an object with methods for behavior, we can model state as a function. The state is a function that transitions to the next state.
def even(bit: int) -> ParityF: if bit % 2 == 1: return odd else: return even def odd(bit: int) -> ParityF: if bit % 2 == 1: return even else: return odd def parity_check(message: Iterable[int], init: ParityF = None) -> ParityF: parity = init or even for bit in message: parity = parity(bit) return parity def even_parity(p: ParityF) -> bool: return p is even
Each state is modeled by a function.
The parity_check()
function examines each bit, and applies the current state function (either even()
or odd()
) to compute the next state, and save this as the vakue of the parity
variable.
What's the ParityF type? This:
from typing import Protocol class ParityF(Protocol): def __call__(self, bit: int) -> "ParityF": ...
This uses a Protocol to define a type with a recursive cycle in it. It would be more fun to use something like ParityF = Callable[[int], "ParityF"]
, but that's not (yet) supported.
What if we need each state to have more attributes?
Python functions have attributes. Like this: even.some_value = 2
; odd.some_value = 1
. We can add all the attributes we require.
What about other functions that happen on entry to a state or exit from a state? This is trickier. My preference is to use a class as a namespace that contains a number of related functions.
class Even: @staticmethod def __call__(bit: int) -> ParityF: if bit % 2 == 1: odd.enter() return odd else: return even @staticmethod def enter() -> None: print("even") even = Even()
This seems to work out well, and keeps each state-specific material in a single namespace. It uses static methods to follow the same design principle as the previous example -- these are pure functions, collected into the class only to provide a namespace so we can use odd.enter()
or even.enter()
.
The State design pattern isn't required to implement a FSA.
This isn't about details of pySpark. This is about the philosophy of testing when working with a large, complex framework, like pySpark, pandas, numpy, or whatever.
Use data subsets.
Write unit tests for the functions that process the data.
Don't test pyspark itself. Test the code you write.
I've worked with folks -- data scientists specifically -- without a deep background in software engineering.
When we said their model-building applications needed a test case, they supplied the test case they used to validate the model.
Essentially, their test script ran the entire training set. Built the model. Did extensive statistical testing on the resulting decisions made by the model. The test case asserted that the stats were "good." In fact, they recapitulated the entire model review process that had gone on in the data science community to get the model from "someone's idea" to a "central piece of the business."
The test case ran for hours and required a huge server loaded up with GPUs. It cost a fortune to run. And. It tended to timeout the deployment pipeline.
This isn't what we mean by "test." Our mistake.
We had to explain that a unit test demonstrates the code works. That was all. It shouldn't involve the full training set of data and the full training process with all the hyperparameter tuning and hours of compute time. We don't need to revalidate your model. We want to know the code won't crash. We'd like 100% code coverage. But the objective is little more than show it won't crash when we deploy it.
It was difficult to talk them down from full training sets. They couldn't see the value in testing code in isolation. A phrase like "just enough data to prove the thing could plausibly work with real data" seemed to resonate.
A few folks complained that a numpy array with a few rows didn't really show very much. We had to explain (more than once) that we didn't really want to know all the algorithmic and performance nuances. We mostly wanted to know it wouldn't crash when we applied it to production data. We agreed with them the test case didn't show much. We weren't qualified to revalidate the model; we were only qualified to run their training process for them. If they had done enough work to be sure we *could* run it.
(It was a bank. Software deployments have rules. An AI model-building app is still an app. It still goes through the same CI/CD pipeline as demand deposit account software changes. It's a batch job, really, just a bit more internally sophisticated than the thing that clears checks.)
I lean toward the following tiers of testing:
pytest
and pytest-cov
packages to tracking testing and make sure every line of code has some test case. For a few particularly tricky things, every logic path is better than simply testing lines of code. In some cases, every line of code will tend to touch every logic path, but seems less burdensome.hypothesis
for the more sensitive numeric functions. In “data wrangling” applications there may not be too many of these. In the machine learning and model application software, there may be more sophisticated math that benefits from hypothesis testing.
pyspark
processing, using multiple functions or classes to be sure they work together correctly, but without the added complication of actually using pySpark. This means creating mocks for some of the libraries using unittest.mock
objects. This is a fair bit of work, but it pays handsome dividends when debugging. For well-understood pyspark
APIs, it should be easy to provide mocked results for the app components under test to use. For the less well-understood parts, the time spent building a mock will often provide useful insight into how (and why) it works the way it does. In rare cases, building the mock suggests a better design that's easier to test.pyspark
instance from the command line. For this, I really like using behave
, and writing the acceptance testing cases using the Gherkin language. This enforces a very formal “Given-When-Then” structure on the test scenarios, and allows you to write in English. You can share the Gherkin with users and other stakeholders to be sure they agree on what the application should do.
Why?
Each tier of testing builds up a larger, and more complete picture of the overall application.
More important, we don't emphasize running pySpark and testing it. It already works. It has it's own tests. We need to test the stuff we wrote, not the framework.
We need to test our code in isolation.
We need to test integrated code with mocked pySpark.
Once we're sure our code is likely to work, the next step is confirmation that the important parts do work with pySpark. For life-critical applications, the integration tests will need to touch 100% of the logic paths. For data analytics, extensive integration testing is a lot of cost for relatively little benefit.
Even for data analytics, testing is a lot of work. The alternative is hope and prayer. I suggest starting with small unit tests, and expanding from there.