Tuesday, November 22, 2022

Testing with PySpark

This isn't about details of pySpark. This is about the philosophy of testing when working with a large, complex framework, like pySpark, pandas, numpy, or whatever. 


Use data subsets. 

Write unit tests for the functions that process the data.

Don't test pyspark itself. Test the code you write.

Some History

I've worked with folks -- data scientists specifically -- without a deep background in software engineering.

When we said their model-building applications needed a test case, they supplied the test case they used to validate the model.

Essentially, their test script ran the entire training set. Built the model. Did extensive statistical testing on the resulting decisions made by the model. The test case asserted that the stats were "good." In fact, they recapitulated the entire model review process that had gone on in the data science community to get the model from "someone's idea" to a "central piece of the business." 

The test case ran for hours and required a huge server loaded up with GPUs. It cost a fortune to run. And. It tended to timeout the deployment pipeline.

This isn't what we mean by "test." Our mistake.

We had to explain that a unit test demonstrates the code works. That was all. It shouldn't involve the full training set of data and the full training process with all the hyperparameter tuning and hours of compute time. We don't need to revalidate your model. We want to know the code won't crash. We'd like 100% code coverage. But the objective is little more than show it won't crash when we deploy it.

It was difficult to talk them down from full training sets. They couldn't see the value in testing code in isolation. A phrase like "just enough data to prove the thing could plausibly work with real data" seemed to resonate. 

A few folks complained that a numpy array with a few rows didn't really show very much. We had to explain (more than once) that we didn't really want to know all the algorithmic and performance nuances. We mostly wanted to know it wouldn't crash when we applied it to production data. We agreed with them the test case didn't show much. We weren't qualified to revalidate the model; we were only qualified to run their training process for them. If they had done enough work to be sure we *could* run it.

(It was a bank. Software deployments have rules. An AI model-building app is still an app. It still goes through the same CI/CD pipeline as demand deposit account software changes. It's a batch job, really, just a bit more internally sophisticated than the thing that clears checks.)

Some Structure

I lean toward the following tiers of testing:

  1. Unit tests of every class and function. 100% code coverage here. I suggest using pytest and pytest-cov packages to tracking testing and make sure every line of code has some test case. For a few particularly tricky things, every logic path is better than simply testing lines of code. In some cases, every line of code will tend to touch every logic path, but seems less burdensome.
  2. Use hypothesis for the more sensitive numeric functions. In “data wrangling” applications there may not be too many of these. In the machine learning and model application software, there may be more sophisticated math that benefits from hypothesis testing.
  3. Write larger integration tests that mimic pyspark processing, using multiple functions or classes to be sure they work together correctly, but without the added complication of actually using pySpark. This means creating mocks for some of the libraries using unittest.mock objects. This is a fair bit of work, but it pays handsome dividends when debugging. For well-understood pyspark APIs, it should be easy to provide mocked results for the app components under test to use. For the less well-understood parts, the time spent building a mock will often provide useful insight into how (and why) it works the way it does. In rare cases, building the mock suggests a better design that's easier to test.
  4. Finally. Write a few overall acceptance tests that use your modules and also start and run a small pyspark instance from the command line. For this, I really like using behave, and writing the acceptance testing cases using the Gherkin language. This enforces a very formal “Given-When-Then” structure on the test scenarios, and allows you to write in English. You can share the Gherkin with users and other stakeholders to be sure they agree on what the application should do.


Each tier of testing builds up a larger, and more complete picture of the overall application. 

More important, we don't emphasize running pySpark and testing it. It already works. It has it's own tests. We need to test the stuff we wrote, not the framework.

We need to test our code in isolation.

We need to test integrated code with mocked pySpark.

Once we're sure our code is likely to work, the next step is confirmation that the important parts do work with pySpark. For life-critical applications, the integration tests will need to touch 100% of the logic paths. For data analytics, extensive integration testing is a lot of cost for relatively little benefit.

Even for data analytics, testing is a lot of work. The alternative is hope and prayer. I suggest starting with small unit tests, and expanding from there.

Tuesday, November 15, 2022

Generators as Stacks of Operations

See https://towardsdatascience.com/building-generator-pipelines-in-python-8931535792ff 

I'm delighted by this article. 

I was shown only the first, horrible, example. I think the idea was to push back on the idea of complex generators. I fumed.

Then I read the entire article.

Now I'm fuming at someone who posted the first example -- apparently having failed to read the rest of the post.

This idea of building a stack of iterators is very, very good.

The example (using simple operations) can be misleading. A follow-on example doing something like file parsing might be helpful. But, if you go too far, you wind up writing an entire book about Functional Programming in Python.

Tuesday, November 8, 2022

Fighting Against Over-Engineering

I've been trying to help some folks who have a "search" algorithm that's slow. 

They know it's slow -- that's pretty obvious.

They're -- unfortunately -- sure that asyncio will help. That's not an obvious conclusion. It involves no useful research. Indeed, that's a kind of magical thinking. Which leads me to consider the process of over-engineering.

The Problem

Over-engineering is essentially a technique for burning brain-calories on planning to build something instead of building something.

The distinction is "planning" vs. "doing."

Lots of folks subscribe to Methodology Magic Thinking (MMT™). The core tenet of MMT is that some  methodology is good, and more methodology is better. 

The classic waterfall methodology expects requirements, design, code, test, and what-not, all flowing downhill. A series of waterfalls.

The more modern agile-fall methodology expects requirements, design, code, test, and what-not all being done in tiny MVP slices.

Why is this bad?

Its bad because it falls apart when confronted with really difficult algorithm and data structure problems.

What Breaks?

The thing that breaks is the "learn about the technology" or "learn about the problem domain" things that we need to do. We like to pretend with understand the technology -- in spite of the obvious information that we're rarely experts. We're smart. We're capable. But. We're not experts.

This applies to both the solution technology (i.e., language, persistence, framework, etc.) and the problem domain. 

When we have a process that takes *forever* to run, we've got a bad algorithm/data structure, and we don't know what to do.

We need to explore.


Managers rarely permit exploration.

They have a schedule. The waterfall comes with a schedule. The agilefall sprints have timelines. And these are rarely negotiable. 

What Are Some Wrong Things to Do?

One wrong thing to do is to pick some technology and dig in hard. The asyncio module is not magical pixie dust. It doesn't make arbitrary bad code run faster. This is true in general. Picking a solution technology isn't right. Exploring alternatives -- emphasis on the plural -- is essential.

Another wrong thing to do is demand yet more process. More design docs. More preliminary analysis docs. More preliminary study. More over-engineering.

This is unhelpful. There are too many intellectual vacuums. And nature abhors a vacuum. So random ideas get sucked in. Some expertise in the language/tool/framework is required. Some expertise in the problem domain is required. Avoid assumptions.

What Should We Do?

We have to step back from the technology trap. We're not experts: we need to learn more. Which means exploring more. Which means putting time in the schedule for this.

We have to understand the problem domain better. We're not experts: we need to learn more. Which means putting time in the schedule for this.

We have to step back from the "deliverable code" trap. Each line of code is not a precious gift from some eternal god of code. It's an idea. And since the thing doesn't run well, it's provably a bad idea.

Code needs to be deleted. And rewritten. And rewritten again. And benchmarked.


I like fixing bad code. I like helping people fix bad code.

I can't -- however -- work with folks who can't delete the old bad code.

It's unfortunate when they reach out and then block progress with a number of constraints that amount to "We can't focus on this; we can't make changes rapidly. Indeed, we're unlikely to make any changes."

The only way to learn is to become an expert is something. This takes time. To minimize the time means work with focus and work rapidly.

Instead of working rapidly, they want magical pixie dust that makes things faster. They want me to tell them were the "Turbo Boost" button is hidden.


Tuesday, October 25, 2022

Some Functional Programming in Python material

This is bonus content for the forthcoming Functional Python Programming 3rd edition book. It didn't make it into the book because -- well -- it was just too much of the wrong kind of detail.

See this "Tough TCO" document for some thoughts on Tail-Call Optimization that can be particularly difficult. This isn't terribly original, but I think it's helpful for folks working through more complex problems from a functional perspective.

"Why a PDF?" I've been working with with LaTeX, and the switching to other ways of editing and presenting code seemed like too much work.

Tuesday, August 23, 2022

Books! Books! More Channels!

I started with the Apple Books platform because it's an easy default for me. 

Pivot to Python

A Guide for professionals and skilled beginners


I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.

But wait, there's more. 

Unlearning SQL

When your only tool is a hammer, every problem looks like a nail


Many folks know some Python, but struggle with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems effectively.

Google Play

Now, I'm duplicating the books on Google Play. Here's Unlearning SQL:


I've made a clone of Pivot to Python, also.


Both books are (intentionally) short to help experts make rapid progress.

Tuesday, August 16, 2022

Enterprise Python -- Some initial thoughts

In the long run, I think there's a small book here. See 8 reasons Python will rule the enterprise — and 8 reasons it won’t | InfoWorld. The conclusion, "Teams need to migrate slowly into the future, and adopting more Python is a way to do that," seems to be sensible. Some of the cautionary tales along the way, however, don't make as much sense.

TL;DR. There are no reasons to avoid Python. Indeed, the 8 points suggest that Python is perhaps a smart decision. 

I want to focus on the negatives part of this because some of them are wrong. I think there's a "technology hegemony" viewpoint where everything in an enterprise must be exactly the same. This tends to prevent creative solutions to problems and mires an enterprise into fighting problems that are inherent in bad technology choices. Also, I think there's an enterprises are run by idiots subtext.

1. Popularity. Really this is about having polyglot software portfolio. The reasoning appears to be that a polyglot software portfolio is impossible to maintain because (1) no one can learn an old language, and (2) software will never be rewritten from an obsolete language to a modern language. If these are both true, it appears the  organization is full of idiots. The notion that a polyglot tech stack must devolve into chaos seems to ignore the endless chain of management decisions that are required to create chaos. Leaving obsolete tech in place isn't a consequence of the tech, or the tech's lack of compatibility, it's a management decision to enshrine bad ideas, frozen in amber, forever.

2. Scripting Languages. Specifically, the spreadsheet is already the de facto scripting language of choice, and nothing can be done about it. Nothing. No one can learn to use Jupyter Lab to do business analytics. If this is true, it appears that the organization is full of idiots. Python will not replace all spreadsheets. A pandas data frame will replace an opaque macro-filled nightmare with code that can be unit tested. Imagine unit testing a spreadsheet. Consider the possibilities of expanding business analysis work to include a few test cases; not 100% code coverage, but a few test cases to confirm the analytical process was implemented consistently. 

3. Dynamic Languages. Specifically, dynamic languages are useless for reliable software because there's no comprehensive type checking across some interfaces. Which begs the question of why there are software failures in statically typed languages. More importantly, complaining about dynamic languages raises important questions about integration and acceptance testing procedures in an organization in general. All languages require extensive test suites for all developed code. All languages benefit from static analysis. Sometimes the compiler does this, sometimes external tools do the linting. Sometimes folks use both the compiler and linters to check types. If we are sure dynamic languages will break, are we equally sure statically typed languages cannot break? Or, do we take steps to prevent problems? I think we tend to take a lot of steps to make sure software works.

4. Tooling. I can't figure this point out. But somehow C++ or Java have better tools for managing large source code bases. There are no details behind this claim, so I'm left to guess. I would suggest that the "incremental recompilation" problem of large C++ (and Java) code bases is its own nightmare. Folks go to great lengths to architect C++ so that an implementation change does not require recompilation of everything. While this could be seen as "evolved to handle the jobs that enterprise coders need done", I submit that there's a deeper problem here, and stepping away from the compiler is a better solution than complex architectures. See Lakos Large-Scale C++ Software Design for some architectural features that don't solve any enterprise problem, but solve the scaling problem of big C++ applications. This bumps into the micro-services/monolith discussion, and the question of carefully testing each interface. None of which has anything to do with Python specifically.

5. Machine Learning and Data Science. These are fads, apparently. I'm not sure I can respond to this, since it has little to do with Python. Of course, Python has one of the most complete data science toolsets, so perhaps avoiding data science makes it easier to avoid Python.

6. Rapid Growth. The growth of Python is rapid, and there's no promise of endless backwards compatibility. This is a consequence of active development and learning. I think it's better than the endless backwards compatibility that leads to JavaScript's list of WATs. Or the endless confusion between java.util.date and Joda-Time. The idea that no one will ever look at the Enterprise code base for Common Vulnerabilities and Exposures seems to indicate a lack of concern for reliability or security. Since the entire compiled code base has to be checked for vulnerabilities, why not also check the Python code base for ongoing upgrades and changes and enhancements? Is code really written once and never looked at again? If so, it sounds like an organization run by idiots.

7. Python Shipped With Some OS's. There's a long story of woe that stems from relying on the OS-supplied Python. The lesson learned here is Never Rely on the OS Python; Always Install Your Own. This doesn't seem like a reason to avoid Python in the enterprise. It seems like an important lesson learned for all software that's not part of the OS: always install your own. I've been using Miniconda to spin up Python environments and absolutely love it.

8. Open Source Software. Agreed. Nothing to do with Python specifically. Everything to do with tech stack and architecture. The question of using Open Source in the first place doesn't seem difficult. It's a well-established way to reduce start-up costs for software development.

Of the eight points, two seem to be completely generic issues. Yes, Machine Learning is new, and yes, choices must be made. The two questions around scripting and dynamic languages seem specious; all programming requires careful design and testing. The Python shipped with the OS is a non-concern; the lesson learned is clear.

We have have three remaining points:

  1. Polyglot Portfolio (From pursuing popularity.) This is already the case in most Enterprises, and needs to be managed through aggressive retiring of old software. I may have taken decades to build that old app, but it often takes months to rewrite it in a new language. The legacy app provides acceptance test cases; it's often filled with cruft and detritus of old decisions. 
  2. Tooling. Agreed. Tooling is important. Not sure that Java or C++ have a real edge here, but, tooling is important.
  3. Growth and Change. Python's rapid evolution requires active management. An enterprise must adopt a YBYO (You Build it You Own it) attitude so that every level of management is aware of the components they're responsible for. CVE's are checked, Python PEP's are checked. Tools like tox or nox are used to build (and rebuild) virtual environments.
If these seem like a high bar, perhaps there are deeper issues in the enterprise. If adding Yet Another Language is a problem, then it's time to start retiring some languages. If Adding Another Tool is a problem, it's worth examining the existing tool chain to see why it's such a burden. If the idea of change is terrifying, perhaps the ongoing change is not being watched carefully enough.

Tuesday, August 9, 2022

Tragedy Averted

I almost made a terrible blunder.

See https://github.com/slott56/py-web-tool for some background. This is a "Literate Programming" tool. I started fooling around with this kind of thing back in '05 (maybe even earlier.) This is not the blunder. The whole idea of literate programming is not very popular. I'm a fan of Jupyter{Book} as the state of the art in sophisticated literate programming, if you're interested in it.

In my case, I started this project so long ago, I used docutils. This was long before Sphinx arrived on the scene. I never updated my little project to use Sphinx. The point was to have a kind of pure literate programming tool that could work with a variety of markup languages, including (but not limited to) RST.

Recently, I learned about PlantUML. The idea of a text description of a diagram is appealing. I don't really need to draw it; I just need to specify what's in it and let graphviz do the rest. This tool is very, very cool. You can capture ideas quickly. You can refine and expand on ideas until you reach a point where code makes more sense than a picture of code. 

For some things, you can gather data and draw a picture of things *as they are*. This is particularly valuable for cloud-based infrastructure where a few queries leads to PlantUML source that is depicted very nicely.

Which leads to the idea of Literate Programming including UML diagrams. 

Doesn't sound too difficult. I can create an extension to docutils to introduce a UML directive. The resulting RST would look like this:

..  uml::

    left to right direction
    skinparam actorStyle awesome

    actor "Developer" as Dev
    rectangle PyWeb {
        usecase "Tangle Source" as UC_Tangle
        usecase "Weave Document" as UC_Weave
    rectangle IDE {
        usecase "Create WEB" as UC_Create
        usecase "Run Tests" as UC_Test
    Dev --> UC_Tangle
    Dev --> UC_Weave
    Dev --> UC_Create
    Dev --> UC_Test

    UC_Test --> UC_Tangle

This could be handy to have the diagrams as part of the documentation that tangles the working the code. One source for all of it. 

I started down the path of researching docutils extensions. Got pretty far. Far enough that I had an empty repository and everything. I was about ready to start creating spike solutions.


[music cue] *duh duh duuuuuuh*

I found that Sphinx already has an extension for PlantUML. I almost started reading the code to see how it worked.

Then I realized how dumb that was. It already works. Why read the code? Why not install it?

I had a choice to make.

  1. Continue building my own docutils plug-in.
  2. Switch to Sphinx.

Some complications:

  • My Literate Programming tool produces RST that *may* not be compatible with Sphinx.
  • It's yet another dependency in a tool that started out with zero dependencies. I've added pytest and tox. What next? 

What to do?

I have to say that Git is amazing. I can make a branch for the spike. If it works, pull request. If it doesn't work, delete the branch. This continues to be game-changing to me. I'm old. I remember when we had to back up the whole project directory tree before making this kind of change.

It worked. My tool's RST (with one exception) worked perfectly with Sphinx. The one exception was an obscure directive, .. class:: name, used to provide an HTML class name for the following block. This always should have been the docutils .. container:: name directive. With this fix, we're good to go.

I'm happy I avoided the trap of reimplementing something. Instead of that, I upgraded from "bare" docutils with my own CSS to Sphinx with it's sophisticated templates and HTML Themes.