Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, December 28, 2021

The Old Old Days -- ca. 1968

As the olds do, I reminisce sometimes. Not often. Let me rewind the memory tapes a back to 1967 or '68.

(What a dumb metaphor for folks who have never used serial storage.)

This isn't -- directly -- about Python. But it may help folks who live at the edge of programming find a project that motivates them to learn to code.

This was my first exposure to an actual computer. I think it was an old IBM 650 that our high-school had. I wasn't in high school yet, but there was an open house, and they fired this thing up.

What was the demo program?

The thing was running a version of blackjack. 

It would clatter and type something like 

 J S
 9 C  7 D
>

And wait for your input. The first line was the dealer's card and your two cards. You could then enter your plan -- hit, stand, split, or double. Subsequent lines would play out your hand.

And yes, I recall much of the interaction to this day. (Not all the details. It was a long time ago.)

This shaped my understanding of "computing as simulation." 

It also caused me to become interested in random numbers and the idea of generating pseudo-random numbers with digital computers. I'll be posting some more thoughts on random numbers and -- I promise -- there will be Python code.

Tuesday, November 23, 2021

Processing Apple Numbers Files

Apple's freebie tools -- Pages, Numbers, Keynote, Garage Band, etc. -- are wonderful things. I really like Numbers. I'm tolerant of Pages. I've used Pages to write books and publish them to the Apple Bookstore. (Shameless Plug: Pivot to Python.)

These tools have a significant problem. Protobuf.

Some History

Once upon a time, Numbers used an XML-based format. This was back in '09, I think. At some point, version 10 of Numbers (2013?) switched to Protobuf.

I had already unwound XLSX and ODS files, which are XML. I had also unwound Numbers '09 in XML. I had a sense of what a spreadsheet needed to look like.

The switch to Protobuf also meant using Snappy compression. Back in 2014? I worked out my own version of the Snappy decompression algorithm in pure Python. I think I knew about python-snappy but didn't want the complex binary dependency. I wrote my own instead.

I found the iWorkFileFormat project. From this, and a lot of prior knowledge about the XML formats, I worked out a way to unpack the protobuf bytes into Python objects. I didn't leverage the formal protobuf definitions; instead I lazily mapped the objects to a dictionary of keys and bytes. If a field had a complex internal structure, I parsed the subset of bytes.

(I vaguely recall the Protobuf definitions are in XCode somewhere. But. I didn't want to write a protobuf compiler to make a pure-Python implementation. See the protobuf project for what I was looking for, but didn't have at the time.)

Which brings us to today's discovery.

State of the Art

Someone has taken the steps necessary to properly unpack Numbers files. See numbers-parser. This has first-class snappy and protobuf processing. It installs cleanly. It has an issue, and I may try to work on it.

I'm rewriting my own Stingray Reader with intent to dispose of my own XLSX, ODS, and Numbers processing. These can (and should) be imported separately. It's a huge simplification to stand on the shoulders of giants and write a dumb Facade over their work.

Ideally, all the various spreadsheet parsing folks would adopt some kind of standard API. This could be analogous to the database API used by SQL processing in Python. The folks with https://www.excelpython.org or http://www.python-excel.org might be a place to start, since they list a number of packages.

The bonus part? Seeing my name in the Credits for numbers-parser. That was delightful.

At some point, I need to make a coherent pitch for a common API with permits external JSON Schema as part of extracting data from spreadsheets.

First. I need to get Stingray Reader into a more final form.

Tuesday, November 16, 2021

Reading Spreadsheets with Stingray Reader and Type Hinting

See Spreadsheets, COBOL, and schema-driven file processing, etc. for some history on this project.

Also, see Stingray-Reader for the current state of this effort.

(This started almost 20 years ago, I've been refining and revising a lot.)

Big Lesson Up Front

Python is very purely driven by the idea that everything you write is generic with respect to type. Adding type hints narrows the type domain, removing the concept of "generic".

Generally, this is good.

But not universally.

Duck Typing -- and Python's generic approach to types -- is made visible via Protocols and Generics.

An Ugly Type Hinting Problem

One type hint complication arises when writing code that really is generic. Decorators are a canonical example of functions that are generic with respect to the function being decorated. This, then, leads to kind of complicated-looking type hints.

See the mypy page on declaring decorators. The use of a TypeVar to show that how a decorator's argument type matches the the return type is big help. Not all decorators follow the simple pattern, but many do.

from typing import TypeVar
F = TypeVar('F', bound=Callable[..., Any])
def myDecorator(function: F) -> F:
    etc.

The Stingray Reader problem is that a number of abstractions are generic with respect to an underlying instance object.

If we're working with CSV files, the instance is a tuple[str].

If we're working with ND JSON objects, the instance is some JSON type.

If we're working with some Workbook (e.g., via xlrd, openpyxl, or pyexcel) then, the instance is defined by one of these external libraries.

If we're working with COBOL files, then the instances may be str or may be bytes. The typing.AnyStr type provides a useful generic definition.

Traditional OO Design Is The Problem

Once upon a time, in the dark days, we had exactly one design choice: inheritance. 

Actually, we had two, but so many writers get focused on "explaining" OO programming, that they tend to gloss over composition. The focus on the sort-of novel concept of inheritance.

And this leads to folks arguing that inheritance shouldn't be thought of as central. Which is a kind of moot argument over what we're doing when we're writing about OO design. We have to cover both. Inheritance has more drama, so it becomes a bit more visible than composition. Indeed, inheritance creates a number of design constraints, and that's where the drama surfaces.

Any discussion of design patterns tends to be more balanced. Many patterns -- like Strategy and State -- are compositional patterns. Inheritance actually plays a relatively minor role until you reach interesting boundary cases.

Specifically.

What do you do when you have a Strategy class hierarchy and ONE of those strategies has an unique type hint for a parameter? Most of the classes use one type. One unique subclass needs a distinct type. For example, this outlier among the Strategy alternatives uses a str parameter instead of float.

Do you push that type distinction up to the top of the hierarchy? Maybe define it as edge_case: Optional[Union[str, float]] = None?

You can't simply change the parameter's value in one subclass with impunity. mypy will catch you at this, and tell you you have Liskov Substitution problems.

Traditionally, we would often take this to mean that we have a larger problem here. We have a leaky abstraction. Some implementation details are surfacing in a bad way and we need more abstract classes.

It's A Protocol ("Duck Typing")

When I started rewriting Stingray Reader, I started with a fair number of abstract classes. These classes were supposed to have widely varying implementations, but common semantics. 

Applying a schema definition to a CSV file means that data values can be converted from strings to something more useful,

Applying a schema to a JSON file means doing a validation pass to be sure the loaded object meets the schema's expectations.

Applying a schema to a Workbook file is a kind of hybrid between CSV processing and JSON processing. The workbook's values will have been unpacked by the interface module. Each row will look like a list[Any] that can be subject to JSON schema validation. 

Apply a schema to COBOL means using the schema details to figure out how to unpack the bytes. This is suddenly a lot more complex than the other cases.

The concepts of inheritance and composition aren't really applicable. 

This is something even more open-ended. It's a protocol. 

We want a common interface and common semantics. But. We're not really going to leverage any common code. 

This unwinds a lot abstract superclasses, replacing them with Protocol class definitions.

class Workbook(abc.ABC):
    @abc.abstractmethod
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Union[str, bytes, int, float, etc.]]]:
        ...

The above is not universally useful. Liskov Substitution has to apply. In some cases, we don't have a tidy set of relationships. Here's the alternative

class Workbook(Protocol):
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Any]]:
        ...

This gives us the ability to define classes that adhere to the Workbook Protocol but don't have a simple, strict subclass-superclass-Liskov substitution relationship.

It's A Generic Protocol

It turns out, this isn't quite right. What's really required is a Generic[Type], not the simple Protocol.

class Workbook(Generic[Instance]):
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Instance]]:
        ...

This lets us create Workbook variants that are highly type-specific, but not narrowly constrained by inheritance rules.

This type hinting technique describes Python code that really is generic with respect to implementation type details. It allows a single Facade to contain a number of implementations.

Tuesday, November 2, 2021

Welcome to Python: Some hints for ways to explain how truly bad the language is

As an author with many books on Python, I'm captivated by people's hot takes on why Python is so epically bad. Really Bad. Uselessly Bad. Profoundly Broken. etc.

I'll provide some hints on topics that get repeated a lot. If you really need to write a blog post about how bad Python is, please try to take a unique approach on any of these common complaints.  If you have a blog post half-written, skip to the tl;dr section to see if your ideas are truly unique.

Whitespace

Please don't waste time complaining about having to use whitespace in your code. I'm sure it's a burden on your soul to configure your editor to indent in groups of four spaces. I'm sorry it's so painful. But. Python isn't the only language with whitespace.

The shell scripting language has semantic whitespace. (It's not used for indentation, but please try cat$HOME/.bashrc (without any spaces) and tell me what happens. Spaces matter in a lot of languages. 

Even in C, some whitespace is semantic. The rest of the whitespace is for humans to read your code.

If you're *sure* that indentation is a fatal problem, please provide an example in the language of your choice where the {}'s or the case/esac was *required* because ordinary, readable indentation didn't -- somehow -- express the nesting.

The example can be the basis for a Python Enhancement Proposal (PEP) to fix the whitespace problem you've identified.

The self Instance Variable

Using self everywhere is simpler than using this in those obscure special cases where it's ambiguous. Python developers are sure that being uniformly explicit is a terrible burden on your soul. If you really feel that obscure special cases are required, consider writing a pre-processor to sort out the special cases for us.

I'm sure there's a way to inject another level of name resolution into the local v. global choices. Maybe local-self-global or self-local-global could be introduced. 

Please include examples. From this a Python Enhancement Proposal can be drafted to clarify what the improvement is.

No Formal Constants

Python doesn't waste too much time on keywords, like const, to alter the behavior of assignment. Instead, we tend to rely on tools to check our code.

Other languages have compilers to look for assignment to consts. Python has tools like flake8, pyflakes, pylint, and others, to look for this kind of thing. Conventionally, variables at the module level with ALL_CAPS names are likely to be constants. Multiple assignment statements would be a problem. Got it.

"Why can't the language check?" you ask. Python doesn't normally have a separate compile pass to pre-check the code. But. As I said above, you can use tools to create a pre-checking pass. That's what most of us do.

"But what if someone accidentally overwrites a constant?" you insist. Many folks would suggest some better documentation to explain the consequences an clarify how unit tests will fail when this happens. 

"Why should I write unit tests to be sure a constant wasn't changed?" you demand. I'm not really insisting on it. But you said you had developers who would "accidentally" overwrite a constant in an assignment statement, and you couldn't use tools like pylint to check for it. So. I suggested another choice. If you don't like that, use enums. Or write documentation and explain which global items can be changed and which can't be changed.

BTW. If you have global variables that are NOT constants, consider this a code smell. 

If you really need a mixture of constants and variables as module globals, you can use the enum module to create named attribute values of a class definition. You get constants and a namespace. It's pretty sweet.

Lack of Privacy

It appears to be an article of faith that a private keyword is unconditionally required.

Looking at the history of OO languages, it looks like private seems to have been introduced with C++. Not every OO language has the same notion of private the C++ has. CLU has no concept of private. Smalltalk considers instance variables equivalent to C++ protected, not private. Eiffel has a particularly sophisticated feature exportation that doesn't involve a trivial private/public distinction.

Since many languages that aren't C++ or Java have a variety of approaches, it appears private isn't required. The next question, then, is it necessary?

It really helps to have a concrete example of a place where a private method or attribute was absolutely essential. And it helps to do this in a way that a leading _ on the variable name -- every time it's used -- is more confusing than a keyword like private somewhere else in the code.

It also helps when the example does not involve a hypothetical Idiot Developer who (a) doesn't read the documentation and (b) doesn't understand the _leading_underscore variable, and can still manage to use the class. It's not that this developer doesn't exist; it's questionable whether or not a complex language feature is better than a little time spent on a code review. 

It helps when the example does not include the mysterious Evil Genius Developer who (a) reads the documentation, and (b) leverages the _leading_underscore variable to format one of the OS disks or something. This is far-fetched because the Evil Genius Developer had access to the Python source, and didn't need a sophisticated subclassing subterfuge. They could simply edit the code to remove the magical privacy features.

No Declarations

Python is not the only language where variables don't have type declarations. In some languages, there are implied types associated with certain kinds of names. In other languages, there are naming conventions to help a reader understand what's going on.

It's an article of faith that variable declarations are essential. C programmers will insist that a void * pointer is still helpful even though the thing to which it points is left specifically undeclared. 

C (and C++) let you cast a pointer to -- well -- anything. With resulting spectacular run-time crashes. Java has some limitations on casting. Python doesn't have casting. An object is a member of a class and that's the end of that. There's no wiggle-room to push it up or down the class hierarchy.

Since Python isn't the only language without variable declarations, it raises the question: are they necessary?

It really helps to have a concrete example of a place where a variable declaration was absolutely essential for preventing some kind of behavior that could not be prevented with a pylint check or a unit test. While I think it's impossible to find a situation that's untestable and can only be detected by careful scrutiny of the source, I welcome the counter-example that proves me wrong.

And. Please avoid this example.

for data in some_list_of_int:
    if data == 42:
        print("data is int")
for data in some_list_of_str:
    if data == "bletch":
        print("data is str")

This requires reusing a variable name. Not really a good look for code. If you have an example where there's a problem that's not fixed by better variable names, I'm looking forward to it.

This will change the world of Python type annotations. It will become an epic PEP.

Murky Call-By-Value Semantics

Python doesn't have primitive types. There are no call-by-value semantics. It's not that the semantics are confusing: they don't exist. Everything is a reference. It seems simpler to avoid the special case of a few classes of objects that don't have classes.

The complex special cases surrounding unique semantics for bytes or ints or strings or something requires an example. Since this likely involves a lot of hand-waving about performance (e.g., primitive types are faster for certain things) then benchmarking is also required. Sorry to make you do all that work, but the layer of complexity requires some justification.

No Compiler (or All Errors are Runtime Errors)

This isn't completely true. Even without a "compiler" there are a lot of ways to check for errors prior to runtime. Tools like flake8, pyflakes, pylint, and mypy can check code for a number of common problems. Unit tests are another common way to look for problems. 

Code that passes a unit test suite and crashes at runtime doesn't seem to be a language problem. It seems to be a unit testing problem.

"I prefer the compiler/IDE/something else find my errors," you say. Think of pylint as the compiler. Many Python IDE's actually do some static analysis. If you think unit tests aren't appropriate for finding and preventing problems, perhaps programming isn't your calling.

tl;dr

You may have some unique insight. If you do, please share that.

If on the other hand, you're writing about these topics, please realize that Python has been around for over 30 years. These topics are not new. For the following, please try to provide something unique:

  • Whitespace
  • The self Instance Variable
  • No Formal Constants 
  • Lack of Privacy
  • No Declarations
  • Murky Call-By-Value Semantics
  • No Compiler (or All Errors Are Run-Time Errors)

It helps to provide a distinctive spin on these problems. It helps even more when you provide a concrete example. It really helps to write up a Python Enhancement Proposal. Otherwise, we can seem dismissive of Yet Another Repetitive Rant On Whitespace (YARROW).

Tuesday, October 26, 2021

Python is a Bad Programming Language. Wait, wut?

It may help to read Python is a Bad Programming Language, but it's not very useful. 

I shouldn't be tempted by click-bait headlines. But.  I am drawn in by bad articles on Python.

In particular, any post claiming Python is deficient causes me to look for the concrete PEP's that fix the problems.

Interestingly, there never seem to be any PEP's in any article that bashes Python. This post is yet another example of complaining without offering any practical solutions. 

BLUF

The article has a complaining tone, but, I can't figure out some of the complaints. It lifts up a confusing collection of features from other languages as if these features are somehow universally desirable. No justification is provided. The author seems to rely exclusively on Stack Overflow answers for information about Python. There are no PEP's proposed to fix Python. There aren't even any examples.

Point-by-Point

I will try to address each point. It's difficult, because some of the points are hard to discern. There's a lot of "Who thought that was a good idea?" which isn't really a specific point that can be refuted. It's a kind of rhetorical flourish that seems to work best with folks that already agree.

Let's start.

A Fragmented Language

This is the result of profound confusion. It's hard to find anyone recommending Python 2 anywhere. The supplied link is 9 years old, making this comment extremely misleading.  (I'm being charitable. A nine-year old link on Stack Overflow requires some curation. This is not a Python problem.)

Ugly Object-Orientation

The inconsistent use of this in C++ and Java is lifted up as somehow good. The consistent use of the self instance variable in Python is somehow less good; perhaps because it's consistent.

"See how I have to both declare and initialize them in the constructor? Another example of Python stupidity." Um. No, I don't actually see you declare them anywhere. I guess you're unaware of what declare means in languages like C++ and why declare isn't a thing in Python.

Somehow using the private keyword is better than __ name mangling. I'm unclear on why it's better, it's simply stated in a way that makes it sound like a long keyword used once is better because it's better. No additional reason or justification is offered. The idea of using __ to emphasize the privacy is somehow inconceivable.

The private and protected keywords are in C++, C#, and Java to optimize recompilation. To an extent, this also permits distribution of libraries in the form of "headers" and obfuscated binaries. None of this makes sense in a Python context.  A single example of how the private keyword would be helpful in Python is missing from the original post. There are huge complications of the protected keyword, also; these make the keywords more trouble than they are worth, and any example needs to cover these issues, also.

"In general, when you point out any flaws in their language, Python developers will act hostile and condescending." Sorry, this complaint in the original post sounds hostile and condescending. I'll try to ignore the tone and stick to what content I can find.

Whitespace

"...how is using whitespace any better than curly braces?" has an answer. But. Somehow it can't be chased down and included in the original post. Whitespace (like name mangling) is described as wrong because it's wrong, with no further justification provided.

An example where braces seem to be essential for sorting out syntax would be nice. The entire Python community is waiting for any example where braces were necessary and the indentation wasn't already clear.  

"And only in Python will the difference between tabs and spaces cause the interpreter to have a heart attack." Um. A syntax error is a heart attack? I wish I was able to type code without syntax errors. I am humbled thinking about the idea of seeing syntax errors so rarely. I have my editor set up to use spaces instead of tabs, and haven't had a problem in 20 years of using Python. 

Dynamic Typing

The opening quote, "Dynamic typing is bad," is stated as if it's axiomatic. The rest of the paragraph seems like vitriol rather than justification. "Some Python programmers have realized that dynamic typing is bad" requires some justification; a link to some documentation to support the claim would be helpful. An example would be good.

I can only assume that code like this is important and needs to be flagged by the compiler or something.

for data in some_list:
    if data == 42:
        print("data is int")
for data in some_other_list:
    if data == "wait":
        print("see the type of data changed.")
        

This seems like poor programming to begin with. Expecting the compiler to reject this seems weak. It seems better to not reuse variable names in the first place.

Constants

Not sure what the point is here. There's no justification for demanding the inconsistent behavior of a one-time-only assignment statement. There's no reference how how folks can use enums to define constant-like names and values. 

The concluding paragraph "The Emperor Has Not Clothes" is some kind of summary. It says "Python will only grow in popularity as more and more software is written in it," which does seem to be true. I think that might be the single most useful sentence.

What Have We Learned?

First, reading a few Stack Overflow posts can be misleading. Python now is not Python from nine years ago.

  1. Everyone says to use Python3. Really. If you have found a Python2 tutorial, stop now. Don't follow it. 
  2. The consistent use of the self variable seems simpler than trying to understand the rules for the this variable.
  3. Variables aren't declared, they're assigned values. It's as simple as it can be and avoids the clutter of variable declarations.
  4. We can read the source; the complexities of private (or protected) instance variables doesn't really help.
  5. Python's use of whitespace is very simple; most people can indent their code correctly. Anyone who's tried to debug C++ code that's correctly indented but missing a (nearly invisible) } will agree that the indentation is easier to get right.
  6. AFAICT, the reason dynamic typing might be bad is when a function or class reuses the same variable name for multiple different types of data. This seems wrong to reuse a variable name for multiple types. A small effort at inspecting the code can prevent this.
  7. Constants are easily implemented via enum. But. They appear to be useless in a dynamic language where the source is trivially available to be changed. I'm not sure why they seem important to people. And this article provides no help there.

Bottom line: Without concrete PEPs to fix things, or examples of what better might look like, this is click-bait whining. 

Starting from C# or Java to locate deficiencies is just as wrong as starting from Dartmouth Basic or FORTH as the standard against which Python is measured.

Tuesday, October 19, 2021

Code so bad it causes me physical pain

Here's the code.

def get_categories(file):
    """
    Get categories.
    """
    verify_file(file)

    categories = set()

    with open(file, "r") as cat_file:
        while line := cat_file.readline().rstrip():
            categories.add(line)

    return categories

To me this was terrible. truly and deeply horrifying. Let me count the ways.

  1. The docstring repeats the name of the function providing no additional information. 
  2. The verify_file() function checks are pure, useless LBYL code. It seemed designed to map a lot of detailed exceptions to a RuntimeError. Which is misleading.
  3. The use while and readline() to iterate through the lines of a file is -- I guess -- reasonable if we're working Pascal or Modula-2. But we're not. Use of the walrus operator isn't really getting any bonus points because -- well -- this is terrible.
  4. While pathlib is used elsewhere in this module, it's not used here. This function works with a filename string, assigned to the file parameter.

Actually, taking a step back, it's not that the author is being malicious. They just missed all the features of files and sets. And -- somehow -- were able to learn about the walrus operator while never figuring out how files work.

This is something like:

source = Path("some_file.txt")
with source.open() as source_file:
    categories = set(source_file)

And that's it. 

It Gets Worse

This was part of some category mapping application. 

They've got a CSV file with some string values. And they want to map those string values to summary category values. 

Most folks think of a dictionary for a mapping from one string to another string.

The code I was sent -- I kid you not -- used a list of two-tuples. I'll repeat that for those who are skimming. It use A LIST OF TWO-TUPLES INSTEAD OF A DICTIONARY.  It used a colossally bad search through an unsorted list of tuples to find matches. (The only search that would have been worse was random probes instead of iteration.)

It really did.

It can't even show you that code, it's such a horrifyingly bad design.

They had a question. Was the looping over a list of two-tuples ineffective? That's why they asked for help. 

It was like they had never heard of a dictionary. Nor seen a tutorial with a dictionary. Nor read a book that mentioned dictionaries. They had managed to learn enough Python to see the walrus operator without hearing of dictionaries.

A list of two-tuples, when provided to the dict() function, will make a dictionary. They were ignorant of this.

A dictionary that does O(1) lookups and avoids looping over a list of two-tuples. This was a mystery to them..

When someone doesn't know the Python dictionary exists, what is the appropriate response?

How can you politely say "Find another tutorial and do the ENTIRE thing all of it!"

That's Not All

There's this nugget of "You can't be serious."

category_counts = {element: 0 for element in categories}

And

category_counts[category] += 1

Yes. They used a dictionary to count instances of the categories. They did not understand collections.defaultdict or collections.Counter. But they understood a dictionary well enough to use it here. But not use it elsewhere for the central functionality of the app.

So. They couldn't use a dictionary, but could use a dictionary.

They couldn't use the csv module, so they wrote their own (bad) CSV parser. 

It's almost impossible to write a polite code review.

Tuesday, October 12, 2021

Legacy Software is a Sticky Mess

I'll get to legacy software. First, however, some backstory on observability.

Sailors will sometimes create "Float Plans". Like aircraft flight plans, they have an itinerary to make it slightly easier to find us when something goes wrong. Unlike airspace, which is tightly controlled by the FAA, the seas are more-or-less chaos.

The practice then, is to create float plan and give it to a trusted shore-side party, go out sailing, check in periodically, and cancel the whole thing when you're done sailing. If you miss a check-in, they can call an appropriate Search-And-Rescue agency like the US Coast Guard or BASRA or local cops with jurisdiction over a lake or river.

How much detail should be in this plan? For a long or complex trip, it doesn't seem sensible to say "Going to the Bahamas" as your float plan. That's a little thin on details. The bare minimum is to provide an Estimated Time of Arrival (ETA). But. When you summarize 36 hours of sailing to a single ETA, you invite observability problems. It's a sailboat, and you could be becalmed. Things are fine, you're just going to be late. 

Late, of course is relative. Simply late means you missed your ETA. If you're becalmed to the point where you're running low on supplies, then, this can become a bit of a problem.

The general policy followed by SAR is to allow several hours past the ETA before activating SAR resources. (The US Coast announces overdue mariners on the VHF radio so others can keep a lookout for them and render assistance.)

If you have a one-checkin-plan that summarizes 36 hours of sailing with a single ETA, you're going to be waiting for many hours after the ETA for help. So. Total systems failure after the first hour means 35 hours of drifting before someone will even alert SAR folks. And then the SAR folks will wait several hours after the ETA in case you're only slow.

What seems better is to have a sequence of waypoints with ETA's at each waypoint. That way you have incremental evidence of success or failure, and you're not waiting a LOOOONG time for your one-and-only ETA to pass without a check-in.

This leads us to software. And legacy software.

Creating the Plan

To create a sensible plan, you have waypoints as Latitude, Longitude pairs. These are angles on a sphere, not distances on a plane, so computing the length of a leg isn't a simple hypotenuse. 

It is a lot like a hypotenuse. For short distances, we can assume the earth is more-or-less flat. We can then use a relatively simple conversion (cosine of the latitude) to compress the longitudes toward the poles. We can convert lat and lon to distances and use a hypotenuse and get a really close answer.

def range_bearing(p1: LatLon, p2: LatLon, R: float = NM) -> tuple[float, Angle]:
    """Rhumb-line course from :py:data:`p1` to :py:data:`p2`.

    See :ref:`calc.range_bearing`.
    This is the equirectangular approximation.
    Without even the minimal corrections for non-spherical Earth.

    :param p1: a :py:class:`LatLon` starting point
    :param p2: a :py:class:`LatLon` ending point
    :param R: radius of the earth in appropriate units;
        default is nautical miles.
        Values include :py:data:`KM` for kilometers,
        :py:data:`MI` for statute miles and :py:data:`NM` for nautical miles.
    :returns: 2-tuple of range and bearing from p1 to p2.

    """
    d_NS = R * (p2.lat.radians - p1.lat.radians)
    d_EW = (
        R
        * math.cos((p2.lat.radians + p1.lat.radians) / 2)
        * (p2.lon.radians - p1.lon.radians)
    )
    d = math.hypot(d_NS, d_EW)
    tc = math.atan2(d_EW, d_NS) % (2 * math.pi)
    theta = Angle(tc)
    return d, theta

This means we can't trivially write down a list of waypoints. We need to do some fancy math to compute distances.

For years and years. (Since our first "big" trip in 2007.) I've used spreadsheets in various forms to work out the waypoints, distances, estimated time enroute (ETE) and ETA.

The math isn't too far beyond what a spreadsheet can do. But. There's a complication.

Complications

File formats are a complication.

There are KML files, GPX files, and CSV files that are used by various pieces of software. This is only the tip of the iceberg, because some of Navionics devices have an even more interesting USR file that contains everything in your chartplotter. It's cool. But complicated.

The file formats are -- clearly -- way outside the box for a spreadsheet.

Python to the rescue.

Since I'm a Python hack (and have been since well before 2007) I've got all kinds of file conversion tools. See https://github.com/slott56/navtools

But.

And here's where legacy enters the picture. (Music Cue.)

Fear that rattles in men's ears
And rears its hideous head
Dread ... Death ... in the wind ...

Spreadsheets.

Up until yesterday, the final planning tool was a spreadsheet with waypoints and times. Mac OS X Numbers is GREAT for this. I can pile in boat information, crew information, safety information, the itinerary, and SAR contact details in one spreadsheet, save it as a PDF, and email it to my shore-side contacts.

The BEST part of this was tinkering with the departure time while we waited for weather. We could plug in the day we're leaving, get revised ETA's for the waypoints, push the document, and take off. 

(We use an old Spot Navigator to provide notifications at midnight to show progress. We're going to upgrade to a SpotX so we can send messages a little more flexibly.)

The Legacy Spreadsheet

The legacy spreadsheet has a lot of good UX features. It's really adequate for some user stories. Save as PDF rocks.

However.

For the more advanced route planning, it isn't ideal. Specifically, spreadsheets can be weak on multiple "what-if" scenarios. 

The genesis of spreadsheets (I'm old, I was there, I remember VisiCalc) was "what-if" analysis. Change an assumption and follow the consequences through the lattice of dependent cells. These are hard to save. You can "Save As" to make a copy of the spreadsheet. You can save pages within a single spreadsheet. These are terrible because you can't really make a more fundamental change very easily. You have to make the same change to all the copies in your pile of "what-if" alternatives. 

To be very specific. I often need to plan for different boat speeds. We have a sailboat; wind and water matter a lot. Slow is about 5 knots. Fast is about 6 knots. Our theoretical top speed is 8 knots, but we've rarely seen that without a river flowing along with us. Sailing at that speed means a lot of sail wrestling, something we'd rather not do.

Fine. That's 3 scenarios, one for each speed: 5, 5.5, and 6. No big deal.

Until we add a waypoint. Or move a waypoint. Now we have to reset all three spreadsheets with a different itinerary. Since it's a different number of rows, we have the usual copy-and-paste problems in spreadsheets. 

What's Better?

Jupyter notebooks crush the life out of spreadsheets.

Here's the revised workflow.

  1. Create the route. Use tools like OpenCPN so the route can be exported as a GPX or CSV file.
  2. Use a notebook to parse the route file, creating an internal Route object.
  3. Manipulate the Route object, providing different ETA's and speed assumptions. These assumptions lead to multiple cells in the notebook. They can all share details so that one fundamental change leads to lots and lots of recomputation of itineraries. We can include all kinds of headings and markdown notes and thoughts and considerations.
  4. Finalize a route that's part of the plan. Still working in the confines of a longish notebook.
  5. Emit a Markdown file with Vessel Identification, Itinerary, Notes, and SAR Contact sections. Run pandoc to make a PDF. (This is the foundation for the nbconvert utility.)

This workflow creates two categories results:

One result is a Notebook with all of the planning details and thoughts and contingencies and considerations. 

The other result(s) are float plan documents as PDF's that can be shared widely.

Why did this take so long?

I used spreadsheets from 2007 to 2021. Why switch now? Some reasons.

Legacy solutions are sticky. This has a lot of consequences. I built up "expertise" in making the legacy work. I had become an "expert" in working around the hinky little problems with multiple what-if scenarios and propagating changes from the route into the what-ifs. For example, I limited the number of what-if scenarios I would consider because more than two got confusing.

New solutions are sometimes invisible. I only learned about Jupyter Notebooks about three years ago. I did not realize how powerful they were. I've since rearranged my thinking.

I've known about RST and Markdown and Pandoc for years. But. Getting from spreadsheet-like flexibility to a Markdown document was never a clear step. Without something like Jupyter Lab.

Pulling it all together

Does it require some kind of catalyst to force change?

Is it a slow accretion of evidence that the legacy software isn't working?

I'm pretty sure I had a long, slow Aha! moment as I realized that the Numbers spreadsheet was a large pain in the ass and a notebook would be simpler. It took a few days of fiddling to become really, really sure Numbers was not working out.

I think one of the biggest issues was a third "what-if" scenario. It was helpful to visualize arrival times. But. It was a huge pain in the neck to fiddle with the spreadsheets to get the right waypoints in there and summarize the alternatives.

I think the lesson here is to avoid automating anything unless you actually are the user.

If an organization wants software, a developer needs to do the job manually to *really* understand what the pain points are. Users develop expertise in the wrong things. And they want automation where the benefits are minor. Automating the spreadsheet-to-PDF is wrong. Replacing the spreadsheet is right.

Tuesday, October 5, 2021

New to Python -- How to manage architecture choices

This is a problem folks new to Python have, and sometimes can't articulate that they have it.

They don't know which package is the "right" one to use. Tox vs. Nox. Click vs. Argparse vs. getopts? What's the "best" choice?

  • Response 1. The whole Python ecosystem is chaos and the language is just a "toy". You don't have this many choices with (pick your language: e.g., Go or Rust or Scala). 
  • Response 2. We need a way to make architectural choices that the team understands and can use.

Response 1 is remarkably common. It's hard to argue against. If someone thinks innovation is chaos, they -- perhaps -- shouldn't be in technology to begin with. Innovation IS chaos. That's the essential definition!

However, they may be a project owner (or the manager of an old-school waterfall-style project) or -- worse -- someone responsible for architecture, and complain about chaos. If so, they're not really cut out for managing rapid technological change, and they need to be bypassed.

Yes. Bypassed. Ignore them. Go to their meetings. Nod politely when they rant about chaos. Then build working software. Eventually, they'll grow to understand that a large ecosystem is NOT chaos. Rapid innovation is not chaos. They may come to understand that filters are required to reject some of the noise that comes from innovation.

Response 2 -- How do we make choices?

Glad you asked.

I have seen four common approaches.

  1. HiPPO: The Highest-Paid Person's Opinion.
  2. Tech Oracle.
  3. HashTAG: Hyperconnected And Socially Helpful Tech Advisory Group.
  4. Peer Pressure.

Let's look at each of these.

HiPPO

The Highest-Paid Person's Opinion isn't easy to dismiss. They're an executive or the product owner and they think their position in the company gives them a magical ability to somehow predict the technical shortcomings of a component or a framework or a language. 

Once upon a time, when all components were licensed, someone negotiated contracts for support and training. The contracts (and negotiations) were a Big Sweaty Deal (BSD™). The HiPPO needed to justify all the time and money spent with the vendor. Okay. Sure. Then their opinion on continuing to invest in a losing proposition makes a lot of sense. Since they've already spent money with the vendor, they'd like us to continue to spend money with the vendor, even if the vendor's product isn't really very good.

Those times are past. Most everything is open source nowadays, and we pay for support reluctantly. We often have POC's to chose among alternatives. We can fire a vendor quickly. We don't invest heavily in inking a deal. (In the olden days, I got lots of plane rides and hotels from vendors. Getting to a deal was fun back then.)

The HiPPO needs to be informed that their opinion isn't helpful unless they can back it up with a POC. If they can't supply the POC, then the technical folks will keep arguing until they have competing POC's to help make a technical choice.

With good languages (like Python) and large ecosystems, POC's are cheap insurance to back up an opinion.

Tech Oracle

The Tech Oracle is expected to provide an opinion on everything. In many cases, this can work.

If the architecture is reasonably well known to the Oracle, then picking open-source projects to help build a solution isn't too difficult. Filters like "date of last update" and "volume of changes on GitHub" can be useful ways for the Oracle to locate better components and frameworks.

The Oracle should be producing POC's. This makes it hard for them also to produce production code. Not impossible, but hard. Their role isn't quite the same as other devs, since they have to provide up-front justification before too much real work is invested.

If the Oracle can't provide POC's, that's a bit of a problem. I've met architects who don't code. I couldn't find a way to trust them. Yes, they may know a lot. They're wonderfully articulate. Great slide decks. Good choices of lunch places where they try to influence you. But... I don't trust architects who don't code. Sorry. Personal weakness.

Architecture diagrams are an essential work product in addition to POC's. Usually, they're focused on a specific project, rather than providing general-purpose guidance. Generally, the overall ecosystem moves so quickly that the idea of a general-purpose, one-size-fits-all architecture isn't a good idea.

HashTAG

The Hyperconnected And Socially Helpful Tech Advisory Group is often a really good thing. It's best when there are multiple teams who need to coordinate. It can be slow-ish, however, and time needs to be invested in this. TAG meetings deserve stories on the storyboard. TAG time needs to be prioritized above individual team needs. 

A TAG needs to look at choices, and publish recommendations. That often means reviewing POC's. And that means folks have to take POC's to the TAG for them to weigh in on the difficult-to-quantify "better solution".

These are interesting demos. The TAG should be looking at the same (or similar) functionality from competing POC's to render a final, binding judgement. There needs to be an agenda, strict time-lines for the presentation, and a final -- almost objective -- score-card to show the elements of the final decision.

Decisions are an essential work product. Published. Socialized. Well-known. Easy-to-find. A whole GitHub repo with decisions is essential.

Architecture diagrams are also an essential work product. These should provide general-purpose guidance. A team should be able to start with one of these, eliminate the parts they don't need, plug in their product name, and move forward quickly.

Peer Pressure

This is the HashTAG reduced to a single team. Given a choice, the members of the team need to look at filters the way the Tech Oracle should. They need to weigh things like More stars in GitHub? Fewer bug reports? Documentation? And they need to capture the decision in something more than a conversation. 

If it's hard to reach consensus, this means the team has to commit to dueling POC's. This needs to be time-boxed work. It's enough of a POC to show how competing libraries or frameworks *could* be used in the implementation. It's important not to run down the road to a candidate implementation. The POC should point the implementers in the right direction.

(A candidate implementation becomes a kind of fait accompli: "I've already built it, we might as well use it. This dilutes consensus in favor of fast coding.)

Ideally, the POC shows what code could look like. It might include benchmarks. Test cases. Concrete things that can be compared -- line by line if necessary -- to show some measurable aspect of "better."

The decision and the diagram are part of the team's legacy. It has to live with the code. The number of decisions that get redebated after a few sprints needs to be minimal. It's never zero, but the team needs to put stories on the board for finalizing tech documentation with architectural decisions, reasons, and links to the POC that backs up the decision.

Wait. What about Python?

This, clearly, has nothing to do with Python.

The vastness and rapidity of change in the Python ecosystem surfaces a need for some kind of formal decision-making process.

But Python isn't the cause of the problem. All open source software moves quickly. A popular language like Python has more potential sources of confusion than a more specialized language/framework like R. 

Embrace the community nature of decision-making. Python is about community building and collective solutions to difficult problems. 

But. All those Proofs of Concept...

Yes, there will be POC's. In the case of a HashTAG or TechOracle, these need to be preserved and maintained and upgraded all the time. It's real work. It's a lot of real work.

Remember, the Python ecosystem moves rapidly. There's a lot of innovation, and it needs to be actively tracked. (Unlike the olden days where a C compiler update was an annual affair buried in an annual OS upgrade.)

This leads to defining projects via project templates. See https://cookiecutter.readthedocs.io/en/1.7.2/ for a good approach to this. You want to create cookie cutters that include enough skeleton code that you can run a complete 100% code coverage unit test.

You can then use tox (or nox) to define your component and framework versions as variant virtual environments. As components evolve, you update the versions and rerun your test suite. You can publish internal update trackers for project teams to make sure they're testing with the latest-and-greatest environments.

You'll also have to watch Python version changes. These can creep up on organizations. The PEP's and the schedules need to be central to folks using Python. See https://endoflife.date/python for a handy visualization.

The Billboard

Enterprise developers all discover that there's no way to share code easily within an enterprise. Everyone is isolated in their teams, and each team winds up reinventing some wheel or other. It's been an ongoing problem since IT organizations grew beyond a single team.

Python is no different. Teams solving related problems don't talk enough. If you have lots of meetings to share things, no real work gets done.

Python uses a Package Index to track popular useful packages. Visit https://pypi.org if you haven't seen it yet. You have two paths forward in an enterprise.

- Your own PyPI. This is easy and fun. You can have the internal PyPI shadow the global PyPI. 

- Use JFrog Artifactory. https://jfrog.com/artifact-management/ This involves spending money to track in-house artifacts as well as global PyPI artifacts.

- A GitHub Billboard organization. This is an organization that serves as a place to post links to other repos. It needs a little bit of curation. As an implementation, the organization's repositories are a lot of small project advertisements. The degenerate case is a README.md. A better case is the POC repo showing how to use the real project. In the middle is a cookie cutter. This is your in-house advertising. It's relatively easy to search because you're looking at one organization's list of repositories. Each is a pithy, focused summary of another project. Choose names that reflect why someone wants to look more deeply at the project.

The point here is to embrace the chaos that stems from innovation and make it visible.

Tuesday, September 28, 2021

Pivot to Python -- 150 pages of things you might need to know

See http://books.apple.com/us/book/id1586977675 

The Python Programming language is a deep topic. This book provides focused guidance on installing Python, creating virtual enivironments and using Jupyter Lab to build foundational skills in using Python. The book covers many built-in data types. There are two small case studies and one larger case study to provide examples of how Python can be used to tackle real-world problems. There are over 100 code samples. 

The companion GitHub repository, https://github.com/slott56/pivot-to-python, contains all of the example code.


Tuesday, September 21, 2021

Found an ancient CGI script -- part IV -- OpenAPI specification

See the previous sections, starting with the first on finding an ancient CGI script.

We don't need an OpenAPI specification. But, it is so helpful to formalize the behavior of a web site that it's hard for me to imagine working without it.

In this case, the legacy script only have a few paths, so the OpenAPI specification is relatively small.

openapi: 3.0.1
info:
  title: CGI Conversion
  version: 1.0.0
paths:
  /resources/{type}/:
    get:
      summary: Query Form
      operationId: form
      parameters:
      - name: type
        in: path
        required: true
        schema:
          type: string
      responses:
        200:
          description: Form
          content: {}
    post:
      summary: Add a document
      operationId: update
      requestBody:
        description: document
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/Document'
        required: true
      parameters:
      - name: type
        in: path
        required: true
        schema:
          type: string
      responses:
        201:
          description: Created
          content: 
            text/html:
              {}
        405:
          description: Invalid input
          content: 
            text/html:
              {}
  /resources/{type}/{guid}:
    get:
      summary: Find documents
      operationId: find
      parameters:
      - name: type
        in: path
        required: true
        schema:
          type: string
      - name: guid
        in: path
        required: true
        schema:
          type: string
      responses:
        200:
          description: successful operation
          content:
            text/html:
              {}
        404:
          description: Not Found
          content:
            text/html:
              {}

components:
  schemas:
    Document:
      type: object
      properties:
        fname:
          type: string
        lname:
          type: string

This shows the rudiments of the paths and the responses. There are three "successful" kinds of responses, plus two additional error responses that are formally defined.

There is a lot of space in this document for additional documentation and details. Every opportunity should be taken to capture details about the application, what it does now, and what it should do when it's rewritten.

In our example, the form (and resulting data structure) is a degenerate class with a pair of fields. We simply write the repr() string to a file. In a practical application, this will often be a bit more complex. There may be validation rules, some of which are obscure, hidden in odd places in the application code.

What's essential here is continuing the refactoring process to more fully understand the underlying data model and state processing. These features need to be disentangled from HTML output and CGI input.

The OpenAPI spec serves as an important part of the definition of done. It supplements the context diagram with implementation details. In a very real and practical way, this drives the integration test suite. We can transform OpenAPI to Gherkin and use this to test the overall web site. See https://medium.com/capital-one-tech/spec-to-gherkin-to-code-902e346bb9aa for more on this topic.

Tuesday, September 14, 2021

Found an ancient cgi script -- part III -- refactoring

Be sure to see the original script and the test cases in the prior posts.

We need to understand a little about what a web request is. This can help us do the refactoring.

It can help to think of a web server a function that maps a request to a response. The request is really a composite object with headers, \(h\), method verb, \(v\), and URL, (\u\). Similarly, the response is a composite with headers, \(h\), and content, (\c\).

$$h, c = s(h, v, u)$$

The above is true for idempotent requests; usually, the method verb is GET. 

Some requests make a state change, however, and use method verbs like POST, PUT, PATCH, or DELETE.

$$h, c; \hat S = s(h, v, u; S)$$

There's a state,  \(S\), which is transformed to a new state, \(\hat S\), as part of making the request.

For the most part, CGI scripts are limited to GET and POST methods. The GET method is (ideally) for idempotent, no-state-change requests. The POST should be limited to making state changes. In some cases, there will be an explicit GET-after-POST sequence of operations using an intermediate redirection so the browser's "back" button works properly. 

In too many cases, the rules aren't followed well and their will be state transitions on GET and idempotent POST operations. Sigh.

Multiple Resources

Most web servers will provide content for a number of resource instances. Often they will work with a number of instances of a variety of resource types. The degenerate case is a server providing content for a single instance of a single type.

Each resource comes from the servers's universe of resources, \(R\).

$$r \in R$$

Each resource type, \(t(r )\), is part of some overall collection of types that describe the various resources. In some cases we'll identify resources with a path that includes the type of the resource,  \(t(r )\), and an identifier within that type, \(i(r )\), \(\langle t( r ), i( r ) \rangle\). This often maps to a character string "type/name" that's part of a URL's path.

We can think of a response's content as the HTML markup, \(m_h\), around a resource, \(r\), managed by the web server.

$$ c = m_h( r )$$

This is a representation of the resource's state. The HTML representation can have both semantic and style components. We might, for example, have a number of HTML structure elements like <p>, as well as CSS styles. Ideally, the styles don't convey semantic information, but the HTML tags do.

Multiple Services

There are often multiple, closely-related services within a web server. A common design pattern is to have services that vary based on a path item, \(p(u)\), within the url.

$$ h, m_h(r ); \hat S = s(h, v, u; S) = \begin{cases} s_x(h, v, u; S) \textbf{ if $p(u) = x$} \\ s_y(h, v, u; S) \textbf{ if $p(u) = y$} \\ \end{cases} $$

There isn't, of course, any formal requirement for a tidy mapping from some element of the path, \(p(u)\), to a type, \(t ( r ) \), that characterizes a resource, \(r\). Utter chaos is allowed. Thankfully, it's not common.

While there may not be a tidy type-based mapping, there must be a mapping from a triple and a state, \(\langle h, u, v; S \rangle \) to a resource, \(r\). This mapping can be considered a database or filesystem query, \(q(\langle h, u, v; S \rangle)\). The request may also involve state change.  It can help to think of the state as a function that can emit a new state for a request. This implies two low-level processing concepts:

$$ \{ r \in R | q(\langle h, u, v; S \rangle, r) \} $$

And

$$ \hat S = S(\langle h, u, v \rangle) $$

The query processing to locate resources is one aspect of the underlying model. The state change for the universe of resources is another aspect of the underlying model. Each request must return a resource; it may also make a state change.

What's essential, then, is to see how these various \(s_x\) functions are related to the original code. The \(m_h(r)\) function, the \(p( u )\) mappings, and the \(s_{t(u)}(h, v, u; S)\) functions are all separate features that can be disentangled from each other.

Why All The Math?

We need to be utterly ruthless about separating several things that are often jumbled together.

  • A web server works with a universe of resources. These can be filesystem objects, database rows, external web services, anything. 
  • Resources have an internal state. Resources may also have internal types (or classes) to define common features.
  • There's at least one function to create an HTML representation of state. This may be partial or ambiguous. It may also be complete and unambiguous.
  • There is at least one function to map a URL to zero or more resources. This can (and often does) result in 404 errors because a resource cannot be found.
  • There may be a function to create a server state from the existing server state and a request. This can result in 403 errors because an operation is forbidden.

Additionally, there can be user authentication and authorization rules. The users are simply resources. Authentication is simply a query to locate a user. It may involve using the password as part of the user lookup. Users can have roles. Authorization is a property of a user's role required by a specific query or state change (or both.)

As we noted in the overview, the HTML representation of state is handled (entirely) by Jinja. HTML templates are used. Any non-Jinja HTML processing in legacy CGI code can be deleted.

The mapping from URL to resource may involve several steps. In Flask, some of these steps are handled by the mapping from a URL to a view function. This is often used to partition resources by type. Within a view function, individual resources will be located based on URL mapping.

What do we do?

In our example code, we have a great deal of redundant HTML processing. One sensible option is to separate all of the HTML printing into one or more functions that emit the various kinds of pages.

In our example, the parsing of the path is a single, long nested bunch of if-elif processing. This should be refactored into individual functions. A single, top-level function can decide what the URL pattern and verb mean, and then delegate the processing to a view function. The view function can then use an HTML rendering function to build the resulting page.

One family of URL's result in presentation of a form. Another family of URL's processes the form input. The form data leads to a resource with internal state. The form content should be used to define a Python class. A separate class should read and write files with these Python objects. The forms should be defined at a high level using a module like WTForms

When rewriting, I find it helps to keep several things separated:

  • A class for the individual resource objects.
  • A  form that is one kind of serialization of the resource objects.
  • An HTML page that is another kind of serialization of the resource objects.

While these things are related very closely, they are not isomorphic to each other. Objects may have implementation details or derived values that should not be trivially shown on a form or HTML page.

In our example, the form only has two fields. These should be properly described in a class. The field objects have different types. The types should also be modeled more strictly, not treated casually as a piece of a file path. (What happens if we use a type name of "this/that"?)

Persistent state change is handled with filesystem updates. These, too, are treated informally, without a class to encapsulate the valid operations, and reject invalid operations.

Some Examples

Here is one the HTML output functions.

def html_post_response(type_name, name, data):
    print "Status: 201 CREATED"
    print "Content-Type: text/html"
    print
    print "<!DOCTYPE html>"
    print "<html>"
    print "<head><title>Created New %s</title></head>" % type_name
    print "<body>"
    print "<h1>Created New %s</h1>" % type_name
    print "<p>Path: %s/%s</p>" % (type_name, name)
    print "<p>Content: </p><pre>"
    print data
    print "</pre>"
    # cgi.print_environ()
    print "</body>"
    print "</html>"

There are several functions like this. We aren't wasting any time optimizing all these functions. We're simply segregating them from the rest of the processing. There's a huge amount of redundancy; we'll fix this when we starting using jinja templates.

Here's the revised main() function.

def main():
    try:
        os.mkdir("data")
    except OSError:
        pass

    path_elements = os.environ["PATH_INFO"].split("/")
    if path_elements[0] == "" and path_elements[1] == "resources":
        if os.environ["REQUEST_METHOD"] == "POST":
            type_name = path_elements[2]
            base = os.path.join("data", type_name)
            try:
                os.mkdir(base)
            except OSError:
                pass
            name = str(uuid.uuid4())
            full_name = os.path.join(base, name)
            data = cgi.parse(sys.stdin)
            output_file = open(full_name, 'w')
            output_file.write(repr(data))
            output_file.write('\n')
            output_file.close()
            html_post_response(type_name, name, data)

        elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 3:
            type_name = path_elements[2]
            html_get_form_response(type_name)

        elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 4:
            type_name = path_elements[2]
            resource_name = path_elements[3]
            full_name = os.path.join("data", type_name, resource_name)
            input_file = open(full_name, 'r')
            content = input_file.read()
            input_file.close()
            html_get_response(type_name, resource_name, content)

        else:
            html_error_403_response(path_elements)
    else:
        html_error_404_response(path_elements)

This has the HTML output fully segregated from the rest of the processing. We can now see the request parsing and the model processing more clearly. This lets us move further and refactor into yet smaller and more focused functions. We can see file system updates and file path creation as part of the underlying model. 

Since these examples are contrived. The processing is essentially a repr() function call. Not too interesting, but the point is to identify this clearly by refactoring the application to expose it.

Summary

When we start to define the classes to properly model the persistent objects and their state, we'll see that there are zero lines of legacy code that we can keep. 

Zero lines of legacy code have enduring value.

This is not unusual. Indeed, I think it's remarkably common.

Reworking a CGI application should not be called a "migration."

  • There is no "migration" of code from Python 2 to Python 3. The Python 2 code is (almost) entirely useless except to explain the use cases.
  • There is no "migration" of code from CGI to some better framework. Flask (and any of the other web frameworks) are nothing like CGI scripts.

The functionality should be completely rewritten into Python 3 and Flask. The processing concept is preserved. The data is preserved. The code is not preserved.

In some projects, where there are proper classes defined, there may be some code that can be preserved. However, a Python dataclass may do everything a more complex Python2 class definition does with a lot less code. The Python2 code is not sacred. Code should not be preserved because someone thinks it might reduce cost or risk.

The old code is useful for three things.

  • Define the unit test cases.
  • Define the integration test cases.
  • Answer questions about edge cases when writing new code.

This means we won't be using the 2to3 tool to convert any of the code. 

It also means the unit test cases are the new definition of the project. These are the single most valuable part of the work. Given test cases that describe the old application, writing the new app using Flask is relatively easy.

Tuesday, September 7, 2021

Found an ancient cgi script -- part II -- testing

See "We have an ancient Python2 CGI script -- what do we do?" The previous post in this series provides an overview of the process of getting rid of legacy code. 

Here's some code. I know it's painfully long; the point is to provide a super-specific, very concrete example of what to keep and what to discard. (I've omitted the module docstring and the imports.)

try:
    os.mkdir("data")
except OSError:
    pass

path_elements = os.environ["PATH_INFO"].split("/")
if path_elements[0] == "" and path_elements[1] == "resources":
    if os.environ["REQUEST_METHOD"] == "POST":
        type_name = path_elements[2]
        base = os.path.join("data", type_name)
        try:
            os.mkdir(base)
        except OSError:
            pass
        name = str(uuid.uuid4())
        full_name = os.path.join(base, name)
        data = cgi.parse(sys.stdin)
        output_file = open(full_name, 'w')
        output_file.write(repr(data))
        output_file.write('\n')
        output_file.close()

        print "Status: 201 CREATED"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Created New %s</title></head>" % type_name
        print "<body>"
        print "<h1>Created New %s</h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, name)
        print "<p>Content: </p><pre>"
        print data
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 3:
        type_name = path_elements[2]
        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Query %s</title></head>" % (type_name,)
        print "<body><h1>Create new instance of <tt>%s</tt></h1>" % type_name
        print '<form action="/cgi-bin/example.py/resources/%s" method="POST">' % (type_name,)
        print """
          <label for="fname">First name:</label>
          <input type="text" id="fname" name="fname"><br><br>
          <label for="lname">Last name:</label>
          <input type="text" id="lname" name="lname"><br><br>
          <input type="submit" value="Submit">
        """
        print "</form>"
        # cgi.print_environ()
        print "</body>"
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 4:
        type_name = path_elements[2]
        resource_name = path_elements[3]
        full_name = os.path.join("data", type_name, resource_name)
        input_file = open(full_name, 'r')
        content = input_file.read()
        input_file.close()

        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Document %s -- %s</title></head>" % (type_name, resource_name)
        print "<body><h1>Instance of <tt>%s</tt></h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, resource_name)
        print "<p>Content: </p><pre>"
        print content
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    else:
        print "Status: 403 Forbidden"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Forbidden: %s to %s</title></head>"  % (os.environ["REQUEST_METHOD"], path_elements)
        cgi.print_environ()
        print "</html>"
else:
    print "Status: 404 Not Found"
    print "Content-Type: text/html"
    print                               # blank line, end of headers
    print "<!DOCTYPE html>"
    print "<html>"
    print "<head><title>Not Found: %s</title></head>" % (os.environ["PATH_INFO"], )
    print "<h1>Error</h1>"
    print "<b>Resource <tt>%s</tt> not found</b>" % (os.environ["PATH_INFO"], )
    cgi.print_environ()
    print "</html>"

At first glance you might notice (1) there are several resource types located on the URL path, and (2) there are several HTTP methods, also. These features aren't always obvious in a CGI script, and it's one of the reasons why CGI is simply horrible. 

It's not clear from this what -- exactly -- the underlying data model is and what processing is done and what parts are merely CGI and HTML overheads.

This is why refactoring this code is absolutely essential to replacing it.

And.

We can't refactor without test cases.

And (bonus).

We can't have test cases without some vague idea of what this thing purports to do.

Let's tackle this in order. Starting with test cases.

Unit Test Cases

We can't unit test this.

As written, it's a top-level script without so much as as single def or class. This style of programming -- while legitimate Python -- is an epic fail when it comes to testing.

Step 1, then, is to refactor a script file into a module with function(s) or class(es) that can be tested.

def main():
    ... the original script ... 

if __name__ == "__main__":  # pragma: no cover
    main()

For proper testability, there can be at most these two lines of code that are not easily tested. These two (and only these two) are marked with a special comment (# pragma: no cover) so the coverage tool can politely ignore the fact that we won't try to test these two lines.

We can now provide a os.environ values that look like a CGI requests, and exercise this script with concrete unit test cases.

How many things does it do?

Reading the code is headache-inducing, so, a fall-back plan is to count the number of logic paths. Look at if/elif blocks and count those without thinking too deeply about why the code looks the way it looks.

There appear to be five distinct behaviors. Since there are possibilities of unhandled exceptions, there may be as many as 10 things this will do in production.

This leads to a unit test that looks like the following:

import unittest
import urllib
import example_2
import os
import io
import sys

class MyTestCase(unittest.TestCase):
    def setUp(self):
        self.cwd = os.getcwd()
        try:
            os.mkdir("test_path")
        except OSError:
            pass
        os.chdir("test_path")
        self.output = io.BytesIO()
        sys.stdout = self.output
    def tearDown(self):
        sys.stdout = sys.__stdout__
        sys.stdin = sys.__stdin__
        os.chdir(self.cwd)
    def test_path_1(self):
        """No /resources in path"""
        os.environ["PATH_INFO"] = "/not/valid"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 404 Not Found")
    def test_path_2(self):
        """Path /resources but bad method"""
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 403 Forbidden")
    def test_path_3(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "GET"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 200 OK")
        self.assertIn("<form ", out)
    def test_path_5(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "POST"
        os.environ["CONTENT_TYPE"] = "application/x-www-form-urlencoded"
        content = urllib.urlencode({"field1": "value1", "field2": "value2"})
        form_data = io.BytesIO(content)
        os.environ["CONTENT_LENGTH"] = str(len(content))
        sys.stdin = form_data
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 201 CREATED")
        self.assertIn("'field2': ['value2']", out)
        self.assertIn("'field1': ['value1']", out)


if __name__ == '__main__':
    unittest.main()

Does this have 100% code coverage? I'll leave it to the reader to copy-and-paste, add the coverage run command and look at the output. What else is required?

Integration Test Case

We can (barely) do an integration test on this. It's tricky because we don't want to run Apache httpd (or some other server.) We want to run a small Python script to be sure this works.

This means we need to (1) start a server as a separate process, and (2) use urllib to send requests to that separate process. This isn't too difficult. Right now, it's not obviously required. The test cases above run the entire script from end to end, providing what we think are appropriate mock values. Emphasis on "what we think." To be sure, we'll need to actually fire up a separate process. 

As with the unit tests, we need to enumerate all of the expected behaviors. 

Unlike the unit tests, there are (generally) fewer edge cases.

It looks like this.

import unittest
import subprocess
import time
import urllib2

class TestExample_2(unittest.TestCase):
    def setUp(self):
        self.proc = subprocess.Popen(
            ["python2.7", "mock_httpd.py"],
            cwd="previous"
        )
        time.sleep(0.25)
    def tearDown(self):
        self.proc.kill()
        time.sleep(0.1)
    def test(self):
        req = urllib2.Request("http://localhost:8000/cgi-bin/example.py/resources/example")
        result = urllib2.urlopen(req)
        self.assertEqual(result.getcode(), 200)
        self.assertEqual(set(result.info().keys()), set(['date', 'status', 'content-type', 'server']))
        content = result.read()
        self.assertEqual(content.splitlines()[0], "<!DOCTYPE html>")
        self.assertIn("<form ", content)

if __name__ == '__main__':
    unittest.main()

This will start a separate process and then make a request from that process. After the request, it kills the subprocess. 

We've only covered one of the behaviors. A bunch more test cases are required. They're all going to be reasonably similar to the test() method.

Note the mock_httpd.py script. It's a tiny thing that invokes CGI's.

import CGIHTTPServer
import BaseHTTPServer

server_class = BaseHTTPServer.HTTPServer
handler_class = CGIHTTPServer.CGIHTTPRequestHandler

server_address = ('', 8000)
httpd = server_class(server_address, handler_class)
httpd.serve_forever()

This will run any script file in the cgi-bin directory, acting as a kind of mock for Apache httpd or other CGI servers.

Tests Pass, Now What?

We need to formalize our knowledge with a some diagrams. This is a Context diagram in PlantUML. It draws a picture that we can use to discuss what this app does and who actually uses it.

@startuml
actor user
usecase post
usecase query
usecase retrieve
user --> post
user --> query
user --> retrieve

usecase 404_not_found
usecase 403_not_permitted
user --> 404_not_found
user --> 403_not_permitted

retrieve <|-- 404_not_found
@enduml

We can also update the Container diagram. There's an "as-is" version and a "to-be" version.

Here's the as-is diagram of any CGI.

@startuml
interface HTTP

node "web server" {
    component httpd as  "Apache httpd"
    interface cgi
    component app
    component python
    python --> app
    folder data
    app --> data
}

HTTP --> httpd
httpd -> cgi
cgi -> python
@enduml

Here's a to-be diagram of a typical (small) Flask application. 

@startuml
interface HTTP

node "web server" {
    component httpd as  "nginx"
    component uwsgi
    interface wsgi
    component python
    component app
    component model
    component flask
    component jinja
    folder data
    folder static
    httpd --> static
    python --> wsgi
    wsgi --> app
    app --> flask
    app --> jinja
    app -> model
    model --> data
}

HTTP --> httpd
httpd -> uwsgi
uwsgi -> python
@enduml

These diagrams can help to clarify how the CGI will be restructured. A complex CGI might have a database or external web services involved. These should be correctly depicted.

The previous post on this subject said we can now refactor this code. The unit tests are required before making any real changes. (Yes, we made one change to promote testability by repackaging a script to be a function.)

We're aimed to start disentangling the HTML and CGI overheads from the application and narrowing our focus onto the useful things it does.


Tuesday, August 31, 2021

We have an ancient Python2 CGI script -- what do we do?

This was a shocking email: the people have a Python 2 CGI script. They needed advice on Python 2 to 3 migration.

Here's my advice on a Python 2 CGI script: Throw It Away.

A great deal of the CGI processing is part of the wsgi module, as well as tools like jinja and flask. This means that the ancient Python 2 CGI script has to be disentangled into two parts.

  • All the stuff that deals with CGI and HTML. This isn't valuable and must be deleted.
  • Whatever additional, useful, interesting processing it does for the various user communities. 

The second part -- the useful work -- needs to be preserved. The rest is junk.

With web services there are often at least three communities: the "interactive users", "analysts", and the administrators who keep it running. The names vary a lot with the problem domain. The interactive users may further decompose into anonymous visitors, people with privileges to make changes, and administrators to manage the privileges. There may be multiple flavors of analytical work based on the web transactions that are logged. A lot can go on, and each of these communities has a feature set they require.

The idea here is to look at the project as a rewrite where some of the legacy code may be preserved. It's better to proceed as though this is new development with the legacy code providing examples and test cases. If we look at this as new, we'll start with some diagrams to provide a definition of done.

Step One

Understand the user communities. Create a 4C Context Diagram to show who the users are and what the expect. Ideally, it's small with "users" and "administrators." It may turn out to be big with complex privilege rules to segregate users.

It's hard to get this right. Everyone wants the code "converted". But no one really knows all the things the code does. There's a lot of pressure to ignore this step.

This step creates the definition of done. Without this, there's no way to do anything with the CGI code and make sure that the original features still work.

Step Two

Create a 4C Container Diagram showing the Apache HTTPD (or whatever server you're using) that fires the CGI. Document all other ancillary things are going on. Ideally, there's nothing. Ideally, this is a minor, stand-alone server that no one noticed until today. Label this picture "As Is." It will change, but you need a checklist of what's running right now. 

(This should be very quick to produce. If it's not, go back to step one and make sure you really understand the context.)

Step Three

Create a 4C Component Diagram, and label it "As Is". This has all the parts of your code base. Be sure you locate all the things in the local site-packages directory that were added onto Python. Ideally, there isn't much, but -- of course -- there could be dozens of add-on libraries.

You will have several lists. One list has all the things in site-packages. If the PYTHONPATH environment variable is used, all the things in the directories named in this environment variable. Plus. All the things named in import statements.

These lists should overlap. Of course someone can install a package that's not used, so the site-packages list should be a superset of the import list.

This is a checklist of things that must be read (and possibly converted) to build the new features.

Step Four?

You'll need two suites of fully automated tests. 

  • Unit tests for the Python code. This must have 100% code coverage and will not be easy.
  • Integration tests for the CGI. You will be using the WSGI module instead of Apache HTTPD (or whatever the server was) for this testing. You will NOT integrate with the original web server, because, that interface is no longer supported and is a security nightmare.

Let's break this into two steps.

Step Four

You need automated unit tests. You need to reach at last 100% code coverage for the unit tests. This is going to be difficult for two reasons. First, the legacy code may not be easy to read or test. Second, Python 2 testing tools are no longer well supported. Many of them still work, but if you encounter problems, the tool will never be fixed.

If you can find a Python 2 version of coverage, and a Python 2 version of pytest, I suggest using this combination to write a test suite, and make sure you have 100% code coverage. 

This is a lot of work, and there's no way around it. Without automated testing, there's no way to prove that you're done and the software can be trusted in production.

You will find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Five

Python has a built-in CGI server you can use. See https://docs.python.org/3/library/http.server.html#http.server.CGIHTTPRequestHandler for a handler that will provide core CGI features from a Python script allowing you to test without the overhead of Apache httpd or some other server.

You need an integration test suite for each user stories in the context you created in Step One. No exceptions. Each User. Each Story. A test to show that it works.

You'll likely want to use the CGIHTTPRequestHandler class in the http.server module to create a test server. You'll then create a pytest fixture that starts the web server before a test and then kills the process after the test. It's very important to use subprocess.Popen() to start and stop the target server to be sure the CGI interface works correctly.

It is common to find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Six

Refactor. Now that you have automated tests to prove the legacy CGI script really works, you need to disentangle the Python code into three distinct components.

  1. A Component to parse the request: the methods, cookies, headers, and URL.
  2. A Component that does useful work. This corresponds to the "model" and "control" part of the MVC design pattern. 
  3. A Component that builds the response: the status, headers, and content. 

In many CGI scripts, there is often a hopeless jumble of bad code. Because you have tests in Step Four and Step Five, you can refactor and confirm the tests still pass.

If the code is already nicely structured, this step is easy. Don't plan on it being easy.

One goal is to eventually replace HTML page output creation with jinja. Similarly, another goal is to eventually replace parsing the request with flask. All of the remaining CGI-related features get pushed into a wsgi-compatible plug-in to a web server.

The component that does the useful work will have some underlying data model (resources, files, downloads, computations, something) and some control (post, get, different paths, queries.) We'd like to clean this up, too. For now, it can be one module.

After refactoring, you'll have a new working application. You'll have a new top-level CGI script that uses the built-in wsgi module to do request and response processing. This is temporary, but is required to pass the integration test suite. 

You may want to create an intermediate Component diagram to describe the new structure of the code.

Step Seven

Write an OpenAPI specification for the revised application. See https://swagger.io/specification/ for more information. Add the path processing so openapi.json (or openapi.yaml) will produce the specification. This means updating unit and integration tests to add this feature. 

While this is new development, it is absolutely essential for building any kind of web service. It will implement the Context diagram, and most of the Container diagram. It will describe significant portions of the Component diagram, also. It is not optional. It's very likely this was not part of the legacy application.

Some of the document structures described in the OpenAPI specification will be based on the data model and control components factored out of the legacy code. It's essential to get these details write in the OpenAPI specification and the unit tests. 

This may expose problems in the CGI's legacy behavior. Don't fix it now. Instead document the features that don't fit with modern API's. Don't be afraid to use # TODO comments to show what should be fixed.

Step Eight

Use the 2to3 tool to convert ONLY the model and control components. Do not convert request parsing and response processing components; they will be discarded. This may involve additional redesign and rewrites depending on how bad the old code was.

Convert the unit tests for ONLY the model and control components components.

Get the unit tests for the model and control to work in Python 3. This is the foundation for the new web site. Update the C4 container, component, and code diagrams. Since there's no request handling or HTML processing, don't worry about code coverage for the project as a whole. Only get the model and control to have 100% coverage.

Do not start writing view functions or HTML templates until underlying model and control module works. This is the foundation of the application. It is not tied to HTTP, but must exist and be tested independently.

Step Nine

Using Flask as a framework and the OpenAPI specification for the web application, build the view functions to exercise all the features of the application. Build Jinja templates for the HTML output. Use proper cookie management from Flask, discarding any legacy cookie management from the CGI. Use proper header parsing rules in Flask, discarding any legacy header processing.

Rewrite the remaining unit tests manually. These unit tests will now use the Flask test client. The goal is to get back to 100% code coverage.

Update the C4 container, component, and code diagrams.

Step Ten

There are untold number of ways to deploy a Flask application. Pick something simple and secure. Do some test deployments to be sure you understand how this works. As one example, you can continue to use Apache httpd. As another example, some people prefer GUnicorn, others prefer to use NGINX. There's lots of advice in the Flask project on ways to deploy Flask applications.

Do not reuse the Apache httpd and CGI interface. This was terrible. 

Step Eleven

Create a pyproject.toml file that includes a tox section so that you have a fully-automated integration capability. You can automate the CI/CD pipeline. Once the new app is in production, you can archive the old code and never use it again for anything. Ever. 

Step Twelve

Fix the bugs you found in Steps Four, Five, and Seven. You will be creating a new release with new, improved features.

tl;dr

This is a lot of work. There's no real alternative. CGI scripts need a lot of rework.

Tuesday, August 24, 2021

Spreadsheets, COBOL, and Schema-Driven File Processing

I need to rewrite Stingray Reader. This project handles a certain amount of file processing using a schema to assure the Logical Layout is understood.  It handles several common Physical Formats:

  • CSV files where the format is extended by the various dialects options.
  • COBOL files in ASCII or EBCDIC.

The project's code can be applied to text files where a regular expression can yield a row-level dictionary object. Web server log files, for example, are in first normal form, but have irregular punctuation that CSV can't handle. 

It can also be applied to NDJSON files (see http://ndjson.org or https://jsonlines.org) without too much work. This also means it can be applied to YAML files. I suspect it can also be applied to TOML files as a distinct physical format.

The complication in the Singran Reader is that COBOL files aren't really in first normal form. They can have repeating groups of fields that CSV files don't (generally) have. And the initial data model in the project wasn't really up to handling this cleanly. The repeating group logic was patched in.

Further complicating this particular project was the history of its evolution. It started as a way to grub through hellishly complex CSV files. You know, the files where there are no headings, or the headings are 8 lines long, or the files where there are a lot of lines before the proper headings for the data. It handled all of those not-first-normal-form issues that arise in CSV world.

I didn't (initially) understand JSON Schema (https://json-schema.org) and did not leverage it properly as an intermediate representation for CSV as well as COBOL layouts. It arose as a kind of after-thought. There are a lot of todo's related to applying JSON Schema to the problem. 

Recently, I learned about Lowrance USR files. See https://github.com/slott56/navtools in general and https://github.com/slott56/navtools/blob/master/navtools/lowrance_usr.py for details. 

It turns out that the USR file could be described, reasonably well, with a Stingray schema. More to the point, it should be describable by a Stingray schema, and the application to extract waypoints or routes should look a lot like a CSV reader.

Consequences

There are a bunch of things I need to do.

First, and foremost, I need to unwind some of the COBOL field extraction logic. It's a right awful mess because of the way I hacked in OCCURS DEPENDING ON. The USR files also have numerous instances of arrays with a boundary defined by other content of the file. This is a JSON Schema Extension (not a weird COBOL special case) and I need to use proper JSON schema extensions and attribute cross-references.

Of course, the OCCURS DEPENDING ON clauses can nest, leading to quite complex navigation through a dynamically-sized collection of bytes. This is not done terribly well in the current version, and involves leaving little state reminders around to "simplify" some of the coding.

The field extractions for COBOL apply to binary files and should be able to leverage the Python struct module to decode individual fields. We should be able to also extract data from USR files. The schema can be in pure JSON or it can be in Python as an internal data structure. This is a new feature and (in principle) can be applied to a variety of binary files that are in (approximately) first normal form. 

(It may also be sensible to extend the struct module to handle some EBCDIC conversions: int, float, packed-decimal, numeric string, and alphanumeric string.)

Once we can handle COBOL and USR file occurs-depending-on with some JSON Schema extensions, we can then work on ways to convert source material (including JSON Schema) to the internal representation of a schema.

  1. CSV headers -> JSON Schema has an API that has worked in the past. The trivial case of first-line-is-degenerate-schema and schema-in-a-separate-file are pleasant. The more complex cases of skip-a-bunch-of-prefix-lines is a bit more complex, but isn't much of a rewrite. This recovers the original feature of handling CSV files in all their various incarnations and dialects with more formally defined schema. It means that CSV with type conversions can be handled.
  2. Parse COBOL DDE  -> JSON Schema. The COBOL parser is a bit of a hacky mess. A better lexical scanner would simplify things slightly. Because the field extraction logic will be rebuilt, we'll also have the original feature of being able to directly decode Z/OS EBCDIC files in Python.
This feels ambitious because the original design was so weak.