Tuesday, August 31, 2021

We have an ancient Python2 CGI script -- what do we do?

This was a shocking email: the people have a Python 2 CGI script. They needed advice on Python 2 to 3 migration.

Here's my advice on a Python 2 CGI script: Throw It Away.

A great deal of the CGI processing is part of the wsgi module, as well as tools like jinja and flask. This means that the ancient Python 2 CGI script has to be disentangled into two parts.

  • All the stuff that deals with CGI and HTML. This isn't valuable and must be deleted.
  • Whatever additional, useful, interesting processing it does for the various user communities. 

The second part -- the useful work -- needs to be preserved. The rest is junk.

With web services there are often at least three communities: the "interactive users", "analysts", and the administrators who keep it running. The names vary a lot with the problem domain. The interactive users may further decompose into anonymous visitors, people with privileges to make changes, and administrators to manage the privileges. There may be multiple flavors of analytical work based on the web transactions that are logged. A lot can go on, and each of these communities has a feature set they require.

The idea here is to look at the project as a rewrite where some of the legacy code may be preserved. It's better to proceed as though this is new development with the legacy code providing examples and test cases. If we look at this as new, we'll start with some diagrams to provide a definition of done.

Step One

Understand the user communities. Create a 4C Context Diagram to show who the users are and what the expect. Ideally, it's small with "users" and "administrators." It may turn out to be big with complex privilege rules to segregate users.

It's hard to get this right. Everyone wants the code "converted". But no one really knows all the things the code does. There's a lot of pressure to ignore this step.

This step creates the definition of done. Without this, there's no way to do anything with the CGI code and make sure that the original features still work.

Step Two

Create a 4C Container Diagram showing the Apache HTTPD (or whatever server you're using) that fires the CGI. Document all other ancillary things are going on. Ideally, there's nothing. Ideally, this is a minor, stand-alone server that no one noticed until today. Label this picture "As Is." It will change, but you need a checklist of what's running right now. 

(This should be very quick to produce. If it's not, go back to step one and make sure you really understand the context.)

Step Three

Create a 4C Component Diagram, and label it "As Is". This has all the parts of your code base. Be sure you locate all the things in the local site-packages directory that were added onto Python. Ideally, there isn't much, but -- of course -- there could be dozens of add-on libraries.

You will have several lists. One list has all the things in site-packages. If the PYTHONPATH environment variable is used, all the things in the directories named in this environment variable. Plus. All the things named in import statements.

These lists should overlap. Of course someone can install a package that's not used, so the site-packages list should be a superset of the import list.

This is a checklist of things that must be read (and possibly converted) to build the new features.

Step Four?

You'll need two suites of fully automated tests. 

  • Unit tests for the Python code. This must have 100% code coverage and will not be easy.
  • Integration tests for the CGI. You will be using the WSGI module instead of Apache HTTPD (or whatever the server was) for this testing. You will NOT integrate with the original web server, because, that interface is no longer supported and is a security nightmare.

Let's break this into two steps.

Step Four

You need automated unit tests. You need to reach at last 100% code coverage for the unit tests. This is going to be difficult for two reasons. First, the legacy code may not be easy to read or test. Second, Python 2 testing tools are no longer well supported. Many of them still work, but if you encounter problems, the tool will never be fixed.

If you can find a Python 2 version of coverage, and a Python 2 version of pytest, I suggest using this combination to write a test suite, and make sure you have 100% code coverage. 

This is a lot of work, and there's no way around it. Without automated testing, there's no way to prove that you're done and the software can be trusted in production.

You will find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Five

Python has a built-in CGI server you can use. See https://docs.python.org/3/library/http.server.html#http.server.CGIHTTPRequestHandler for a handler that will provide core CGI features from a Python script allowing you to test without the overhead of Apache httpd or some other server.

You need an integration test suite for each user stories in the context you created in Step One. No exceptions. Each User. Each Story. A test to show that it works.

You'll likely want to use the CGIHTTPRequestHandler class in the http.server module to create a test server. You'll then create a pytest fixture that starts the web server before a test and then kills the process after the test. It's very important to use subprocess.Popen() to start and stop the target server to be sure the CGI interface works correctly.

It is common to find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Six

Refactor. Now that you have automated tests to prove the legacy CGI script really works, you need to disentangle the Python code into three distinct components.

  1. A Component to parse the request: the methods, cookies, headers, and URL.
  2. A Component that does useful work. This corresponds to the "model" and "control" part of the MVC design pattern. 
  3. A Component that builds the response: the status, headers, and content. 

In many CGI scripts, there is often a hopeless jumble of bad code. Because you have tests in Step Four and Step Five, you can refactor and confirm the tests still pass.

If the code is already nicely structured, this step is easy. Don't plan on it being easy.

One goal is to eventually replace HTML page output creation with jinja. Similarly, another goal is to eventually replace parsing the request with flask. All of the remaining CGI-related features get pushed into a wsgi-compatible plug-in to a web server.

The component that does the useful work will have some underlying data model (resources, files, downloads, computations, something) and some control (post, get, different paths, queries.) We'd like to clean this up, too. For now, it can be one module.

After refactoring, you'll have a new working application. You'll have a new top-level CGI script that uses the built-in wsgi module to do request and response processing. This is temporary, but is required to pass the integration test suite. 

You may want to create an intermediate Component diagram to describe the new structure of the code.

Step Seven

Write an OpenAPI specification for the revised application. See https://swagger.io/specification/ for more information. Add the path processing so openapi.json (or openapi.yaml) will produce the specification. This means updating unit and integration tests to add this feature. 

While this is new development, it is absolutely essential for building any kind of web service. It will implement the Context diagram, and most of the Container diagram. It will describe significant portions of the Component diagram, also. It is not optional. It's very likely this was not part of the legacy application.

Some of the document structures described in the OpenAPI specification will be based on the data model and control components factored out of the legacy code. It's essential to get these details write in the OpenAPI specification and the unit tests. 

This may expose problems in the CGI's legacy behavior. Don't fix it now. Instead document the features that don't fit with modern API's. Don't be afraid to use # TODO comments to show what should be fixed.

Step Eight

Use the 2to3 tool to convert ONLY the model and control components. Do not convert request parsing and response processing components; they will be discarded. This may involve additional redesign and rewrites depending on how bad the old code was.

Convert the unit tests for ONLY the model and control components components.

Get the unit tests for the model and control to work in Python 3. This is the foundation for the new web site. Update the C4 container, component, and code diagrams. Since there's no request handling or HTML processing, don't worry about code coverage for the project as a whole. Only get the model and control to have 100% coverage.

Do not start writing view functions or HTML templates until underlying model and control module works. This is the foundation of the application. It is not tied to HTTP, but must exist and be tested independently.

Step Nine

Using Flask as a framework and the OpenAPI specification for the web application, build the view functions to exercise all the features of the application. Build Jinja templates for the HTML output. Use proper cookie management from Flask, discarding any legacy cookie management from the CGI. Use proper header parsing rules in Flask, discarding any legacy header processing.

Rewrite the remaining unit tests manually. These unit tests will now use the Flask test client. The goal is to get back to 100% code coverage.

Update the C4 container, component, and code diagrams.

Step Ten

There are untold number of ways to deploy a Flask application. Pick something simple and secure. Do some test deployments to be sure you understand how this works. As one example, you can continue to use Apache httpd. As another example, some people prefer GUnicorn, others prefer to use NGINX. There's lots of advice in the Flask project on ways to deploy Flask applications.

Do not reuse the Apache httpd and CGI interface. This was terrible. 

Step Eleven

Create a pyproject.toml file that includes a tox section so that you have a fully-automated integration capability. You can automate the CI/CD pipeline. Once the new app is in production, you can archive the old code and never use it again for anything. Ever. 

Step Twelve

Fix the bugs you found in Steps Four, Five, and Seven. You will be creating a new release with new, improved features.

tl;dr

This is a lot of work. There's no real alternative. CGI scripts need a lot of rework.

Tuesday, August 24, 2021

Spreadsheets, COBOL, and Schema-Driven File Processing

I need to rewrite Stingray Reader. This project handles a certain amount of file processing using a schema to assure the Logical Layout is understood.  It handles several common Physical Formats:

  • CSV files where the format is extended by the various dialects options.
  • COBOL files in ASCII or EBCDIC.

The project's code can be applied to text files where a regular expression can yield a row-level dictionary object. Web server log files, for example, are in first normal form, but have irregular punctuation that CSV can't handle. 

It can also be applied to NDJSON files (see http://ndjson.org or https://jsonlines.org) without too much work. This also means it can be applied to YAML files. I suspect it can also be applied to TOML files as a distinct physical format.

The complication in the Singran Reader is that COBOL files aren't really in first normal form. They can have repeating groups of fields that CSV files don't (generally) have. And the initial data model in the project wasn't really up to handling this cleanly. The repeating group logic was patched in.

Further complicating this particular project was the history of its evolution. It started as a way to grub through hellishly complex CSV files. You know, the files where there are no headings, or the headings are 8 lines long, or the files where there are a lot of lines before the proper headings for the data. It handled all of those not-first-normal-form issues that arise in CSV world.

I didn't (initially) understand JSON Schema (https://json-schema.org) and did not leverage it properly as an intermediate representation for CSV as well as COBOL layouts. It arose as a kind of after-thought. There are a lot of todo's related to applying JSON Schema to the problem. 

Recently, I learned about Lowrance USR files. See https://github.com/slott56/navtools in general and https://github.com/slott56/navtools/blob/master/navtools/lowrance_usr.py for details. 

It turns out that the USR file could be described, reasonably well, with a Stingray schema. More to the point, it should be describable by a Stingray schema, and the application to extract waypoints or routes should look a lot like a CSV reader.

Consequences

There are a bunch of things I need to do.

First, and foremost, I need to unwind some of the COBOL field extraction logic. It's a right awful mess because of the way I hacked in OCCURS DEPENDING ON. The USR files also have numerous instances of arrays with a boundary defined by other content of the file. This is a JSON Schema Extension (not a weird COBOL special case) and I need to use proper JSON schema extensions and attribute cross-references.

Of course, the OCCURS DEPENDING ON clauses can nest, leading to quite complex navigation through a dynamically-sized collection of bytes. This is not done terribly well in the current version, and involves leaving little state reminders around to "simplify" some of the coding.

The field extractions for COBOL apply to binary files and should be able to leverage the Python struct module to decode individual fields. We should be able to also extract data from USR files. The schema can be in pure JSON or it can be in Python as an internal data structure. This is a new feature and (in principle) can be applied to a variety of binary files that are in (approximately) first normal form. 

(It may also be sensible to extend the struct module to handle some EBCDIC conversions: int, float, packed-decimal, numeric string, and alphanumeric string.)

Once we can handle COBOL and USR file occurs-depending-on with some JSON Schema extensions, we can then work on ways to convert source material (including JSON Schema) to the internal representation of a schema.

  1. CSV headers -> JSON Schema has an API that has worked in the past. The trivial case of first-line-is-degenerate-schema and schema-in-a-separate-file are pleasant. The more complex cases of skip-a-bunch-of-prefix-lines is a bit more complex, but isn't much of a rewrite. This recovers the original feature of handling CSV files in all their various incarnations and dialects with more formally defined schema. It means that CSV with type conversions can be handled.
  2. Parse COBOL DDE  -> JSON Schema. The COBOL parser is a bit of a hacky mess. A better lexical scanner would simplify things slightly. Because the field extraction logic will be rebuilt, we'll also have the original feature of being able to directly decode Z/OS EBCDIC files in Python.
This feels ambitious because the original design was so weak.

Tuesday, August 17, 2021

I Have Code That Didn't Work. What Now?

I don't get many of these "I have code that doesn't work" requests. But I do see them once in a great while.

It might be something like the following two-part explanation with a following question.

Tuesday, August 10, 2021

Why Python Is Weird For C++ Developers -- Some Thoughts

See 9 Reasons Why Python Is Weird For C++ Developers.

I'm often bothered by inter-language comparisons. Mostly because programming languages -- except in the most abstract way -- aren't really very comparable. At the Turing Machine level the finite state automata are comparable, but that reductionist view (intentionally) eliminates all the expressive power from a given language.

Let's look at the reasons in some detail. A few of them actually are interesting.

  1. Whitespace. I'm dismissive of this as an interesting difference. When I read code in EVERY other programming language, I'm immediately aware that programmers can indent. Indeed, I've seen C and C++ code were {}'s were omitted, but the code was indented properly, making it devilishly hard to debug. My experience is that folks get the indentation right BEFORE the get the {}'s right.
  2. Syntax. In this article, it's the lack of {}'s. Again, I'm dismissive because I've actually helped folks learning C++ who had the indentation right and the {}'s wrong. This is ony "weird" if you're absolutely and completely convinced that {}'s are somehow a divine requirement that transcends all human attempts at interpretation. With Unicode, we're in a position to separate set membership from block-of-code and start using multiple variants on {}'s.  I'd vote for if a > b 【m = a】else 【m = b】 using 【】for code blocks.
  3. Class Variables. This points out an inherent ambiguity of C++. Most of the time, most things are not "static". They're "automatic" that is, associated with the instance. The auto keyword, however, is rarely used, and is mostly assumed. Python (outside dataclasses) is more consistent. All things inside the class statement are "static": part of the class. In the case of dataclasses, this simple rule is broken, which can be confusing. But. This wasn't mentioned.
  4. Pointer and Reference Transparency. This is simple confusion. All Python is handled by reference all the time. C++ is an absolute mess of "primitive" types that don't use references and objects that do use references. Java is just as bad. And I want to emphasize bad. Python is perfectly consistent, and -- I would suggest -- the opposite of weird. But. The article is describing things from a C++ perspective, as if C++ were somehow not weird. I suggest this isn't a great approach.
  5. Private Class Members. This is summarized as "better encapsulation and control" without a concrete example. It's hard to provide a concrete example because the Pythonic approach works so well. The only use case for "private" that I've been able to understand is when you're concealing the entire implementation from all scrutiny. That is, you have a proprietary implementation with an encrypted JAR file and you want to avoid revealing it to protect some intellectual property. Since Python is source, this can't happen, and we say "We're all adults here." Flag it with a leading _ and we'll recognize it as part of an implementation detail that might change. 
  6. Self vs. this. Not sure what this is but the phrase "only major programming language" is something that relies on Java and C++ being near the top of the TIOBE index. I suspect we can find a lot of languages that use neither "self" nor "this". I'm not sure exactly how this is weird, but, I get that it's different.
  7. Multiple Return Values. This seems like an intentional refusal to understand how tuples and tuple unpacking work. Again, this seems to make C++ the yardstick when C++ is clearly kind of weird here.
  8. No Strong Data Types. This seems like another refusal to understand Python. In this case, it feels like it's a refusal understand that objects are strongly typed in Python and variables are transient labels attached to objects. The mypy tool will try to associate a type with a variable and will warn you about a = "string" followed by a = 42. Perhaps I'm not understanding, but the portrayal of C++ rules as "not weird" seems like it's being taken too far. 
  9. No Constants. This isn't completely true. Some folks use enums to provide enumerated numeric constant values in the rare cases where this might matter. Using global variables as constants actually works out fine in practice. Most tools will look for ALL_CAPS names on the left of an = sign; and if this occurs more than once will raise a warning. If you have really stupid fellow programmers who can't understand how some variables shouldn't be reused, you can easily write a script to walk the AST looking for references to global variables and warn your colleagues that there are rules and they're not following them. This is part and parcel of the "We're all adults here" approach. If folks can't figure out how constants work, you need to collaborate more fully with other developers to help them understand this.

I'm unhappy with lifting up C++ quirks as if they're somehow really important. I don't think C++ is a terribly helpful language. The need for explicit memory management, for example, is a terrible problem. The explicit distinction between primitives and objects is also terrible.

While compare-and-contrast with Python might be helpful for C++ expatriates, I think this article has it exactly backwards. I think the following list couuld be more useful.

  • Python frees you from counting {}'s. Just indent. It's easier.
  • Python has simple rules for class/instance variables (except in the case of dataclasses and named tuples.) Also: if it starts with self. it's an instance variable.
  • Python is all references without the horrifying complexity of primitive types.
  • We're all adults here. Don't stress yourself out over privacy or constants. Document your code, instead. Write a unit test case or two. Use mypy. Use black.
  • Tuple unpacking and the fact that tuples are often implied works out very nicely to create very clean code.
  • Data types are part of the object. There's no magical "cast" capability to process a block of bytes as if they're some other type. 
These are advantages of Python. And disadvantages of C++. I think it's better to talk about what Python has than what Python lacks when measured against a terribly complex language like C++.


Tuesday, August 3, 2021

Writing Interactive Compute-Intensive Programs for Web Browsers

Fascinating. The reference to the classic Mac OS with non-preemptive multi-tasking is quite cool. The concept fits nicely with Python's async/await coroutines that need to collaborate with a periodic OS request to permit interaction with streams of events from another source (i.e., a foreground window.)