Tuesday, September 14, 2021

Found an ancient cgi script -- part III -- refactoring

Be sure to see the original script and the test cases in the prior posts.

We need to understand a little about what a web request is. This can help us do the refactoring.

It can help to think of a web server a function that maps a request to a response. The request is really a composite object with headers, \(h\), method verb, \(v\), and URL, (\u\). Similarly, the response is a composite with headers, \(h\), and content, (\c\).

$$h, c = s(h, v, u)$$

The above is true for idempotent requests; usually, the method verb is GET. 

Some requests make a state change, however, and use method verbs like POST, PUT, PATCH, or DELETE.

$$h, c; \hat S = s(h, v, u; S)$$

There's a state,  \(S\), which is transformed to a new state, \(\hat S\), as part of making the request.

For the most part, CGI scripts are limited to GET and POST methods. The GET method is (ideally) for idempotent, no-state-change requests. The POST should be limited to making state changes. In some cases, there will be an explicit GET-after-POST sequence of operations using an intermediate redirection so the browser's "back" button works properly. 

In too many cases, the rules aren't followed well and their will be state transitions on GET and idempotent POST operations. Sigh.

Multiple Resources

Most web servers will provide content for a number of resource instances. Often they will work with a number of instances of a variety of resource types. The degenerate case is a server providing content for a single instance of a single type.

Each resource comes from the servers's universe of resources, \(R\).

$$r \in R$$

Each resource type, \(t(r )\), is part of some overall collection of types that describe the various resources. In some cases we'll identify resources with a path that includes the type of the resource,  \(t(r )\), and an identifier within that type, \(i(r )\), \(\langle t( r ), i( r ) \rangle\). This often maps to a character string "type/name" that's part of a URL's path.

We can think of a response's content as the HTML markup, \(m_h\), around a resource, \(r\), managed by the web server.

$$ c = m_h( r )$$

This is a representation of the resource's state. The HTML representation can have both semantic and style components. We might, for example, have a number of HTML structure elements like <p>, as well as CSS styles. Ideally, the styles don't convey semantic information, but the HTML tags do.

Multiple Services

There are often multiple, closely-related services within a web server. A common design pattern is to have services that vary based on a path item, \(p(u)\), within the url.

$$ h, m_h(r ); \hat S = s(h, v, u; S) = \begin{cases} s_x(h, v, u; S) \textbf{ if $p(u) = x$} \\ s_y(h, v, u; S) \textbf{ if $p(u) = y$} \\ \end{cases} $$

There isn't, of course, any formal requirement for a tidy mapping from some element of the path, \(p(u)\), to a type, \(t ( r ) \), that characterizes a resource, \(r\). Utter chaos is allowed. Thankfully, it's not common.

While there may not be a tidy type-based mapping, there must be a mapping from a triple and a state, \(\langle h, u, v; S \rangle \) to a resource, \(r\). This mapping can be considered a database or filesystem query, \(q(\langle h, u, v; S \rangle\). The request may also involve state change.  It can help to think of the state as a function that can emit a new state for a request. This implies two low-level processing concepts:

$$ \begin{multline} \{ r \in R | q(\langle h, u, v; S \rangle, r) \} \\ \hat S = S(\langle h, u, v \rangle) \end{multline} $$

The query processing to locate resources is one aspect of the underlying model. The state change for the universe of resources is another aspect of the underlying model. Each request must return a resource; it may also make a state change.

What's essential, then, is to see how these various \(s_x\) functions are related to the original code. The \(m_h(r)\) function, the \(p( u )\) mappings, and the \(s_{t(u)}(h, v, u; S)\) functions are all separate features that can be disentangled from each other.

Why All The Math?

We need to be utterly ruthless about separating several things that are often jumbled together.

  • A web server works with a universe of resources. These can be filesystem objects, database rows, external web services, anything. 
  • Resources have an internal state. Resources may also have internal types (or classes) to define common features.
  • There's at least one function to create an HTML representation of state. This may be partial or ambiguous. It may also be complete and unambiguous.
  • There is at least one function to map a URL to zero or more resources. This can (and often does) result in 404 errors because a resource cannot be found.
  • There may be a function to create a server state from the existing server state and a request. This can result in 403 errors because an operation is forbidden.

Additionally, there can be user authentication and authorization rules. The users are simply resources. Authentication is simply a query to locate a user. It may involve using the password as part of the user lookup. Users can have roles. Authorization is a property of a user's role required by a specific query or state change (or both.)

As we noted in the overview, the HTML representation of state is handled (entirely) by Jinja. HTML templates are used. Any non-Jinja HTML processing in legacy CGI code can be deleted.

The mapping from URL to resource may involve several steps. In Flask, some of these steps are handled by the mapping from a URL to a view function. This is often used to partition resources by type. Within a view function, individual resources will be located based on URL mapping.

What do we do?

In our example code, we have a great deal of redundant HTML processing. One sensible option is to separate all of the HTML printing into one or more functions that emit the various kinds of pages.

In our example, the parsing of the path is a single, long nested bunch of if-elif processing. This should be refactored into individual functions. A single, top-level function can decide what the URL pattern and verb mean, and then delegate the processing to a view function. The view function can then use an HTML rendering function to build the resulting page.

One family of URL's result in presentation of a form. Another family of URL's processes the form input. The form data leads to a resource with internal state. The form content should be used to define a Python class. A separate class should read and write files with these Python objects. The forms should be defined at a high level using a module like WTForms

When rewriting, I find it helps to keep several things separated:

  • A class for the individual resource objects.
  • A  form that is one kind of serialization of the resource objects.
  • An HTML page that is another kind of serialization of the resource objects.

While these things are related very closely, they are not isomorphic to each other. Objects may have implementation details or derived values that should not be trivially shown on a form or HTML page.

In our example, the form only has two fields. These should be properly described in a class. The field objects have different types. The types should also be modeled more strictly, not treated casually as a piece of a file path. (What happens if we use a type name of "this/that"?)

Persistent state change is handled with filesystem updates. These, too, are treated informally, without a class to encapsulate the valid operations, and reject invalid operations.

Some Examples

Here is one the HTML output functions.

def html_post_response(type_name, name, data):
    print "Status: 201 CREATED"
    print "Content-Type: text/html"
    print
    print "<!DOCTYPE html>"
    print "<html>"
    print "<head><title>Created New %s</title></head>" % type_name
    print "<body>"
    print "<h1>Created New %s</h1>" % type_name
    print "<p>Path: %s/%s</p>" % (type_name, name)
    print "<p>Content: </p><pre>"
    print data
    print "</pre>"
    # cgi.print_environ()
    print "</body>"
    print "</html>"

There are several functions like this. We aren't wasting any time optimizing all these functions. We're simply segregating them from the rest of the processing. There's a huge amount of redundancy; we'll fix this when we starting using jinja templates.

Here's the revised main() function.

def main():
    try:
        os.mkdir("data")
    except OSError:
        pass

    path_elements = os.environ["PATH_INFO"].split("/")
    if path_elements[0] == "" and path_elements[1] == "resources":
        if os.environ["REQUEST_METHOD"] == "POST":
            type_name = path_elements[2]
            base = os.path.join("data", type_name)
            try:
                os.mkdir(base)
            except OSError:
                pass
            name = str(uuid.uuid4())
            full_name = os.path.join(base, name)
            data = cgi.parse(sys.stdin)
            output_file = open(full_name, 'w')
            output_file.write(repr(data))
            output_file.write('\n')
            output_file.close()
            html_post_response(type_name, name, data)

        elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 3:
            type_name = path_elements[2]
            html_get_form_response(type_name)

        elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 4:
            type_name = path_elements[2]
            resource_name = path_elements[3]
            full_name = os.path.join("data", type_name, resource_name)
            input_file = open(full_name, 'r')
            content = input_file.read()
            input_file.close()
            html_get_response(type_name, resource_name, content)

        else:
            html_error_403_response(path_elements)
    else:
        html_error_404_response(path_elements)

This has the HTML output fully segregated from the rest of the processing. We can now see the request parsing and the model processing more clearly. This lets us move further and refactor into yet smaller and more focused functions. We can see file system updates and file path creation as part of the underlying model. 

Since these examples are contrived. The processing is essentially a repr() function call. Not too interesting, but the point is to identify this clearly by refactoring the application to expose it.

Summary

When we start to define the classes to properly model the persistent objects and their state, we'll see that there are zero lines of legacy code that we can keep. 

Zero lines of legacy code have enduring value.

This is not unusual. Indeed, I think it's remarkably common.

Reworking a CGI application should not be called a "migration."

  • There is no "migration" of code from Python 2 to Python 3. The Python 2 code is (almost) entirely useless except to explain the use cases.
  • There is no "migration" of code from CGI to some better framework. Flask (and any of the other web frameworks) are nothing like CGI scripts.

The functionality should be completely rewritten into Python 3 and Flask. The processing concept is preserved. The data is preserved. The code is not preserved.

In some projects, where there are proper classes defined, there may be some code that can be preserved. However, a Python dataclass may do everything a more complex Python2 class definition does with a lot less code. The Python2 code is not sacred. Code should not be preserved because someone thinks it might reduce cost or risk.

The old code is useful for three things.

  • Define the unit test cases.
  • Define the integration test cases.
  • Answer questions about edge cases when writing new code.

This means we won't be using the 2to3 tool to convert any of the code. 

It also means the unit test cases are the new definition of the project. These are the single most valuable part of the work. Given test cases that describe the old application, writing the new app using Flask is relatively easy.

Tuesday, September 7, 2021

Found an ancient cgi script -- part II -- testing

See "We have an ancient Python2 CGI script -- what do we do?" The previous post in this series provides an overview of the process of getting rid of legacy code. 

Here's some code. I know it's painfully long; the point is to provide a super-specific, very concrete example of what to keep and what to discard. (I've omitted the module docstring and the imports.)

try:
    os.mkdir("data")
except OSError:
    pass

path_elements = os.environ["PATH_INFO"].split("/")
if path_elements[0] == "" and path_elements[1] == "resources":
    if os.environ["REQUEST_METHOD"] == "POST":
        type_name = path_elements[2]
        base = os.path.join("data", type_name)
        try:
            os.mkdir(base)
        except OSError:
            pass
        name = str(uuid.uuid4())
        full_name = os.path.join(base, name)
        data = cgi.parse(sys.stdin)
        output_file = open(full_name, 'w')
        output_file.write(repr(data))
        output_file.write('\n')
        output_file.close()

        print "Status: 201 CREATED"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Created New %s</title></head>" % type_name
        print "<body>"
        print "<h1>Created New %s</h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, name)
        print "<p>Content: </p><pre>"
        print data
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 3:
        type_name = path_elements[2]
        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Query %s</title></head>" % (type_name,)
        print "<body><h1>Create new instance of <tt>%s</tt></h1>" % type_name
        print '<form action="/cgi-bin/example.py/resources/%s" method="POST">' % (type_name,)
        print """
          <label for="fname">First name:</label>
          <input type="text" id="fname" name="fname"><br><br>
          <label for="lname">Last name:</label>
          <input type="text" id="lname" name="lname"><br><br>
          <input type="submit" value="Submit">
        """
        print "</form>"
        # cgi.print_environ()
        print "</body>"
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 4:
        type_name = path_elements[2]
        resource_name = path_elements[3]
        full_name = os.path.join("data", type_name, resource_name)
        input_file = open(full_name, 'r')
        content = input_file.read()
        input_file.close()

        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Document %s -- %s</title></head>" % (type_name, resource_name)
        print "<body><h1>Instance of <tt>%s</tt></h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, resource_name)
        print "<p>Content: </p><pre>"
        print content
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    else:
        print "Status: 403 Forbidden"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Forbidden: %s to %s</title></head>"  % (os.environ["REQUEST_METHOD"], path_elements)
        cgi.print_environ()
        print "</html>"
else:
    print "Status: 404 Not Found"
    print "Content-Type: text/html"
    print                               # blank line, end of headers
    print "<!DOCTYPE html>"
    print "<html>"
    print "<head><title>Not Found: %s</title></head>" % (os.environ["PATH_INFO"], )
    print "<h1>Error</h1>"
    print "<b>Resource <tt>%s</tt> not found</b>" % (os.environ["PATH_INFO"], )
    cgi.print_environ()
    print "</html>"

At first glance you might notice (1) there are several resource types located on the URL path, and (2) there are several HTTP methods, also. These features aren't always obvious in a CGI script, and it's one of the reasons why CGI is simply horrible. 

It's not clear from this what -- exactly -- the underlying data model is and what processing is done and what parts are merely CGI and HTML overheads.

This is why refactoring this code is absolutely essential to replacing it.

And.

We can't refactor without test cases.

And (bonus).

We can't have test cases without some vague idea of what this thing purports to do.

Let's tackle this in order. Starting with test cases.

Unit Test Cases

We can't unit test this.

As written, it's a top-level script without so much as as single def or class. This style of programming -- while legitimate Python -- is an epic fail when it comes to testing.

Step 1, then, is to refactor a script file into a module with function(s) or class(es) that can be tested.

def main():
    ... the original script ... 

if __name__ == "__main__":  # pragma: no cover
    main()

For proper testability, there can be at most these two lines of code that are not easily tested. These two (and only these two) are marked with a special comment (# pragma: no cover) so the coverage tool can politely ignore the fact that we won't try to test these two lines.

We can now provide a os.environ values that look like a CGI requests, and exercise this script with concrete unit test cases.

How many things does it do?

Reading the code is headache-inducing, so, a fall-back plan is to count the number of logic paths. Look at if/elif blocks and count those without thinking too deeply about why the code looks the way it looks.

There appear to be five distinct behaviors. Since there are possibilities of unhandled exceptions, there may be as many as 10 things this will do in production.

This leads to a unit test that looks like the following:

import unittest
import urllib
import example_2
import os
import io
import sys

class MyTestCase(unittest.TestCase):
    def setUp(self):
        self.cwd = os.getcwd()
        try:
            os.mkdir("test_path")
        except OSError:
            pass
        os.chdir("test_path")
        self.output = io.BytesIO()
        sys.stdout = self.output
    def tearDown(self):
        sys.stdout = sys.__stdout__
        sys.stdin = sys.__stdin__
        os.chdir(self.cwd)
    def test_path_1(self):
        """No /resources in path"""
        os.environ["PATH_INFO"] = "/not/valid"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 404 Not Found")
    def test_path_2(self):
        """Path /resources but bad method"""
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 403 Forbidden")
    def test_path_3(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "GET"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 200 OK")
        self.assertIn("<form ", out)
    def test_path_5(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "POST"
        os.environ["CONTENT_TYPE"] = "application/x-www-form-urlencoded"
        content = urllib.urlencode({"field1": "value1", "field2": "value2"})
        form_data = io.BytesIO(content)
        os.environ["CONTENT_LENGTH"] = str(len(content))
        sys.stdin = form_data
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 201 CREATED")
        self.assertIn("'field2': ['value2']", out)
        self.assertIn("'field1': ['value1']", out)


if __name__ == '__main__':
    unittest.main()

Does this have 100% code coverage? I'll leave it to the reader to copy-and-paste, add the coverage run command and look at the output. What else is required?

Integration Test Case

We can (barely) do an integration test on this. It's tricky because we don't want to run Apache httpd (or some other server.) We want to run a small Python script to be sure this works.

This means we need to (1) start a server as a separate process, and (2) use urllib to send requests to that separate process. This isn't too difficult. Right now, it's not obviously required. The test cases above run the entire script from end to end, providing what we think are appropriate mock values. Emphasis on "what we think." To be sure, we'll need to actually fire up a separate process. 

As with the unit tests, we need to enumerate all of the expected behaviors. 

Unlike the unit tests, there are (generally) fewer edge cases.

It looks like this.

import unittest
import subprocess
import time
import urllib2

class TestExample_2(unittest.TestCase):
    def setUp(self):
        self.proc = subprocess.Popen(
            ["python2.7", "mock_httpd.py"],
            cwd="previous"
        )
        time.sleep(0.25)
    def tearDown(self):
        self.proc.kill()
        time.sleep(0.1)
    def test(self):
        req = urllib2.Request("http://localhost:8000/cgi-bin/example.py/resources/example")
        result = urllib2.urlopen(req)
        self.assertEqual(result.getcode(), 200)
        self.assertEqual(set(result.info().keys()), set(['date', 'status', 'content-type', 'server']))
        content = result.read()
        self.assertEqual(content.splitlines()[0], "<!DOCTYPE html>")
        self.assertIn("<form ", content)

if __name__ == '__main__':
    unittest.main()

This will start a separate process and then make a request from that process. After the request, it kills the subprocess. 

We've only covered one of the behaviors. A bunch more test cases are required. They're all going to be reasonably similar to the test() method.

Note the mock_httpd.py script. It's a tiny thing that invokes CGI's.

import CGIHTTPServer
import BaseHTTPServer

server_class = BaseHTTPServer.HTTPServer
handler_class = CGIHTTPServer.CGIHTTPRequestHandler

server_address = ('', 8000)
httpd = server_class(server_address, handler_class)
httpd.serve_forever()

This will run any script file in the cgi-bin directory, acting as a kind of mock for Apache httpd or other CGI servers.

Tests Pass, Now What?

We need to formalize our knowledge with a some diagrams. This is a Context diagram in PlantUML. It draws a picture that we can use to discuss what this app does and who actually uses it.

@startuml
actor user
usecase post
usecase query
usecase retrieve
user --> post
user --> query
user --> retrieve

usecase 404_not_found
usecase 403_not_permitted
user --> 404_not_found
user --> 403_not_permitted

retrieve <|-- 404_not_found
@enduml

We can also update the Container diagram. There's an "as-is" version and a "to-be" version.

Here's the as-is diagram of any CGI.

@startuml
interface HTTP

node "web server" {
    component httpd as  "Apache httpd"
    interface cgi
    component app
    component python
    python --> app
    folder data
    app --> data
}

HTTP --> httpd
httpd -> cgi
cgi -> python
@enduml

Here's a to-be diagram of a typical (small) Flask application. 

@startuml
interface HTTP

node "web server" {
    component httpd as  "nginx"
    component uwsgi
    interface wsgi
    component python
    component app
    component model
    component flask
    component jinja
    folder data
    folder static
    httpd --> static
    python --> wsgi
    wsgi --> app
    app --> flask
    app --> jinja
    app -> model
    model --> data
}

HTTP --> httpd
httpd -> uwsgi
uwsgi -> python
@enduml

These diagrams can help to clarify how the CGI will be restructured. A complex CGI might have a database or external web services involved. These should be correctly depicted.

The previous post on this subject said we can now refactor this code. The unit tests are required before making any real changes. (Yes, we made one change to promote testability by repackaging a script to be a function.)

We're aimed to start disentangling the HTML and CGI overheads from the application and narrowing our focus onto the useful things it does.


Tuesday, August 31, 2021

We have an ancient Python2 CGI script -- what do we do?

This was a shocking email: the people have a Python 2 CGI script. They needed advice on Python 2 to 3 migration.

Here's my advice on a Python 2 CGI script: Throw It Away.

A great deal of the CGI processing is part of the wsgi module, as well as tools like jinja and flask. This means that the ancient Python 2 CGI script has to be disentangled into two parts.

  • All the stuff that deals with CGI and HTML. This isn't valuable and must be deleted.
  • Whatever additional, useful, interesting processing it does for the various user communities. 

The second part -- the useful work -- needs to be preserved. The rest is junk.

With web services there are often at least three communities: the "interactive users", "analysts", and the administrators who keep it running. The names vary a lot with the problem domain. The interactive users may further decompose into anonymous visitors, people with privileges to make changes, and administrators to manage the privileges. There may be multiple flavors of analytical work based on the web transactions that are logged. A lot can go on, and each of these communities has a feature set they require.

The idea here is to look at the project as a rewrite where some of the legacy code may be preserved. It's better to proceed as though this is new development with the legacy code providing examples and test cases. If we look at this as new, we'll start with some diagrams to provide a definition of done.

Step One

Understand the user communities. Create a 4C Context Diagram to show who the users are and what the expect. Ideally, it's small with "users" and "administrators." It may turn out to be big with complex privilege rules to segregate users.

It's hard to get this right. Everyone wants the code "converted". But no one really knows all the things the code does. There's a lot of pressure to ignore this step.

This step creates the definition of done. Without this, there's no way to do anything with the CGI code and make sure that the original features still work.

Step Two

Create a 4C Container Diagram showing the Apache HTTPD (or whatever server you're using) that fires the CGI. Document all other ancillary things are going on. Ideally, there's nothing. Ideally, this is a minor, stand-alone server that no one noticed until today. Label this picture "As Is." It will change, but you need a checklist of what's running right now. 

(This should be very quick to produce. If it's not, go back to step one and make sure you really understand the context.)

Step Three

Create a 4C Component Diagram, and label it "As Is". This has all the parts of your code base. Be sure you locate all the things in the local site-packages directory that were added onto Python. Ideally, there isn't much, but -- of course -- there could be dozens of add-on libraries.

You will have several lists. One list has all the things in site-packages. If the PYTHONPATH environment variable is used, all the things in the directories named in this environment variable. Plus. All the things named in import statements.

These lists should overlap. Of course someone can install a package that's not used, so the site-packages list should be a superset of the import list.

This is a checklist of things that must be read (and possibly converted) to build the new features.

Step Four?

You'll need two suites of fully automated tests. 

  • Unit tests for the Python code. This must have 100% code coverage and will not be easy.
  • Integration tests for the CGI. You will be using the WSGI module instead of Apache HTTPD (or whatever the server was) for this testing. You will NOT integrate with the original web server, because, that interface is no longer supported and is a security nightmare.

Let's break this into two steps.

Step Four

You need automated unit tests. You need to reach at last 100% code coverage for the unit tests. This is going to be difficult for two reasons. First, the legacy code may not be easy to read or test. Second, Python 2 testing tools are no longer well supported. Many of them still work, but if you encounter problems, the tool will never be fixed.

If you can find a Python 2 version of coverage, and a Python 2 version of pytest, I suggest using this combination to write a test suite, and make sure you have 100% code coverage. 

This is a lot of work, and there's no way around it. Without automated testing, there's no way to prove that you're done and the software can be trusted in production.

You will find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Five

Python has a built-in CGI server you can use. See https://docs.python.org/3/library/http.server.html#http.server.CGIHTTPRequestHandler for a handler that will provide core CGI features from a Python script allowing you to test without the overhead of Apache httpd or some other server.

You need an integration test suite for each user stories in the context you created in Step One. No exceptions. Each User. Each Story. A test to show that it works.

You'll likely want to use the CGIHTTPRequestHandler class in the http.server module to create a test server. You'll then create a pytest fixture that starts the web server before a test and then kills the process after the test. It's very important to use subprocess.Popen() to start and stop the target server to be sure the CGI interface works correctly.

It is common to find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Six

Refactor. Now that you have automated tests to prove the legacy CGI script really works, you need to disentangle the Python code into three distinct components.

  1. A Component to parse the request: the methods, cookies, headers, and URL.
  2. A Component that does useful work. This corresponds to the "model" and "control" part of the MVC design pattern. 
  3. A Component that builds the response: the status, headers, and content. 

In many CGI scripts, there is often a hopeless jumble of bad code. Because you have tests in Step Four and Step Five, you can refactor and confirm the tests still pass.

If the code is already nicely structured, this step is easy. Don't plan on it being easy.

One goal is to eventually replace HTML page output creation with jinja. Similarly, another goal is to eventually replace parsing the request with flask. All of the remaining CGI-related features get pushed into a wsgi-compatible plug-in to a web server.

The component that does the useful work will have some underlying data model (resources, files, downloads, computations, something) and some control (post, get, different paths, queries.) We'd like to clean this up, too. For now, it can be one module.

After refactoring, you'll have a new working application. You'll have a new top-level CGI script that uses the built-in wsgi module to do request and response processing. This is temporary, but is required to pass the integration test suite. 

You may want to create an intermediate Component diagram to describe the new structure of the code.

Step Seven

Write an OpenAPI specification for the revised application. See https://swagger.io/specification/ for more information. Add the path processing so openapi.json (or openapi.yaml) will produce the specification. This means updating unit and integration tests to add this feature. 

While this is new development, it is absolutely essential for building any kind of web service. It will implement the Context diagram, and most of the Container diagram. It will describe significant portions of the Component diagram, also. It is not optional. It's very likely this was not part of the legacy application.

Some of the document structures described in the OpenAPI specification will be based on the data model and control components factored out of the legacy code. It's essential to get these details write in the OpenAPI specification and the unit tests. 

This may expose problems in the CGI's legacy behavior. Don't fix it now. Instead document the features that don't fit with modern API's. Don't be afraid to use # TODO comments to show what should be fixed.

Step Eight

Use the 2to3 tool to convert ONLY the model and control components. Do not convert request parsing and response processing components; they will be discarded. This may involve additional redesign and rewrites depending on how bad the old code was.

Convert the unit tests for ONLY the model and control components components.

Get the unit tests for the model and control to work in Python 3. This is the foundation for the new web site. Update the C4 container, component, and code diagrams. Since there's no request handling or HTML processing, don't worry about code coverage for the project as a whole. Only get the model and control to have 100% coverage.

Do not start writing view functions or HTML templates until underlying model and control module works. This is the foundation of the application. It is not tied to HTTP, but must exist and be tested independently.

Step Nine

Using Flask as a framework and the OpenAPI specification for the web application, build the view functions to exercise all the features of the application. Build Jinja templates for the HTML output. Use proper cookie management from Flask, discarding any legacy cookie management from the CGI. Use proper header parsing rules in Flask, discarding any legacy header processing.

Rewrite the remaining unit tests manually. These unit tests will now use the Flask test client. The goal is to get back to 100% code coverage.

Update the C4 container, component, and code diagrams.

Step Ten

There are untold number of ways to deploy a Flask application. Pick something simple and secure. Do some test deployments to be sure you understand how this works. As one example, you can continue to use Apache httpd. As another example, some people prefer GUnicorn, others prefer to use NGINX. There's lots of advice in the Flask project on ways to deploy Flask applications.

Do not reuse the Apache httpd and CGI interface. This was terrible. 

Step Eleven

Create a pyproject.toml file that includes a tox section so that you have a fully-automated integration capability. You can automate the CI/CD pipeline. Once the new app is in production, you can archive the old code and never use it again for anything. Ever. 

Step Twelve

Fix the bugs you found in Steps Four, Five, and Seven. You will be creating a new release with new, improved features.

tl;dr

This is a lot of work. There's no real alternative. CGI scripts need a lot of rework.

Tuesday, August 24, 2021

Spreadsheets, COBOL, and Schema-Driven File Processing

I need to rewrite Stingray Reader. This project handles a certain amount of file processing using a schema to assure the Logical Layout is understood.  It handles several common Physical Formats:

  • CSV files where the format is extended by the various dialects options.
  • COBOL files in ASCII or EBCDIC.

The project's code can be applied to text files where a regular expression can yield a row-level dictionary object. Web server log files, for example, are in first normal form, but have irregular punctuation that CSV can't handle. 

It can also be applied to NDJSON files (see http://ndjson.org or https://jsonlines.org) without too much work. This also means it can be applied to YAML files. I suspect it can also be applied to TOML files as a distinct physical format.

The complication in the Singran Reader is that COBOL files aren't really in first normal form. They can have repeating groups of fields that CSV files don't (generally) have. And the initial data model in the project wasn't really up to handling this cleanly. The repeating group logic was patched in.

Further complicating this particular project was the history of its evolution. It started as a way to grub through hellishly complex CSV files. You know, the files where there are no headings, or the headings are 8 lines long, or the files where there are a lot of lines before the proper headings for the data. It handled all of those not-first-normal-form issues that arise in CSV world.

I didn't (initially) understand JSON Schema (https://json-schema.org) and did not leverage it properly as an intermediate representation for CSV as well as COBOL layouts. It arose as a kind of after-thought. There are a lot of todo's related to applying JSON Schema to the problem. 

Recently, I learned about Lowrance USR files. See https://github.com/slott56/navtools in general and https://github.com/slott56/navtools/blob/master/navtools/lowrance_usr.py for details. 

It turns out that the USR file could be described, reasonably well, with a Stingray schema. More to the point, it should be describable by a Stingray schema, and the application to extract waypoints or routes should look a lot like a CSV reader.

Consequences

There are a bunch of things I need to do.

First, and foremost, I need to unwind some of the COBOL field extraction logic. It's a right awful mess because of the way I hacked in OCCURS DEPENDING ON. The USR files also have numerous instances of arrays with a boundary defined by other content of the file. This is a JSON Schema Extension (not a weird COBOL special case) and I need to use proper JSON schema extensions and attribute cross-references.

Of course, the OCCURS DEPENDING ON clauses can nest, leading to quite complex navigation through a dynamically-sized collection of bytes. This is not done terribly well in the current version, and involves leaving little state reminders around to "simplify" some of the coding.

The field extractions for COBOL apply to binary files and should be able to leverage the Python struct module to decode individual fields. We should be able to also extract data from USR files. The schema can be in pure JSON or it can be in Python as an internal data structure. This is a new feature and (in principle) can be applied to a variety of binary files that are in (approximately) first normal form. 

(It may also be sensible to extend the struct module to handle some EBCDIC conversions: int, float, packed-decimal, numeric string, and alphanumeric string.)

Once we can handle COBOL and USR file occurs-depending-on with some JSON Schema extensions, we can then work on ways to convert source material (including JSON Schema) to the internal representation of a schema.

  1. CSV headers -> JSON Schema has an API that has worked in the past. The trivial case of first-line-is-degenerate-schema and schema-in-a-separate-file are pleasant. The more complex cases of skip-a-bunch-of-prefix-lines is a bit more complex, but isn't much of a rewrite. This recovers the original feature of handling CSV files in all their various incarnations and dialects with more formally defined schema. It means that CSV with type conversions can be handled.
  2. Parse COBOL DDE  -> JSON Schema. The COBOL parser is a bit of a hacky mess. A better lexical scanner would simplify things slightly. Because the field extraction logic will be rebuilt, we'll also have the original feature of being able to directly decode Z/OS EBCDIC files in Python.
This feels ambitious because the original design was so weak.

Tuesday, August 17, 2021

I Have Code That Didn't Work. What Now?

I don't get many of these "I have code that doesn't work" requests. But I do see them once in a great while.

It might be something like the following two-part explanation with a following question.

Tuesday, August 10, 2021

Why Python Is Weird For C++ Developers -- Some Thoughts

See 9 Reasons Why Python Is Weird For C++ Developers.

I'm often bothered by inter-language comparisons. Mostly because programming languages -- except in the most abstract way -- aren't really very comparable. At the Turing Machine level the finite state automata are comparable, but that reductionist view (intentionally) eliminates all the expressive power from a given language.

Let's look at the reasons in some detail. A few of them actually are interesting.

  1. Whitespace. I'm dismissive of this as an interesting difference. When I read code in EVERY other programming language, I'm immediately aware that programmers can indent. Indeed, I've seen C and C++ code were {}'s were omitted, but the code was indented properly, making it devilishly hard to debug. My experience is that folks get the indentation right BEFORE the get the {}'s right.
  2. Syntax. In this article, it's the lack of {}'s. Again, I'm dismissive because I've actually helped folks learning C++ who had the indentation right and the {}'s wrong. This is ony "weird" if you're absolutely and completely convinced that {}'s are somehow a divine requirement that transcends all human attempts at interpretation. With Unicode, we're in a position to separate set membership from block-of-code and start using multiple variants on {}'s.  I'd vote for if a > b 【m = a】else 【m = b】 using 【】for code blocks.
  3. Class Variables. This points out an inherent ambiguity of C++. Most of the time, most things are not "static". They're "automatic" that is, associated with the instance. The auto keyword, however, is rarely used, and is mostly assumed. Python (outside dataclasses) is more consistent. All things inside the class statement are "static": part of the class. In the case of dataclasses, this simple rule is broken, which can be confusing. But. This wasn't mentioned.
  4. Pointer and Reference Transparency. This is simple confusion. All Python is handled by reference all the time. C++ is an absolute mess of "primitive" types that don't use references and objects that do use references. Java is just as bad. And I want to emphasize bad. Python is perfectly consistent, and -- I would suggest -- the opposite of weird. But. The article is describing things from a C++ perspective, as if C++ were somehow not weird. I suggest this isn't a great approach.
  5. Private Class Members. This is summarized as "better encapsulation and control" without a concrete example. It's hard to provide a concrete example because the Pythonic approach works so well. The only use case for "private" that I've been able to understand is when you're concealing the entire implementation from all scrutiny. That is, you have a proprietary implementation with an encrypted JAR file and you want to avoid revealing it to protect some intellectual property. Since Python is source, this can't happen, and we say "We're all adults here." Flag it with a leading _ and we'll recognize it as part of an implementation detail that might change. 
  6. Self vs. this. Not sure what this is but the phrase "only major programming language" is something that relies on Java and C++ being near the top of the TIOBE index. I suspect we can find a lot of languages that use neither "self" nor "this". I'm not sure exactly how this is weird, but, I get that it's different.
  7. Multiple Return Values. This seems like an intentional refusal to understand how tuples and tuple unpacking work. Again, this seems to make C++ the yardstick when C++ is clearly kind of weird here.
  8. No Strong Data Types. This seems like another refusal to understand Python. In this case, it feels like it's a refusal understand that objects are strongly typed in Python and variables are transient labels attached to objects. The mypy tool will try to associate a type with a variable and will warn you about a = "string" followed by a = 42. Perhaps I'm not understanding, but the portrayal of C++ rules as "not weird" seems like it's being taken too far. 
  9. No Constants. This isn't completely true. Some folks use enums to provide enumerated numeric constant values in the rare cases where this might matter. Using global variables as constants actually works out fine in practice. Most tools will look for ALL_CAPS names on the left of an = sign; and if this occurs more than once will raise a warning. If you have really stupid fellow programmers who can't understand how some variables shouldn't be reused, you can easily write a script to walk the AST looking for references to global variables and warn your colleagues that there are rules and they're not following them. This is part and parcel of the "We're all adults here" approach. If folks can't figure out how constants work, you need to collaborate more fully with other developers to help them understand this.

I'm unhappy with lifting up C++ quirks as if they're somehow really important. I don't think C++ is a terribly helpful language. The need for explicit memory management, for example, is a terrible problem. The explicit distinction between primitives and objects is also terrible.

While compare-and-contrast with Python might be helpful for C++ expatriates, I think this article has it exactly backwards. I think the following list couuld be more useful.

  • Python frees you from counting {}'s. Just indent. It's easier.
  • Python has simple rules for class/instance variables (except in the case of dataclasses and named tuples.) Also: if it starts with self. it's an instance variable.
  • Python is all references without the horrifying complexity of primitive types.
  • We're all adults here. Don't stress yourself out over privacy or constants. Document your code, instead. Write a unit test case or two. Use mypy. Use black.
  • Tuple unpacking and the fact that tuples are often implied works out very nicely to create very clean code.
  • Data types are part of the object. There's no magical "cast" capability to process a block of bytes as if they're some other type. 
These are advantages of Python. And disadvantages of C++. I think it's better to talk about what Python has than what Python lacks when measured against a terribly complex language like C++.


Tuesday, August 3, 2021

Writing Interactive Compute-Intensive Programs for Web Browsers

Fascinating. The reference to the classic Mac OS with non-preemptive multi-tasking is quite cool. The concept fits nicely with Python's async/await coroutines that need to collaborate with a periodic OS request to permit interaction with streams of events from another source (i.e., a foreground window.) 



Tuesday, July 27, 2021

SOLID Coding in Python

SOLID Coding in Python by Mattia Cinelli.

Download Medium on the App Store or Play Store

This was fun to read. It has some nice examples. 

I submit that the order of presentation (S, O, L, I, D) is misleading. The acronym is fun, but awkward.

My LinkedIn Learning course covers these in (what I think is) a more useful order.

  1. Interface Segregation. I think this is the place to start: make your interfaces as small as possible.
  2. Liskov Substitution. Where necessary, leverage inheritance.
  3. Open/Closed. This is a good quality check to be sure you've followed the first two principles well.
  4. Dependency Injection. This is often about test design and future expansion. In Python, where everything really happens at run time, we often fail to parameterize a type properly. We often figure that out a test time, and need to revisit the Open/Closed principle to get things right.
  5. Single Responsibility is more of a summary of the previous principles than a distinct, new principle. I think it comes last and should be treated as a collection of good ideas, not a single idea.

I think time spent on the first three -- Interface Segregation, Liskov Substitution, and the Open/Closed principle -- pay off tremendously. The ILODS acronym, though, isn't as cool as SOLID.

The "Single Responsibility" suffers from an ambiguous context. At one level of abstraction, all classes have a single responsibility. As we dive into details, we uncover multiple responsibilities. The further we descend into implementation details the more responsibilities we uncover. I prefer to consider this a poetic summary, not the first step in reviewing a design.

Tuesday, July 20, 2021

How can people find inspiration at work? #CreateMeaning

What do I know about "inspiration" at work? I'm not sure I know much, but I think I may have some advice that could be useful.

I was in the high-tech write-software-every-day workplace since about 1976 or so. (The first two years were part-time while in college.) I use past perfect "was" because I'm old enough (and lucky enough) to be able to retire from daily work. I've switched from writing code to writing books about writing code.

For the math-impaired, my software career spanned 45 years.

Early in this career, the question of finding inspiration at work wasn't asked in the same stark way people discuss it nowadays. When I was younger, the idea of maintaining a work/life balance wasn't something we were asked or encouraged to consider. We did the best we could and tried to avoid getting replaced by someone who could do it better.

Which -- with the advantages of hind-sight -- was a terrible way to live and work. Simply awful. I was lucky enough to see hundreds of projects in my working career. I worked with scores of different organizations. There was a spectrum of bad behavior.

I did learn this: Fear is not Inspiring. I learned a few other things, but let's start with fear.

The Fear Factor

I want to dwell a bit on the fear factor in the workplace. I'm firmly convinced that some manager types suffer with a nagging background of essential fear for their own jobs. And they can project this fear onto folks around them.

Try these shoes on for a moment. The technology has moved on and you haven't kept up. You're trying to manage people, but you have a nagging suspicion your core managerial skills are weak enough that you could be replaced. Motivated by fear, you encourage "casual overtime" and "working weekends" and "meeting the committed schedule". Perhaps you feel it necessary to go so far as to demand these things.

Fear of getting fired creates an uninspiring place to work. It will be an incubator for burnout. 

Further, I suspect it can lead to worse situations than people quitting. I think the "work a little harder" folks plant seeds for various kinds of workplace abuses. 

I think there are a variety of fears. The fear of getting fired is at the bottom of Maslow's hierarchy of needs: we could get fired, and be unable to get another job. We're in the Physiological and Safety realm of the needs pyramid.

A fear of a project not working out means we'll tumble in the eyes of more senior management. This is somewhere in the higher level of Social Belonging and Esteem needs. The problem is that projects have a variety of metrics, and simply making the schedule is an easy metric and can seem to lead to immediate esteem.

What about higher level cognitive needs like Self-actualization and Transcendence? I strongly suspect fears related to these needs can color someone's workplace. I think these often show up as "Am I really going to be doing this for the Rest of My Life?" questions. This becomes an undercurrent of negativity stemming from fear of being trapped in unfulfilling work. 

We might see these fears in several places. We each harbor our own private fears. In any organization with a hierarchy, we'll have to deal with fears that trickle down to us from supervisors. In non-hierarchical organizations, we'll have to deal with fears of our peers and colleagues. We're surrounded by fears, and I think this can sap our inspiration.

What can we do to find inspiration in a work environment?

My Experience

I've worked on hundreds of projects. That means hundreds of jobs that came to an end. And when the project ended, I was no longer needed. 

In effect, I was fired hundreds of times.

This isn't a helpful thing. If Maslow's base physiological needs are met, then having a project end isn't too horrible. I was a contract programmer in the olden days when we were salaried, and the company would carry us from assignment to assignment. Being let go by a customer can be harsh, but getting paid in spite of being let go softens the blow.

I emphatically do not recommend this way of working as a source of inspiration. Some people like the constant changing gears and changing directions. Other people might find it terrifying: each project is a whole new group of people in a new organization. Potentially very unsettling.

I don't think the "get tougher" or "grow thicker skin" advice is good. I'm don't think it worked out for me. I think this kind of transience left me feeling isolated. I think it lead me to carry around a sense of superiority. So. Let's set aside any dumb-sounding advice based on a literal review of my experience. 

How Did I Cope?

Finding ways to cope, I think, is important, but it is also potentially misleading. The idea of coping with new projects, new organizations, travel, and getting fired all the time isn't inspiring. It's merely coping with an endless stream of loss and separation. 

Underpinning the idea of coping is a more foundational question. Where did I find the inspiration to keep on doing this contract programming thing for so many years? And the other question is how well my search for inspiration might apply to folks who aren't commuting computer programmers?

I think there's a first step that many people can take. It's this:

We can disentangle our self-worth from the work-place imposed sense of worth.

This may be overly glib. But. I think the things rewarded in the workplace aren't a good reflection of who we are and what we're capable of. While it's important to be confident in one's self, our confidence can be undermined by a toxic workplace. Having confidence can let us take our skills and abilities in a variety of directions. We might, for example, decide to find another workplace; one where our value is recognized. Or, we might decide to change our circumstances in the workplace we're currently inhabiting. In both cases, we're asserting our value. We're making a further claim about our value: we may not match the workplace's expectations of us. The workplace can change, or we can find another workplace.

We might see a mismatch in lousy performance reviews. These can can stem from many causes. Perhaps we're not suited for a job and need to find something else. Or, perhaps the person reviewing our work doesn't see what we could (or should) be doing. (They have their own fears, and they may not be willing to try to make the changes we'd like them to make.)

Looking back, I may have been doing this all along, without being clear or intentional about it. Perhaps I excelled at places that valued me, and failed to meet expectations at places that treated me poorly. Perhaps my job shifting was (in an indirect way) a search for a workplace that valued me, my unique experiences, and my distinctive voice.

I was not intentional about it. I stumbled from job to job, knowing the sales folks would find me a next assignment when the current assignment had run its course. I think a vague sense of self-worth is what lead me to locating inspiration in spite of a lot of change and disruption.

Finding Inspiration

When we think of inspiration, we think of a spiritual drive to do the work. This doesn't often parallel with working for pay to cover rent and expenses. 

A good manager, however, can create a cohesive team from a group of people. A group of peers can welcome a new colleague. This creates belonging and esteem: the middle levels of Maslow's hierarchy of needs. We may find that a team or a team's goal may be inspiring. This means that our own self-worth is recognized and valued by our co-workers. This can be a marvelous experience.

What about the bad manager or unhelpful group of colleagues? In this cases, we're forced to make the best of an awkward situation. I think we can do this:

We can search for inspiration at the margins of our work life. 

Can we find some side-bar aspect of the work that leads to some helpful insights? Perhaps there is a chronic problem we can take notes on and -- eventually -- fix. Perhaps someone is less helpful that others, and we can try to understand what would make them less toxic. Perhaps cleaning the break-room fridge is better than complaining about month-old food. (Yuck. But. If things are better, it may be worth it.)

For years, I had an aspiration to write about software development. To further this dream, I started taking more and more careful notes of projects I was on. In the era before the World Wide Web, publication was difficult, but not impossible. I wrote small articles for technical magazines; this effort was something that inspired me to work with customers who were inept and had horrible, horrible problems. I liked the awful customers because it provided me good examples of things that should not be done.

At the end of a horrible project, I'd have a good anecdote for what not to do.

I acknowledge my two ideas of self worth and inspiration isn't a dramatic, life-changing epiphany. I'm pretty sure the scales won't fall from anyone's eyes as they think about looking at sidebar topics as a source of inspiration.

Looking at the margins, edges, and corners of a job can help to reveal the whole job. The whole team. The whole goal. Finding this broader view might inspire us to look for a better team with better goals. In other cases, it might help us find the missing skills in the team we're on. In other cases, a better perspective might help us steer our supervisor toward doing something that's better than what they're doing right now.

There are very fine lines between toxic, poorly organized, poorly managed, confusing work places, and workplaces that are still trying to find a workable organization. Most places have a combination of good and bad, inept and well-done, confusing and sensible features. Indeed, these may all be different axes and an organization is really a multi-dimensional object with different kinds of overlaps and gaps. 

I believe the foundation for inspiration is a clear sense of self-worth. I think we create meaning in our workplace by knowing what we can contribute, what we want to contribute, and what the organization needs. Our unique contribution and what the organization needs may not overlap at all, or the organization may have always been searching for someone like us. Either way, our awareness of our skills, our experience, and our authentic voice is what lets us find inspiration.

Tuesday, July 13, 2021

What Books Should I Read? In What Order?

A fascinating question.

... I'm baffled by the amount of books you've published over the course of time. Currently Reddit suggests that I use Building Skills in Python under Beginner's section, but it looks quite outdated. So back and forth, I found your Building Skills in OO on GitHub Page and was quite happy with the read on the first 100 pages.

I searched for more info on the books you've published and wanted to know if you could sort them in ascending order of difficulty for me as I intend to purchase them slowly.

My main concern to learn Python is just to cross technical interviews and building applications that help with my workflow (they are in bash with around 200 functions, so I'm hoping to migrate them to something which is more robust). 

Currently the focus I intend to develop is on:
1) Strong Foundations of the Python Language.
2) Strong Foundations on the Basic Libraries for Data Structures and Algorithms (For example, bisect gives me insort(), calendar gives me isleap(), iter_tools gives me permutation(), etc).
3) Strong Foundations on the Design Patterns.

So could you please help me out and suggest your books?

This is challenging for a few reasons.

First, the "Building Skills" books have been reduced to only the Building Skills in OO Design. This can be found in GitHub. https://github.com/slott56/building-skills-oo-design-book.

That book is not really targeted to beginners, though. It presumes some core OO skills, and provides a (very) long series of exercises to build on those skills.

Second, I never really conceived of a beginner-to-expert sequence of books. From your letter, I see that I need to look at filling in some gaps in my sequence of books. I'll alert my editors at Packt, and we can consider this in the future.

Specific Needs

Let's look at your needs.

1. Foundations in the Python language.

This might be something you can learn from my Python Essentials. This isn't focused on complete n00bs. All of my books expect some programming background. Since you're an Android engineer and write code in C++ and Java, this may be helpful. This title is getting old, however, and needs a second edition.

For someone with core programming skills, I suspect that Mastering OO Python will be suitable. My Python 3 OO Programming (4th ed.) is similarly aimed at folks who can program and can learn a new language quickly.

A book like Martelli's Python in a Nutshell may provide a better foundation the way the language works than any of mine. Also Lutz's Learning Python is extremely popular.

2. Foundations in the Standard Library.

This is tricky. I touch on some of these topics in Functional Python Programming (2nd ed.) I also touch on some of these topics in the Modern Python Cookbook (2nd ed.)

I don't, however, cover very much of the library. I touch on a few really important modules. The library is vast. A book like Hellmann's The Python Standard Library by Example might be more suitable than one of mine.

3. Design Patterns.

This is central to Python 3 OO Programming (4th ed.) Dusty Phillips and I cover a number of popular design patterns in detail. 

There are -- of course -- a lot of very, very good books on Python. I'm honored you reached out to me.

Other Random Advice

Because Python is a relatively simple language (with a vast library) I have always suspected that language foundations don't really require a ton of explanation. Many languages (i.e., C++) are filled with odd details and weird features that are really unpleasantly complex. Many Java programmers get used to the distinction between the primitive int type and the Integer class type. While the Java and C++ approach can seem simple (after a while) it really isn't simple at all.

The standard library is vast, and it takes time to get used to how much is there. I would suggest having a browser tab open to https://docs.python.org/3/library/ at all times.

Design patterns, similarly, require some care. There are complex details around implementing the Singleton pattern in C++ and Java. Python class definitions and Python module definitions are Singletons, and using a class definition as a Singleton object is often far simpler than the commonly-used techniques for C++ and Java.

Tuesday, July 6, 2021

A Python Roadmap

An interesting tweet. The  roadmap has three sections. I'm not sure this is actually complete, or even grouped correctly. It is a very good list of topics.

Foundations

I want start by quibbling about variables being first. I'm not sold on this.

I think that operators, expressions, and the built-in immutable types are foundational. int, float, str, and tuple are hugely important as core concepts in computing and Python.

I also think that "loops" is a sketchy notion and I kind of wish we wouldn't describe for and while statements as "loops". I think we should call them iterations. They implement two kinds of logical quantifiers "for all" and "there exists." I think we should talk about the final result of a for statement: all of the values in a range are processed. Similarly a for-if-break construct establishes a "for exists" that defines the first value in a range for which a condition is met. And yes, range objects will be central.

I think that a huge amount of programming can be covered with these topics. I'm not sure "basic" is the right term; foundations might be a better idea. 

The use of variables to manage state is part of this. But. Variables, assignment, and state change are a bit more advanced and maybe shouldn't be first.

I also think function definitions are foundational. Mathematics has been defining functions based on other functions. It's a way of providing a mental short-hand for complex concepts. I don't need to know all the details of how to compute a square root to make use of square root as a concept.

The wide varieties of assignment statements, including assignment to decompose collections aren't mentioned in the original post. This may be an important omission, causing me to quibble on "complete."

I agree that files and elements of File IO are part of this foundation. If we limit ourselves to reading and writing files, then they're essentially immutable structures. I think we can safely avoid update-in-place files because this is an application topic more than a language topic. Python offers the minimal level of support via seek and tell, but little more. And most modern application relies on a database for updatable files.

Data Structures

Moving from basic to intermediate. I prefer the term "data structures" which are built on the language foundations. I think that the mutable built-in data structures come next in the roadmap. My preference is to omit terms like Object-Oriented or Functional, and focus on list, dict, and set, and how the iteration works. This means comprehensions and generators are part of this essential data structure section.

No, comprehensions aren't and shouldn't be called "advanced." They're very much a core concept. Thinking about statements to implement a map/filter/reduce over a collection is the essence of a great deal of programming. We don't always learn it that way, but it needs to be presented in that framework even to beginners. A pile of for and if statements and a bunch of variables is a programmer's first step toward a simpler comprehension. In both cases, they're doing a mapping and it needs to be described as mapping one collection to another collection.

This is where the standard library collections module is introduced.  Yes it's part of the library. I think it's too central to be ignored. I think dataclasses belong here, too.

Talking about the mutable data structures means revisiting the for statement and using it on a variety of iterables. The way Python's concepts apply to a variety of data types is an important feature of the language. (In the olden days, they used to talk about "orthogonality" of data and processing; we don't need to dwell on it, but I think it helps to acknowledge it.)

Functional Programming

It appears to me that the functional programming topics can come next. The idea of functional composition via higher-order functions and decorators builds on the existing foundation. This is where map() and filter() belong. Because of the way sorted(), max(), and min() work on collections with a key= function, these are part of the functional programming roadmap. The inconsistency between map() and functions like max() is an important thing to note.

I also think itertools belongs here. We can make the case that it's in the standard library, but then, so is io. I think itertools and functools are as central to practical Python as the math module and collections.

I think typing.NamedTuple and dataclasses belong here, also. A frozen dataclass is stateless, and can be helpful when creating list comprehensions to perform a mapping from one collection to another collection.

Object-Oriented Programming

I think OO programming and related concepts build on the previous material. Class definitions and state management aren't simple, even though they're essential parts of Python.

To an extent, OO programming can be decomposed into two layers. While I hate to overuse "foundation", there seem to be two parts:

OO Foundations -- inheritance, composition, and different kinds of delegation. This tends to expose a number of common design patterns like Strategy, Decorator, and Facade.

OO Features -- this includes metaprogramming, decorators, ABC's, mixins, and the like. These topics are all designed to avoid copy-and-paste in sophisticated edge cases that cross class boundaries.

Concurrency

I'm not sure why concurrency and parallelism are separate topics in the original list. I've had folks try to split this hair a number of ways. The idea is to find a place where async lives that's "concurrency lite" or something.

The concepts here become blurry because threads and processes are OS features, not language features. The async/await language features, however, are clearly part of Python. It becomes particularly awful when working on something practical where asyncio doesn't provide the feature you need. Specifically, blocking file system I/O isn't part of asyncio and requires an explicit appeal to the underlying thread pool for the blocking operation. 

To an extent, async/await needs to be on the roadmap. It's tricky, though, to cover this without also digressing into threads as a way to deal with blocking operations.

Test, Integration, and Deployment

This is where tools show up. This is where pip, unittest, pytest, tox/nox, coverage, etc. live. Are these part of the language? Or are the part of the broader ecosystem?

I submit they're explicitly not part of the language. The roadmap ends just before this topic. The idea is that we should have a Python roadmap that uses the language and the standard library.

Once we've talked about the language (and some of the library) we can move on to pip and packaging. I don't think pip is and "intermediate" topic. I find that premature introduction of pip is a sign of trying to create useful interesting examples. Examples that don't use pip wind up being kind of boring. Everyone wants to play with pygame and pillow and other kinds of projects, but, those aren't foundational to the language. They're interesting and appealing and -- frankly -- a lot of fun.

tl;dr

I'm not a fan of the roadmap. I like some of it. I don't like some of it.

I am a fan of presenting the idea for discussion.

Tuesday, June 29, 2021

Letter to Mom -- What Is This "Computer Programming" Thing?

Happy birthday, mom. Glad to see you're still doing well, avoiding the complications of COVID-19.

You asked what it was I did for a living. Emphasis on the past tense, now that we're both fully retired old people. 

I have to confess that it's not easy describing high-tech work. There's a lot of jargon. Your varied range of careers included many things, one of which was being a school librarian. The world has had libraries and librarians for millennia. The job title is pretty well understood. The world hasn't had electronic computers for very long, making the job of programming them a relative novelty.

Aids to computation include slide rules and other mechanical devices. The idea of a mechanical computer dates from the 1830's. You can read about the first computer programmer, Ada Lovelace, here: https://www.computerhistory.org/babbage/adalovelace/. Proper electronic digital computes didn't arise until the 40's, when you were a child. ENIAC, for example dates from 1945.

While a lot has changed since the ENIAC, there are a few universal truths. I'm going to beat one of those truths like a dead horse because it's both essential and obscured by layers of technology.

This first and most fundamental truth is that a computer -- even something as sophisticated as a laptop with a dozen open browser tabs, zoom, and two different solitaire games -- is really a small device that is patiently waiting for you to type or click; the software works out some response and this is displayed on the screen or burbles out the speakers (or both.)

We can say that a general-purpose computer is "applied" to a specific problem. We shorthand this into creating "Application Software;" the software that applies the computer's hardware to a problem. And we shorthand this into "Apps" or "Applications" that do useful things on a general-purpose device.

The distinction between software (things you download and change) and hardware (the box on your desk) has become pretty common-place. The details of the software are what we need to put under the magnifying glass to look at closely.

To make your computer more useful, clever engineers have worked out a way to interleave activities from a variety of applications, all of which are using your computer concurrently. There's a set of rules to determine which application is in the "foreground"; this is the application software that has access to keyboard, mouse, display, and speakers. When you click on another window, you bring another application to the foreground. Access to the hardware switches and the display updates. It's very slick. They provide a number of visual cues to show you which application's "window" is in the foreground; all the others have different cues to show you they're in the background.

What's important about this foreground/background concept is that each application is -- from one point of view -- free to behave as if it is in total control of the entire computer. In reality, an application emphatically does not have unfettered control over the computing resources; there are a large number of gates and fences forcing applications into an orderly, and disciplined sharing and cooperation.

You taught at a nursery school. You know how important an orderly set of rules is. Applications are no different than unruly three- and four-year olds: they try to grab snacks out of order. They forget how pants work when they try to use the toilet. They need lessons in how to put their coats on to go outside in the winter.

These rules -- the set of policies and procedures that constrain applications -- is collectively called the "Operating System." (Don't ask why, the computer folks borrow terms from other disciplines and imbue them with new meanings. There's rarely a sensible etymology, just conventional usage.) The idea of a "system" of components is essential. There are a lot of layers of engineering in the OS.

The presence of an operating system lets multiple apps cooperate. But, it doesn't change the fundamental truth that originated with Babbage and Lovelace and continued on through Turing and Von Neumann and others and was handed down to me.

The general-purpose computer is applied ("programmed") to a problem; it's set up to respond to inputs by displaying outputs.

So that was boring. What did you do?

Good point. That was boring. But necessary, I think, to bracket the nuanced difference between "computer" as a collection and an individual application. The computer-as-collection includes a lot of software: an OS plus applications. This is distinct from each individual application that's part of the collection. It's all software, but the context shifts from everything the computer is doing to one specific solitaire game.

Above, I mentioned that the OS has layers. In a way it's like a quilt, there's a backing, batting in the middle, and a complex quilt top made from pieces. Most important is the quilting that holds the layers together. 

In a way, it's also like a library. There's the foundational problem of storing and loaning books. But there's a secondary problem of finding the damn things; leading to Dewey or LC codes for topics so we can keep related books together. And there's a third layer problem of having an accurate index or catalog of all the books. Using small cards (the card catalog) gives the library flexibility to make sure the catalog matches the stacks. And there are related problems of loaning them out with some reasonable promise to return them. 

I might even be able to work out an analogy with the Apple Orchard or the Arboretum or the Summer Camp. But, I think you get my drift here, that there are foundational elements that we can't really change, and we build on those foundations to make the whole slightly easier for people to use.

I get it, you built application software. What did you do?

What's important about concept of layers is how pervasive the layering idea is in all of computing.

Because of the potential complexity of a solution to a problem, we take the "layering" idea one step further than simply decreeing there should be layers.

What we found, starting in the 70's, was that the operating system tended to conceal many details of the underlying hardware. A modern programming language also divorced us from details of the hardware. Admiral Grace Murray Hopper's idea was to have an application that would transform a program written in some neutral language into the language of whatever hardware we had on hand. She pioneered the COBOL programming language; the language was utterly unlike any specific piece of hardware, and required a "compiler" application to translate COBOL statements into a form that the OS could run as an application.

We liked this idea: the underlying hardware became a kind of hazy abstraction. We knew it was there, but between our languages, libraries of pre-written software, our compilers, and the OS, we didn't really see the underlying hardware. This lets us decompose a complex problem into a number of smaller problems; giving us a lot of leverage.

The core idea of "abstraction" leads to the idea of layers of abstraction. Within our application software we can can also use this idea of layers to decompose our solution to a problem. An application layer can be quilted to a library layer that we bought or downloaded. The library is -- independently -- quilted to an OS layer. And the whole stack of layers is carefully stitched down to the underlying silicon chip. Maybe it was a Motorola chip, or an Intel chip, or an AMD chip. We didn't much know or care.

Well. We cared a little. Some of the AMD chips were faster than some of the Intel chips. So we would prefer to have our OS and our language focused on those chips because things were faster. Until Intel jumped ahead of AMD. The concept was to remain divorced from gritty details of how the little fleck of silicon with its millions of transistors actually worked. 

Recap

Application software configures the general-purpose computer to a specific task. Applications coexist via an operating system and reusable libraries.

Software (application, operating system, libraries) is created in layers and provide abstractions to hide the details of underlying layers.

My job?

Design the layers. Get other programmers to understand the design for the layers. Help them to create statements ("code") using the language of choice. (I'm a big fan of Python, but I've used many, many other languages.)

Note that I didn't (generally) design the visible quilt top in any detail. My job was to help the visual designers and the user experience (UX) designers create a top that delighted people using the software. I made sure that the top and the layers underneath it all fit together reasonably well for a sensible budget. Cutting and stitching all the blocks was a specialized skill that I tried to avoid.

I did more than design, however. When I say design of the structure, you can imagine an architect or civil engineer looking over drawings of girders and beams and making sure the floor would hold the weight of all those books in the new wing of the library.

While many software designers and architects do pour over drawings, I -- personally -- didn't like to leave it at the drawing stage. This was probably a career-limiting choice, but I liked to get my hands dirty actually digging holes and standing up cinderblocks in the foundation. The idea of swinging a hammer to build components told me -- directly -- how good (or bad) my design was.

There's a fork in the career path for programmers. Some software architects work best with Keynote presentations to developers and executives. They build understanding and consensus. They're trusted with larger projects and larger budgets. If things didn't work out, they could deflect blame to the folks writing the software. This distinction between design and realization can be used to avoid culpability. It worried me.

Other architects (me, specifically) work best with code. I still needed to build understanding and consensus. But I also built software so I could be *sure* things worked. I liked to provide concrete, tangible, "do it like this" code.

To higher-level executives -- people with budget authority -- I was only a low-level programmer. 

For decades, this meant a project would wind down after completion, and I would leave the customer's location, and move on to a new project. That's why I traveled a LOT.

A few clients would come to realize that I did offer significant value by being able to design the layers and abstractions while also helping folks actually build the software. This recognition was a rarity, which is why I call it a career-limiting choice. It happened a few times. There's a particularly memorable offer from a client in the 90's that -- in retrospect -- I should have taken. But, generally, I moved from work site to work site, designing, and building the application software for very, very large computers.

So, you went to meetings a lot?

Precisely.

At first, I needed to talk about the problem. What they want software to do. Why do they think new, custom-built, unique software will solve the problem they have? This means meeting with people to understand the problem in the first place. "What can't you do?" "Why can't you do it?" There's a lot of "Why?" questions that need to be asked to locate the obstacle that's easiest to remove. (Or the lowest-hanging fruit we can pick.)

Then, we need to talk about the solution. How will we solve the problem with computers and software? In some cases, they have departments that aren't talking. Or they have legal obstacles. Or they have a half-wit vice president in charge of being the owner's brother. Eventually, we wind up at "aha. They have software that acts as a kind of 'custodian' for their cloud-based resources, but the language of the rules for that custodian are opaque." 

(Seriously. A real problem. Very, very removed from reality: governance of rented "cloud" resources. Enterprise policies for use of cloud resources. Concrete rules for cleaning up the computers rented from a cloud vendor. Mathematical foundations for those rules. Very. Abstract. https://github.com/cloud-custodian/cel-python)

Once we've got the preferred solution, we need to decompose it into things we can download, and things we have to build. Ideally, we can download most (or all) of it and move on. Realistically, the problem domain is unique or something about the overall context and organization is unique, and means leads to customized software to reflect the unique situation.

Before too long, we have meetings to review some pictures: some contexts, some containers for application software, some components (or I've called them "layers" above). This will lead to people writing some code. (The 4 C's: Context, Container, Component, Code.)

(Side-bar. The "container" is a generalization of the idea of a computer. The OS lets multiple applications cooperate; what if we have multiple OS's cooperating? This idea of layers of abstraction is so compelling, we can apply it in a variety of places. This lets us to talk about abstract containers instead of concrete computers.)

We'll have daily meetings while we're building the code that populates the components that gets installed into the containers that fills out the context. These last 10 minutes. What we've done. What we're doing.

We'll have meetings every two weeks to look at components and containers and be sure they work. People will demo what they've done. It will be fun. We'll have donuts.

We'll have impromptu meetings to talk about how to write tests and do quality assurance on our code and components. The testing and quality checking became my obsession during the last five years of my career. Answering the question "Did you test everything?" 

We'll have meetings to talk about managing the containers to be sure they're working. And how to integrate and deploy the components into the containers. 

In and among the meetings, I wrote code. For the last ten years, it was always in Python. Before that it was in other languages.

So, that's what I did for a living. I went to meetings. I wrote code.

Tuesday, June 15, 2021

Architectural Boundaries: Which Package/Module/Class Owns That Responsibility?

 The SOLID design principles beat the design boundary issue to death. Here are the principles in my preferred order. (See https://www.linkedin.com/learning/learning-s-o-l-i-d-programming-principles

  1. Interface Segregation -- minimize the boundaries. Do this first.
  2. Liskov Substitution -- keep the boundaries consistent. Do this for hierarchies.
  3. Open/Closed -- keep the boundaries stable and allow subclasses. 
  4. Dependency [Inversion] Injection -- keep the implementation separate from the design.
  5. Single Responsibility -- This is essentially a summary of the above four principles.

The point here is that these principles are pleasantly poetic, but there are those edgy cases where an interface can go either way.

Specifically, here's an Edgy Case that can go either way.

We're reading GPX (GPS Exchange) data. See https://www.topografix.com/GPX/1/1/

Associated with this is what's known as the Lowrance USR file format. A lot of devices include the same (or similar) underlying software, and can exchange waypoint and route information in USR format.

We have this as part of the underlying model.

  • The underlying Angle as an abstraction. This has two subclasses:
    • Latitude. An angle with "N" and "S" for its sign, conventionally shown as a two-digit number of degrees: 25°42.925′N
    • Longitude. An angle with "E" and "W" for its sign, conventionally shown as a three-digit number of degrees: 080°13.617′W
  • A Point (or LatLon) is a two-tuple, tuple[Lat, Lon].

A waypoint includes name, description, a time-of-last-update (TOLU), and display symbol to be used. It may also include a GUID to track name changes and assure uniqueness in spite of repeated names.

So far, so good. Nothing too edgy there. "Where's the problem?" you ask.

The problem is representation.

In GPX files, latitude and longitude are float values in degrees. You'll see this: <wpt lon="-80.22695124" lat="25.7154147">...</wpt>.

To do any useful computation, they need to be radians. Or a geocode that supports proximity comparisons, like OLC.

And. If you work with a CSV export from a tool like OpenCPN, then you get strings. This can be any combination of degrees and minutes or degrees, minutes, and seconds. And, depending on the software, there may be either ° or ΒΊ for the degrees. Can't tell the apart? One is U+00B0, the DEGREE SIGN. The other is U+00BA, the MASCULINE ORDINAL INDICATOR. Plus, of course, everyone uses apostrophe (') and quote (") where they should have used prime (′) and double prime (″). These are easy regular expression problems to solve.

This leads to a class like the following:

class Angle(float):
@classmethod def fromdegrees(cls, deg: float, hemisphere: Optional[str] = None) -> "Angle": ...
@classmethod def fromstring(cls, value: str) -> "Angle": ...

This Angle class converts numbers or strings into useful values; in radians internally. Formatted in degrees externally.  (And yes, this gets a warning from Python 3.9 that we can't usefully extend float like this.)

The problem is USR files. 

In USR files, they use millimeter mercator numbers for latitude and longitude. These are distances from the equator or the prime meridian. Because they're in millimeters, an integer will do nicely. A little computation is done to extract degrees (or radians) from these values.

SEMIMINOR_B = 6_356_752.3142

lon = round(math.degrees(mm_lon / SEMIMINOR_B), 8)
lat = round( math.degrees(2 * math.atan(math.exp(mm_lat / SEMIMINOR_B)) - math.pi / 2), 8 )

These aren't too bad. But.

Here's the question.

Where does this belong? Is it part of the underlying Angle class? It is separate?

Where does Millimeter Mercator representation belong?

This raises a secondary question: Where does ANY representation belong?

Do we separate the essential object (an angle in radians, a float) from all representation questions? If so, how do we properly bind value and representation at run time? 

Is our app full of complex mixins to bind the float with representation choices?  class Latitude(float, DMS, MM, etc.): pass. This seems potentially annoyingly complex: we have to make sure names don't collide, when defining all these aspects separately.

I think the representation for latitudes and longitudes *is* the essential problem here. The math (i.e. computing the loxodromic distance between points) is trivially separated from all of these representation concerns. 

If we buy into the centrality of representation issues, then, we're down to the following argument.

Resolution: millimeter mercator belongs in the Angle class.

Affirmative: it's yet another representation of an angle's value. 

Negative: it's not used outside USR files and belongs in the USR file parser module.

Affirmative Rebuttal: None of the other representations in Angle are tied specifically to a file format.

Negative Rebuttal: Because the other formats (float, string) are intermixed in CSV files and text displays, making them "widely used." While float is used consistently in GPX, this encoding is a pleasant exception that relies on widely-used encodings.

Okay. We seem to have conflicting goals here. Some representation is a generic thing that crosses file formats and some representation is localized to a specific file format and not reused.

The SOLID design principles don't help chose between these designs. Instead, they provide post-hoc justification for the design we chose.

We can exploit the SOLID principles in a variety of ways. Some Examples.

  • We could claim that LatitudeMM is a subclass of Latitude with the MM conversions mixed in. Open/Closed. Liskov Substitution. 
  • We could claim that Latitude has several load/dump strategies available, including Load from MM. Open/Closed. Dependency is Injected at run-time.

Sigh.

Prior Art

Methods like __str__() and __repr__() are generally considered part of the essential class. That means the most common string representations need to be provided. The parsing of a string, similarly, is the constructor for  an instance of the float class.

So. Some representations are part of the class. Clearly, however, not all representations are part of the class. Representation codecs like pickle, struct, or ctype are kept separate.

I'm going to make the case that there's a very, very fine line between unique and non-unique-but-not-widely-used aspects of a class of objects. And, in this specific case, the millimeter mercator should be kept separate.

I'm going to rely on other representations like PlusCode (also called OLC) as yet another obscure representation and insist these aren't essential to the class. Indeed, I'm going to suggest that proximity-friendly geocoding is clearly separate because it's a hack to replace complex distance computations with substring comparisons.