Tuesday, June 28, 2022

Massive Rework of Data Structures

As noted in My Shifting Understanding and A Terrible Design Mistake, I had a design that focused on serialization instead of proper modeling of the objects in question.

Specifically, I didn't start with a suitable abstract syntax tree (AST) structure. I started with an algorithmic view of "weaving" and "tangling" to transform a WEB of definitions into documentation and code. The weaving and tangling are two of the three distinct serializations of a common AST. 

The third serialization is the common source format that underpins the WEB of definitions. Here's an example that contains a number of definitions and a tangled output file.

Fast Exponentiation
===================

A classic divide-and-conquer algorithm.

@d fast exp @{
def fast_exp(n: int, p: int) -> int:
    match p:
        case 0: 
            return 1
        case _ if p % 2 == 0:
            t = fast_exp(n, p // 2)
            return t * t
        case _ if p % 1 == 0:
            return n * fast_exp(n, p - 1)
@| fast_exp
@}

With a test case.

@d test case @{
>>> fast_exp(2, 30)
1073741824
@}

@o example.py @{
@< fast exp @>

__test__ = {
    "test 1": '''
@< test case @>
    '''
}
@| __test__
@}

Use ``python -m doctest`` to test.

Macros
------

@m

Names
-----

@u

This example uses RST as the markup language for the woven document. A tool can turn this simplified document into complete RST with appropriate wrappers around the code blocks. The tool can also weave the example.py file from the source document.

The author can focus on exposition, explaining the algorithm. The reader gets the key points without the clutter of programming language overheads and complications.

The compiler gets a tangled source.

The key point is to have a tool that's (mostly) agnostic with respect to programming language and markup language. Being fully agnostic isn't possible, of course. The @d name @{code@} constructs are transformed into markup blocks of some sophistication. The @<name@> becomes a hyperlink, with suitable markup. Similarly, the cross reference-generating commands, @m and @u, generate a fair amount of markup content. 

I now have Jinja templates to do this in RST. I'll also have to provide LaTeX and HTML. Further, I need to provide generic LaTeX along with LaTeX I can use with PacktPub's LaTeX publishing pipeline. But let's not look too far down the road. First things first.

TL;DR

Here's today's progress measurement.

==================== 67 failed, 13 passed, 1 error in 1.53s ====================

This comforts me a great deal. Some elements of the original structure still work. There are two kinds of failures: new test fixtures that require TestCase.setUp() methods, and tests for features that are no longer part of the design.

In order to get the refactoring to a place where it would even run, I had to incorporate some legacy methods that -- it appears -- will eventually become dead code. It's not totally dead, yet, because I'm still mid-way through the refactoring. 

But. I'm no longer beating back and forth trying to see if I've got a better design. I'm now on the downwind broad reach of finding and fixing the 67 test cases that are broken. 

Tuesday, June 21, 2022

My Shifting Understanding and A Terrible Design Mistake

I've been fascinated by Literate Programming forever. 

I have two utterly divergent takes on this.

See https://github.com/slott56/PyLit-3 for one.

See https://github.com/slott56/py-web-tool for another.

And yet, I've still done a really bad design job. Before we get to the design, a little bit of back story.

Back Story

Why two separate literate programming projects? Because it's not clear what's best. It's a field without too many boundaries and a lot of questions about the value produced.

PyLit I found, forked, and upgraded to Python 3. I didn't design it. It's far more clever than something I'd design.

Py-Web-Tool is something I wrote based on using a whole bunch of tools that follow along behind the original WEB tools. Nothing to do with web servers or web.py.

The Problem Domain

The design problem is, in retrospect, pretty obvious. I set it out here as a cautionary tale.

I'm looking at the markup languages for doing literate programming. The idea is to have named blocks of code in your document, presented in an order that makes sense to your reader. A tool will "weave" a document from your source. It will also "tangle" source code by rearranging the code snippets from presentation order into compiler-friendly order.

This means you can present your core algorithm first, even though it's buried in the middle of some module in the middle of your package. 

The presentation order is *not* tied to the order needed by your language's toolchain.

For languages like C this is huge freedom. For Python, it's not such a gigantic win.

The source material is a "web" of code and information about the code. A web file may look like this:

Important insight.

@d core feature you need to know about first @{
    def somecode() -> None:
        pass
@}

And see how this fits into a larger context?

@d something more expansive @{
def this() -> None:
    pass
    
def that() -> None:
    pass
    
@<core feature you need to know about first@>
@}

See how that works?

This is easy to write and (relatively) easy to read. The @<core feature you need to know about first@> becomes a hyperlink in the published documentation. So you can flip between the sections. It's physically expanded inline to tangle the code, but you don't often need to look at the tangled code.

The Design Question

The essential Literate Programming tool is a compiler with two outputs:

  • The "woven" document with markup and such
  • The "tangled" code files which are code, largely untouched, but reordered.

We've got four related problems.

  1. Parsing the input
  2. An AST we can process
  3. Emitting tangled output from the AST
  4. Emitting woven output form the AST

Or, we can look at it as three classic problems: deserialization, AST representation, and serialization. Additionally, we have two distinct serialization alternatives.

What did I do?

I tackled serialization first. Came up with a cool bunch of classes and methods to serialize the two kinds of documents.

Then I wrote the deserialization (or parsing) of the source WEB file. This is pretty easy, since the markup is designed to be as trivial as possible. 

The representation is little more than glue between the two.

What a mistake.

A Wrong Answer

Focusing on serialization was an epic mistake.

I want to try using Jinja2 for the markup templates instead of string.Template

However. 

My AST was such a bad hack job it was essentially impossible to use it. It was a quagmire of inconsistent ad-hoc methods to solve a specific serialization issue.

As I start down the Jinja road, I found a need to be able to build an AST without the overhead of parsing.

Which caused me to realize that the AST was -- while structurally sensible -- far from the simple ideal.

What's the ideal?

The Right Answer

This ideal AST is something that lets me build test fixtures like this:

example = Web(
   chunks=[
       TextChunk("\n"),
       NamedCodeChunk(name="core feature you need to know about first", lines=["def someconme() -> None: ...", "pass"])),
       TextChunk("\nAnd see how this fits into a larger context?\n"),
       NamedCodeChunk(name="something more expansive", lines=[etc. etc.])
   ]
)

Here's my test for usability: I can build the AST "manually" without a parser. 

The parser can build one, also, but I can build it as a sensible, readable, first-class Python object.

This has pointed me to a better design for the overall constructs of the WEB source document. Bonus. It's helping me define Jinja templates that can render this as a sensible woven document.

Tangling does not need Jinja. It's simpler. And -- by convention -- the tangled code does not have anything injected into it. The woven code is in a markup language (Markdown, RST, HTML, LaTeX, ASCII DOC, whatever) and some markup is required to create hyperlinks and code sections. Jinja is super helpful here. 

TL;DR

The essence of the problem is rarely serialization or deserialization.  It's the internal representation.


Tuesday, June 14, 2022

A LaTeX Thing I Did -- And A ToDo:

When writing about code in LaTeX, the essential strategy is to use an environment to format the code so it stands out from surrounding text. There are a few of these environments available as LaTeX add-on packages. The three popular ones are:

These are nice for making code readable and distinct from the surrounding text.

A common way to talk about the code is to use inline verbatim \verb|code| sections. I prefer inline \lstinline|code|, but, my editor prefers \verb. (I have trouble getting all the moving parts of minted installed properly, so I use listings.)

Also. And more important. 

There's the \lstinputlisting[language=Python, firstline=2, lastline=12]{some_module.py} command. This lets an author incorporate examples from working, tested modules. Minted doesn't seem to have this, but it might work with an \input command. Don't know. Haven't tried.

Let's talk about workflow.

Workflow

The idea behind these tools is you have code and after that, you write about the code. I call this code first.

Doing this means you can include code snippets from a file.

Which is okay, but, there's another point of view: you have a document that contains the code. This is closer to the Literate Programming POV. I call this document first. I've got all the code in the document you're reading, I've just broken it up and spread it around in an order to serve my purpose as a writer, not serve the limitations of a parser or compiler.

There is a development environment -- WEB -- to create code that can be run through the Weave and Tangle tools to create working code and usable documentation. This is appealing in many ways. 

For now, I'm settling for the following workflow:

  1. Write the document with code samples. Use \lstlisting environment with explicit unique labels for each snippet. The idea is to focus on the documentation with explanations.
  2. Write a Jinja template that references the code samples. This is a lot of {{extract['lst:listing_1']}} kind of references. There's a bit more that can go in here, we'll return to the templates in a moment.
  3. Run a tool to extract all the \lstlisting environments to a dictionary with the label as the key and the block of text as the value. This serializes nicely as a JSON (or TOML or YAML) file. It can even be pickled, but I prefer to be able to look at the file to see what's in it.
  4. The tool to populate the template is a kind of trivial thing to build a Jinja environment, load up the template, fill in the code samples, and write the result.
  5. I can then use tox (and doctest and pytest and mypy) to test the resulting module to be sure it works.

This tangles code from a source document. There's no weave step, since the source is already designed for publication. This does require me to make changes to the LaTeX document I'm writing and run a make test command to extract, tangle, and test. This is not a huge burden. Indeed, it's easy to implement in PyCharm, because the latest release of PyCharm understands Makefiles and tox. Since each chapter is a distinct environment, I can use tox -e ch01 to limit the testing to only the chapter I'm working on.

I like this because it lets me focus on explanation, not implementation details. It helps me make sure that all the code in the book is fully tested. 

The Templates

The template files for an example module have these three kinds of code blocks:

  1. Ordinary Listings. These fall into two subclasses.
    1. Complete function or class definitions.
    2. Lines of code taken out of context.
  2. REPL Examples. 

These have three different testing requirements. We'll start with the "complete function or class definitions."  For these, the template might look like the following

{{extract['lst:listing_1']}}

def test_listing_1() -> None:
    assert listing_1(42)
    assert not listing_1(None)

This has both the reference to the code in the text of the book and a test case for the code.

For lines of code out of context, we have to be more careful. We might have this.

def some_example(arg: int) -> bool:
    {{extract['lst:listing_2']}}

def test_listing_2() -> None:
    assert listing_2(42)
    assert not listing_2(None)

This is similar to a complete definition, but it has a fiddly indentation that needs to be properly managed, also. Jinja's generally good about not inserting spaces. The template, however, is full of what could appear to be syntax errors, so the code editor could have a conniption with all those {} blocks of code. They happen to be valid Python set literals, so, they're tolerated. PyCharm's type checking hates them.

The REPL examples, look like this.

REPL_listing_3 = """
{{extract['lst:listing_3']}}
"""

I collect these into a __test__ variable to make them easy for doctest to find. The extra fussiness of  a __test__ variable isn't needed, but it provides a handy audit for me to make sure everything has a home.

The following line of code is in most (not all) templates.

__test__ = {
    name: value
    for name, value in globals().items() 
    if name.startswith("REPL")
}

This will locate all of the global variables with names starting with REPL and put them in the __test__ mapping. The REPL names then become the test case names, making any test failures easier to spot.

My Goal

I do have some Literate Programming tools that I might be able to leverage to make myself a Weaver that produces useful LaTeX my publisher can work with. I should do this because it would be slightly simpler. The problem is my Web/Weave/Tangle tooling has a bunch of dumb assumptions about the weave and tangle outputs; a problem I really need to fix.

See py-web-tool.

The idea here is to mimic other WEB-based tooling. These are the two primary applications:

  • Weave. This makes documentation in a fairly transparent way from the source. There are a bunch of substitutions required to fill in HTML or LaTeX or Markdown or RST around the generic source. Right now, this is pretty inept and almost impossible to configure.
  • Tangle. This makes code from the source. The point here is the final source file is not necessarily built in any obvious order. It's a tangle of things from the documentation, put into the order required by parser or compiler or build system or whatever.

The weaving requires a better way to provide the various templates that fill in missing bits. Markdown, for example, works well with fenced blocks. RST uses a code directive that leads to an extra level of indentation that needs to be carefully excised. Futher, most markup languages have a mountain of cruft that goes around the content. This is unpleasantly complex, and very much subject to odd little changes that don't track against the content, but are part of the evolution of the markup language.

My going-in assumption on tangling was the document contained all the code. All of it. Without question or exception. For C/C++ this means all the fiddly little pre-processor directives that add no semantic clarity yet must be in the code file. This means the preprocessor nonsense had to be relegated to an appendix of "yet more code that just has to be there."

After writing a tangler to pull code from a book into a variety of contexts, I'm thinking I need to have a tangler that works with a template engine. I think there would be the following two use cases:

  • No-Template Case. The WEB source is complete. This works well for a lot of languages that don't have the kind of cruft that C/C++ has. It generally means a WEB source document will contain definition(s) for the final code file(s) as a bunch of references to the previously-explained bits. For C/C++, this final presentation can include the fiddly bits of preprocessor cruft.
  • Template Case. A template is used to with the source to create the tangled output. This is what I have now for pulling book content into a context where it is testable. For the most part, the template files are quite small because the book includes test cases in the form of REPL blocks. This presents a bit of a problem because it breaks the "all in one place" principle of a WEB project. I have a WEB source file with the visible content plus one or more templates with invisible content.

What I like about this is an attempt to reduce some of the cruftiness of the various tools. 

I think my py-web-tool might be expanded to handle my expanded understanding of literate programming. 

I have a book to finish, first, though. Then I can look at improving my workflow. (And yes, this is backwards from a properly Agile approach.)