Tuesday, February 25, 2020

Stingray Reader Pervasively Bad Decision

I made some bad decisions when I wrote this a few years ago: https://github.com/slott56/Stingray-Reader. Really bad. And. Recently, I've burdened myself with conflicting goals. Ugh.

I need to upgrade to Python 3.8, and add type hints. This exposed somes badness.

See https://slott-softwarearchitect.blogspot.com/2020/01/stingray-reader-rewrite.html for some status.

The very first version(s) of this were expeditious solutions to some separate-but-related problems. Spreadsheet processing was an important thing for me f. Fixed-format file versions of spreadsheets showed up once in a while mixed with XLS and CSV files. Separately, COBOL code analysis was a thing I'd been involved in going back to the turn of the century.

The two overlap. A lot.

The first working versions of apps to process COBOL data in Python relied on a somewhat-stateful representation of the COBOL DDE (Data Definition Element.) The structure had to be visited more than once to figure out size, offset, and dimensionality. We'll talk about this some more.

A slightly more clever algorithm would leverage the essential parsing as a kind of tree walk, pushing details down into children and summarizing up into the parent when the level number changed. It didn't seem necessary at the time.

Today

I've been working for almost three weeks on trying to disentangle the original DDE's from the newer schema. I've been trying to invert the relationships so a DDE exists independently of a schema attribute. This means some copy-and-paste of data between the DDE source and the more desirable and general schema definition.

It turns out that some design decisions can be pervasively bad. Really bad-foundation-wrecks-the-whole-house kind of bad.

At this point, I think I've teased apart the root cause problem. (Of course, you never know until you have things fixed.)

For the most part, this is a hierarchical schema. It's modeled nicely by JSONSchema or XSD. However. There are two additional, huge problems to solve.

REDEFINES. The first huge problem is a COBOL definition can redefine another field. I'm not sure about the directionality of the reference. I know many languages require things be presented in dependency order: a base definition is provided  lexically first and all redefinitions are subsequent to it. Rather than depend on order of presentation, it seems a little easier to make a "reference resolution" pass. This plugs in useful references from items to the things they redefine, irrespective of any lexical ordering of the definitions.

This means we data can only be processed strictly lazily. A given block of bytes may have multiple, conflicting interpretations. It is, in a way, a free union of types. In some cases, it's a discriminated union, but the discriminating value is not a formal part of the specification. It's part of the legacy COBOL code.

OCCURS DEPENDING ON. The second huge problem is the number of elements in an array can depend on another field in the current record. In the common happy-path cases, occurrences are fixed. Having fixed occurrences means sizes and offsets can be computed as soon as the REDEFINES are sorted out.

Having occurrences depending on data means sizes and offsets cannot be computed until some data is present. The most general case, then, means settings sizes and offsets uniquely for each row of data.

Current Release

The current release (4.5) handles the ODO, size, and offset computation via a stateful DDE object.

Yes. You read that right. There are stateful values in the DDE. The values are adjusted on a row-by-row basis.

Tomorrow

There's got to be a better way.

Part of the problem has been conflicting goals.

  • Minimal tweaks required to introduce type hints.
  • Minimal tweaks to break the way a generic schema depended on the DDE implementation. This had to be inverted to make the DDE and generic schema independent.
The minimal tweaks idea is really bad. Really bad. 

The intent was to absolutely prevent breaking the demo programs. I may still be able to achieve this, but... There needs to be a clean line between the exposed work-book like functionality, and some behind the scenes COBOL DDE processing.

I now think it's essential to gut two things:
  1. Building a schema from the DDE. This is a (relatively) simple transformation from the COBOL-friendly source model to a generic, internal model that's compatible with JSONSchema or XSD. The simple attributes useful for workbooks require some additional details for dimensionality introduced by COBOL.
  2. Navigating to the input file bytes and creating Workbook Cell objects in a way that fits with the rest of the Workbook abstraction.
The happy path for Cell processing is more-or-less by attribute name: row.get('attribute').  This changes in the presence of COBOL OCCURS clause items. We have to add an index. row.get('ARRAY-ITEM', index=2) is the Python version of COBOL's ARRAY-ITEM(3).

The COBOL variable names *could* be mapped to Python names, and we *could* overload __getitem__() so that row.array_item[3] could be valid Python to fetch a value.

But nope. COBOL has 1-based indexing, and I'm not going to hide that. COBOL has a global current instance of the row, and I'm not going to work with globals. 

So. Where do I stand?

I'm about to start gutting. Some of the DDE size-and-offset (for a static occurrences)

Tuesday, February 11, 2020

Interesting Data Restructuring Problem

This seemed like an interesting problem. I hope this isn't someone's take-home homework or an interview question. It seemed organic enough when I found out about it.

Given a document like this...

doc = {
    "key": "the key",
    "tag1": ["list", "of", "values"],
    "tag2": ["another", "list", "here"],
    "tag3": ["lorem", "ipsum", "dolor"],
}


We want a document like this...

doc = {
    "key": "the key",
    "values": [
        {"tag1": "list", "tag2": "another", "tag3": "lorem"},
        {"tag1": "of", "tag2": "list", "tag3": "ipsum"},
        {"tag1": "values", "tag2": "here", "tag3": "dolor"},
    ]
}


In effect, rotating the structure from Dict[str, List[Any]] to List[Dict[str, Any]].
Bonus, we need to limiting the rotation to those keys with a value of List[Any], ignoring keys with atomic values (int, str, etc.).

Step 1. Key Partitioning

We need to distinguish the keys to be rotated from the other keys in the dict.
We start with Dict[str, Union[List[Any], Any]]. We need to distinguish the two subtypes in the union.

from itertools import filterfalse
list_of_values = lambda x: isinstance(doc[x], list)
lov_keys = list(filter(list_of_values, doc.keys()))
non_lov_keys = list(filterfalse(list_of_values, doc.keys()))

This gets two disjoint subsets of keys: those which have a list and all the others. The others, presumably, are strings or integers or something irrelevant.

List lengths

There's no requirement for the lists to be the same lengths. We have three choices here:
  • insist on uniformity,
  • truncate the long ones,
  • pad the short ones.

We'll opt for uniformity in this example. Truncating is what zip() normally does. Padding is what itertools.zip_longest() does.

lengths = (len(doc[k]) for k in lov_keys)
sample = next(lengths)
assert all(l == sample for l in lengths), "Inconsistent lengths"

Some folks don't like using assert for this. This can be a more elaborate if-raise ValueError() if that's necessary.

Use zip() to merge data values

We have several List[Any] instances in the document. The intermediate goal is a List[Tuple[Any, ...]] structure where the items from each tuple are chosen from the source lists. This gets us a sequence of tuples that have parallel selections of items from each of the source lists.

The zip(list, list) function produces pairs from each of the two lists. In our case, we have n lists in the original document. A zip(*lists) will produce a sequence of items selected from each list.

Here's what it looks like:

list(zip(*(doc[k] for k in lov_keys)))

We can also use zip(key-list, value-list) to make a list of key-value pairs from a tuple of the keys and a tuple of values. zip(Tuple[Any, ...], Typle[Any, ...]]) gives us a List[Tuple[Any, Any]] structure. These objects can be turned into dictionaries with the dict() function.

It looks like this:

list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))

Assemble the parts

The final document, then, is built from untouched keys and touched keys.

d1 = {
    k: doc[k] for k in non_lov_keys
}
d2 = {
    "values": list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))
}
d1.update(d2)

It might be slightly easier to "somehow" build this as s single dictionary, but the two subsets of keys make it seem more sensible to build the resulting document in two parts.

The code I was asked to comment on was quite complex. It built a large number of intermediate structures rather than building a List[Dict] using a list comprehension.

What's important about this problem is the complexity of the list comprehension. In particular, the keys are used twice in the comprehension. One use extracts the source lists from the original document. The second use attaches the key to each value from the original list.

It almost seems like the Python 3.8 "Walrus" operator might be a handy way to shrink this code down from about 14 lines. I'm not sure it's helpful to make this any shorter. Indeed, I'm not 100% sure this compact form is really optimal. The fact that I had to expand things as part of an explanation suggests that separate lines of code are as important as separate subsections of this blog post.

Tuesday, February 4, 2020

Dictionary clear() as a code smell

Using the clear() method of a dict isn't *wrong*. But. The reasons have to be investigated. I got a question about this code not working "properly." ("Properly"? Seems too vague to be useful.)

Here's a summary of the example.

final_list = []
temp_dict = {}
for obj in some_source:
    cool_function(obj, temp_dict)
    final_list.append(temp_dict)
    temp_dict.clear()  # Ready for reuse, right?


This can't work.

(Bonus points if you suspect that list.append() is a smell, too. There may be a list comprehension solution that's tidier than this.)

It's not always easy to get to a succinct statement of what doesn't work "properly," or what's confusing about the Python list structure. Getting useful information can be hard. Why?
  • Some programmers are "Assumptions First" kind of people, and their complaint is often "doesn't match my assumption" not "doesn't actually work."
  • Some people live in "All Details Matter" world. Rather than create the smallest example of code that's confusing, they send the *entire* project. The problem is buried in a log, wrapped with "Why is the list of dictionaries not being properly updated?" In an email that provides background details. For a Trello story that links to background details. Details. None of which point to the problem. 

"Properly?" What does that even mean?

 Confronted with hundreds of lines of impenetrable code, I asked for a definition of "properly" and got these exact seven words: "Properly is defined as correctly or satisfactorily."

So... 

They have no idea what's wrong, can't summarize the code that's broken, and it's my fault because I'm the Python guru.

Why Won't My Code Work?

The short answer is "Because You're Making an Assumption."

Of course, anyone who puts their assumptions first is as blind to their assumptions as we are to the air that surrounds us. Assumptions are just there. All around them. They breathe their assumptions in and out without seeing them.

The long answer is Python uses references.

If you apply the id() function to the items in the resulting final_list, you'll see that it's reference after reference of one object, temp_dict.  Not copies of individually populated dictionaries, but multiple references to the same dictionary. The same dictionary which was cleared and reloaded over and over again.

The very first log, crammed with useless details, had output from print() functions. It showed multiple copies of the same dict. 

Because they assumed Python is making copies, there was no explanation for why the list of dictionaries was broken. Clearly, it couldn't be in their code. They assume their code is correct. The only choice has to be an undocumented mystery in Python. And I'm the Python guru, so it's my problem.

The presence of duplicates in the output meant "something" to them. They could point it out as somehow wrong. But the idea that their assumptions might be wrong? That was a nope.

They wanted it to be the list object, final_list, which didn't append dictionaries the way they assumed it would. They needed it to be a Python internals problem. They needed it to be a bad documentation problem. (Seriously. These convos have spun out of control in the past.)

tl;dr

Using the clear() method of a dict may indicate the developer is hoping Python shares copies, not references. Either add an explicit copy() (or deepcopy.copy()) or fix things to create new, fresh dictionaries each time. Objects are cheap. Why reuse them?

(Indeed, an interesting side-bar question I did not ask is "In what god-forsaken programming language does this 'clear-and-reuse' a data structure even make sense? FORTRAN?)

The list comprehension solution to this problem will have to wait. Stay tuned. I want to disentangle the algorithmic design problem from the "why aren't my assumptions correct?" problem..

Tuesday, January 28, 2020

Stingray Reader Rewrite

See https://slott-softwarearchitect.blogspot.com/2020/01/stingrayreader-upgrade.html

This drifted into some serious rethinking of bad design decisions. (If someone else did this, I'd call it weak, and suggest improvements. It was me. It was bad. I'm a bad programmer and I feel bad about it.)

An an example, there's this sketchy construct:

some_data = {name: source[name] for name in the_names}
the_object = SomeClass(**some_data)

The some_data dictionary could be called Dict[str, Any], but that's unhelpful for letting mypy check the consistency of data structures. This is what was required:

  FullAttr = TypedDict("FullAttr",
      {
          "name": str,
          "offset": int,
          "size": int,
          "type": str,
          "create": Cell,
      },
      total=False
  )

This dictionary changes -- profoundly -- the relationship between classes. The FullAttr type gives us an intermediary representation. The SomeClass hierarchy has a flexible collection of attributes. We can use this to uncouple some parsing operations from object factory operations, using this minimal subset of definitions as a kind of bridge between modules, both of which can be fully type-checked, but still permit Python's duck-type flexibility.

It Got Worse

Adding type hints to Stingray Reader required navigating some shoal water created by a poor set of dependency decisions.

The original, vague, concept was to have a Schema and Attribute definition that could be shared by all the various readers. A schema contains a number of attributes. Ideally, an attribute can be defined by a sub-schema. This is how JSONSchema and XSD work.

But.

The Stingray Reader reads Workbooks with an extension to read COBOL. There are a bunch of extensions required.
  • The schema is loaded by a COBOL parser. 
  • The physical file formats require the possibility of EBCDIC -> Unicode conversion. 
  • Unlike ordinary workbooks, the record layouts have to be built lazily. An ordinary workbook row is complete. Some physical formats elide empty cells, but they're easy to replace with an explicit empty cell. COBOL, has a REDEFINES clause that means we can't even attempt to parse the bytes for a row until they're required by the app. There's no way -- from the data definition alone -- to discern which of the redefines options will have valid data. There's more, but you get the idea: COBOL is kind of complex.
Versions 1 to 4 had a dumb-as-a-bag-of-hammers problem.

The Schema and Attribute definitions where extended to depend on COBOL implementation details.

It works nicely because of duck typing and late binding of types.

Python's type hinting exposes the grotesque consequences of this dependency.

We tried several ways of reordering a bunch of definitions to remove forward type references. It took almost an hour to realize the circularity could not be removed trivially because of a circularity. Two Attribute subclasses depended on COBOL features. And the COBOL features had weakref references back to their Attributes.

Crushing everything into a single, large module, worked to ease the complications or circularity. But the essential interdependence needs to be expunged.

What has to happen next is to invert the relationship between Attributes and COBOL details. This means two changes:

  1. Extending the Attribute class hierarchy to contain just enough information to cover the COBOL complications. 
  2. Changing the function that builds an Attribute definition from the COBOL source so it copies details into the Attribute. The COBOL detail needs to be little more than the description of the property.

This isn't easy. But. 187 test cases and a TOX setup makes it a reasonable effort.

And.

I can finally look seriously at converting between JSON Schema and COBOL. 

Tuesday, January 21, 2020

StingrayReader Upgrade

See https://github.com/slott56/Stingray-Reader

It's time to add type hints.

And.

Learn some interesting lessons.

Here's the interesting problem:

some_data = {name: source[name] for name in the_names}
the_object = SomeClass(**some_data)

While valid, this concerns mypy.

The point here is to have a flexible source of data, source. Perhaps this is a spreadsheet row, or a complex JSON/YAML-formatted document with optional or irrelevant fields. The short list of relevant names is in the_names.  Ideally, this list of names matches the keyword args of SomeClass.

This gives mypy fits because there's no way to match the dictionary with the object's parameters.

We have two paths forward.
  1. Eliminate the intermediate dictionary. Use SomeClass(x=source['x'], y=source['y'], ... etc.)
  2. Consider using a TypedDict for the intermediate dictionary. But. Then the dictionary's types must be kept in sync with the SomeClass definition, which may be a little crazy.
Item 2 isn't as crazy as it sounds, though. The SomeClass definition has a **kwargs option, allowing extra attributes to be set. This is, perhaps also crazy. But, the framework needs to drag around extra attributes for the application's benefit.

A possibility is to do away with **kwargs, and replace it with other: Dict[Any, Any]. This cuts down on the expressivity of the framework. Now we support SomeClass.app_name. This change would mean we'd have SomeClass.other['app_name']. While possibly better for mypy, I don't think it's ideal for users.

I can also rework SomeClass to use __getattribute__() to look into self.other for extra attribute names.

I'm very happy to have the rigorous static check. The rethinking is helpful.

("Wait," you say. "You didn't provide the recommended path forward."  Correct.  I'll update.)

Tuesday, January 14, 2020

The Wrong Abstraction Problem

For the last week I've been working with some legacy code that reveals a kind of problem I hadn't really seen before.

I'm calling it the Wrong Abstraction.

I want to contrast this with the Leaky Abstraction, where implementation details are revealed and raise havoc.

The Wrong Abstraction problem seems to arise when a specification is too technical. A detailed, code-like tangle of if-then-else becomes its own problem. I'm guessing someone worked to detail all the technical considerations. The chosen format as code-like text was not a great idea. The cyclomatic complexity of the specification is through the roof. And the code reflects this failure to actually capture anyone's underlying intent.

Cue the gif from the office. https://gph.is/1m89uqR Someone with "people skills" tried to recast the business intent into technical if-then-else.

Details

The context doesn't matter very much, but it can help people visualize the problem.

We're talking about validation rules. A document arrives, perhaps it's source code, or perhaps it's a shopping cart, or perhaps it's a schema definition. The document is validated according to some fairly sophisticated rules.
  • There's the obvious syntax check: is it valid JSON or Python or whatever the language is.
  • There are isolated validity checks. Individual elements (statements, items in the cart, subschemas) have to be valid.
  • There are aggregate validity checks. Groups of items -- the cart overall -- must satisfy some additional criteria. In our case, nine additional rules.
Some of the rules are complex. I think they original intent was drafted by a committee. It's visible, and involves large piles of money and potential lawsuits. Serious rules.

There are at least two separate implementations, mostly in JavaScript. (I'm not here to curse out JavaScript. The language has a lot of wat -- https://github.com/denysdovhan/wtfjs -- but that's not the point.)

So, you ask, where's the Wrongness?

It's a vast gap between intent and implementation.

Mind the Gap

The source documents decompose the validation into 9 steps. There's an explicit "all or nothing" disclaimer. That's nice.

The code looks more-or-less like this:

valid = True
for item in cart:
    for r in (Rule1, Rule2, Rule3, Rule4, ..., Rule9):
        if applies(r, item):
            valid = valid and r(item)

It turns out, though, we don't really apply all 9 rules like this. This is The Gap.

We actually have three types of items in the cart (or code or schema or whatever.) One type item has a default, a hidden feature of rule 1. It breaks down like this.
  • Rule 1 applies to an item of Type A. If the Type A item is omitted, the default value will pass the Rule 1 check. 
  • Rule 2 applies to all the items of Type B. Only.
  • Rules 3 to 8 apply to the items of Type C. Only. And they work in pairs, 3-4, 5-6, 7-8.
  • Rule 9 applies to a subset of items of Type C. The C9 subset.
Code with a nested "for all items" and "for all rules" is -- well -- wrong. It's flat-out lying about the validation rules and the objects (and collections) being validated. It's lying to a level that seems unconscionable to me. But. Maybe there's a reason.

The validation is really something more like this.

valid = Rule1(filter(lambda item: item.is_a, cart))
    and Rule2(filter(lambda item: item.is_b, cart))
    and all(
        r(x) 
        for r in (Rule3, Rule4, Rule5, ..., Rule8) 
        for x in filter(lambda item: item.is_c, cart)
    )
    and Rule9(filter(lambda item: item.is_c9, cart))

This reflects the actual structure of item types and rule types without wrapping them in a wrong abstraction.

(It's actually *more* complex than this, but, this is enough to expose the core issue.)

Why The Gap?

There are a number of causes. In part, the gap seems to reflect a disconnect between intention and implementation. Indeed, this seems to be an example of Conway's Law.
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."
I think the for item in cart: for rule in (Rule1, ..., Rule9):  structure reflects some intermediate design work between the original intent and the developer who implemented the code.

The extra layer of design work was a failed attempt to "simplify" things for the developer. I can imagine the conversation.

Designer: "It's simple. There are 12 rules. Each rule applies to each item."
Developer: "Rule one only seems to apply to Type A. So maybe it's not simple."
Designer: "It's simple. Don't make it complex. Write an 'applicability' test. Evaluate the rule if it applies to the item."
Developer: "So it's not trivially all rules against all items? Could we associate subsets of rules with the separate item types?"
Designer: "No. You're making it complex; It's simply evaluating all 12 rules against each item. If the rule applies to the item type. Other than that, it's simple."
Developer: "Instead of the 'applicability test,' could we group the rules?"
Designer: "No. You're making it complex."

I also think the gap also reflects an inability (or a lack of permission) to hack incrementally.

Incremental Development

One of Python's strong suits is the ability to run code at the >>> prompt. Confronted with a complex data structure and complex rules, some of us will try different designs on for size as quickly as we can. We hack out the essence of the code and see if it would make sense in a tutorial explanation.

I've darted down any number of dead-ends trying to get a sensible abstraction that I can understand and explain. The idea is to write a bit of code, mess around, and then decide to backtrack or push forward. (For a lot of people, rubber ducking or pair programming helps with this.)

When you're only a few lines of code into the problem, it's easy and fun to delete it all and start again. Or. It *should* be easy and fun. Some folks worry about deleting bad code and starting over.

I think the overall context didn't facilitate hacking around. The documentation talks about creating mock documents (or carts or collections) of items for testing purposes. I don't think anyone tried that. I'm not sure they knew the feature was available. I think they put the validation code into the framework, ran it in the development environment, looked at the debugging logs, changed the code, deployed, and ran things again until it worked. A long, painful slog, where backtracking would be considered a horrible set-back.

The complex "applies()" test has a surprising bunch of if statements that don't seem to reflect the actual properties of the three types of items. It seems to reflect an evolving series of guesses about attributes that were present or absent.

When I was younger, writing COBOL, PL/I, Fortran and the like, that's how we worked. Run it. Look at logs. Run it again later in the day. The long, slow development cycle meant that as soon as something looked like it was working, we called the project 90% complete.

This lead inexorably to the ninety-ninety rule.
"The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time.” 
Even if the abstraction is wrong. We've take 90% of the time to get something that works. There's no fixing it, now. We have to ship something, so we spend the next 90% of the time working around the wrongness and filling in gaps that shouldn't have existed.

A horrid development environment tends to prohibit refactoring. You can't simply run the test suite with refactored code because the test suite is neither fast nor fully automated. In this case, I don't think it runs in a handy form on the desktop, but requires a dedicated server. Without a Docker container for each developer, I think the project gets paralyzed and stuck with icky code and me doing a very expensive rewrite.

tl;dr

An utterly wrong abstraction seems have two root causes:

  • Too many designers
  • No ability to delete the garbage abstraction and start over with something better
  • No simple unit test environment to support refactoring

Tuesday, January 7, 2020

Patreon Book Idea


See "Additional, Related Content". It's one of the posts here: https://www.patreon.com/slott

I think there's space for a Building Skills in Functional Python title next to the Building Skills in OO Design