Tuesday, January 14, 2020

The Wrong Abstraction Problem

For the last week I've been working with some legacy code that reveals a kind of problem I hadn't really seen before.

I'm calling it the Wrong Abstraction.

I want to contrast this with the Leaky Abstraction, where implementation details are revealed and raise havoc.

The Wrong Abstraction problem seems to arise when a specification is too technical. A detailed, code-like tangle of if-then-else becomes its own problem. I'm guessing someone worked to detail all the technical considerations. The chosen format as code-like text was not a great idea. The cyclomatic complexity of the specification is through the roof. And the code reflects this failure to actually capture anyone's underlying intent.

Cue the gif from the office. https://gph.is/1m89uqR Someone with "people skills" tried to recast the business intent into technical if-then-else.

Details

The context doesn't matter very much, but it can help people visualize the problem.

We're talking about validation rules. A document arrives, perhaps it's source code, or perhaps it's a shopping cart, or perhaps it's a schema definition. The document is validated according to some fairly sophisticated rules.
  • There's the obvious syntax check: is it valid JSON or Python or whatever the language is.
  • There are isolated validity checks. Individual elements (statements, items in the cart, subschemas) have to be valid.
  • There are aggregate validity checks. Groups of items -- the cart overall -- must satisfy some additional criteria. In our case, nine additional rules.
Some of the rules are complex. I think they original intent was drafted by a committee. It's visible, and involves large piles of money and potential lawsuits. Serious rules.

There are at least two separate implementations, mostly in JavaScript. (I'm not here to curse out JavaScript. The language has a lot of wat -- https://github.com/denysdovhan/wtfjs -- but that's not the point.)

So, you ask, where's the Wrongness?

It's a vast gap between intent and implementation.

Mind the Gap

The source documents decompose the validation into 9 steps. There's an explicit "all or nothing" disclaimer. That's nice.

The code looks more-or-less like this:

valid = True
for item in cart:
    for r in (Rule1, Rule2, Rule3, Rule4, ..., Rule9):
        if applies(r, item):
            valid = valid and r(item)

It turns out, though, we don't really apply all 9 rules like this. This is The Gap.

We actually have three types of items in the cart (or code or schema or whatever.) One type item has a default, a hidden feature of rule 1. It breaks down like this.
  • Rule 1 applies to an item of Type A. If the Type A item is omitted, the default value will pass the Rule 1 check. 
  • Rule 2 applies to all the items of Type B. Only.
  • Rules 3 to 8 apply to the items of Type C. Only. And they work in pairs, 3-4, 5-6, 7-8.
  • Rule 9 applies to a subset of items of Type C. The C9 subset.
Code with a nested "for all items" and "for all rules" is -- well -- wrong. It's flat-out lying about the validation rules and the objects (and collections) being validated. It's lying to a level that seems unconscionable to me. But. Maybe there's a reason.

The validation is really something more like this.

valid = Rule1(filter(lambda item: item.is_a, cart))
    and Rule2(filter(lambda item: item.is_b, cart))
    and all(
        r(x) 
        for r in (Rule3, Rule4, Rule5, ..., Rule8) 
        for x in filter(lambda item: item.is_c, cart)
    )
    and Rule9(filter(lambda item: item.is_c9, cart))

This reflects the actual structure of item types and rule types without wrapping them in a wrong abstraction.

(It's actually *more* complex than this, but, this is enough to expose the core issue.)

Why The Gap?

There are a number of causes. In part, the gap seems to reflect a disconnect between intention and implementation. Indeed, this seems to be an example of Conway's Law.
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."
I think the for item in cart: for rule in (Rule1, ..., Rule9):  structure reflects some intermediate design work between the original intent and the developer who implemented the code.

The extra layer of design work was a failed attempt to "simplify" things for the developer. I can imagine the conversation.

Designer: "It's simple. There are 12 rules. Each rule applies to each item."
Developer: "Rule one only seems to apply to Type A. So maybe it's not simple."
Designer: "It's simple. Don't make it complex. Write an 'applicability' test. Evaluate the rule if it applies to the item."
Developer: "So it's not trivially all rules against all items? Could we associate subsets of rules with the separate item types?"
Designer: "No. You're making it complex; It's simply evaluating all 12 rules against each item. If the rule applies to the item type. Other than that, it's simple."
Developer: "Instead of the 'applicability test,' could we group the rules?"
Designer: "No. You're making it complex."

I also think the gap also reflects an inability (or a lack of permission) to hack incrementally.

Incremental Development

One of Python's strong suits is the ability to run code at the >>> prompt. Confronted with a complex data structure and complex rules, some of us will try different designs on for size as quickly as we can. We hack out the essence of the code and see if it would make sense in a tutorial explanation.

I've darted down any number of dead-ends trying to get a sensible abstraction that I can understand and explain. The idea is to write a bit of code, mess around, and then decide to backtrack or push forward. (For a lot of people, rubber ducking or pair programming helps with this.)

When you're only a few lines of code into the problem, it's easy and fun to delete it all and start again. Or. It *should* be easy and fun. Some folks worry about deleting bad code and starting over.

I think the overall context didn't facilitate hacking around. The documentation talks about creating mock documents (or carts or collections) of items for testing purposes. I don't think anyone tried that. I'm not sure they knew the feature was available. I think they put the validation code into the framework, ran it in the development environment, looked at the debugging logs, changed the code, deployed, and ran things again until it worked. A long, painful slog, where backtracking would be considered a horrible set-back.

The complex "applies()" test has a surprising bunch of if statements that don't seem to reflect the actual properties of the three types of items. It seems to reflect an evolving series of guesses about attributes that were present or absent.

When I was younger, writing COBOL, PL/I, Fortran and the like, that's how we worked. Run it. Look at logs. Run it again later in the day. The long, slow development cycle meant that as soon as something looked like it was working, we called the project 90% complete.

This lead inexorably to the ninety-ninety rule.
"The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time.” 
Even if the abstraction is wrong. We've take 90% of the time to get something that works. There's no fixing it, now. We have to ship something, so we spend the next 90% of the time working around the wrongness and filling in gaps that shouldn't have existed.

A horrid development environment tends to prohibit refactoring. You can't simply run the test suite with refactored code because the test suite is neither fast nor fully automated. In this case, I don't think it runs in a handy form on the desktop, but requires a dedicated server. Without a Docker container for each developer, I think the project gets paralyzed and stuck with icky code and me doing a very expensive rewrite.

tl;dr

An utterly wrong abstraction seems have two root causes:

  • Too many designers
  • No ability to delete the garbage abstraction and start over with something better
  • No simple unit test environment to support refactoring

Tuesday, January 7, 2020

Patreon Book Idea


See "Additional, Related Content". It's one of the posts here: https://www.patreon.com/slott

I think there's space for a Building Skills in Functional Python title next to the Building Skills in OO Design

Tuesday, December 31, 2019

Christmas Ornament

See https://github.com/slott56/cpx-xmas-ornament

You'll need a Circuit Playground Express https://www.adafruit.com/product/3333

Install the code. Enjoy the noise and blinky lights.

The MML translation isn't as complete as you might like. The upper/lower case for the various commands isn't handled quite as cleanly as it could be. AFAIK, case shouldn't matter, but I omitted any lower() functions, making the  MML parser case sensitive. It only mattered for one of the four songs, and it was easier to edit the song.

The processing leaves a great deal of "clickiness" in the start_tone() processing. I think I know how to address it.

There are barely 96 or so different tones available in MML compositions. It might be possible to generate the wave shapes in advance to have a smoother music experience.

One could image having an off-line translator to transform the MML text into a sequence of bytes with note number and duration. This would slightly compress the song, but would speed up processing by eliminating the overhead of parsing.

Additionally, having 96 wave tables could speed up tone production. The tiny bit of time to recompute the sine wave at a given frequency would be eliminated. But. Memory is limited.

Tuesday, December 24, 2019

Walrusing Around

This is -- well -- it is what it is. I don't have to like it.

>>> t_s = (8063599, 0)
>>> fields = [(t_s := divmod(t_s[0], b))[1] for b in (60, 60, 24, 7)]
>>> list(reversed(fields + [t_s[0]]))
[13, 2, 7, 53, 19]


It works and shows how the assignment operator works.

The point here is to convert a timestamp into ISO week, day, hour, minute, second. 13th week, 2nd day, 7h, 53m, 19s.

The divmod() function returns a two-tuple, which the assignment operator can't decompose. Instead, we decompose it by wrapping the whole thing in ()[1].

Works.

Do Not Recommend.

Tuesday, December 17, 2019

Plannng a Linked-in Learning Course (and using the := walrus operator)

I've recorded two courses for LinkedIn Learning https://www.linkedin.com/learning/me
Let me emphasize that their production values take a lot of work. While I think I'm a pretty good live presenter, a few days in the recording booth with a producer, reveals all my weaknesses. so. um. you know?

I'm starting down the road to at least one more, maybe another one or two after that. 

Which leads to code. Of course. And the code uses the assignment expression ("walrus") operator.

Here's what's going on. I've got a directory full of CSV files with the slide-by-slide scripts. Each file has a bunch of tabs, and the relevant tables have a fixed heading that the production folks use. 

target_headings = ['Part', 'Voice', 'Visual Description', 'Storyboard / Description']

The "Voice" column in these tables is the script. Each row is a slide or other visual. The overall management of the resources with all of these spreadsheets doesn't seem ideal to me. However, it's the way skilled professionals prefer to manage these multi-media assets.

The question is: "which sections are too long?"

Generally, we speak at a consistent rate. During rehearsals, I can use my stopwatch to get timing for a particular script. This gives me a seconds/word or words/second rate metric. Given an average rate, and a script, I can predict a likely duration given the text of the script.

The data is in spreadsheets -- generally the root cause of many complications. There's no word-count in Numbers. So. Time to apply Python. (I'm sure someone has a bunch of Excel macros that can do word-counts. Good for you. I don't own a copy of Excel.)

Here's how this shakes out. There are three parts to the analysis, modeling, and applying the model. The first is a functional flattener to turn all of the files and tabs and tables into a single stream of useful rows.

The Data Gathering

The essential data gathering has to flatten the relatively complex file/sheet/table structure into something we can extract features from. A sequence of the final text of the scripts is what we want. Each script can be a mapping from the slide label to the voice content. It's this content -- the script text -- where we'll find the interesting features.

Here's how this starts.

from pathlib import Path
from fractions import Fraction
import csv
import re
from typing import Tuple, Dict, Iterator, List

sheet_table_pattern = re.compile(f"^(\w+): (.+)$")
target_headings = ['Part', 'Voice', 'Visual Description', 'Storyboard / Description']


def script_iter(source: Path) -> Iterator[Tuple[str, Dict[str, str]]]:
    for script_path in sorted(source.glob("*.csv")):
        # print(script_path)
        with script_path.open() as script_file:
            reader = csv.reader(script_file)
            row_iter = iter(reader)
            for row in row_iter:
                if len(row) == 1 and (match := sheet_table_pattern.match(row[0])):
                    if match and match.group(2).startswith('Table '):
                        headings = next(row_iter)
                        if headings == target_headings:
                            section = match.group(1)
                            text = {}
                            # print(f"Analyzing {section}")
                            for sub_row in row_iter:
                                if len(sub_row) == 0:
                                    break
                                dict_sub_row = dict(zip(headings, sub_row))
                                text[dict_sub_row['Part']] = dict_sub_row['Voice']
                            yield section, text


The outermost for statement locates all .csv files. All the rows within a file will belong to a number of sheets and tables within each sheet. The separator is a line with a Sheet: Table string, described by the sheet_table_pattern. The second for statement picks all the rows from a given sheet, looking for the separators.

There are a bunch of irrelevant tables. Hence the tall stack of if-statements. The useful parts of the script all have names that start with 'Table '. Weird, but true. The match.group(2).startswith('Table ') check feels like some casual ad-hoc test and should probably be made more visible and configurable.

Once we've found a table with the right headings, we can iterate over the following rows until we get to a blank line at end-of-table. We accumulate a dictionary, named text, which has the 'Part' and 'Voice' column values as a handy Dict[str, str] mapping.

Note that we're sharing an iterator, the row_iter variable, among two for statements. This is a very handy trick when doing this kind of partitioning. The outermost use of the iterator is rejecting irrelevant rows. The inner use of the iterator is assembling composite objects from a subset of rows, effectively partitioning the raw data.

This *can* be decomposed into separate functions. Further refactoring is left as an exercise for the reader.

The Benchmark Data

The result of benchmarking is a Fraction object with my unique reading pace. And yes, a Fraction makes more sense than a float value. We're working in int space, and introducing float seems wrong.

Here's the benchmarking to create a model.
def rate() -> Fraction:
    Benchmarks = [
        {'time': 3*60 + 29, 'words': 568},  # 01_01
        {'time': 5*60 + 32, 'words': 732},  # 01_04
        {'time': 5*60 + 54, 'words': 985},  # 02_04
        {'time': 4*60 + 58, 'words': 663},  # 02_05
        {'time': 8*60 + 48, 'words': 1192},  # 03_02 (draft)
    ]
    time_bm = sum(b['time'] for b in Benchmarks)
    words_bm = sum(b['words'] for b in Benchmarks)
    time_per_word = Fraction(time_bm/words_bm)
    return time_per_word

For some sample sections, I read through the material in my best NPR professional broadcasting voice. The sums of words and times give us a time-per-word Fraction object. The resulting value is near 31 seconds for 75 words.

I really like using Fraction instead of float for this kind of thing. The data doesn't support even one decimal place of supposed accuracy.

Note that I didn't factor in any slide count. I assumed this is a linear model from words to time. If I was a real scientist I might have tried a bunch of models.

Applying the Model

The model is linear. It's a scaling factor applied to a specific feature, the number of words. Here's one version of the code. I'm not sure I like it.

def main() -> None:
    time_per_word = rate()
    source = Path.cwd()
    print(f"script, slides, words, time")
    for script, body in script_iter(source):
        word_count = sum(len(text.split()) for text in body.values())
        slide_count = sum(1 for text in body.values() if len(text) > 0)
        m, s = divmod(int(word_count*time_per_word), 60)
        print(f"{script}, {slide_count}, {word_count}, {m}:{s:02d}")

There are three mappings going on here. This makes it a little tricky to create a simple function to map from raw data to something the model can use, then applying the model.

The 'word_count' is a mapping from raw data to one feature. The 'slide_count' is another mapping from raw data to a secondary feature. The 'm' and 's' values represent another mapping from the word_count to the estimated time.

We can hack this around to find another use for the assignment operator.  But the following seems insane:

divmod(int(word_count:=sum(len(text.split()) for text in body.values())*time_per_word), 60)

Let's not consider this assignment expression example as particularly helpful. The above turns two simple statements into a mess.

Alternative Implementation

The relationships among the mappings can be built a pure functional programming, but seems flirt with needless complexity. We can have a pair of functions to map the body.values() to some named tuple with feature values. We can use a third function to apply the model.

Something like this is an alternative that's slightly more functional.

class Features(NamedTuple):
    body: Dict[str, str]
    @property
    def word_count(self) -> int:
        return sum(len(text.split()) for text in self.body.values())
    @property
    def slide_count(self) -> int:
        return sum(1 for text in self.body.values() if len(text) > 0)
    def duration(self, time_per_word: Fraction) -> int:
        return int(self.word_count*time_per_word)

def main_2() -> None:
    time_per_word = rate()
    source = Path.cwd()
    print(f"script, slides, words, time")
    for script, body in script_iter(source):
        details = Features(body)
        m, s = divmod(details.duration(time_per_word), 60)
        print(f"{script}, {details.slide_count}, {details.word_count}, {m}:{s:02d}")

I'm not sure this is dramatically "better". It isolates some aspects of feature collection and model application. It also harbors a secret inefficiency. The two feature values should be cached to avoid recomputing them.

I'll leave the refactoring for the interested reader.

The durations over > 5:00 (300 seconds) need some rework. That's the actual useful output: the list of scripts with excessive time becomes the queue of content that needs rework.

Wednesday, December 4, 2019

Creating Palindromes -- if possible -- from a string of letters.

This can be an interesting exercise. I think it is something that can help people learn to code well. I found this in the LinkedIn Python community:  https://www.linkedin.com/groups/25827/

The Palindrome Problem:
Make a function that makes a palindrome out of the letters in a string and
returns -1 if this is not possible.
Convert a list of strings with the function.
Some test cases:

>>> palify('eedd')
'edde' (or 'deed')
>>> palify('wgerar')
>>> palify('uiuiqii')
'uiiqiiu' or several similar variants.



Let's not get too carried away. I like *some* of this problem.

I don't like the idea of Union[str, int] as a return type from this function. Yes, it's valid Python, but it seems like a code smell. Since the intent is to build lists, a None would be more sensible than a number; we'd have Optional[str] which seems better overall.

The solution that was posted was interesting. It did way too much work, but it was acceptable-looking Python. (It started with a big block comment with "#" on each line instead of a docstring, so... there were minor style problems, but otherwise, it was not bad.)

Here's what popped into my head, to act as a concrete response to the request for comments.

"""
Make a function that makes a palindrome out of the letters in a string and
returns -1 if this is not possible.
Convert a list of strings with the function.
Some test cases:

>>> palify('eedd')
'edde'
>>> palify('wgerar')
>>> palify('uiuiqii')
'uiiqiiu'
"""
from typing import Optional, Set


def palify(source: str) -> Optional[str]:
    """Core palindromic conversion."""
    singletons: Set[str] = set()
    pairs = list()
    for c in source:
        if c in singletons:
            pairs.append(c)
            singletons.remove(c)
        else:
            singletons.add(c)

    if pairs and len(singletons) <= 1:
        # presuming a single letter can't be palindromic.
        return ''.join(pairs+list(singletons)+pairs[::-1])
    return None

if __name__ == "__main__":
    s =  ['eedd', 'wgerar', 'uiuiqii']
    p = list(map(palify, s))
    print(f"from {s=}, we get {p=}")


The core problem statement is interesting. And the ancillary requirement is almost as interesting as the problem.

The simple-seeming "Make a palindrome out of the letters of the string" has two parts. First, there's the question of "can it even become a palindrome"? Which implies validating the source data against some set of rules. After that, we have to emit one of the many possible palindromes from the source material.

The original post had a complicated survey of the data. This was followed by an elegant way of creating a palindrome from the survey data. Since we're looking for a bunch of pairs and a singleton, I elided the more complex survey and opted to collect pairs and singletons into two separate collections.

When we've consumed the input, we will have partitioned the characters into their two pools and we can decide if the pools have the right sizes to proceed. The emission of the palindrome is a lazy assembly of the resulting data, first as a list, and then transformed to a single string.

The ancillary requirement is interesting in its own right. When a bundle of letters can't form a palindrome, that seems like a ValueError exception to me. Doing bulk transformations in the presence of ValueErrors seems wrong-ish. I already griefed about the -1 response above: it seems very bad. A None is less bad than -1. An Exception, however, seems like a more right thing to do.

Code Review Response

I think my response to the original code should be follow-up questions on why a defaultdict(int) was used to survey the data in the first place. A Counter() is a better idea, and requires less code.

The survey involved trying to locate singletons -- a laudable goal. There may have been a better approach to looking for the presence of a singleton letter in the Counter values.

More fundamentally, there are few states for each letter. There are two stark algorithmic choices: a structure keyed by letter or collections of letters. I've shown the collections, and hinted at the collection. The student response used a collection.

I think this problem serves as a good discussion for algorithmic alternatives. The core problem of detecting the possibility of palindromicity for a bunch of letters is cool. There are two choices. The handling of the exceptional case (-1, None or ValueError) is another bundle of choices.

Tuesday, December 3, 2019

Functional programming design pattern: Nested Iterators == Flattening

Here's a functional programming design pattern I uncovered. This may not be news to you, but it was a surprise to me. It cropped up when looking at something that needs parallelization to reduced the elapsed run time.

Consider this data collection process.

for h in some_high_level_collection(arg1):
    for l in h.some_low_level_collection(arg2):
        if some_filter(l):
            logger.info("Processing %s %s", h, l)
            some_function(h, l)

This is pretty common in devops world. You might be looking at all repositories in all github organizations. You might be looking at all keys in all AWS S3 buckets under a specific account. You might be looking at all tables owned by all schemas in a database.

It's helpful -- for the moment -- to stay away from taller tree structures like the file system. Traversing the file system involves recursion, and the pattern is slightly different there. We'll get to it, but what made this clear to me was a "simpler" walk through a two-layer hierarchy. 

The nested for-statements aren't really ideal. We can't apply any itertools techniques here. We can't trivially change this to a multiprocessing.map()

In fact, the more we look at this, the worse it is.

Here's something that's a little easier to work with:

def h_l_iter(arg1, arg2):
    for h in some_high_level_collection(arg1):
        for l in h.some_low_level_collection(arg2):
            if some_filter(l):
                logger.info("Processing %s %s", h, l)
                yield h, l

itertools.starmap(some_function, h_l_iter(arg1, arg2))

The data gathering has expanded to a few more lines of code. It gained a lot of flexibility. Once we have something that can be used with starmap, it can also be used with other itertools functions to do additional processing steps without breaking the loops into horrible pieces.

I think the pattern here is a kind of "Flattened Map" transformation. The initial design, with nested loops wrapping a process wasn't a good plan. A better plan is to think of the nested loops as a way to flatten the two tiers of the hierarchy into a single iterator. Then a mapping can be applied to process each item from that flat iterator.

Extracting the Filter

We can now tease apart the nested loops to expose the filter. In the version above, the body of the h_l_iter() function binds log-writing with the yield. If we take those two apart, we gain the flexibility of being able to change the filter (or the logging) without an awfully complex rewrite.

T = TypeVar('T')
def logging_iter(source: Iterable[T]) -> Iterator[T]:
    for item in source:
        logger.info("Processing %s", item)
        yield item

def h_l_iter(arg1, arg2):
    for h in some_high_level_collection(arg1):
        for l in h.some_low_level_collection(arg2):
            yield h, l

raw_data = h_l_iter(arg1, arg2)
filtered_subset = logging_iter(filter(some_filter, raw_data))
itertools.starmap(some_function, filtered_subset)

Yes, this is still longer, but all of the details are now exposed in a way that lets me change filters without further breakage.

Now, I can introduce various forms of multiprocessing to improve concurrency.

This transformed a hard-wired set of nest loops, if, and function evaluation into a "Flattener" that can be combined with off-the shelf filtering and mapping functions.

I've snuck in a kind of "tee" operation that writes an iterable sequence to a log. This can be injected at any point in the processing.

Logging the entire "item" value isn't really a great idea. Another mapping is required to create sensible log messages from each item. I've left that out to keep this exposition more focused.

I'm sure others have seen this pattern, but it was eye-opening to me.

Full Flattening

The h_l_iter() function is actually a generator expression. A function isn't needed.

h_l_iter = (
    (h, l) 
    for h in some_high_level_collection(arg1) 
        for l in h.some_low_level_collection(arg2)
)

This simplification doesn't add much value, but it seems to be general truth. In Python, it's a small change in syntax and therefore, an easy optimization to make.

What About The File System?

When we're working with some a more deeply-nested structure, like the File System, we'll make a small change. We'll replace the h_l_iter() function with a recursive_walk() function.

def recursive_walk(path: Path) -> Iterator[Path]:
    for item in path.glob():
        if item.is_file():
            yield item
        elif item.is_dir():
            yield from recursive_walk(item)

This function has, effectively the same signature as h_l_iter(). It walks a complex structure yielding a flat sequence of items. The other functions used for filtering, logging, and processing don't change, allowing us to build new features from various combinations of these functions.

tl;dr

The too-long version of this is:

Replace for item in iter: process(item) with map(process, iter).

This pattern works for simple, flat items, nested structures, and even recursively-defined trees. It introduces flexibility with no real cost.

The other pattern in play is:

Any for item in iter: for sub-item in item:  processing is "flattening" a hierarchy into a sequence. Replace it with (sub-item for item in iter for sub-item in item).

These felt like blinding revelations to me.