Tuesday, October 20, 2020

Type Hints to extend built-in structures

Working on revisions to a book. Fun stuff. See https://www.packtpub.com/product/python-3-object-oriented-programming/9781849511261 I may have the privilege of working with Dusty.

I've been using mypy for all my 2nd edition changes, but not in --strict mode. 

I've decided to ramp things up, and switch to strict type checking for all of the examples and case studies.

This lead me to stumble over

class MyThing(dict):
    def some_extra_method(self):
        pass

I absolutely could not get this to work for hours and hours.

I more-or-less gave up on it, until I started a similar example for a later chapter.

class ListWithFeatures(list):
    def feature(self):
        pass

This is almost the same, but, somehow, I understood it better.  As written, it is rejected by mypy. What I meant was this.

class ListWithFeatures(List[MyThing]):
    @overload
    def __init__(self) -> None: ...
    @overload
    def __init__(self, source: Iterable[MyThing]) -> None: ...
    def __init__(self, source: Optional[Iterable[MyThing]]) -> None:
        if source:
	    super().__init__(source)
        else:
            super().__init__()
    def feature(self) -> float:
        return sum(thing.some_extra_method())/len(self)

I don't know why, but this was easier for me to visualize the problem.. It clarified my understanding profoundly.

We don't simply extend list or dict.  We should extend them because list is an alias for List[Any], and when being strict, we need to avoid Any. Aha.

Tuesday, October 13, 2020

Sources of Confusion: a "mea culpa" [Updated]

 I am not patient. I have been dismissed as one who does not suffer fools gladly.

This is a bad attitude, and I absolutely suffer from it. No denials here. I'm aware it limits my ability to help the deeply confused.

My personal failing is not being patient enough to discover the root cause of their confusion.

It's not that I don't care -- I would truly like to help them. It's that I can't keep my mouth shut while they explain their ridiculous chain of invalid assumptions as if things they invented have some tangible reality.

I'm old, and I'm not planning on becoming more empathetic. Instead, I've become less curious about wrong ideas. I find silence is helpful because I don't yell at them as much as I could.

Recently someone tried to tell me that a Python tuple wasn't **really** immutable. 

Yes. 

That's what they said.

A tuple of lists has lists that can be modified. (They did not have the courtesy to provide examples, I've had to add examples based I what I assume they're talking about.)

>>> t = ([], [], [])
>>> t[0].append(42)
>>> t
([42]. [], [])

"See," they said. "Mutable." 

Implicit follow-up: Python is a web of lies and there's no reason for it to be better than a spreadsheet.

I did not ask how the immutability of a tuple was magically transferred to the lists contained within the tuple. 

I did not ask about their infections disease theory of protocol transformation of types. Somehow, when associated with a tuple, the list became tuple-like, and lost a bunch of methods. 

I did not ask if they thought there was some some method-shedding feature where an immutable structure forces other data structures to shed methods.

I did not ask what was supposed to happen to a dictionary, where there's no built-in frozen dictionary.

I did not ask what would happen with a "custom" class (one created in the app, not a built-in collection.)

I did not ask what fantasy land they come from where a tuple of mutable objects would lead to immutability of the contained objects.

I did not ask if it worked the other way, too: was a list of tuples also supposed to freeze up? 

I did not ask if it transferred more than one level deep into the lists inside the tuple.

I should have.

It was an epic failing on my part to not dig into the bizzaro world where the question could arise.

BTW. They had the same complaint about frozen data classes. (Again. They did not have the courtesy to provide code. I'm guessing this is what they meant. My failure for not asking.)

>>> from typing import List
>>> from dataclasses import dataclass
>>> @dataclass(frozen=True)
... class Kidding_Right:
...     one_thing: List[int]
...     another_thing: List[int]
... 
>>> kr = Kidding_Right([], [])
>>> kr.one_thing.append(42)
>>> kr
Kidding_Right(one_thing=[42], another_thing=[])

Triumphant Sneer: "See, frozen is little more than a suggestion. The lists within the data class are *not* frozen." 

Yes. They appeared to be claiming frozen was supposed to apply transitively to the objects within the dataclass.  Appeared. My mistake was failing to ask what they hell they were talking about.

I really couldn't bear to know what caused someone to think this was in any way "confusing" or required "clarification." I didn't want to hear about transitivity and how the data class properties were supposed to infect the underlying objects. 

Their notion was essentially wrong, and wickedly so. I could have asked, but I'm not sure I could have waited patiently through their answer.

Update.

"little more than a suggestion".

Ah.

This is an strange confusion.

A dynamic language (like Python) resolves everything at run-time. It turns out that there are ways to override __getattr__() and __setattr__() to break the frozen setting. Indeed, you can reach into the internal __dict__ object and do real damage to the object.

I guess the consequences of a dynamic language can be confusing if you aren't expecting a dynamic language to actually be dynamic.

Tuesday, October 6, 2020

The Python Podcast __init__

Check out https://www.pythonpodcast.com/steven-lott-learn-to-code-episode-283/. This was a fun conversation on Python and learning.

We didn't talk about my books in detail. Instead, we talked about learning and what it takes to get closer to mastery.

It's a thing I worry about. I suspect other writers worry about it, also. Will the reader take the next steps? Or will they simply think they've got it because the read about it?

Wednesday, September 9, 2020

Open Source Support Ideas

"... [I] am thinking of building an in house conda forge, or buying a solution, or paying someone to set something up."

The build v. Buy decision. This is always hard. Really hard.  

We used to ask "What's your business? Is it building software or making widgets?" 

And (for some) the business is making widgets.

This is short-sighted. 

But. A lot of folks in senior positions were given this as a model back in the olden days. So, you need to address the "how much non-widget stuff are we going to take on?" question.

The "Our Business is Widgets" is short-sighted because it fails to recognize where the money is made. It's the ancillary things *around* the widgets. Things only software can do. Customer satisfaction. Supply-chain management. 

So. Business development == Software development. They're inextricably bound.

With that background, lets' look at what you want to do.

Open Source software is not actually "free" in any sense. Someone has to support it. If you embrace open source, then, you have to support it in-house. Somehow. And that in-house work isn't small.  

The in-house open-source support comes in degrees, starting with a distant "throw money at a maintainer" kind of action. You know. Support NumFocus and Anaconda and hope it trickles down appropriately (it  sometimes does) to the real maintainers.  

The next step is to build the tooling (and expertise) in-house. Conda forge (or maybe JFrog or something else) and have someone on staff who can grow to really understand how it fits together. They may not be up to external contributions, but they can do the installs, make sure things are running, handle updates, manage certificates, rotate keys, all the things that lead to smooth experience for users.  

The top step is to hire one of the principles and let them do their open source thing but give them office space and a salary.  

I'm big on the middle step. Do it in-house. It's *not* your core business (in a very narrow, legal and finance sense) but it *is* the backbone fo the information-centric value-add where the real money is made.  

Folks in management (usually accouting) get frustrated with this approach. It seems like it should take a month or two and you're up and running. (The GAAP requires we plan like this. Make up a random date. Make up a random budget.)

But. Then. 13 weeks into the 8-week project, you still don't have a reliable, high-performance server.  Accounting gets grumpy because the plan you have them months ago turns out to have been riddled with invalid assumptions and half-truths. (They get revenge by cancelling the project at the worst moment to be sure it's a huge loss in everyone's eyes.)

I think the mistake is failing to enumerate the lessons learned. A lot will be learned. A real lot. And some of it is small, but it still takes all day to figure it out. Other things are big and take failed roll-outs and screwed up backup-restore activities. It's essential to make a strong parallel between open source and open learning.

You don't know everything. (Indeed, you can't, much to the consternation of the accountants.) But. You are learning at a steady rate. The money is creating significant value. 

And after 26 weeks, when things *really* seem to be working, there needs to be a very splashy list of "things we can do now that we couldn't do before."  A demo of starting a new project. `conda create demo python=3.8.6 --file demo_env.yml` and watch it run, baby. A little dask. Maybe analyze some taxicab data.

Tuesday, September 1, 2020

A Comprehensive Introduction to Python

Python 101, by Michael Driscoll. 545 pages, available from leanpub.com in a variety of formats. Available soon in hardcover.

The modern Python programming language is a large topic. A book on a programming language has to be seen as a collection of several large topics.

At its core, a book on a programming language has to cover the syntax of the language. What’s for more important is covering the underlying semantics of the various constructs. Software captures knowledge, and it’s essential for a book on a programming language to make it clear how the language expresses knowledge.

For a programming expert, a fifteen page technical report can be enough to get started with a new language. When I was first learning to program, that’s all there was. For the vast majority of people who come in contact with programming, there’s a lot more information required.

This leads to a number of interesting tradeoffs when writing about a programming language. How much of a book should be devoted to installing the language tools? How much should it cover the other tools required to create software? I think Python 101 makes good choices.

In the modern era of open-source software, the volume and sophistication of the available tools can be daunting. An author must consider how many words to invest in text editors, debuggers, performance measurement, testing, and documentation. These are all important parts of producing software, they’re often tied closely with a language, but these additional tools aren’t really the language itself.

A language like Python offers a rich collection of built-in data types. A book’s essential job is to cover the data structures (and algorithms) that are first-class parts of the Python language. A focus on data puts the various syntactic elements (like statements) into perspective. The break statement, for example, can’t really be discussed in isolation. It’s part of the conversation about for statements and conditional processing in if statements. Because Python 101 follows this data-first approach, I think it can help build comprehensive Python skills.

The coverage of built-in data structures in a modern language needs to include file objects. While Python reads strings and bytes, the standard library provides ways to read HTML, CSV, JSON, and XML documents. Additional packages provide access to Excel spreadsheet files. While, technically, not part of the language, these are essential parts of the problem domain a programming language like Python is designed to address. Because these are part of the book, a reader will be empowered to solve practical problems.

There was a time when a programming “paradigm” was part of a book’s theme. Functional programming, procedural programming, and object-oriented programming approaches spawned their own libraries. Some languages have a strong bias. Other languages, like Python, lack a strong bias. A developer can work with functions, using material from the first seventeen chapters of Python 101 and be happy and successful. Moving into class definitions can be helpful for simplifying certain kinds of programs, but it’s not required, and a good book on Python should treat classes as a sensible alternative to functions for handling more complex object state and bundle operations with the state.

Moving beyond the language itself, a book can only pick a few topics that can be called “advanced.” This book looks at some of the language internals, exposed via introspection. It touches on some of the standard library modules for managing subprocesses and threads. It covers tools like debuggers and profilers. It expands to cover development environments like the Jupyter Notebook, also. I’d prefer to reduce coverage of threading and switch to Jupyter Lab from Jupyter Notebook. These are small changes at the edges of large pool of important details.

I’m still waffling over one choice of advanced topics. Does unit testing count as an advanced topic? For software professionals, a testing framework is as important as the language itself. For amateur hackers, however, a testing framework may be a more advanced topic. The location of a chapter on unit testing is a telling indication of who the book’s audience is. 

The Python ecosystem includes the standard library and the vast collection of packages and applications available through the Python Package Index. These components can all be added to a Python environment. This means any book on the language must also cover parts of the standard library, as well as covering how to install  new packages from the larger ecosystem. Python 101 doesn’t disappoint. There are solid chapters in PIP and Virtual Environment management. I can quibble over their  place in Part II. The presence of chapters on tools is important; Python is more than a language; Python 101 makes it clear Python is a collection of tools for building on the work of others to solve problems collaboratively.

I’m not easily convinced that Part IV has the same focus on helping the new programmer as the earlier three parts. I think packaging and distribution considerations take the reader too far outside problem-solving with a programming language and tools. I’m not sure the audience who sees testing as an advanced topic is ready to distribute their code. I think there’s room for a Python 102 book to cover these more professionally-oriented topics.

The volume of material covered by this comprehensive book on Python seems to require something more elaborate than a simple, linear sequence of chapters. The sequence of chapters have jumps that seem a little awkward. For example, from an introduction run-time introduction introspection, we move to the PIP and virtual environment tools, then move back to ways to make best use of Python’s annotations and type hints. Calling this flow awkward is — admittedly — a highly nuanced consideration. I suspect few people will read this book sequentially; when each chapter is used more-or-less independently, the sequence of chapters becomes a minor side-bar consideration. Each chapter has generous examples and there are screen shots where necessary. 

The scope of this book covers the language and the features through Python 3.8 in a complete and intelligible way. The depth is appropriate for a beginning audience and the examples are focused on simple, concrete, easy-to-understand code. The presence of review questions in each chapter is a delight, making it easy to leverage the book for instructor-guided training. I can imagine covering a few chapters each week and quizzing students with the review questions. Some of the questions are nicely advanced and can lead to further exploration of the language.

If you’re new to Python, this should be part of your Python reading list. If you’ve just started and need more examples and help in using some of the common tools, this book will be very helpful. If you’re teaching or helping guide people deeper into Python, this may be a helpful resource. 

Driscoll’s colorful nature photos are a bonus. My Kindle is limited to black and white, and the pictures would have been disappointing. I’m glad I got the PDF version.

Tuesday, August 25, 2020

Another shiny new MacBook pro

See https://slott-softwarearchitect.blogspot.com/2014/03/shiny-new-macbook-pro.html

At the time (2014), the 8Gb machine was way more than adequate for all my needs as a writer.

Enter bloat.

Mac OS Catalina has essentially filled this machine to the breaking point. Six short years is the lifespan. Things (generally) work, but it now crashes frequently. Sometimes, streaming TV won't play properly. I've tried a large number of remedies (reboot WiFi, reboot computer, reset Bluetooth) and it glitches too offten to be comfortable.

(Rumors suggest the crashes seem to be associated with going to sleep. The machine crashes when it's idle. I come back to it and find it has restarted, and needs to restart my apps. It's not horrible. But it's an indication of a deeper problem. And it's time.)

It works. But. I've spent too many years waiting for slow computers and slow networks. An hour a day (cumulative) for 300 days a year for 40 years means I've spent 1.3 years of my life waiting for a computer to do something.

I’m reluctantly replacing my kind-of-working "Late 2013" vintage machine with a new 13” MacBook Pro. At least 16Gb RAM. At least a terabyte of storage. Hopefully, things will not be "glitchy" and I won't have constant crashes.

I’ve gotten used to having an 27" Thunderbolt Display, and a USB Querkywriter keyboard, and two USB disks doing backups. That's a lot of stuff plugged in all the time. Also. I really need a slot for SD cards (the boat uses micro SD cards, as does the old GoPro camera.). So. A fancy USB-C hub will be essential, I think.

The question is 2 ports (power and hub) or 4 ports (power, hub, and two other things)?  I suspect I can live with 2 ports.  4 ports ships immediately.

I have several use cases:
Writing books actually requires some computing power. But. Not *too* much power. The general reader doesn’t always have a huge computer. If my examples require more computing power than my readers have access to, that’s a problem. The advantage of having a smallish computer is I’m not overstepping what’s available to my readers. This is a handy way to take a tax deduction to pay for this extravagance.

Writing fiction requires a small machine. Scrivener works on an iPad Pro. I’m good with almost anything. Even an iPhone can be used for writing and editing fiction. It’s hard, of course, with a tiny screen. But not impossible. And. I'm trying to learn the craft, so tools aren't as important as understanding character arc.

Creating MicroPython-based devices is a bit confusing right now. A lot of the development environments depend on a reliable USB connectivity to the Arduino or Circuit Playground Express board. I worry about the (potential) complexity of introducing a USB hub into the mix.  I suspect I only need to replace some of my USB cables; the Arduino boards all seem to use a bulky USB type B. The CPX use USB type Micro B. (I thinks one can be replaced with a USB C to USB B “printer cable”, the other is a USB microB to C adapter. Or, maybe a USB C to USB A adapter can be used with my vast collection of legacy cables. Don't know.)

Boating involves connecting external devices like the GPS antenna to the laptop and tracking position or planning routes. This is a Bluetooth thing, generally. 

It does require considerable power for the laptop; the 60W power brick becomes a constraint. The boat have an inverter and can handle the load gracefully. A computer is a dedicated 5A draw, though; twice what the fridge pulls (and the fridge runs infrequently.) We have 225Ah available. The computer could be as bad as 120Ah if it was left on for 24 hours during an overnight passage.

The good news is that the use cases are more-or-less exclusive. The boating use case is rare. We have more thrifty navigation systems permanently installed on the boat. Many folks are using CPX and Arduino’s with MacBook Pro’s, so I shouldn’t worry too much, just buy new cables.

The best part?

Since I use Time Machine, the new machine recovers from the Time Machine backups. It has to be left to run overnight, but. Boom. Done.

(On the to-do list -- encrypt the backup volumes. Ugh. But. Necessary.)

Tuesday, August 11, 2020

Modern Python Cookbook 2e -- Out with the old

 Most of the things that got cut were (to me) obviously obsolete. For example, replacing collections.namedtuple with typing.NamedTuple seemed like a clear example of obsolete. A reviewer really thought I should skip all NamedTuple and use frozen data classes. 

More important are some things that I learned about in my formative years. I think they're important because they'll little nuggets of cool algorithm. But. Pragmatically? They're too hard to explain and don't really capture interesting features of Python.

Back in '01. Yes. The turn of the millennium. 

(Pull up a chair. This is a long yarn.)

Back in '01, I was starting to look at ways to perfect my Python and literate programming skills.

(And yes, I was using Python on '01.)

I had a project that I'd learned about in the 80's. That's in the previous millennium. A thousand years ago. Computers were large, expensive, and rare.

And. Random Number Generators (RNG's) were a bit of a struggle. In the 80's, more sensitive statistical methods were uncovering biases in the RNG's of the day. Back in the 70's, Knuth's The Art of Computer Programming, Volume 2, Seminumerical Algorithms had covered this topic pretty well. But. Not quite well enough for language libraries or OS's to offer really solid RNG's.

(The popular Mersenne Twister algorithm dates from '97.)

One of my co-workers at the time showed me a technical report that I have no real bibliographic information for. I read it, captivated, because it described -- in detail -- Knuth's statistical tests for random number generators. 

This lead me to Knuth Volume 2. 

This lead me to implement *all* of this in Pascal (in the '80's.)

This lead me to implement *all* of this in Python (in the '00's.)

There were 10 tests. Each is a tidy little algorithm with a tidy little implementation that can run on a big collection of data to ascertain how random it is. 

  1. Frequency Test - develops frequency distribution of individual samples.
  2. Serial Test - develops frequency distribution of pairs of samples.
  3. Gap Test - develops frequency distribution of the length of gaps between groups samples in a given range.
  4. Poker Test - develops frequency distribution for 5-card "hands" of samples over a small (16-value) domain.
  5. Coupon Collector's Test - develops frequency distribution for lengths of subsets that contain a complete set of values from a small (8-value) domain.
  6. Permutation Test - develops frequency distribution for the permutations of ordering of 4-sample selections.
  7. Runs Up Test - develops frequency distribution for lengths of "runs up" where each value is larger than the previous value; one variation covers the case where runs are statistically dependent.
  8. Runs Up Test with independent runs and a relatively large domain.
  9. Runs Up Test with a "small domain", that has a slightly different expected distribution.
  10. Maximum of T - develops frequency distribution for the largest value in a group of T values.
  11. Serial Correlation - computes the correlation coefficient between adjacent pairs of values.

What's important here is that we're gaging the degree of randomness of a collection of samples. All of these are core data science. Finding a truly random random number generator is the same as looking at a variable and seeing that it's too random to have any predictive value. This is the Type I Error problem.

Doing this with RNG's means starting with a specific seed. Which means we need to run this for a large number of seed values and compare the results. Lots of computer cycles can be burned up examining random number generators.

Lots.

The frequency test, for example. We bin the numbers and compare the frequencies. They aren't the same; they're within a few standard deviations of each other. That means you don't use 5 bins. You use 128 bins so you can compare the bin sizes to the expected bin size and compute a deviation. The deviation for expected needs to pass a chi-squared test.

Back in the day, chi-squared values were looked up in the back of a handy statistics book. 

That seems weak. Can we compute the exact chi-squared values?  

(Spoiler alert, Yes.)

Computing expected chi-squared values means computing Sterling numbers, Bernoulli numbers, and evaluating the partial gamma function. Knuth gives details on Sterling numbers. I have no reference material on Bernoulli numbers. 

The Log Gamma function is ACM collected algorithms (CALGO) number 291 and 309. The incomplete gamma function is CALGO 435 and 654.

Fascinating stuff. 

To me.

Of this, only one thing ever saw the light of day.

The Coupon Collector's test. Given a long sequence composed of selections from a small pool of distinct values ("coupons"), how many samples from the overall sequence do you have to examine to collect one of each distinct coupon value? This yields a kind of Poisson distribution of the number of samples seen before getting a full set of coupons.

If there's eight kinds of coupons, the smallest number of samples we have to examine is eight. Lucky break. One of each and done. But. Pragmatically, we'll see a distribution that varies from a low of 8 to a high of -- well -- infinity. We'll see a peak at like 15 to 18 samples before collecting all eight coupons and a long, long tail. We can cut the tail at 40 samples and have a statistically useful distribution to discern of the source samples were randomly ordered.

Why did this -- of all things -- see the light of day?

It involves set manipulations.

def collect_coupons(samples: Iterable[int]) -> Iterator[int]:
    while True:
        coupons = set()
        count = 0
        for u in samples: 
            coupons |= u
            count += 1
            if len(set(coupons)) == 8:
                break
        yield count

I've used a number of variations on the above theme to use set manipulation to accumulate data.  There are a lot of ways to restate this using itertools, also. It can be viewed as a clever "reduce" algorithm.

But.

It's so hard to explain. And. It's not really used much by data scientists to reject type I errors because few things fit the coupon model very well.

But. 

It's a cool set processing example.

So.

It's safely out of the book. 

Thursday, July 30, 2020

Modern Python Cookbook Journey

For the author, a book is a journey.  

Writing something new, the author describes a path the reader can follow to get from -- well -- anywhere the reader might be to the author's suggested destination. Not everyone makes the whole trip. And not everyone arrives at the hoped-for destination.

Second editions? The idea is to update the directions to reflect the new terrain.  

I'm a sailor. Here's a view of the boat.


What's important to me is the way the authorities produce revised nautical charts on a stable, regular cadence. There's no "final" chart, there's only the "current" chart. Kept up-to-date by the patient hard work of armies of cartographers. 

Is updating a book like updating the nautical charts? I don't think so. Charts have a variety of update cadences.  For sailors in the US, we start here: https://nauticalcharts.noaa.gov/charts/chart-updates.html. The changes can be frequent. See https://distribution.charts.noaa.gov/weekly_updates/ for the weekly chart updates. This is supplemented by the Notices to Mariners, here, too: https://msi.nga.mil/NTM. So, I think charts are much, much more complex than books.

Sailors have to integrate a lot of data.  This is no different from software developers having to keep abreast of language, library, and platform changes.

The author's journey is different from the reader's journey. A technical book isn't a memoir. 

The author may have crashed into all kinds of rocks and shoals. The author's panic, fear, and despair are not things the reader needs to know about. The reader needs to know the course to set, the waypoints, and hazards. The estimated distances and the places to anchor that provide shelter.

For me, creating a revision is possibly as difficult as the initial writing. I don't know how other authors approach subsequent editions, but the addition of type hints meant every example had to be re-examined.  And this meant discovering problems in code that I *thought* was exemplary. 

While many code examples can simply have type hints pasted in, some Python programming practices have type hints that can't be trivially introduced to the code. Instead some thinking is required.

Generics

Python code is always generic with respect to type. Expressions like a + b will work for a surprisingly wide variety of object classes. Of course, we expect any of the numbers to work. But lists, tuples, and strings all respond to the "+" operator. This is implemented by a sophisticated check of a's __add__() and b's __radd__() methods.

When we write hints, it's often intended to narrow the domain of potential types. Here's some starting code.

def fact(a):
   if a == 0:
       return 1
   return a*fact(a-1)

The implied type hint is Any. This means, any class of objects that defines __eq__(), __mul__() and __sub__() will work. There are a fair number of these classes.

When we write type hints, we narrow the domain. In this case, it should be integers. Like this:

def fact(a: int) -> int:
    if a == 0:
        return 1
    return a*fact(a-1)

This tells mypy (or other, similar analytic tools) to confirm that every place the fact() function is used, the arguments will be integers. Also, the result will be an integer.

What's important is there's no run-time consequence to this. Python runs the same whether we evaluate fact(2) or fact(3.0).  The integer-based computation clearly matches the intent stated in the code. The floating-point computation is clearly at odds with the stated intent.

And this brings us to the author's journey.

Shoal Water

Sometimes we have code that works. And will always work. But. The type hints are hard to express.

The most common examples?

Decorators.

Decorators can be utterly and amazingly generic. And this can make it very, very difficult to express the domain of types involved.

def make_a_log(some_function: Callable) -> Callable:
    @wraps(some_function)
    def concrete_function(*args, **kwargs):
        print(some_function, args, kwargs)
        result = some_function((*args, **kwargs)
        print(result)
    return concrete_function

This is legal, but very shady Python. The use of the Callable type hint is almost intentionally misleading. It could be anything. Indeed, because of the way Python works, it can truly be any kind of function or method. Even a lambda object can be decorated with this. 

The internal concrete_function doesn't have any type hints. This forces mypy to assume Any, and that will lead to a possibly valid application of this decorator when -- perhaps -- it wasn't really appropriate.

In the long run, this kind of misleading hinting is a bad policy.

In the short run, this code will pass every unit test you can throw at it.

What does the author do?
  1. Avoid the topic? Get something published and move on? It is simpler and quicker to ignore decorators when talking about type hints. Dropping the section from the outline would have been easy.
  2. Dig deeply into how we can create Protocols to express a narrower domain of candidates for this decorator? This is work. And it's new work, since the previous edition never touched on the subject. But. Is it part of this cookbook? Or do these deeper examples belong in a separate book?
  3. Find a better example? 
Spoiler Alert: It's all three.

I start by wishing I hadn't broached the topic in the first edition. Maybe I should pretend it wasn't there and leave it out of the second edition.

Then I dig deeply into the topic, overwriting the topic until I'm no longer sure I can write about it. There's enough, and there's too much. A journey requires incremental exposition, and the side-trip into Protocols may not be the appropriate path for any but a very few readers.

After this, I may decide to throw the example out and look for something better.  What's important is having an idea of what is appropriate for the reader's journey, and what is clutter.

The final result can be better because it can be:
  • Focused on something useful.
  • Any edge cases can be corrected to work with the latest language, library, and mypy release.
  • Where necessary, replaced by an alternative example that's clearer and simpler.
Unfortunately (for me) I examine everything. Every word. Every example.

Packt seems to be tolerant of my slow pace of delivery. For me, it simply takes a long time to rewrite -- essentially -- everything. I think the result is worth all the work.

Tuesday, July 28, 2020

Modern Python Cookbook 2nd ed -- Advance Copies -- DM me

This is your "why wait" invitation.

Advanced copies will be available.  

IF.

And this is a big "if".

You have to write a blurb. 

I'll be putting you in contact with Packt marketing folks who will get you your advanced copy so you can write blurbs and reviews and -- well -- actually use the content.

It's all updated to Python 3.8. Type hints almost everywhere. F-strings and the walrus operator. Bunches of devops and data science examples. Plus a few personal examples involving sailboat navigation and management.

See me at LinkedIn https://www.linkedin.com/in/steven-lott-029835/ and I'll hook you up with Packt marketing folks.

See https://www.amazon.com/Modern-Python-Cookbook-Updated-programmer/dp/180020745X for the official Amazon Book Link. This is for ordinary "no obligation to write a review" orders.

DM me directly slott56 at gmail to be put into the marketing spreadsheet.

Tuesday, June 30, 2020

Over-Solving or Solving Problems You Don't Have

Sometimes we call them "Belt and Braces" solutions. As a former suspenders person who switched to belts, the idea of wearing both is a little like over-engineering. In the unlikely event of catastrophic failure of one system, your pants can still remain properly hoist. There's a weird, but defensible reason for that. Most over-engineering lacks a coherent reason. 

Sometimes we call them "Bells and Whistles." The solution has both bells and whistles for signaling. This is usually used in a derogatory sense of useless noisemakers, there for show only. Again, there's a really low-value and dumb, but defensible reason for this. 

While colorful, none of this is helpful for describing over-engineered software. Over-engineered software is often over-engineered for incoherent and indefensible reasons.

Over-engineering generally means trying to solve a problem that no user actually has. This leads to throwing around irrelevant features.

Concrete Example

I lived on a boat. I spent a fair amount of time fretting over navigation. 

There are two big questions: 
  1. How far apart are two points, really. 
  2. What's the real bearing from one point to another.
These are -- in some cases -- easy to answer.

If you have a printed, paper chart at the right scale, you can use dividers to compute a distance. It's actually a very easy task. Similarly, you can read the bearing off the chart directly. There's a trick to comparing a course to a nearby compass rose, but it's easy to learn and very accurate.

Of course, we don't want to painstakingly copy our notes from a paper chart to a spreadsheet to add them up to get total distance. And then fold in speed to get time and fuel consumption. These summary computations are a pain.

What you want is to do all of this with a computer.
  1. Plot the points using a piece of software like OpenCPN (https://opencpn.org).
  2. Extract the GPX file.
  3. Compute distances, bearings, and durations to create a route.
"So?" you ask.

So. When I did this, I researched the math and got a grip on the haversine formula for doing the spherical geometry computation of distances between points on a sphere.

It's not too bad. The formula are big-ish. But manageable. See http://www.edwilliams.org/avform.htm#Dist for the great circle distance formula.


For airplanes and powered freighters crossing oceans, this is perfect.

For a small sailboat going from Annapolis, Maryland, to the Bahamas, this level of complexity is craziness. While accurate, it doesn't really solve the problem I have. 

I don't actually need that much accuracy. 

I need this much accuracy.


And no more. This is the essential hypotenuse distance using an R-factor to convert the difference between latitudes and the distance between longitudes into pretty-close distances. For nautical miles, R is 60×180÷π. 

This is simpler and it solves the problem I actually have. Up to about 232 miles, the answer is within 1 mile of correct. The error grows quickly. Double the distance and the error seems to jump to 8 miles. A 464 mile sailing journey (at 6 knots) takes 3 days. Wind, weather, tides and currents will introduce more error than the simplifying assumptions.

What's important is this can be put into a spreadsheet without pain. I don't need to write sophisticated Python apps to apply haversine to sequences of way-points. I can do a simpler hypotenuse computation on waypoints converted to radians.

Is there a lesson learned?

I think there is.

There's the haversine a super-general solution. It handles great-circle routes elegantly. 

But it doesn't solve my actual problem. And that makes it over-engineering.

My problem is what we call rhumb-line sailing. Over short-enough distances the world may as well be flat. Over slightly longer distances, errors in the ship's compass and speedometer make a hyper-accurate great circle route moot. 

(Even with several fancy GPS-based navigation computers, a prudent mariner has paper backups. The list of waypoints, estimated times and directions are essential when the boat's GPS reciever fails.)

I don't really need the sophistication (and the potential for bugs) with haversine. It doesn't solve a problem I actually have.

Tuesday, June 2, 2020

Overcoming Incuriosity -- Sailing Over The Horizon

I'm in regular contact with a few folks who seem remarkably incurious.

Seem.

Perhaps they're curious about something other than software. I don't know.

But I do know they're remarkably incurious about software. And are trying to write Python applications.

I know some people don't sail out of sight of their home port. I've sailed over a few horizons. It's not courage. It's curiosity. And patience. And preparation.

I find this frustrating. I refuse to write their code for them.

But any advice I give them devolves to "Do you have an example?" With the implicit "Which I can copy and paste?"

Even the few who claim they don't want examples, suffer from a paralyzing level of incuriosity. They can't seem to make search work because they never read beyond the first few results on their first attempt. A lot of people seem to be able to make search work; and the incurious folks seem uniquely paralyzed by search.

And it's an attribute I don't understand.

Specific example.

They read through the multiprocessing module until they got to examples with apply_async() and appear to have stopped reading.  They've asked for code reviews on two separate module. Both based on apply_async().

One module was so hopelessly broken it was difficult to make the case that it could never be made to work. There's a way the results of apply_async() have to be consumed, and the code not only did not reflect this, it seemed like they had decided specifically never to consider an alternative. (Spoiler alert, it requires an explicit wait().)

The results were sometimes consumed -- by luck -- and the rest of the time, the app was quirky. It wasn't quirky. It was deplorably wrong. And "reread the apply_async()" advice fell on deaf ears. They couldn't have failed to read the page in the standard library documentation, no, it had to be Python or Windows or me or something.

The other module was a trivial map() application. But. Since apply_async() has an incumbency, there was an amazingly elaborate implementation that amounted to rebuilding apply() or map() with globals and callbacks. This was wrapped by queue processing of Byzantine complexity. The whole mess appeared to stem from an unwillingness to read the documentation past the first example.

What to do?

My current suggestion is to exhaustively enumerate each of the methods for putting work into the processing pool. Write an example of each and every one.

In effect: "Learn the methods by building throw-away code."

I anticipate a series of objections. "Why write throw-away code?" and this one: "That's not realistic, what do you do?"

What do I do?

I write throw-away code.

But that's no substitute for a lack of curiosity.

Tuesday, May 26, 2020

Modern Python Cookbook 2nd ed -- big milestone

Whew.

Chapter rewrites finished.

Technical reviews in process.

Things are going pretty well. Look for Packt to publish this in the next few months. Details will be posted.

Now. For LinkedIn Learning course recordings.

Tuesday, April 21, 2020

Why Python is not the programming language of the future -- a response

See https://towardsdatascience.com/why-python-is-not-the-programming-language-of-the-future-30ddc5339b66.

This is an interesting article with some important points. And. It has some points that I disagree with.

  • Speed. This is a narrow perspective. numpy and pandas are fast, dask is fast. A great many Python ecosystem packages are fast. This complaint seems to be unsupported by evidence.
  • Dynamic Scoping Rules. This actually isn't the problem. The problem is something about not being able to change containing scopes. First, I'm not sure changing nesting scopes is of any value at all. Second, the complaint ignores the global and nonlocal statements. The vague "leads to a lot of confusion" seems unsupported by any evidence. 
  • Lambdas. The distinction between expressions and statements isn't really a distinction in Python in general, only in  the bodies of lambdas. I'm not sure what the real problem is, since a lambda with statements seems like a syntactic nightmare better solved with an ordinary, named function.
  • Whitespace. Sigh. I've worked with many people who get the whitespace right but the {}'s wrong in C++. The code looks great but doesn't work. Python gets it right. The code looks great and works.
  • Mobile App Platform. See https://beeware.org
  • Runtime Errors. "coding error manifests itself at runtime" seems to be the problem. I'm not sure what this means, because lots of programming languages have run-time problems. Here's the quote: "This leads to poor performance, time consumption, and the need for a lot of tests. Like, a lot of tests." Performance? See above. Use numpy. Or Cuda. Time consumption? Not sure what this means. A lot of tests? Yes. Software requires tests. I'm not sure that a compiled language like Rust, Go, or Julia require fewer tests. Indeed, I think the testing is essentially equivalent.
I'm interested in ways Python could be better. 

Tuesday, April 14, 2020

The COBOL-to-SomeBetterLang Translator

Here's a popular idea.
... a COBOL-to-X translator, where X is a more-modern programming language ...
This is a noble aspiration.

In principle -- down deep -- all programming can be reduced to an idealized Turing Machine.

This means that we *should* be able to locate all the state changes in a given spaghetti-bowl of COBOL. Given the abstract state transitions, we can emit a version of that machine in any language.

Emphasis on the *should*.

There are road-blocks.

The first two are rarities. But. When confronted with these, we'll have significant problems.

  • The ALTER statement means the code can be changed at run-time. There are constraints, but still... When the code is not static, the possible domain of state changes moves outside working storage and into the procedure division itself.
  • A data structure with a RENAMES clause. This adds a layer of alternative naming, making the data states quite a bit more complex.
The next one is a huge complication: the GOTO statement. This makes state transitions extremely difficult to analyze. It's possible to untangle any GOTO of arbitrary complexity into properly tested IF and WHILE statements. 

However. The tangle of GOTO's may have been actually meaningful. It may have carried some suggestion of a business owner's intent. A COBOL elimination algorithm may turn tangled code into opaque code. (It's also possible that it clarifies age-old bad programming.)

The ordinary REDEFINES clause. This was heavily used as a storage optimization for the tiny, slow file systems we had back in the olden days. It's a union of distinct types. And. It's a "free" union. We do not know how to distinguish the various types that are being redefined. It's intimately tied to processing logic in the procedure division.

Just to make it even more horrifying...

File layouts evolve over time. It's entirely possible for a *working*, *valid*, *in-production* file to have content that does not match any working program's DDE. The data has flags or indicators or something that lets the app glide past the bad data. But the data is bad. It used to be good. Then something changed, and now it's almost uninterpretable. But the apps work because there are enough paths through the logic to make the row "work" without it matching any file layout in any obvious way.

I'm not sure an automated translation from COBOL is of any value. 

I think it's far better to start with file layouts, review the code, and then write new code from scratch in a modern language. This manual rewrite leads directly to small programs that -- in a modern language -- are little more than class definitions. In some cases, each legacy COBOL app would like becomes a Python module.

Given snapshots of legacy files, the Python can be tested to be sure it does the same things. The processing is not nuanced, or tricky, or even particularly opaque.

The biggest problem is the knowledge captured in COBOL code tends to be disorganized. The real work is disentangling it. A language that supports ruthless refactoring will be helpful.

Tuesday, April 7, 2020

Why Isn't COBOL Dead? Or Why Didn't It Evolve?

Here's part of the question:
Why didn't COBOL evolve more successfully?
FORTRAN, OTOH, has survived precisely because it--and more importantly, related tools, esp compilers--has evolved to solve/overcome many (certainly not all!) of the sorts of pain-points you describe, while retaining the significant performance edge that (IMHO, ICBW) prevents challengers (e.g., Python) from dislodging it for tasks like (e.g.) running dynamical models (esp weather forecasting).
In short, why is FORTRAN still OK? Why is COBOL not still OK?

Actually, I'd venture to say the stories of these languages are essentially identical. They're both used because they have significant legacy implementations.

There's a distinction, that I think might be relevant to the "revulsion factor."

Folks don't find Fortran quite so revolting because it's sequestered into libraries where we don't really have to look at it. It's often wrapped into SciPy. The GCC compiler system handles it and we're happy.

COBOL, however, isn't sequestered into libraries with tidy Python wrappers and Conda installers. COBOL is the engine of enterprise applications.

Also. COBOL is used by organizations that suffer from high amounts of technical inertia, which makes the language a kind of bellwether for the rest of the organization. The organization changes slowly (or not at all) and the language changes at an even more tectonic pace.

This is a consequence of very large organizations with regulatory advantages. Governments, for example, regulate themselves into permanence. Other highly-regulated industries like banks and insurance companies can move slowly and tolerate the stickiness of COBOL.

Also.

For a FORTRAN library function that does something useful, it's not utterly mysterious. There's often a crisp mathematical definition, and a way to test the implementation. There are no quirks.

For a COBOL program that does something required by law, there can still be absolutely opaque mysteries and combinations of features without acceptable unit test cases. This isn't for lack of trying. It's the nature of "application" vs. "subroutine."

The special case and exceptions have to live somewhere. They live in the application.

For FORTRAN, the exceptions are in the Python wrapper using numpy using FORTRAN.

For COBOL, the exceptions are in the COBOL  Somewhere.

The COBOL Problem


It's a tweet, so I know there's no room for depth here.

As it is, it's absolutely correct. Allow me to add to it.

First. Replacing COBOL with something shiny and new is more-or-less impossible. Replacing COBOL is a two-step job.

1. Replace the COBOL with something that's nearly identical but written in a new language. Python. Java. Scala. Whatevs. Language doesn't matter. What matters is the hugeness of this leap.

2. Once the COBOL is gone and the mainframe powered off, then you can rebuild things yet again to create RESTful API's and put many shiny things around it.

Second. Replacing COBOL is essential. Software is a form of knowledge capture. If the language (and tools) have become opaque, then the job of knowledge capture has failed. Languages drift. The audience is in a constant state of flux. New translations are required.

Let's talk about the "Nearly Identical But In A New Language."

Nearly Identical

COBOL code has two large issues in general
  • Data. The file layouts are very hard to work with. I know a lot about this. 
  • Processing. The code has crap implementations of common data structures. I know. I wrote some. There's more, we'll get to it.
We have -- for the most part -- two kinds of COBOL code in common use.
  • Batch processing. Once upon a time, we called it "Programming in the Large." The Z/OS Job Control Language (JCL) was a kind of shell script or AWS Step Function state transition map among applications. This isn't easy to deal with because the overall data flow is not a simple Directed Acyclic Graph (DAG.) It has cycles and state changes.
  • Interactive (once called "on-line") processing. We called it OLTP: On-Line Transaction Processing. There are two common frameworks, CICS and IMS, and both are complicated.
Okay. Big Breath. What do we *DO*?

Here's the free consulting part.

You have to run the new and old side-by-side until you're sick of the errors and poor performance of the old machine.

You have to migrate incrementally, one app at a time.

It's hellishly expensive to positively determine what the COBOL really did. You can't easily do a "clean-room" conversion by writing intermediate specifications. You must read the COBOL and rewrite it into Python (or Java or Scala or whatever.)

You cannot unit test your way to success here, because you never really knew what the COBOL does/did. All you can do is extract example records and use those to build Gherkin-language acceptance tests using a template like this. GIVEN a source document WHEN the app runs THEN the output document matches the example. 

In effect, you're going to do TDD on the COBOL, replacing COBOL with Python essentially 1-for-1 until you have a test suite that passes.

Don't do this alphabetically, BTW. 

The processing graph for COBOL will include three essential design patterns for programs. "Edit" programs validate and possibly merge input files. "Update" programs will apply changes to master files or databases. "Report" programs will produce useful reports and feeds for reporting systems that involve yet more data derivation and merging.

  1. Find the updates. Convert them first. They will involve the most knowledge capture, A/K/A "Business Logic."  There will be a lot of special cases and exceptions. You will find latent bugs that have always been there.
  2. Convert the programs that produce files for the updates, working forward in the graph.
  3. The "reporting" is generally a proper DAG, and should be easier to deal with than the updates and edits. You never know, but the reporting apps are filled with redundancy. Tons of reporting programs are minor variations on each other, often built as copy-pasta from some original text and then patched haphazardly. Most of them can be replaced with a tool to emit CSV files as an interim step.
Each converted application requires two new steps injected into the COBOL batch jobs.
  • Before an update runs, the files are pushed to some place where they can be downloaded.
  • The app runs as it always had. For now.
  • After the update, the results are pushed, also.
This changes merely slow things down with file transfers. It provides fodder for parallel testing.

Then. 

Two changes are made so the job now looks like this.
  • Before an update runs, the files are pushed to some place where they can be downloaded. (No change here.)
  • Kill time polling the file location, waiting for the file to be created externally. (The old app is still around. We could run it if we wanted to.) 
  • After the update, download the results from the external location.
This file-copy-and-parallel-run dance can, of course, be optimized if you take whole streams of edit-update processing and convert them as a whole.

Yes, But, The COBOL Is Complicated

No. It's not.

It's a lot of code working around language limitations. There aren't many design patterns, and they're easy to find.
  1. Read, Validate, Write. The validation is quirky, but generally pretty easy to understand. In the long run, the whole thing is a JSONSchema document. But for now, there may be some data cleansing or transformation steps buried in here.
  2. Merged Reading. Execute the Transaction. Write. The transaction execution updates are super important. These are the state changes in object classes. They're often entangled among bad representations of data. 
  3. Cached Data. A common performance tweak is to read reference data ("Lookups") into an array. This was often hellishly complex because... well... COBOL. It was a Python dict, for the love of God, there's nothing to it. Now. Then. Well. It was tricky.
  4. Accumulators. Running totals and counts were essential for audit purposes. The updates could be hidden anywhere. Anywhere. Not part of the overall purpose, but necessary anyway.
  5. Parameter Processing. This can be quirky. Some applications had a standard dataset with parameters like the as-of-date for the processing. Some applications prompted an operator. Some had other quirky ways of handling the parameters.
The bulk of the code isn't very complex. It's quirky. But not complicated.

The absolute worst applications were summary reports with a hierarchy. We called these "control break" reports. I don't know why. Each level of the hierarchy had its own accumulators. The data had to be properly sorted. It was complicated. 

Do Not Convert these. Find any data cleansing or transformation and simply pour the data into a CSV file and let the users put it into a spreadsheet.

Right now. We have to keep the lights on. COBOL apps have to be kept operational to manage unemployment benefits through the pandemic.

But once we're out of this. We need to get rid of the COBOL.

And we need to recognize that all code expires and we need to plan for expiration. 

Tuesday, March 17, 2020

70% of Modern Python Cookbook 2e...

At this point, we're closing in on 9/13 (70%) of the way through the 2nd edition rewrite.

Important changes.
  1. Type Hints
  2. Type Hints
  3. Type Hints
First. Every single class, method, or function has to be changed to add hints. Every. Single. One. This is kind of huge. The book is based on over 13,000 lines of example code in 157 files. A big bunch of rewrites.

Second. Some things were either wrong or at least sketchy. These rewrites are important consequences of using type hints in the first place. If you can't make mypy see things your way, then perhaps your way needs rework.

Third. dataclasses, frozen dataclasses, and NamedTuples have some nuanced overlapping use cases. Frequently, they differ only by small type hint changes.

I hate to provide useless non-advice like "try them and see which works for you." However, there's only so much room to try and beat out a detailed list of consequences of each alternative. Not every decision has a clear, prescriptive, "do this and you'll be happy." Further, I doubt any reader needs detailed explanations of *potential* performance consequences of mutable vs. immutable objects.

Also. I'm very happy cutting back on the overwrought, detailed explanations. This is (a) not the only book on Python, and (b) not my only book. When I started the first drafts 20 years ago, I wrote as though this was my magnum opus, a lifetime achievement.  A Very Bad Idea (VBI™).

This is a resource for people who want more depth. At work, I spend time coaching people who call themselves advanced beginners. The time spent with them has helped me understand my audience a lot better, and stuck to useful exposition of the language features.


Tuesday, February 25, 2020

Stingray Reader Pervasively Bad Decision

I made some bad decisions when I wrote this a few years ago: https://github.com/slott56/Stingray-Reader. Really bad. And. Recently, I've burdened myself with conflicting goals. Ugh.

I need to upgrade to Python 3.8, and add type hints. This exposed somes badness.

See https://slott-softwarearchitect.blogspot.com/2020/01/stingray-reader-rewrite.html for some status.

The very first version(s) of this were expeditious solutions to some separate-but-related problems. Spreadsheet processing was an important thing for me f. Fixed-format file versions of spreadsheets showed up once in a while mixed with XLS and CSV files. Separately, COBOL code analysis was a thing I'd been involved in going back to the turn of the century.

The two overlap. A lot.

The first working versions of apps to process COBOL data in Python relied on a somewhat-stateful representation of the COBOL DDE (Data Definition Element.) The structure had to be visited more than once to figure out size, offset, and dimensionality. We'll talk about this some more.

A slightly more clever algorithm would leverage the essential parsing as a kind of tree walk, pushing details down into children and summarizing up into the parent when the level number changed. It didn't seem necessary at the time.

Today

I've been working for almost three weeks on trying to disentangle the original DDE's from the newer schema. I've been trying to invert the relationships so a DDE exists independently of a schema attribute. This means some copy-and-paste of data between the DDE source and the more desirable and general schema definition.

It turns out that some design decisions can be pervasively bad. Really bad-foundation-wrecks-the-whole-house kind of bad.

At this point, I think I've teased apart the root cause problem. (Of course, you never know until you have things fixed.)

For the most part, this is a hierarchical schema. It's modeled nicely by JSONSchema or XSD. However. There are two additional, huge problems to solve.

REDEFINES. The first huge problem is a COBOL definition can redefine another field. I'm not sure about the directionality of the reference. I know many languages require things be presented in dependency order: a base definition is provided  lexically first and all redefinitions are subsequent to it. Rather than depend on order of presentation, it seems a little easier to make a "reference resolution" pass. This plugs in useful references from items to the things they redefine, irrespective of any lexical ordering of the definitions.

This means we data can only be processed strictly lazily. A given block of bytes may have multiple, conflicting interpretations. It is, in a way, a free union of types. In some cases, it's a discriminated union, but the discriminating value is not a formal part of the specification. It's part of the legacy COBOL code.

OCCURS DEPENDING ON. The second huge problem is the number of elements in an array can depend on another field in the current record. In the common happy-path cases, occurrences are fixed. Having fixed occurrences means sizes and offsets can be computed as soon as the REDEFINES are sorted out.

Having occurrences depending on data means sizes and offsets cannot be computed until some data is present. The most general case, then, means settings sizes and offsets uniquely for each row of data.

Current Release

The current release (4.5) handles the ODO, size, and offset computation via a stateful DDE object.

Yes. You read that right. There are stateful values in the DDE. The values are adjusted on a row-by-row basis.

Tomorrow

There's got to be a better way.

Part of the problem has been conflicting goals.

  • Minimal tweaks required to introduce type hints.
  • Minimal tweaks to break the way a generic schema depended on the DDE implementation. This had to be inverted to make the DDE and generic schema independent.
The minimal tweaks idea is really bad. Really bad. 

The intent was to absolutely prevent breaking the demo programs. I may still be able to achieve this, but... There needs to be a clean line between the exposed work-book like functionality, and some behind the scenes COBOL DDE processing.

I now think it's essential to gut two things:
  1. Building a schema from the DDE. This is a (relatively) simple transformation from the COBOL-friendly source model to a generic, internal model that's compatible with JSONSchema or XSD. The simple attributes useful for workbooks require some additional details for dimensionality introduced by COBOL.
  2. Navigating to the input file bytes and creating Workbook Cell objects in a way that fits with the rest of the Workbook abstraction.
The happy path for Cell processing is more-or-less by attribute name: row.get('attribute').  This changes in the presence of COBOL OCCURS clause items. We have to add an index. row.get('ARRAY-ITEM', index=2) is the Python version of COBOL's ARRAY-ITEM(3).

The COBOL variable names *could* be mapped to Python names, and we *could* overload __getitem__() so that row.array_item[3] could be valid Python to fetch a value.

But nope. COBOL has 1-based indexing, and I'm not going to hide that. COBOL has a global current instance of the row, and I'm not going to work with globals. 

So. Where do I stand?

I'm about to start gutting. Some of the DDE size-and-offset (for a static occurrences)

Tuesday, February 11, 2020

Interesting Data Restructuring Problem

This seemed like an interesting problem. I hope this isn't someone's take-home homework or an interview question. It seemed organic enough when I found out about it.

Given a document like this...

doc = {
    "key": "the key",
    "tag1": ["list", "of", "values"],
    "tag2": ["another", "list", "here"],
    "tag3": ["lorem", "ipsum", "dolor"],
}


We want a document like this...

doc = {
    "key": "the key",
    "values": [
        {"tag1": "list", "tag2": "another", "tag3": "lorem"},
        {"tag1": "of", "tag2": "list", "tag3": "ipsum"},
        {"tag1": "values", "tag2": "here", "tag3": "dolor"},
    ]
}


In effect, rotating the structure from Dict[str, List[Any]] to List[Dict[str, Any]].
Bonus, we need to limiting the rotation to those keys with a value of List[Any], ignoring keys with atomic values (int, str, etc.).

Step 1. Key Partitioning

We need to distinguish the keys to be rotated from the other keys in the dict.
We start with Dict[str, Union[List[Any], Any]]. We need to distinguish the two subtypes in the union.

from itertools import filterfalse
list_of_values = lambda x: isinstance(doc[x], list)
lov_keys = list(filter(list_of_values, doc.keys()))
non_lov_keys = list(filterfalse(list_of_values, doc.keys()))

This gets two disjoint subsets of keys: those which have a list and all the others. The others, presumably, are strings or integers or something irrelevant.

List lengths

There's no requirement for the lists to be the same lengths. We have three choices here:
  • insist on uniformity,
  • truncate the long ones,
  • pad the short ones.

We'll opt for uniformity in this example. Truncating is what zip() normally does. Padding is what itertools.zip_longest() does.

lengths = (len(doc[k]) for k in lov_keys)
sample = next(lengths)
assert all(l == sample for l in lengths), "Inconsistent lengths"

Some folks don't like using assert for this. This can be a more elaborate if-raise ValueError() if that's necessary.

Use zip() to merge data values

We have several List[Any] instances in the document. The intermediate goal is a List[Tuple[Any, ...]] structure where the items from each tuple are chosen from the source lists. This gets us a sequence of tuples that have parallel selections of items from each of the source lists.

The zip(list, list) function produces pairs from each of the two lists. In our case, we have n lists in the original document. A zip(*lists) will produce a sequence of items selected from each list.

Here's what it looks like:

list(zip(*(doc[k] for k in lov_keys)))

We can also use zip(key-list, value-list) to make a list of key-value pairs from a tuple of the keys and a tuple of values. zip(Tuple[Any, ...], Typle[Any, ...]]) gives us a List[Tuple[Any, Any]] structure. These objects can be turned into dictionaries with the dict() function.

It looks like this:

list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))

Assemble the parts

The final document, then, is built from untouched keys and touched keys.

d1 = {
    k: doc[k] for k in non_lov_keys
}
d2 = {
    "values": list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))
}
d1.update(d2)

It might be slightly easier to "somehow" build this as s single dictionary, but the two subsets of keys make it seem more sensible to build the resulting document in two parts.

The code I was asked to comment on was quite complex. It built a large number of intermediate structures rather than building a List[Dict] using a list comprehension.

What's important about this problem is the complexity of the list comprehension. In particular, the keys are used twice in the comprehension. One use extracts the source lists from the original document. The second use attaches the key to each value from the original list.

It almost seems like the Python 3.8 "Walrus" operator might be a handy way to shrink this code down from about 14 lines. I'm not sure it's helpful to make this any shorter. Indeed, I'm not 100% sure this compact form is really optimal. The fact that I had to expand things as part of an explanation suggests that separate lines of code are as important as separate subsections of this blog post.

Tuesday, February 4, 2020

Dictionary clear() as a code smell

Using the clear() method of a dict isn't *wrong*. But. The reasons have to be investigated. I got a question about this code not working "properly." ("Properly"? Seems too vague to be useful.)

Here's a summary of the example.

final_list = []
temp_dict = {}
for obj in some_source:
    cool_function(obj, temp_dict)
    final_list.append(temp_dict)
    temp_dict.clear()  # Ready for reuse, right?


This can't work.

(Bonus points if you suspect that list.append() is a smell, too. There may be a list comprehension solution that's tidier than this.)

It's not always easy to get to a succinct statement of what doesn't work "properly," or what's confusing about the Python list structure. Getting useful information can be hard. Why?
  • Some programmers are "Assumptions First" kind of people, and their complaint is often "doesn't match my assumption" not "doesn't actually work."
  • Some people live in "All Details Matter" world. Rather than create the smallest example of code that's confusing, they send the *entire* project. The problem is buried in a log, wrapped with "Why is the list of dictionaries not being properly updated?" In an email that provides background details. For a Trello story that links to background details. Details. None of which point to the problem. 

"Properly?" What does that even mean?

 Confronted with hundreds of lines of impenetrable code, I asked for a definition of "properly" and got these exact seven words: "Properly is defined as correctly or satisfactorily."

So... 

They have no idea what's wrong, can't summarize the code that's broken, and it's my fault because I'm the Python guru.

Why Won't My Code Work?

The short answer is "Because You're Making an Assumption."

Of course, anyone who puts their assumptions first is as blind to their assumptions as we are to the air that surrounds us. Assumptions are just there. All around them. They breathe their assumptions in and out without seeing them.

The long answer is Python uses references.

If you apply the id() function to the items in the resulting final_list, you'll see that it's reference after reference of one object, temp_dict.  Not copies of individually populated dictionaries, but multiple references to the same dictionary. The same dictionary which was cleared and reloaded over and over again.

The very first log, crammed with useless details, had output from print() functions. It showed multiple copies of the same dict. 

Because they assumed Python is making copies, there was no explanation for why the list of dictionaries was broken. Clearly, it couldn't be in their code. They assume their code is correct. The only choice has to be an undocumented mystery in Python. And I'm the Python guru, so it's my problem.

The presence of duplicates in the output meant "something" to them. They could point it out as somehow wrong. But the idea that their assumptions might be wrong? That was a nope.

They wanted it to be the list object, final_list, which didn't append dictionaries the way they assumed it would. They needed it to be a Python internals problem. They needed it to be a bad documentation problem. (Seriously. These convos have spun out of control in the past.)

tl;dr

Using the clear() method of a dict may indicate the developer is hoping Python shares copies, not references. Either add an explicit copy() (or deepcopy.copy()) or fix things to create new, fresh dictionaries each time. Objects are cheap. Why reuse them?

(Indeed, an interesting side-bar question I did not ask is "In what god-forsaken programming language does this 'clear-and-reuse' a data structure even make sense? FORTRAN?)

The list comprehension solution to this problem will have to wait. Stay tuned. I want to disentangle the algorithmic design problem from the "why aren't my assumptions correct?" problem..

Tuesday, January 28, 2020

Stingray Reader Rewrite

See https://slott-softwarearchitect.blogspot.com/2020/01/stingrayreader-upgrade.html

This drifted into some serious rethinking of bad design decisions. (If someone else did this, I'd call it weak, and suggest improvements. It was me. It was bad. I'm a bad programmer and I feel bad about it.)

An an example, there's this sketchy construct:

some_data = {name: source[name] for name in the_names}
the_object = SomeClass(**some_data)

The some_data dictionary could be called Dict[str, Any], but that's unhelpful for letting mypy check the consistency of data structures. This is what was required:

  FullAttr = TypedDict("FullAttr",
      {
          "name": str,
          "offset": int,
          "size": int,
          "type": str,
          "create": Cell,
      },
      total=False
  )

This dictionary changes -- profoundly -- the relationship between classes. The FullAttr type gives us an intermediary representation. The SomeClass hierarchy has a flexible collection of attributes. We can use this to uncouple some parsing operations from object factory operations, using this minimal subset of definitions as a kind of bridge between modules, both of which can be fully type-checked, but still permit Python's duck-type flexibility.

It Got Worse

Adding type hints to Stingray Reader required navigating some shoal water created by a poor set of dependency decisions.

The original, vague, concept was to have a Schema and Attribute definition that could be shared by all the various readers. A schema contains a number of attributes. Ideally, an attribute can be defined by a sub-schema. This is how JSONSchema and XSD work.

But.

The Stingray Reader reads Workbooks with an extension to read COBOL. There are a bunch of extensions required.
  • The schema is loaded by a COBOL parser. 
  • The physical file formats require the possibility of EBCDIC -> Unicode conversion. 
  • Unlike ordinary workbooks, the record layouts have to be built lazily. An ordinary workbook row is complete. Some physical formats elide empty cells, but they're easy to replace with an explicit empty cell. COBOL, has a REDEFINES clause that means we can't even attempt to parse the bytes for a row until they're required by the app. There's no way -- from the data definition alone -- to discern which of the redefines options will have valid data. There's more, but you get the idea: COBOL is kind of complex.
Versions 1 to 4 had a dumb-as-a-bag-of-hammers problem.

The Schema and Attribute definitions where extended to depend on COBOL implementation details.

It works nicely because of duck typing and late binding of types.

Python's type hinting exposes the grotesque consequences of this dependency.

We tried several ways of reordering a bunch of definitions to remove forward type references. It took almost an hour to realize the circularity could not be removed trivially because of a circularity. Two Attribute subclasses depended on COBOL features. And the COBOL features had weakref references back to their Attributes.

Crushing everything into a single, large module, worked to ease the complications or circularity. But the essential interdependence needs to be expunged.

What has to happen next is to invert the relationship between Attributes and COBOL details. This means two changes:

  1. Extending the Attribute class hierarchy to contain just enough information to cover the COBOL complications. 
  2. Changing the function that builds an Attribute definition from the COBOL source so it copies details into the Attribute. The COBOL detail needs to be little more than the description of the property.

This isn't easy. But. 187 test cases and a TOX setup makes it a reasonable effort.

And.

I can finally look seriously at converting between JSON Schema and COBOL. 

Tuesday, January 21, 2020

StingrayReader Upgrade

See https://github.com/slott56/Stingray-Reader

It's time to add type hints.

And.

Learn some interesting lessons.

Here's the interesting problem:

some_data = {name: source[name] for name in the_names}
the_object = SomeClass(**some_data)

While valid, this concerns mypy.

The point here is to have a flexible source of data, source. Perhaps this is a spreadsheet row, or a complex JSON/YAML-formatted document with optional or irrelevant fields. The short list of relevant names is in the_names.  Ideally, this list of names matches the keyword args of SomeClass.

This gives mypy fits because there's no way to match the dictionary with the object's parameters.

We have two paths forward.
  1. Eliminate the intermediate dictionary. Use SomeClass(x=source['x'], y=source['y'], ... etc.)
  2. Consider using a TypedDict for the intermediate dictionary. But. Then the dictionary's types must be kept in sync with the SomeClass definition, which may be a little crazy.
Item 2 isn't as crazy as it sounds, though. The SomeClass definition has a **kwargs option, allowing extra attributes to be set. This is, perhaps also crazy. But, the framework needs to drag around extra attributes for the application's benefit.

A possibility is to do away with **kwargs, and replace it with other: Dict[Any, Any]. This cuts down on the expressivity of the framework. Now we support SomeClass.app_name. This change would mean we'd have SomeClass.other['app_name']. While possibly better for mypy, I don't think it's ideal for users.

I can also rework SomeClass to use __getattribute__() to look into self.other for extra attribute names.

I'm very happy to have the rigorous static check. The rethinking is helpful.

("Wait," you say. "You didn't provide the recommended path forward."  Correct.  I'll update.)