Tuesday, September 15, 2020

One of the Modern Python Cookbook Recipes

 See https://opendatascience.com/removing-items-from-a-set-remove-pop-and-difference/

This puts the focus on a few important set operations.

Wednesday, September 9, 2020

Open Source Support Ideas

"... [I] am thinking of building an in house conda forge, or buying a solution, or paying someone to set something up."

The build v. Buy decision. This is always hard. Really hard.  

We used to ask "What's your business? Is it building software or making widgets?" 

And (for some) the business is making widgets.

This is short-sighted. 

But. A lot of folks in senior positions were given this as a model back in the olden days. So, you need to address the "how much non-widget stuff are we going to take on?" question.

The "Our Business is Widgets" is short-sighted because it fails to recognize where the money is made. It's the ancillary things *around* the widgets. Things only software can do. Customer satisfaction. Supply-chain management. 

So. Business development == Software development. They're inextricably bound.

With that background, lets' look at what you want to do.

Open Source software is not actually "free" in any sense. Someone has to support it. If you embrace open source, then, you have to support it in-house. Somehow. And that in-house work isn't small.  

The in-house open-source support comes in degrees, starting with a distant "throw money at a maintainer" kind of action. You know. Support NumFocus and Anaconda and hope it trickles down appropriately (it  sometimes does) to the real maintainers.  

The next step is to build the tooling (and expertise) in-house. Conda forge (or maybe JFrog or something else) and have someone on staff who can grow to really understand how it fits together. They may not be up to external contributions, but they can do the installs, make sure things are running, handle updates, manage certificates, rotate keys, all the things that lead to smooth experience for users.  

The top step is to hire one of the principles and let them do their open source thing but give them office space and a salary.  

I'm big on the middle step. Do it in-house. It's *not* your core business (in a very narrow, legal and finance sense) but it *is* the backbone fo the information-centric value-add where the real money is made.  

Folks in management (usually accouting) get frustrated with this approach. It seems like it should take a month or two and you're up and running. (The GAAP requires we plan like this. Make up a random date. Make up a random budget.)

But. Then. 13 weeks into the 8-week project, you still don't have a reliable, high-performance server.  Accounting gets grumpy because the plan you have them months ago turns out to have been riddled with invalid assumptions and half-truths. (They get revenge by cancelling the project at the worst moment to be sure it's a huge loss in everyone's eyes.)

I think the mistake is failing to enumerate the lessons learned. A lot will be learned. A real lot. And some of it is small, but it still takes all day to figure it out. Other things are big and take failed roll-outs and screwed up backup-restore activities. It's essential to make a strong parallel between open source and open learning.

You don't know everything. (Indeed, you can't, much to the consternation of the accountants.) But. You are learning at a steady rate. The money is creating significant value. 

And after 26 weeks, when things *really* seem to be working, there needs to be a very splashy list of "things we can do now that we couldn't do before."  A demo of starting a new project. `conda create demo python=3.8.6 --file demo_env.yml` and watch it run, baby. A little dask. Maybe analyze some taxicab data.

Tuesday, September 1, 2020

A Comprehensive Introduction to Python

Python 101, by Michael Driscoll. 545 pages, available from leanpub.com in a variety of formats. Available soon in hardcover.

The modern Python programming language is a large topic. A book on a programming language has to be seen as a collection of several large topics.

At its core, a book on a programming language has to cover the syntax of the language. What’s for more important is covering the underlying semantics of the various constructs. Software captures knowledge, and it’s essential for a book on a programming language to make it clear how the language expresses knowledge.

For a programming expert, a fifteen page technical report can be enough to get started with a new language. When I was first learning to program, that’s all there was. For the vast majority of people who come in contact with programming, there’s a lot more information required.

This leads to a number of interesting tradeoffs when writing about a programming language. How much of a book should be devoted to installing the language tools? How much should it cover the other tools required to create software? I think Python 101 makes good choices.

In the modern era of open-source software, the volume and sophistication of the available tools can be daunting. An author must consider how many words to invest in text editors, debuggers, performance measurement, testing, and documentation. These are all important parts of producing software, they’re often tied closely with a language, but these additional tools aren’t really the language itself.

A language like Python offers a rich collection of built-in data types. A book’s essential job is to cover the data structures (and algorithms) that are first-class parts of the Python language. A focus on data puts the various syntactic elements (like statements) into perspective. The break statement, for example, can’t really be discussed in isolation. It’s part of the conversation about for statements and conditional processing in if statements. Because Python 101 follows this data-first approach, I think it can help build comprehensive Python skills.

The coverage of built-in data structures in a modern language needs to include file objects. While Python reads strings and bytes, the standard library provides ways to read HTML, CSV, JSON, and XML documents. Additional packages provide access to Excel spreadsheet files. While, technically, not part of the language, these are essential parts of the problem domain a programming language like Python is designed to address. Because these are part of the book, a reader will be empowered to solve practical problems.

There was a time when a programming “paradigm” was part of a book’s theme. Functional programming, procedural programming, and object-oriented programming approaches spawned their own libraries. Some languages have a strong bias. Other languages, like Python, lack a strong bias. A developer can work with functions, using material from the first seventeen chapters of Python 101 and be happy and successful. Moving into class definitions can be helpful for simplifying certain kinds of programs, but it’s not required, and a good book on Python should treat classes as a sensible alternative to functions for handling more complex object state and bundle operations with the state.

Moving beyond the language itself, a book can only pick a few topics that can be called “advanced.” This book looks at some of the language internals, exposed via introspection. It touches on some of the standard library modules for managing subprocesses and threads. It covers tools like debuggers and profilers. It expands to cover development environments like the Jupyter Notebook, also. I’d prefer to reduce coverage of threading and switch to Jupyter Lab from Jupyter Notebook. These are small changes at the edges of large pool of important details.

I’m still waffling over one choice of advanced topics. Does unit testing count as an advanced topic? For software professionals, a testing framework is as important as the language itself. For amateur hackers, however, a testing framework may be a more advanced topic. The location of a chapter on unit testing is a telling indication of who the book’s audience is. 

The Python ecosystem includes the standard library and the vast collection of packages and applications available through the Python Package Index. These components can all be added to a Python environment. This means any book on the language must also cover parts of the standard library, as well as covering how to install  new packages from the larger ecosystem. Python 101 doesn’t disappoint. There are solid chapters in PIP and Virtual Environment management. I can quibble over their  place in Part II. The presence of chapters on tools is important; Python is more than a language; Python 101 makes it clear Python is a collection of tools for building on the work of others to solve problems collaboratively.

I’m not easily convinced that Part IV has the same focus on helping the new programmer as the earlier three parts. I think packaging and distribution considerations take the reader too far outside problem-solving with a programming language and tools. I’m not sure the audience who sees testing as an advanced topic is ready to distribute their code. I think there’s room for a Python 102 book to cover these more professionally-oriented topics.

The volume of material covered by this comprehensive book on Python seems to require something more elaborate than a simple, linear sequence of chapters. The sequence of chapters have jumps that seem a little awkward. For example, from an introduction run-time introduction introspection, we move to the PIP and virtual environment tools, then move back to ways to make best use of Python’s annotations and type hints. Calling this flow awkward is — admittedly — a highly nuanced consideration. I suspect few people will read this book sequentially; when each chapter is used more-or-less independently, the sequence of chapters becomes a minor side-bar consideration. Each chapter has generous examples and there are screen shots where necessary. 

The scope of this book covers the language and the features through Python 3.8 in a complete and intelligible way. The depth is appropriate for a beginning audience and the examples are focused on simple, concrete, easy-to-understand code. The presence of review questions in each chapter is a delight, making it easy to leverage the book for instructor-guided training. I can imagine covering a few chapters each week and quizzing students with the review questions. Some of the questions are nicely advanced and can lead to further exploration of the language.

If you’re new to Python, this should be part of your Python reading list. If you’ve just started and need more examples and help in using some of the common tools, this book will be very helpful. If you’re teaching or helping guide people deeper into Python, this may be a helpful resource. 

Driscoll’s colorful nature photos are a bonus. My Kindle is limited to black and white, and the pictures would have been disappointing. I’m glad I got the PDF version.

Tuesday, August 25, 2020

Another shiny new MacBook pro

See https://slott-softwarearchitect.blogspot.com/2014/03/shiny-new-macbook-pro.html

At the time (2014), the 8Gb machine was way more than adequate for all my needs as a writer.

Enter bloat.

Mac OS Catalina has essentially filled this machine to the breaking point. Six short years is the lifespan. Things (generally) work, but it now crashes frequently. Sometimes, streaming TV won't play properly. I've tried a large number of remedies (reboot WiFi, reboot computer, reset Bluetooth) and it glitches too offten to be comfortable.

(Rumors suggest the crashes seem to be associated with going to sleep. The machine crashes when it's idle. I come back to it and find it has restarted, and needs to restart my apps. It's not horrible. But it's an indication of a deeper problem. And it's time.)

It works. But. I've spent too many years waiting for slow computers and slow networks. An hour a day (cumulative) for 300 days a year for 40 years means I've spent 1.3 years of my life waiting for a computer to do something.

I’m reluctantly replacing my kind-of-working "Late 2013" vintage machine with a new 13” MacBook Pro. At least 16Gb RAM. At least a terabyte of storage. Hopefully, things will not be "glitchy" and I won't have constant crashes.

I’ve gotten used to having an 27" Thunderbolt Display, and a USB Querkywriter keyboard, and two USB disks doing backups. That's a lot of stuff plugged in all the time. Also. I really need a slot for SD cards (the boat uses micro SD cards, as does the old GoPro camera.). So. A fancy USB-C hub will be essential, I think.

The question is 2 ports (power and hub) or 4 ports (power, hub, and two other things)?  I suspect I can live with 2 ports.  4 ports ships immediately.

I have several use cases:
Writing books actually requires some computing power. But. Not *too* much power. The general reader doesn’t always have a huge computer. If my examples require more computing power than my readers have access to, that’s a problem. The advantage of having a smallish computer is I’m not overstepping what’s available to my readers. This is a handy way to take a tax deduction to pay for this extravagance.

Writing fiction requires a small machine. Scrivener works on an iPad Pro. I’m good with almost anything. Even an iPhone can be used for writing and editing fiction. It’s hard, of course, with a tiny screen. But not impossible. And. I'm trying to learn the craft, so tools aren't as important as understanding character arc.

Creating MicroPython-based devices is a bit confusing right now. A lot of the development environments depend on a reliable USB connectivity to the Arduino or Circuit Playground Express board. I worry about the (potential) complexity of introducing a USB hub into the mix.  I suspect I only need to replace some of my USB cables; the Arduino boards all seem to use a bulky USB type B. The CPX use USB type Micro B. (I thinks one can be replaced with a USB C to USB B “printer cable”, the other is a USB microB to C adapter. Or, maybe a USB C to USB A adapter can be used with my vast collection of legacy cables. Don't know.)

Boating involves connecting external devices like the GPS antenna to the laptop and tracking position or planning routes. This is a Bluetooth thing, generally. 

It does require considerable power for the laptop; the 60W power brick becomes a constraint. The boat have an inverter and can handle the load gracefully. A computer is a dedicated 5A draw, though; twice what the fridge pulls (and the fridge runs infrequently.) We have 225Ah available. The computer could be as bad as 120Ah if it was left on for 24 hours during an overnight passage.

The good news is that the use cases are more-or-less exclusive. The boating use case is rare. We have more thrifty navigation systems permanently installed on the boat. Many folks are using CPX and Arduino’s with MacBook Pro’s, so I shouldn’t worry too much, just buy new cables.

The best part?

Since I use Time Machine, the new machine recovers from the Time Machine backups. It has to be left to run overnight, but. Boom. Done.

(On the to-do list -- encrypt the backup volumes. Ugh. But. Necessary.)

Tuesday, August 11, 2020

Modern Python Cookbook 2e -- Out with the old

 Most of the things that got cut were (to me) obviously obsolete. For example, replacing collections.namedtuple with typing.NamedTuple seemed like a clear example of obsolete. A reviewer really thought I should skip all NamedTuple and use frozen data classes. 

More important are some things that I learned about in my formative years. I think they're important because they'll little nuggets of cool algorithm. But. Pragmatically? They're too hard to explain and don't really capture interesting features of Python.

Back in '01. Yes. The turn of the millennium. 

(Pull up a chair. This is a long yarn.)

Back in '01, I was starting to look at ways to perfect my Python and literate programming skills.

(And yes, I was using Python on '01.)

I had a project that I'd learned about in the 80's. That's in the previous millennium. A thousand years ago. Computers were large, expensive, and rare.

And. Random Number Generators (RNG's) were a bit of a struggle. In the 80's, more sensitive statistical methods were uncovering biases in the RNG's of the day. Back in the 70's, Knuth's The Art of Computer Programming, Volume 2, Seminumerical Algorithms had covered this topic pretty well. But. Not quite well enough for language libraries or OS's to offer really solid RNG's.

(The popular Mersenne Twister algorithm dates from '97.)

One of my co-workers at the time showed me a technical report that I have no real bibliographic information for. I read it, captivated, because it described -- in detail -- Knuth's statistical tests for random number generators. 

This lead me to Knuth Volume 2. 

This lead me to implement *all* of this in Pascal (in the '80's.)

This lead me to implement *all* of this in Python (in the '00's.)

There were 10 tests. Each is a tidy little algorithm with a tidy little implementation that can run on a big collection of data to ascertain how random it is. 

  1. Frequency Test - develops frequency distribution of individual samples.
  2. Serial Test - develops frequency distribution of pairs of samples.
  3. Gap Test - develops frequency distribution of the length of gaps between groups samples in a given range.
  4. Poker Test - develops frequency distribution for 5-card "hands" of samples over a small (16-value) domain.
  5. Coupon Collector's Test - develops frequency distribution for lengths of subsets that contain a complete set of values from a small (8-value) domain.
  6. Permutation Test - develops frequency distribution for the permutations of ordering of 4-sample selections.
  7. Runs Up Test - develops frequency distribution for lengths of "runs up" where each value is larger than the previous value; one variation covers the case where runs are statistically dependent.
  8. Runs Up Test with independent runs and a relatively large domain.
  9. Runs Up Test with a "small domain", that has a slightly different expected distribution.
  10. Maximum of T - develops frequency distribution for the largest value in a group of T values.
  11. Serial Correlation - computes the correlation coefficient between adjacent pairs of values.

What's important here is that we're gaging the degree of randomness of a collection of samples. All of these are core data science. Finding a truly random random number generator is the same as looking at a variable and seeing that it's too random to have any predictive value. This is the Type I Error problem.

Doing this with RNG's means starting with a specific seed. Which means we need to run this for a large number of seed values and compare the results. Lots of computer cycles can be burned up examining random number generators.

Lots.

The frequency test, for example. We bin the numbers and compare the frequencies. They aren't the same; they're within a few standard deviations of each other. That means you don't use 5 bins. You use 128 bins so you can compare the bin sizes to the expected bin size and compute a deviation. The deviation for expected needs to pass a chi-squared test.

Back in the day, chi-squared values were looked up in the back of a handy statistics book. 

That seems weak. Can we compute the exact chi-squared values?  

(Spoiler alert, Yes.)

Computing expected chi-squared values means computing Sterling numbers, Bernoulli numbers, and evaluating the partial gamma function. Knuth gives details on Sterling numbers. I have no reference material on Bernoulli numbers. 

The Log Gamma function is ACM collected algorithms (CALGO) number 291 and 309. The incomplete gamma function is CALGO 435 and 654.

Fascinating stuff. 

To me.

Of this, only one thing ever saw the light of day.

The Coupon Collector's test. Given a long sequence composed of selections from a small pool of distinct values ("coupons"), how many samples from the overall sequence do you have to examine to collect one of each distinct coupon value? This yields a kind of Poisson distribution of the number of samples seen before getting a full set of coupons.

If there's eight kinds of coupons, the smallest number of samples we have to examine is eight. Lucky break. One of each and done. But. Pragmatically, we'll see a distribution that varies from a low of 8 to a high of -- well -- infinity. We'll see a peak at like 15 to 18 samples before collecting all eight coupons and a long, long tail. We can cut the tail at 40 samples and have a statistically useful distribution to discern of the source samples were randomly ordered.

Why did this -- of all things -- see the light of day?

It involves set manipulations.

def collect_coupons(samples: Iterable[int]) -> Iterator[int]:
    while True:
        coupons = set()
        count = 0
        for u in samples: 
            coupons |= u
            count += 1
            if len(set(coupons)) == 8:
                break
        yield count

I've used a number of variations on the above theme to use set manipulation to accumulate data.  There are a lot of ways to restate this using itertools, also. It can be viewed as a clever "reduce" algorithm.

But.

It's so hard to explain. And. It's not really used much by data scientists to reject type I errors because few things fit the coupon model very well.

But. 

It's a cool set processing example.

So.

It's safely out of the book. 

Thursday, July 30, 2020

Modern Python Cookbook Journey

For the author, a book is a journey.  

Writing something new, the author describes a path the reader can follow to get from -- well -- anywhere the reader might be to the author's suggested destination. Not everyone makes the whole trip. And not everyone arrives at the hoped-for destination.

Second editions? The idea is to update the directions to reflect the new terrain.  

I'm a sailor. Here's a view of the boat.


What's important to me is the way the authorities produce revised nautical charts on a stable, regular cadence. There's no "final" chart, there's only the "current" chart. Kept up-to-date by the patient hard work of armies of cartographers. 

Is updating a book like updating the nautical charts? I don't think so. Charts have a variety of update cadences.  For sailors in the US, we start here: https://nauticalcharts.noaa.gov/charts/chart-updates.html. The changes can be frequent. See https://distribution.charts.noaa.gov/weekly_updates/ for the weekly chart updates. This is supplemented by the Notices to Mariners, here, too: https://msi.nga.mil/NTM. So, I think charts are much, much more complex than books.

Sailors have to integrate a lot of data.  This is no different from software developers having to keep abreast of language, library, and platform changes.

The author's journey is different from the reader's journey. A technical book isn't a memoir. 

The author may have crashed into all kinds of rocks and shoals. The author's panic, fear, and despair are not things the reader needs to know about. The reader needs to know the course to set, the waypoints, and hazards. The estimated distances and the places to anchor that provide shelter.

For me, creating a revision is possibly as difficult as the initial writing. I don't know how other authors approach subsequent editions, but the addition of type hints meant every example had to be re-examined.  And this meant discovering problems in code that I *thought* was exemplary. 

While many code examples can simply have type hints pasted in, some Python programming practices have type hints that can't be trivially introduced to the code. Instead some thinking is required.

Generics

Python code is always generic with respect to type. Expressions like a + b will work for a surprisingly wide variety of object classes. Of course, we expect any of the numbers to work. But lists, tuples, and strings all respond to the "+" operator. This is implemented by a sophisticated check of a's __add__() and b's __radd__() methods.

When we write hints, it's often intended to narrow the domain of potential types. Here's some starting code.

def fact(a):
   if a == 0:
       return 1
   return a*fact(a-1)

The implied type hint is Any. This means, any class of objects that defines __eq__(), __mul__() and __sub__() will work. There are a fair number of these classes.

When we write type hints, we narrow the domain. In this case, it should be integers. Like this:

def fact(a: int) -> int:
    if a == 0:
        return 1
    return a*fact(a-1)

This tells mypy (or other, similar analytic tools) to confirm that every place the fact() function is used, the arguments will be integers. Also, the result will be an integer.

What's important is there's no run-time consequence to this. Python runs the same whether we evaluate fact(2) or fact(3.0).  The integer-based computation clearly matches the intent stated in the code. The floating-point computation is clearly at odds with the stated intent.

And this brings us to the author's journey.

Shoal Water

Sometimes we have code that works. And will always work. But. The type hints are hard to express.

The most common examples?

Decorators.

Decorators can be utterly and amazingly generic. And this can make it very, very difficult to express the domain of types involved.

def make_a_log(some_function: Callable) -> Callable:
    @wraps(some_function)
    def concrete_function(*args, **kwargs):
        print(some_function, args, kwargs)
        result = some_function((*args, **kwargs)
        print(result)
    return concrete_function

This is legal, but very shady Python. The use of the Callable type hint is almost intentionally misleading. It could be anything. Indeed, because of the way Python works, it can truly be any kind of function or method. Even a lambda object can be decorated with this. 

The internal concrete_function doesn't have any type hints. This forces mypy to assume Any, and that will lead to a possibly valid application of this decorator when -- perhaps -- it wasn't really appropriate.

In the long run, this kind of misleading hinting is a bad policy.

In the short run, this code will pass every unit test you can throw at it.

What does the author do?
  1. Avoid the topic? Get something published and move on? It is simpler and quicker to ignore decorators when talking about type hints. Dropping the section from the outline would have been easy.
  2. Dig deeply into how we can create Protocols to express a narrower domain of candidates for this decorator? This is work. And it's new work, since the previous edition never touched on the subject. But. Is it part of this cookbook? Or do these deeper examples belong in a separate book?
  3. Find a better example? 
Spoiler Alert: It's all three.

I start by wishing I hadn't broached the topic in the first edition. Maybe I should pretend it wasn't there and leave it out of the second edition.

Then I dig deeply into the topic, overwriting the topic until I'm no longer sure I can write about it. There's enough, and there's too much. A journey requires incremental exposition, and the side-trip into Protocols may not be the appropriate path for any but a very few readers.

After this, I may decide to throw the example out and look for something better.  What's important is having an idea of what is appropriate for the reader's journey, and what is clutter.

The final result can be better because it can be:
  • Focused on something useful.
  • Any edge cases can be corrected to work with the latest language, library, and mypy release.
  • Where necessary, replaced by an alternative example that's clearer and simpler.
Unfortunately (for me) I examine everything. Every word. Every example.

Packt seems to be tolerant of my slow pace of delivery. For me, it simply takes a long time to rewrite -- essentially -- everything. I think the result is worth all the work.

Tuesday, July 28, 2020

Modern Python Cookbook 2nd ed -- Advance Copies -- DM me

This is your "why wait" invitation.

Advanced copies will be available.  

IF.

And this is a big "if".

You have to write a blurb. 

I'll be putting you in contact with Packt marketing folks who will get you your advanced copy so you can write blurbs and reviews and -- well -- actually use the content.

It's all updated to Python 3.8. Type hints almost everywhere. F-strings and the walrus operator. Bunches of devops and data science examples. Plus a few personal examples involving sailboat navigation and management.

See me at LinkedIn https://www.linkedin.com/in/steven-lott-029835/ and I'll hook you up with Packt marketing folks.

See https://www.amazon.com/Modern-Python-Cookbook-Updated-programmer/dp/180020745X for the official Amazon Book Link. This is for ordinary "no obligation to write a review" orders.

DM me directly slott56 at gmail to be put into the marketing spreadsheet.