Tuesday, January 16, 2018

I've decided on Windows -- Please help justify my choice

After many words, the email chain I received netted out to this:
  1. I can't teach myself data science on my crappy old Windows machine.
  2. I've decided to get a new Windows machine. Here are the specs.
My response was a mixture of incredulity and bafflement.

It appears that two things happened while I wasn't paying attention.
  1. Apple ceased to exist.
  2. The cloud ceased to exist.
I'm aware that many people think the Apple alternative is a non-starter. They are sure that after 40+ years selling computes, Apple is doomed, and we'll only have Windows on the desktop. Seriously.

Some people have farcical explanations for why Apple Cannot be Taken Seriously.

In this specific instance, there was a large investment in Python and Java that somehow couldn't be rebuild in Mac OS X. Details were explicitly not provided. Which is a way of saying there were no tangible "requirements" for this upgrade. Just specification numbers.

Import note. None of this involved "data" or "science." That was the baffling part. No objective measurement of anything. No list of software titles. No projects. No dataset sizes. Nothing.

The anti-cloud argument was even stranger than the anti-Mac argument.

Somehow, a super-large AWS server -- let's say it was a x1.16xlarge -- being used an hour a day (365 day*1 hr/day*$1.82/hr = $664) was deemed *more* expensive that a 64Gb 6-core home-based machine that would sit idle 23 hours each day.

The best part of $664/yr being *more* expensive?

Expert Judgement.

No "data". No "science". No measurement. No supporting details.

I wish I'd kept the email describing how someone who knew something said something about pricing.  It was marvelous Highest Paid Person's Opinion nonsense.

AFAIK, they were using 8,766 hours per year to compare AWS computing vs. at-home computing. This meant that an m5.4xlarge should be considered as costing $1,939 each year. Presumably because they'd never shut it off.

It included terms like "half-way decent performance."

There's a depth of wrongness to this that's hard to characterize beyond no "data" and no "science".

Tuesday, January 9, 2018

Code Rewrites and How They Create Value -- Stop Fighting Against It

TL;DR -- To remove doubts and questions, rewrite it.

Many, many people are confronted with the request to maintain someone else's code.

Either it's open source, and you have to make formal PR's visible to the world.

Or it's "enterprise" in-house software, and you have to make PR's visible to a work team.


It's "work group" in-house software, and you have source that may not be under proper source-code control installed on a server where you're taking over someone else's carefully built structure of porcelain components.

In the first case -- public code -- a rewrite is a challenge. People depend on it having a well-known interface. A small change here could alienate large swaths of the user community. On the other hand. A small change here could make the project *more* useful to *more* people. This is challenging. My advice here is limited.

When we look at enterprise software, however, a rewrite won't have quite the same "blast radius."

When we look at work-group software, there is no blast radius.

Benefits of a Rewrite

Why rewrite? Three reasons.

  1. You can understand it. (This is HUGE.)
  2. You can make it objectively better (i.e., higher PyLint scores, better documentation coverage.)
  3. You can add or expand the test cases. 

The first and foremost reason -- understanding -- can't be understated.


It's a HUGE fight every time. The standard argument is "If It Ain't Broke, Don't Fix It."

This is, of course, based on misunderstanding the level of "broke." A delicately-balanced tower of porcelain components that worked once last month is -- in effect -- already pre-broken if any change will disturb the structure and ruin everything.

There are lots of examples.

  • The app only works with Pandas 0.12.0 and will not work with 0.13.x or the 0.22.0 you have in your default Conda environment.
  • The app only works when you provide --someoption=False, and no one can figure out why.
  • Some test cases have @skip because they don't work. But should.
  • The setup.py doesn't work and you can only run it using PYTHONPATH.
  • The default logging initialization is "somehow" not right and requires a manual override in the app.

I know. I've created all of these problems.

On one hand, we have management: "It ain't broke."

On the other hand, we have everyone else: "It's a fragile nightmare of pre-broken components that cannot be touched."

The Script

Here's how it plays out. In the Real World.

Management: "It ain't broke. Don't fix it."
You: "I can't make it work."
Management: "It ran last month."
You: "I made one small change and it doesn't run this month."
Management: "Back out the change."
You: "The results are then useless."

After much Grrr and Gnashing...

Management: "Identify the problems and we'll prioritize."
You: "Here are a dozen things."
Management: "These are too vague. Be more specific."
You: "Here are a score of things."
Management: "You're in the weeds. Bring it up a level to where business people can understand it."

After more Grrr and Gnashing...

Management: "What's the smallest change we can get away with?"
You: "The one I made that broke everything."
Management: "Let's have lots of peer review and design walkthroughs."
You: "Cool, then you'll see how broken it is."
Management: "Okay. Let's not. Instead, make the smallest change you can."
You: "I made one small change and it doesn't run this month."
Management: "Back out the change."
You: "The results are then useless."

Can we break the cycle of uselessness?


We may be struggling with management folks who are set against fixing what's obviously broken. They're living a rich fantasy life that you can't really change.


The Grrr and Gnashing part of the dialog represents time in which useful stuff can be done. Specifically. Rewrites.

Rewriting Strategy

There are three important parts of rewriting.
  • Understand what it's doing and why.
  • Describe it with test cases.
  • Make it objectively clear (i.e., high PyLint scores, complete documentation, etc.)
The effort often involves multiple passes. I like to describe it as Test-Driven Reverse Engineering (TDRE).

  1. Create (or expand) the test cases.
  2. Rewrite the code.
  3. Repeat until it's better.
It's essential to do these in order. Without test cases, rewrites are only more breakage. With test cases, rewrites are guaranteed to produce the same results as the previous mess of horrible code.

Sometimes the test cases are really a kind of "system test" where the whole application is run against some known inputs to produce some expected outputs. This is better than nothing. It supports building fine-grained unit test cases that conform to the system test case.

Other times, the test cases may be proper unit tests and the rewrites can be at a finer level of granularity. In this case, test coverage may have to be expanded to include the fragile bits. In some cases, the rewrites may be necessary to make the code testable in the first place.

Adding test cases is objectively valuable work.

Even the dumbest of "It ain't broke" managers can recognize this value. The rewrites are a beneficial consequence of adding test cases. You may be able to achieve a goal of fixing something without ever being seen as "fixing" it. All you did was improve test case coverage and improve the "design for testability."

Costs and Benefits

Consider the cost of struggling vs. the cost of rewriting.

It's the same 80 hours of effort.

In one case, you struggled with something management insisted wasn't broken. Eventually, you found ways to make it work.

In the other case, you rewrite something management insisted wasn't broken. In the end, you actually understood it and created objective improvements in the code.

Which is better?

Tuesday, December 19, 2017

The TLA Problem -- Over-Engineering Three-Letter Acronyms

Here's something we can gleefully over-engineer. Because anything worth doing is worth over-engineering until it morphs into a different kind of problem.

I'm unclear on the backstory, so try not to ask "why are we doing this?" I think it has something to do with code camp and there are teenagers involved.

What we want to do is generate a pool of 200 TLA's. Seems simple, right?

The first pass may not be obvious, but this works well.

import string
import random
from typing import Iterator

DOMAIN = string.ascii_lowercase
def tlagen() -> Iterator[str]:
    while True:
        yield "".join(random.choice(DOMAIN) for _ in range(3))

tla_iter = tlagen()
for i in range(200):

This makes 200 TLA's. It's configurable, if that's important. We could change tlagen() to take an argument in case we wanted four-letter acronyms or something else.

However. This will generate words like "cum" and "ass". We need a forbidden word list and want to filter out some words. Also, there's no uniqueness guarantee.

The Forbidden Word Filter

Here's a simple forbidden word filter. Configure FORBIDDEN with a set of words you'd like to exclude. Maybe exclude "sex" and "die", too. Depends on what age the teenagers are.

def acceptable(word: str) -> bool:
    return word not in FORBIDDEN

Here's another handy thing. Rather than repeat the "loop until" logic, we can encapsulate it into a function.

from typing import TypeVar, Iterator
T_ = TypeVar("T_")
def until(limit: int, source: Iterator[T_]) -> Iterator[T_]:
    source_iter = iter(source)
    for _ in range(limit):
        item = next(source_iter)
        yield item

Okay. That's workable. This lets us use the following:

list(until(200, filter(acceptable, tlagen())))

Read this from inside to outside. First, generate a sequence of TLA's. Apply a filter to that generator. Apply the "until" counter to that filtered sequence of TLA's. Create a list with 200 items. Nice.

The TypeVar gives us a flexible binding. The input is an iterator over things and the output will be an iterator over the same kinds of things. This formalizes a common understanding of how iterators work. The mypy tool can confirm that the code meets the claims in the type hint.

There are two sketchy parts about this. First the remote possibility of duplicates. Nothing precludes duplicates. And we're doing a bunch of string hash computations. To avoid duplicates, we need a growing cache of already-provided words. Or perhaps we need to build a set until it's the right size. Rather than compute hashes of strings, can we work with the numeric representation directly?

Numeric TLA's

There are only a few TLA's.
$$26^{3} = 17576 $$
Of these, perhaps four are forbidden. We can easily convert a number back to a word and work with a finite domain of integers. Here's a function to turn an integer into a TLA.

def intword(number: int) -> str:
    def digit_iter(number: int) -> Iterator[int]:
        for i in range(3):
            number, digit = divmod(number, 26)
            yield digit
    return "".join(DOMAIN[d] for d in digit_iter(number))

This will iterate over the three digits using a simple divide-and-remainder process to extract the digits from an integer. The digits are turned into letters and we can build the TLA from the number.

What are the numeric identities of the forbidden words?

def polynomial(base: int, coefficients: Sequence[int]) -> int:
    return sum(c*base**i for i, c in enumerate(coefficients))

def charnum(char: str) -> int:
    return DOMAIN.index(char)
def wordint(word: str) -> int:
    return polynomial(26, map(charnum, word))

We convert a word to a number by mapping individual characters to numbers, then computing a polynomial in base 26. And yes, the implementation of polynomial() is inefficient because it uses the ** operator instead of folding in a multiply-and-add operation among the terms of the polynomial.

Here's another way to handle the creation of TLA's.

FORBIDDEN_I = set(map(wordint, FORBIDDEN))
subset = list(set(range(0, 26**3)) - FORBIDDEN_I)
return list(map(intword, subset[:200]))

This is cool. We create a set of numeric codes for all TLA's, then remove the few numbers from the set of TLA's. What's left is the entire domain of permissible TLA's. All of them. Shuffle and pick the first 200.

It guarantees no duplicates. This has a lot of advantages because it's simple code.

This, however, takes a surprisingly long time: almost 17 milliseconds on my laptop.

Numeric Filtering

Let's combine the numeric approach with the original ideal of generating as few items as we can get away with, but also checking for duplicates.

First, we need to generate the TLA numbers instead of strings. Here's a sequence of random numbers that is confined to the TLA domain.

def tlaigen() -> Iterator[int]:
    while True:
        yield random.randrange(26**3)

Now, we need to pass unique items, and reject duplicate items. This requires a cache that grows. We can use a simple set. Although, a bit-mask with 17,576 bits might be more useful.

def unique(source: Iterator[T_]) -> Iterator[T_]:
    cache = set()
    for item in source:
        if item in cache:
        yield item

This uses an ever-growing cache to locate unique items. This will tend to slow slightly based on memory management for the set. My vague understanding is the implementation will double the size when hash collisions start occurring, leading a kind of log2 slowdown factor as the set grows.

The final generator looks like this:

list(until(200, unique(filter(lambda w: w not in FORBIDDEN_I, tlaigen()))))

Reading from inner to outer, we have a generator which will produce numbers in the TLA range. The few forbidden numbers are excluded. The cache is checked for uniqueness. Finally, the generator stops after yielding 200 items.


Of course, we're using timeit to determine the overall impact of all of this engineering. We're only doing 1,000 iterations, not the default of 1,000,000 iterations.

The original version: 0.94 seconds.

The improved number-based version: 0.38 seconds.

So there.  Want to generate values from a limited domain of strings? Encode things as numbers and work with the numeric representation. Much faster.

Saturday, December 16, 2017

Ordered Keys in Dictionaries

Raymond Hettinger (@raymondh)
#python news: 😀 @gvanrossum just pronounced that dicts are now guaranteed to retain insertion order. This is the end of a long journey.

Tuesday, December 12, 2017

The Business of Book Promotions

It is hard (for me) to promote my books. It seems like empty vanity. I realize that it's not -- promotion is essential -- but it's difficult.

Packt just send a raft of detailed information for authors. Some things they suggest I do.

  • ✅Referrals. Want a "free" Python book? The "free" is in scare quotes because you'll have to write a review. Otherwise, the book is yours. DM me @s_lott and I'll put you on the list.
  • ✅Amazon Author Profile: https://www.amazon.com/Steven-Lott/e/B00HNRSLEK Check.
  • Packt Blog Posts. Interesting. I'll have to look into that. I think they can follow the RSS feed from here. If they can, check, this is done.
  • ➡️Other on-line PR. Cool. I give them the blogs I follow. Nice. I can do that.
  • ✅Social Media. I (sort-of) link these blog posts to my Twitter feed through https://dlvrit.com. But I'm half-hearted about it. Packt suggests in including @PacktAuthors so that there's a proper Twitter tie-in. That seems like there's no work to that. Also, there's a Packt Experts Author Community on Facebook.
  • ✅Blog. Got it. You're reading this. And, also, here: https://medium.com/@s_lott.
  • ➡️Author Newsletter. This seems like next-level blogging with more serious editorial planning. Interesting.
  • 🆒Packt Promotions. I like this: they do the marketing. I just have to cooperate by sharing the information.
  • 🆒Virtual Book Release Party. Wait. This could be cool. Functional Python Programming 2e is coming next summer. Hmmm. This could be fun. Banners and raffles for freebies or discounts.
  • 🚻Author Street Team. Mutual Support. Advanced Copies. I would need to keep it organized, right? That's (potentially) a chunk of work. I'll need to contemplate that
  • ✅Conferences. This is fun. For one of my PyCon trips, I got a promo code from Packt for free content during the conference. That was handy to give out. I went to Vistaprint and had a box of cards printed with contact info and the promotion code. 
  • ➡️Packt Live on-line conferences. I've done a few webinars. They're difficult. Writing is easier because it unfolds more slowly. I'll have to look into doing a few more of these.
  • ✅LinkedIn Profile. https://www.linkedin.com/in/steven-lott-029835/ 

I've hit a few. A few more to do.

This is a pretty comprehensive list. It's good to see this kind of author support.

Also. They've done some serious re-engineering on their Author toolchain. MS-Word documents seem to be a thing of the past.

Tuesday, December 5, 2017

Functional Requirements and Use Cases -- even for "simple" things

In the mailbag I found this nonsense, doomed to inevitable failure:
"As I get more serious about this data science stuff, it has become obvious that a windows machine is not the way to go. ...
Q1: What other things should I think about and consider while shopping for a new computer?
Q2: Are there issues w/ running VmWare and Windows 7 w/in VmWare on Ubuntu?
I've omitted many, many words (400 or so.)

Here are all of the functional requirements I could discern:
  • I would like to have 1 machine. I don't want a desktop and a laptop
  • Install VmWare 
  • Install Windows 7 using VmWare
This was all of the functional requirements. The other 400 words involved specifications. Nothing that approaches a use case other than singular, VMWare, and Windows under VMWare. The form factor of laptop, which seems to go without saying, might be a user story, but that's pushing it.

The "a windows machine is not the way to go" and "Install Windows 7" indicate a fairly serious problem. It is not the way to go and it's required. Both. This is doomed to inevitable failure. 

This is not the way to make a decision.

Q1. What other things should I think about? 

Just about every other thing. Start with use cases and functional requirements. Skip over specifications. (In general, never start with specifications because that's where you end: a list of useless numbers that don't bracket what you actually want to do.)

Use Cases Matter. Specifications Don't Matter.

Write down all the Mbs and Tbs you want. Without a use case, they're irrelevant noisy details. Throw the numbers away until you have a list of verbs. Things you will DO. 

With so few actual functional requirements, almost *any* computer (possibly including a Raspberry Pi 3) would pass the suite of acceptance test cases.

✅ One Machine.
✅ VMWare.
✅ Windows.

After a lot more back-and-forth, I discerned one (or maybe two) additional functional requirement(s).
  • I have leo w/ java to gen html.
I know what Leo is in this context. I'm guessing the "java to gen html" is JRst. The lack of clarity is, of course, part of the problem here.

This requirement surfaced in the context of explaining to me why Windows was so important. Really. Windows was required to run two open-source apps. And. "a windows machine is not the way to go." Doomed. To. Inevitable. Failure.

Here's the only relevant functional requirement(s): run Leo and Java. And even then, there's a huge hole in this. Leo is Python-based. Docutils RST2HTML is Python-based. Why not simply use Leo and Python?  What does Java have to do with anything?

Buy this: a Pi-top: https://www.sparkfun.com/products/13896

Q2. Are there issues w/ running...? 

Yes. Always. For everything you can possibly enumerate there are "issues". 

There. Are. Always. Issues.

Use Cases Matter.

Since you don't have any functional requirements or use cases, it's impossible to filter the issues and see if any of the known issues impact what you think you're going to do.

From what I was told, a Pi-top covers everything that's required. It's hard to be sure, of course, when the functional requirements are so vague. But there's no evidence that the Pi-top can't work to fill all of the stated functional requirements.

What To Do Next

It seems obvious, but the next step is to create a test plan. Actually, that was the first step. Since it wasn't done first, now it's the next step.

Write down the things you want to do. Make a list. Ideally a long list of things you will DO. Active voice. Verbs. Actions. Tasks. Activities. It's hard to emphasize this enough.

Then, when considering a computer, see if it can actually do those things. Test it against the requirements to see if it does what it's supposed to do. Among all the machines that pass the tests, you can then sort by price. (Or availability, or reputation, or cool stickers, whatever non-functional requirements seem relevant.)

The questions of Tb and Mb and processor clock speed mean nothing. Nothing. Find the cheapest (smallest) machine that does what you want. Don't find the machine with xMb and yTb of whatever.

There there's this, "As I get more serious about this data science stuff" which seems little more than context. But it's really important. Indeed, it's essential.

If you're going to do machine learning, you don't really want to buy the necessary computer. You want to rent it for the hour or so each day you actually need it. It will be idle 23/24 hours each day (96% of the time.) Why buy that much horsepower which you are never going to use.

If you're going to login to a server you purchased from a cloud computing vendor. Amazon AWS. Microsoft Azure, etc., then, you can probably get by with a tablet that runs SSH and a browser. A tablet with a cool keyboard and a little display rack can be very nice. https://panic.com/prompt/ and https://www.termius.com seem to be all that's required.

Without Use Cases, however, it's impossible to select a computer. Don't spend money without test cases.