Tuesday, December 19, 2017

The TLA Problem -- Over-Engineering Three-Letter Acronyms

Here's something we can gleefully over-engineer. Because anything worth doing is worth over-engineering until it morphs into a different kind of problem.

I'm unclear on the backstory, so try not to ask "why are we doing this?" I think it has something to do with code camp and there are teenagers involved.

What we want to do is generate a pool of 200 TLA's. Seems simple, right?

The first pass may not be obvious, but this works well.

import string
import random
from typing import Iterator

DOMAIN = string.ascii_lowercase
def tlagen() -> Iterator[str]:
    while True:
        yield "".join(random.choice(DOMAIN) for _ in range(3))

tla_iter = tlagen()
for i in range(200):
    print(next(tla_iter))

This makes 200 TLA's. It's configurable, if that's important. We could change tlagen() to take an argument in case we wanted four-letter acronyms or something else.

However. This will generate words like "cum" and "ass". We need a forbidden word list and want to filter out some words. Also, there's no uniqueness guarantee.

The Forbidden Word Filter

Here's a simple forbidden word filter. Configure FORBIDDEN with a set of words you'd like to exclude. Maybe exclude "sex" and "die", too. Depends on what age the teenagers are.

def acceptable(word: str) -> bool:
    return word not in FORBIDDEN

Here's another handy thing. Rather than repeat the "loop until" logic, we can encapsulate it into a function.

from typing import TypeVar, Iterator
T_ = TypeVar("T_")
def until(limit: int, source: Iterator[T_]) -> Iterator[T_]:
    source_iter = iter(source)
    for _ in range(limit):
        item = next(source_iter)
        yield item

Okay. That's workable. This lets us use the following:

list(until(200, filter(acceptable, tlagen())))

Read this from inside to outside. First, generate a sequence of TLA's. Apply a filter to that generator. Apply the "until" counter to that filtered sequence of TLA's. Create a list with 200 items. Nice.

The TypeVar gives us a flexible binding. The input is an iterator over things and the output will be an iterator over the same kinds of things. This formalizes a common understanding of how iterators work. The mypy tool can confirm that the code meets the claims in the type hint.

There are two sketchy parts about this. First the remote possibility of duplicates. Nothing precludes duplicates. And we're doing a bunch of string hash computations. To avoid duplicates, we need a growing cache of already-provided words. Or perhaps we need to build a set until it's the right size. Rather than compute hashes of strings, can we work with the numeric representation directly?

Numeric TLA's

There are only a few TLA's.
$$26^{3} = 17576 $$
Of these, perhaps four are forbidden. We can easily convert a number back to a word and work with a finite domain of integers. Here's a function to turn an integer into a TLA.

def intword(number: int) -> str:
    def digit_iter(number: int) -> Iterator[int]:
        for i in range(3):
            number, digit = divmod(number, 26)
            yield digit
    return "".join(DOMAIN[d] for d in digit_iter(number))

This will iterate over the three digits using a simple divide-and-remainder process to extract the digits from an integer. The digits are turned into letters and we can build the TLA from the number.

What are the numeric identities of the forbidden words?

def polynomial(base: int, coefficients: Sequence[int]) -> int:
    return sum(c*base**i for i, c in enumerate(coefficients))

def charnum(char: str) -> int:
    return DOMAIN.index(char)
    
def wordint(word: str) -> int:
    return polynomial(26, map(charnum, word))

We convert a word to a number by mapping individual characters to numbers, then computing a polynomial in base 26. And yes, the implementation of polynomial() is inefficient because it uses the ** operator instead of folding in a multiply-and-add operation among the terms of the polynomial.

Here's another way to handle the creation of TLA's.

FORBIDDEN_I = set(map(wordint, FORBIDDEN))
subset = list(set(range(0, 26**3)) - FORBIDDEN_I)
random.shuffle(subset)
return list(map(intword, subset[:200]))

This is cool. We create a set of numeric codes for all TLA's, then remove the few numbers from the set of TLA's. What's left is the entire domain of permissible TLA's. All of them. Shuffle and pick the first 200.

It guarantees no duplicates. This has a lot of advantages because it's simple code.

This, however, takes a surprisingly long time: almost 17 milliseconds on my laptop.

Numeric Filtering

Let's combine the numeric approach with the original ideal of generating as few items as we can get away with, but also checking for duplicates.

First, we need to generate the TLA numbers instead of strings. Here's a sequence of random numbers that is confined to the TLA domain.

def tlaigen() -> Iterator[int]:
    while True:
        yield random.randrange(26**3)

Now, we need to pass unique items, and reject duplicate items. This requires a cache that grows. We can use a simple set. Although, a bit-mask with 17,576 bits might be more useful.

def unique(source: Iterator[T_]) -> Iterator[T_]:
    cache = set()
    for item in source:
        if item in cache:
            continue
        cache.add(item)
        yield item

This uses an ever-growing cache to locate unique items. This will tend to slow slightly based on memory management for the set. My vague understanding is the implementation will double the size when hash collisions start occurring, leading a kind of log2 slowdown factor as the set grows.

The final generator looks like this:

list(until(200, unique(filter(lambda w: w not in FORBIDDEN_I, tlaigen()))))

Reading from inner to outer, we have a generator which will produce numbers in the TLA range. The few forbidden numbers are excluded. The cache is checked for uniqueness. Finally, the generator stops after yielding 200 items.

tl;dr

Of course, we're using timeit to determine the overall impact of all of this engineering. We're only doing 1,000 iterations, not the default of 1,000,000 iterations.

The original version: 0.94 seconds.

The improved number-based version: 0.38 seconds.

So there.  Want to generate values from a limited domain of strings? Encode things as numbers and work with the numeric representation. Much faster.


Saturday, December 16, 2017

Ordered Keys in Dictionaries

Raymond Hettinger (@raymondh)
#python news: 😀 @gvanrossum just pronounced that dicts are now guaranteed to retain insertion order. This is the end of a long journey.





Tuesday, December 12, 2017

The Business of Book Promotions

It is hard (for me) to promote my books. It seems like empty vanity. I realize that it's not -- promotion is essential -- but it's difficult.

Packt just send a raft of detailed information for authors. Some things they suggest I do.

  • ✅Referrals. Want a "free" Python book? The "free" is in scare quotes because you'll have to write a review. Otherwise, the book is yours. DM me @s_lott and I'll put you on the list.
  • ✅Amazon Author Profile: https://www.amazon.com/Steven-Lott/e/B00HNRSLEK Check.
  • Packt Blog Posts. Interesting. I'll have to look into that. I think they can follow the RSS feed from here. If they can, check, this is done.
  • ➡️Other on-line PR. Cool. I give them the blogs I follow. Nice. I can do that.
  • ✅Social Media. I (sort-of) link these blog posts to my Twitter feed through https://dlvrit.com. But I'm half-hearted about it. Packt suggests in including @PacktAuthors so that there's a proper Twitter tie-in. That seems like there's no work to that. Also, there's a Packt Experts Author Community on Facebook.
  • ✅Blog. Got it. You're reading this. And, also, here: https://medium.com/@s_lott.
  • ➡️Author Newsletter. This seems like next-level blogging with more serious editorial planning. Interesting.
  • 🆒Packt Promotions. I like this: they do the marketing. I just have to cooperate by sharing the information.
  • 🆒Virtual Book Release Party. Wait. This could be cool. Functional Python Programming 2e is coming next summer. Hmmm. This could be fun. Banners and raffles for freebies or discounts.
  • 🚻Author Street Team. Mutual Support. Advanced Copies. I would need to keep it organized, right? That's (potentially) a chunk of work. I'll need to contemplate that
  • ✅Conferences. This is fun. For one of my PyCon trips, I got a promo code from Packt for free content during the conference. That was handy to give out. I went to Vistaprint and had a box of cards printed with contact info and the promotion code. 
  • ➡️Packt Live on-line conferences. I've done a few webinars. They're difficult. Writing is easier because it unfolds more slowly. I'll have to look into doing a few more of these.
  • ✅LinkedIn Profile. https://www.linkedin.com/in/steven-lott-029835/ 

I've hit a few. A few more to do.

This is a pretty comprehensive list. It's good to see this kind of author support.

Also. They've done some serious re-engineering on their Author toolchain. MS-Word documents seem to be a thing of the past.

Tuesday, December 5, 2017

Functional Requirements and Use Cases -- even for "simple" things

In the mailbag I found this nonsense, doomed to inevitable failure:
"As I get more serious about this data science stuff, it has become obvious that a windows machine is not the way to go. ...
Q1: What other things should I think about and consider while shopping for a new computer?
Q2: Are there issues w/ running VmWare and Windows 7 w/in VmWare on Ubuntu?
"
I've omitted many, many words (400 or so.)

Here are all of the functional requirements I could discern:
  • I would like to have 1 machine. I don't want a desktop and a laptop
  • Install VmWare 
  • Install Windows 7 using VmWare
This was all of the functional requirements. The other 400 words involved specifications. Nothing that approaches a use case other than singular, VMWare, and Windows under VMWare. The form factor of laptop, which seems to go without saying, might be a user story, but that's pushing it.

The "a windows machine is not the way to go" and "Install Windows 7" indicate a fairly serious problem. It is not the way to go and it's required. Both. This is doomed to inevitable failure. 

This is not the way to make a decision.

Q1. What other things should I think about? 

Just about every other thing. Start with use cases and functional requirements. Skip over specifications. (In general, never start with specifications because that's where you end: a list of useless numbers that don't bracket what you actually want to do.)

Use Cases Matter. Specifications Don't Matter.

Write down all the Mbs and Tbs you want. Without a use case, they're irrelevant noisy details. Throw the numbers away until you have a list of verbs. Things you will DO. 

With so few actual functional requirements, almost *any* computer (possibly including a Raspberry Pi 3) would pass the suite of acceptance test cases.

✅ One Machine.
✅ VMWare.
✅ Windows.

After a lot more back-and-forth, I discerned one (or maybe two) additional functional requirement(s).
  • I have leo w/ java to gen html.
I know what Leo is in this context. I'm guessing the "java to gen html" is JRst. The lack of clarity is, of course, part of the problem here.

This requirement surfaced in the context of explaining to me why Windows was so important. Really. Windows was required to run two open-source apps. And. "a windows machine is not the way to go." Doomed. To. Inevitable. Failure.

Here's the only relevant functional requirement(s): run Leo and Java. And even then, there's a huge hole in this. Leo is Python-based. Docutils RST2HTML is Python-based. Why not simply use Leo and Python?  What does Java have to do with anything?

Buy this: a Pi-top: https://www.sparkfun.com/products/13896

Q2. Are there issues w/ running...? 

Yes. Always. For everything you can possibly enumerate there are "issues". 

There. Are. Always. Issues.

Use Cases Matter.

Since you don't have any functional requirements or use cases, it's impossible to filter the issues and see if any of the known issues impact what you think you're going to do.

From what I was told, a Pi-top covers everything that's required. It's hard to be sure, of course, when the functional requirements are so vague. But there's no evidence that the Pi-top can't work to fill all of the stated functional requirements.

What To Do Next

It seems obvious, but the next step is to create a test plan. Actually, that was the first step. Since it wasn't done first, now it's the next step.

Write down the things you want to do. Make a list. Ideally a long list of things you will DO. Active voice. Verbs. Actions. Tasks. Activities. It's hard to emphasize this enough.

Then, when considering a computer, see if it can actually do those things. Test it against the requirements to see if it does what it's supposed to do. Among all the machines that pass the tests, you can then sort by price. (Or availability, or reputation, or cool stickers, whatever non-functional requirements seem relevant.)

The questions of Tb and Mb and processor clock speed mean nothing. Nothing. Find the cheapest (smallest) machine that does what you want. Don't find the machine with xMb and yTb of whatever.

There there's this, "As I get more serious about this data science stuff" which seems little more than context. But it's really important. Indeed, it's essential.

If you're going to do machine learning, you don't really want to buy the necessary computer. You want to rent it for the hour or so each day you actually need it. It will be idle 23/24 hours each day (96% of the time.) Why buy that much horsepower which you are never going to use.

If you're going to login to a server you purchased from a cloud computing vendor. Amazon AWS. Microsoft Azure, etc., then, you can probably get by with a tablet that runs SSH and a browser. A tablet with a cool keyboard and a little display rack can be very nice. https://panic.com/prompt/ and https://www.termius.com seem to be all that's required.



Without Use Cases, however, it's impossible to select a computer. Don't spend money without test cases.