Tuesday, February 6, 2018

Spec to Gherkin to Code: A Relay Play based on Swagger and OAS

Recent publications elsewhere: Spec to Gherkin to Code

Tuesday, January 30, 2018

The SQL-based relational database isn't perfection? Whoa if true

Yes, there are people for whom document databases (and the file system) are confusing and weird.

I was sent this: Relational Algebra Is the Root of SQL Problems which is really brilliant and provides some helpful concrete examples of stuff SQL is really bad at.

The accompanying email was filled with nonsense about how important and world-changing SQL was.

I can't disagree. Back when disk was very expensive and very small, the SQL-based join strategies where essential for micro-managing every bit of data. Literally. Every Bit.

And then we would denormalize the structure for performance reasons. Because we always knew the SQL was terrible at a fairly large number of things.

Those days are behind us. We can now chose to use a document database, and make our lives simpler. Storage is relatively inexpensive, and the labor to normalize and denormalize data doesn't create significant value. The need to write stored procedures to turn a single conceptual operation into a bunch of inserts and updates was a symptom that this wasn't the best approach.

I've had many "But what about..." conversations regarding document databases.

"What about ad-hoc queries in SQL?"

- Do you really do these without writing a Python script or creating a Pandas dataframe? I doubt it. But. If you really think you'll do this, most document stores either support a modified SQL or Javascript. And yes, you hate Javascript, duly noted. I hate SQL, so we're even there.

"What about joins?"

- It's a space-saving technique. We don't need the overheads to save the space. The "update anomalies" still require careful design, and may lead to some decomposition of data into multiple documents. But the ruthless normalization shouldn't be seen as a requirement.

"What about the schema?"

- It's brittle and schema migration creates a lot of low-value labor. We can use Python JSONSchema to validate documents. See NoSQL Database doesn't Mean No Schema.

Transactional v. Analytical

It requires some care to understand the distinction between "transactional" and "analytical" uses for data. While folks try to leverage this distinction, it's a spectrum not a distinction.

A lot of data collection is a simple sequence of event documents. These have no sensible state change, so they're not really transactional. They are often created by concurrent processes where locking prevents corruption, so transactions *seem* helpful. Except, of course, the file system writes can be trivially sharded by process ID and then unified later. And all document databases serialize document writes from multiple client processes, so there's no value to writing a relational database.

Some data operations are properly stateful. By normalizing our tables, moving from consistent state to consistent state is made complex. Which requires a defined transaction as a work-around. And don't get me started on replication and two-phase commit as yet another layer of complexity on top of transactions.

A document database allows us to skip over 1NF. We can think of a document as being a row in a table where the data types are complex data structures involving mappings, sequences, strings, numbers, booleans, and nulls. (See JSON Schema.) A lot of multi-step SQL transactions are operations on several children of a common parent. If the parent was persisted as a single document, there wouldn't be multiple operations, an atomic MongoDB update operation can make complex rewrites to a complex document.

We can contrive a design where state changes must be coordinated and the data cannot be colocated in a single document. It's not difficult to stipulate enough requirements to make single documents difficult. The presence of these contrived requirement, however, doesn't suddenly invalidate document datastores for transactional data. In the SQL world, the idea of long-running and reversible long-running transactions has always been a horrible problem. Allowing stacked "undo" for the user means either creating a chain of Memento objects that can recover previous state, or having numerous flags and indicators on each record, allowing the state to be reversed. Some design problems are really hard. And the SQL model seems to make them harder.

The core ACID concepts of always consistent is -- in practice -- nonsense. As soon as we have to consider "isolation levels" and "read consistency" it becomes clear that there is no consistent state unless all transactions and queries are serialized via exclusive "whole database" locking. Competent DBA's know that long-running analytic queries performed concurrently with transactional updates can't use locking, and must tolerate inconsistencies in the database.

It's common practice to do data extracts so that analytic queries aren't working against the (inconsistent) transactional data. In this case, the frequency of extracts is the timing of "eventual consistency" promised by the BASE concept.

Bottom Line: Relational ACID rules are almost always broken in practice by read consistency rules and extracts to analytic databases. Analytical data is always based on eventual consistency expectations. The batch extracts means "eventually" is measured in hours. A document data store can often create consistency in milliseconds. (MongoDB primary failure, voting, and secondary promotion to primary relies on a 10-second heartbeat, so it takes time to discover and repair.)

Also

A second email detailed their amazement (Amazing! Wow! Unbelievable! You Must Inform The World Of This!) that analytic processing of data is actually faster and simpler using the file system. The very idea of HDFS was so amazing that they were amazed.

Somehow, the idea of the raw filesystem as being really, really fast was the source of much amazement.

I'm glad they're making an effort to catch up. I'm glad they're seeing the relational model as a bad choice that has a limited number of use cases. Mostly, relational databases are useful for an organization can't write API's to handle the integrity issues.

To SQL or NoSQL? That's the database question | Ars Technica

Tuesday, January 23, 2018

PyCon 2018 Program Committee


I was "volunteered" by a colleague to help the program committee for PyCon 2018. I rarely think of myself as qualified for this kind of thing. Yes. I have six books on Python (with a seventh on the way) but the PSF folks are brilliant and dedicated and hard-working, and I'm just a slob.

Yes, I do get to help the community by reviewing almost 700 individual proposals. Some good. Some really good. Some which we *must* hear. 

The collateral benefit? 

Side reading.

My browser history is filled with things I hadn't known existed. 

Next time, I need to get started *before* the deadline so I can have a little more interaction with the authors. There were a few outlines where we could only discuss the possibility of making a change if the proposal was accepted. 

In particular, there seemed to be a *lot* of Machine Learning-Bayesian-Deep Learning-Recommender-Data Science pitches that had abbreviated outlines. They tend to all look alike to someone who's not an expert. Five bullet points: the author's background, the problem domain, ML (or modeling or whatever), a Jupyter notebook showing the results, and a conclusion.  Providing some distinct angle to the pitch (other than the problem domain) might help me understand them more fully. It seemed best to defer to the consensus on these.

I've been learning to live with my personal bias against meta-talks about building community. A presentation on community building at a community event seems redundant to me. But that doesn't mean they're not thorough, articulate talks that will be useful to others. Since I have a seat at the table, I'm biased. The Python tie-in feels weak, but our code of conduct (Open, Considerate, Respectful) means PyCon really is the place for more of this. Most importantly, they're objectively solid talks. (And -- as a member of the the over-represented old male nerd class, I do need to listen more.)

It's been enlightening. And the conference will rock.

Tuesday, January 16, 2018

I've decided on Windows -- Please help justify my choice

After many words, the email chain I received netted out to this:
  1. I can't teach myself data science on my crappy old Windows machine.
  2. I've decided to get a new Windows machine. Here are the specs.
My response was a mixture of incredulity and bafflement.

It appears that two things happened while I wasn't paying attention.
  1. Apple ceased to exist.
  2. The cloud ceased to exist.
I'm aware that many people think the Apple alternative is a non-starter. They are sure that after 40+ years selling computes, Apple is doomed, and we'll only have Windows on the desktop. Seriously.

Some people have farcical explanations for why Apple Cannot be Taken Seriously.

In this specific instance, there was a large investment in Python and Java that somehow couldn't be rebuild in Mac OS X. Details were explicitly not provided. Which is a way of saying there were no tangible "requirements" for this upgrade. Just specification numbers.

Import note. None of this involved "data" or "science." That was the baffling part. No objective measurement of anything. No list of software titles. No projects. No dataset sizes. Nothing.

The anti-cloud argument was even stranger than the anti-Mac argument.

Somehow, a super-large AWS server -- let's say it was a x1.16xlarge -- being used an hour a day (365 day*1 hr/day*$1.82/hr = $664) was deemed *more* expensive that a 64Gb 6-core home-based machine that would sit idle 23 hours each day.

The best part of $664/yr being *more* expensive?

Expert Judgement.

No "data". No "science". No measurement. No supporting details.

I wish I'd kept the email describing how someone who knew something said something about pricing.  It was marvelous Highest Paid Person's Opinion nonsense.

AFAIK, they were using 8,766 hours per year to compare AWS computing vs. at-home computing. This meant that an m5.4xlarge should be considered as costing $1,939 each year. Presumably because they'd never shut it off.

It included terms like "half-way decent performance."

There's a depth of wrongness to this that's hard to characterize beyond no "data" and no "science".

Tuesday, January 9, 2018

Code Rewrites and How They Create Value -- Stop Fighting Against It

TL;DR -- To remove doubts and questions, rewrite it.

Many, many people are confronted with the request to maintain someone else's code.

Either it's open source, and you have to make formal PR's visible to the world.

Or it's "enterprise" in-house software, and you have to make PR's visible to a work team.

Or.

It's "work group" in-house software, and you have source that may not be under proper source-code control installed on a server where you're taking over someone else's carefully built structure of porcelain components.

In the first case -- public code -- a rewrite is a challenge. People depend on it having a well-known interface. A small change here could alienate large swaths of the user community. On the other hand. A small change here could make the project *more* useful to *more* people. This is challenging. My advice here is limited.

When we look at enterprise software, however, a rewrite won't have quite the same "blast radius."

When we look at work-group software, there is no blast radius.

Benefits of a Rewrite

Why rewrite? Three reasons.

  1. You can understand it. (This is HUGE.)
  2. You can make it objectively better (i.e., higher PyLint scores, better documentation coverage.)
  3. You can add or expand the test cases. 

The first and foremost reason -- understanding -- can't be understated.

And.

It's a HUGE fight every time. The standard argument is "If It Ain't Broke, Don't Fix It."

This is, of course, based on misunderstanding the level of "broke." A delicately-balanced tower of porcelain components that worked once last month is -- in effect -- already pre-broken if any change will disturb the structure and ruin everything.

There are lots of examples.

  • The app only works with Pandas 0.12.0 and will not work with 0.13.x or the 0.22.0 you have in your default Conda environment.
  • The app only works when you provide --someoption=False, and no one can figure out why.
  • Some test cases have @skip because they don't work. But should.
  • The setup.py doesn't work and you can only run it using PYTHONPATH.
  • The default logging initialization is "somehow" not right and requires a manual override in the app.

I know. I've created all of these problems.

On one hand, we have management: "It ain't broke."

On the other hand, we have everyone else: "It's a fragile nightmare of pre-broken components that cannot be touched."

The Script

Here's how it plays out. In the Real World.

Management: "It ain't broke. Don't fix it."
You: "I can't make it work."
Management: "It ran last month."
You: "I made one small change and it doesn't run this month."
Management: "Back out the change."
You: "The results are then useless."

After much Grrr and Gnashing...

Management: "Identify the problems and we'll prioritize."
You: "Here are a dozen things."
Management: "These are too vague. Be more specific."
You: "Here are a score of things."
Management: "You're in the weeds. Bring it up a level to where business people can understand it."

After more Grrr and Gnashing...

Management: "What's the smallest change we can get away with?"
You: "The one I made that broke everything."
Management: "Let's have lots of peer review and design walkthroughs."
You: "Cool, then you'll see how broken it is."
Management: "Okay. Let's not. Instead, make the smallest change you can."
You: "I made one small change and it doesn't run this month."
Management: "Back out the change."
You: "The results are then useless."

Can we break the cycle of uselessness?

Depends.

We may be struggling with management folks who are set against fixing what's obviously broken. They're living a rich fantasy life that you can't really change.

However.

The Grrr and Gnashing part of the dialog represents time in which useful stuff can be done. Specifically. Rewrites.

Rewriting Strategy

There are three important parts of rewriting.
  • Understand what it's doing and why.
  • Describe it with test cases.
  • Make it objectively clear (i.e., high PyLint scores, complete documentation, etc.)
The effort often involves multiple passes. I like to describe it as Test-Driven Reverse Engineering (TDRE).

  1. Create (or expand) the test cases.
  2. Rewrite the code.
  3. Repeat until it's better.
It's essential to do these in order. Without test cases, rewrites are only more breakage. With test cases, rewrites are guaranteed to produce the same results as the previous mess of horrible code.

Sometimes the test cases are really a kind of "system test" where the whole application is run against some known inputs to produce some expected outputs. This is better than nothing. It supports building fine-grained unit test cases that conform to the system test case.

Other times, the test cases may be proper unit tests and the rewrites can be at a finer level of granularity. In this case, test coverage may have to be expanded to include the fragile bits. In some cases, the rewrites may be necessary to make the code testable in the first place.

Adding test cases is objectively valuable work.

Even the dumbest of "It ain't broke" managers can recognize this value. The rewrites are a beneficial consequence of adding test cases. You may be able to achieve a goal of fixing something without ever being seen as "fixing" it. All you did was improve test case coverage and improve the "design for testability."

Costs and Benefits

Consider the cost of struggling vs. the cost of rewriting.

It's the same 80 hours of effort.

In one case, you struggled with something management insisted wasn't broken. Eventually, you found ways to make it work.

In the other case, you rewrite something management insisted wasn't broken. In the end, you actually understood it and created objective improvements in the code.

Which is better?

Tuesday, December 19, 2017

The TLA Problem -- Over-Engineering Three-Letter Acronyms

Here's something we can gleefully over-engineer. Because anything worth doing is worth over-engineering until it morphs into a different kind of problem.

I'm unclear on the backstory, so try not to ask "why are we doing this?" I think it has something to do with code camp and there are teenagers involved.

What we want to do is generate a pool of 200 TLA's. Seems simple, right?

The first pass may not be obvious, but this works well.

import string
import random
from typing import Iterator

DOMAIN = string.ascii_lowercase
def tlagen() -> Iterator[str]:
    while True:
        yield "".join(random.choice(DOMAIN) for _ in range(3))

tla_iter = tlagen()
for i in range(200):
    print(next(tla_iter))

This makes 200 TLA's. It's configurable, if that's important. We could change tlagen() to take an argument in case we wanted four-letter acronyms or something else.

However. This will generate words like "cum" and "ass". We need a forbidden word list and want to filter out some words. Also, there's no uniqueness guarantee.

The Forbidden Word Filter

Here's a simple forbidden word filter. Configure FORBIDDEN with a set of words you'd like to exclude. Maybe exclude "sex" and "die", too. Depends on what age the teenagers are.

def acceptable(word: str) -> bool:
    return word not in FORBIDDEN

Here's another handy thing. Rather than repeat the "loop until" logic, we can encapsulate it into a function.

from typing import TypeVar, Iterator
T_ = TypeVar("T_")
def until(limit: int, source: Iterator[T_]) -> Iterator[T_]:
    source_iter = iter(source)
    for _ in range(limit):
        item = next(source_iter)
        yield item

Okay. That's workable. This lets us use the following:

list(until(200, filter(acceptable, tlagen())))

Read this from inside to outside. First, generate a sequence of TLA's. Apply a filter to that generator. Apply the "until" counter to that filtered sequence of TLA's. Create a list with 200 items. Nice.

The TypeVar gives us a flexible binding. The input is an iterator over things and the output will be an iterator over the same kinds of things. This formalizes a common understanding of how iterators work. The mypy tool can confirm that the code meets the claims in the type hint.

There are two sketchy parts about this. First the remote possibility of duplicates. Nothing precludes duplicates. And we're doing a bunch of string hash computations. To avoid duplicates, we need a growing cache of already-provided words. Or perhaps we need to build a set until it's the right size. Rather than compute hashes of strings, can we work with the numeric representation directly?

Numeric TLA's

There are only a few TLA's.
$$26^{3} = 17576 $$
Of these, perhaps four are forbidden. We can easily convert a number back to a word and work with a finite domain of integers. Here's a function to turn an integer into a TLA.

def intword(number: int) -> str:
    def digit_iter(number: int) -> Iterator[int]:
        for i in range(3):
            number, digit = divmod(number, 26)
            yield digit
    return "".join(DOMAIN[d] for d in digit_iter(number))

This will iterate over the three digits using a simple divide-and-remainder process to extract the digits from an integer. The digits are turned into letters and we can build the TLA from the number.

What are the numeric identities of the forbidden words?

def polynomial(base: int, coefficients: Sequence[int]) -> int:
    return sum(c*base**i for i, c in enumerate(coefficients))

def charnum(char: str) -> int:
    return DOMAIN.index(char)
    
def wordint(word: str) -> int:
    return polynomial(26, map(charnum, word))

We convert a word to a number by mapping individual characters to numbers, then computing a polynomial in base 26. And yes, the implementation of polynomial() is inefficient because it uses the ** operator instead of folding in a multiply-and-add operation among the terms of the polynomial.

Here's another way to handle the creation of TLA's.

FORBIDDEN_I = set(map(wordint, FORBIDDEN))
subset = list(set(range(0, 26**3)) - FORBIDDEN_I)
random.shuffle(subset)
return list(map(intword, subset[:200]))

This is cool. We create a set of numeric codes for all TLA's, then remove the few numbers from the set of TLA's. What's left is the entire domain of permissible TLA's. All of them. Shuffle and pick the first 200.

It guarantees no duplicates. This has a lot of advantages because it's simple code.

This, however, takes a surprisingly long time: almost 17 milliseconds on my laptop.

Numeric Filtering

Let's combine the numeric approach with the original ideal of generating as few items as we can get away with, but also checking for duplicates.

First, we need to generate the TLA numbers instead of strings. Here's a sequence of random numbers that is confined to the TLA domain.

def tlaigen() -> Iterator[int]:
    while True:
        yield random.randrange(26**3)

Now, we need to pass unique items, and reject duplicate items. This requires a cache that grows. We can use a simple set. Although, a bit-mask with 17,576 bits might be more useful.

def unique(source: Iterator[T_]) -> Iterator[T_]:
    cache = set()
    for item in source:
        if item in cache:
            continue
        cache.add(item)
        yield item

This uses an ever-growing cache to locate unique items. This will tend to slow slightly based on memory management for the set. My vague understanding is the implementation will double the size when hash collisions start occurring, leading a kind of log2 slowdown factor as the set grows.

The final generator looks like this:

list(until(200, unique(filter(lambda w: w not in FORBIDDEN_I, tlaigen()))))

Reading from inner to outer, we have a generator which will produce numbers in the TLA range. The few forbidden numbers are excluded. The cache is checked for uniqueness. Finally, the generator stops after yielding 200 items.

tl;dr

Of course, we're using timeit to determine the overall impact of all of this engineering. We're only doing 1,000 iterations, not the default of 1,000,000 iterations.

The original version: 0.94 seconds.

The improved number-based version: 0.38 seconds.

So there.  Want to generate values from a limited domain of strings? Encode things as numbers and work with the numeric representation. Much faster.


Saturday, December 16, 2017

Ordered Keys in Dictionaries

Raymond Hettinger (@raymondh)
#python news: 😀 @gvanrossum just pronounced that dicts are now guaranteed to retain insertion order. This is the end of a long journey.