Tuesday, November 13, 2018

Using Python instead of bash

See Bashing the Bash — Replacing Shell Scripts with Python for some concrete examples of stuff you can do in Python or the shell.

And yes, it's a good, workable idea. 

1. It's unit testable.
2. It's easier to read.
3. It may be faster. Not that you'd notice unless you've really made a terrible mistake and written some gigantic application as a shell script.

While you're at it, check out the overall blog: https://medium.com/capital-one-tech. There's a lot going on.

Tuesday, November 6, 2018

PyData 2018 Washington, DC

See https://pydata.org/dc2018/

You do need to get your tickets ASAP. The schedule is fabulous.

Hotel rooms are still available, so don't waste any time getting connected.

Tuesday, October 30, 2018

The SourceForge vs. GitHub Conundrum


Or "When is it time to move?"

I've got https://sourceforge.net/projects/stingrayreader/ which has been on SourceForge since forever. 

Really since about 2014. Not that long. But. Maybe long enough?

The velocity of change is relatively slow.

However. 

(And this is a big however.) SourceForge seems kind of complicated when compared with Github. 

It's not a completely fair comparison. SourceForge has a *lot* of features. I don't use very many of those features. 

The troubling issues are these.

1. Documentation. SourceForge -- while it has a Git interface -- doesn't handle my documentation very well. Instead of a docs directory, I do a separate upload of the HTML. It's inelegant. SourceForge may handle this more smoothly nowadays. Or maybe I should switch to readthedocs? 

2. The Literate Programming Workflow. There's an extra step (or two) in LP workflows. The PyLit3 synchronization to create the working Python from the RST source. This is followed by the ubiquitous steps creation of a release, creation of a distribution, and the upload to PyPI. I don't have an elegant handle on this because my velocity of change is so low. SourceForge imposed a "make your own ZIP file" mentality that could be replaced by a nicer "use PyPI" approach.

3. Clunky Design Issue. I've uncovered a clunky, stateful design problem in the StingrayReader. I really really really need to fix it. And while fixing it, why not move to Github?

4. Compatibility Testing. The StingrayReader seems to work with Python 3.5 and up. I don't have a formal Tox suite. I think it works with a number of versions of XLRD. And it *should* be amenable to other tools for Excel processing. Not sure. And (until I start using tox) can't tell. 

5. Type Hints. See #3. The stateful design problem can be finessed into a much more elegant use of NamedTuples. And then mypy can be used.

6. Unit Tests. Currently, the testing is all unittest.TestCase. I really want to convert to pytest and simplify all of it.

7. Lack of a proper workflow in the first place. See #2. It's a more-or-less sitting in the master branch of a git repo that's part of SourceForge. That's kind of shabby. 

8. Version Numbering Vagueness. When I was building my own Zip archives from the code manually (because that's the way SourceForge worked.) I wasn't super careful about semantic versioning, and I've been release patch-number versions for a while. Which is wrong. A few of those versions included new features. Minor, but features. 

But. One tiny new feature. So. It will be release 4.5.

See https://sourceforge.net/p/stingrayreader/blog/2018/10/moving-to-github/ for status, also

Tuesday, October 16, 2018

The Edge of the Envelope

I don't -- generally -- think of myself as an edge-of-the-envelope developer. I'm a tried-and-proven kind of engineer. I want stuff that's been around for years, with a long history of changes.

Except.

Today.

Currently, I'm revising Mastering Object-Oriented Python. Second Edition.

That means upgrading everything to Python 3.7 with full type hints throughout almost all of the 18 chapters. (SQLAlchemy presents some problems, so we're not going deep there.)

The chapter on foundational WSGI applications is *totally* broken. I can't get anything to work with mypy. (The unit tests run, but mypy complains. Loudly.) Of course, I tried every wrong thing for three solid days. Then I pulled the stub file from typeshed and realized how dumb I was.

Okay. I finally got the correct type hints. Yay!

But.

Something in mypy is balking at the start_response() function calls. Too many arguments.

Read the issues. Hm. Stack Overflow. Hm.

Just to be sure, I updated to the new 0.630 release in September, 2018.

Problem solved. So. I've arrived at the edge of the envelope. I now require the absolutely latest and greatest mypy release. By the time I'm done with the rewrites, this release will be ancient history. But today, it was wonderful to get past the examples.

Tuesday, September 18, 2018

Data Modeling Nightmare -- XML, HTML, and Markdown

Here's a particularly tangled and difficult problem. It arises because I have another blog. Specifically this: Team Red Cruising. And it's an unholy mess.

There are two important features of the Team Red Cruising blog.
  1. It's managed with off-line editor(s) so I can write posts from the boat and then upload them when I get connectivity. Welcome to being a technomad -- I don't always have a web-based blog editor available.
  2. It was actually created with two different off-line editors over a period of years: iWeb and Sandvox. iWeb is long dead. Sandvox hasn't seen many updates recently, and I think I'd like to move on to something newer and "better". 
(In this case, "better" means iOS-friendly. e.g., Blogo or BlogPad ProAlso. Blogo's support site seems to be a right mess. Not a good look. They're working on it.)

The blog isn't the unholy mess. We'll get to the mess below. First, however some background on the overall strategy. I want to move my content. What's involved? There are several things in play: the hosting, the target, and the source. So. Essentially. Everything.

Changing the Hosting Platform

Both of my legacy tools would export and upload the changes to my hosting service directly, avoiding the overheads of having any complex hosting software. The site was static and served simply from the filesystem via Apache httpd. Publishing was an SFTP transfer to the server. Nothing more. The "platform" was almost nothing.

(I could switch to using an Amazon S3 bucket and a DNS entry and it would work nicely.)

Both of these offline editing tools have a tiny bias toward working with hosting services like WordPress. Blogo claims it can also work with Medium, and Blogger, as well.

This means running Wordpress on top of my default SFTP/Apache configuration. I use A2 Hosting, so this is really easy to do.

So. The hosting is more-or-less settled. I'll do very little. (Dealing with breaking links is a separate hand-wringing exercise.)

In order to move from iWeb and Sandvox to another tool, and start using WordPress, I have two strategies for converting the content.
  1. Ignore my legacy content. Leave it where it is, more-or-less uneditable. The tool(s) are gone, all that's left is the static HTML output from the tool. 
  2. Gather the legacy content and migrate it to WordPress and then pick an offline tool that works with WordPress. 
I've already done strategy #1, when I converted from iWeb to Sandvox. I left the old iWeb stuff out there, and moved to a new URL path with new content. While a clever menu structure can make it look like it's all one multi-year blog, the pages themselves are vastly different in the way they look. There's no comprehensive search. And, of course, I can't easily maintain the old iWeb stuff.

Having one #1, I'm now sure that's a bad idea.

An advantage of moving to WordPress is the ability to have all of the content in one, uniform database. WordPress has export functionality, so the next tool is a distinct possibility.

Note that SandVox seems to have a distinct problem trying to import iWeb's published content. They have a cool HTML scraper, but iWeb relies on JavaScript, and scraper doesn't do well.

Getting to WordPress

What we're looking at is a fairly complex data structure. While I'd like to look at this from a vast and reserved distance (i.e., in the abstract) I have a very concrete problem. So, we're forced to consider this from the WordPress POV.

We have a WordPress "Site" with a long series of posts and some pages.


The essence here is that the content can -- to an extent -- be converted to Markdown. The titles and dates are easy to preserve. The body? Not so much.

We can, as an alternative to Markdown, use some kind of skinny HTML that WordPress supports. I think WP can handle a structure free of class names, and using a most of the available HTML tags.

Most of the blog content is relatively flat. The block structure is generally limited to images, block quotes, paragraphs, ordered and unordered lists. The inline tags in use seem to be a, img, strong, em, and a few span tags for font changes.

The complexity, then, is building a useful content model from the source. There are a few AST's for Markdown. commonmark.py might have a useful AST.  It's not complex, so it may be simpler to define my own.

It's hard to understand the inline blocks in mistletoe. The python-markdown project uses ElementTree objects to build the AST. I'm not a fan of this because I'm not parsing Markdown.

Starting From -- Well, it's Complicated

There are -- as noted above -- two sources:
  • Sandvox.
  • iWeb.
The Sandvox desktop "database" structure is opaque. The media is easy to find. The content is some kind of binary-encoded data with headers that tell me a little about the XCode environment, but nothing else.

To read this, I have to scrape the HTML using Beautiful Soup. It involves processing like this:

    content = soup.html.body.find("div", id="main-content")
    article = content.find(class_="article-content").find(class_="RichTextElement").div

Find a nested <div> with a target ID. Inside that <div> is where the article can be found.

This seems to work out pretty well. Almost everything I want to preserve can be -- sort of -- mushed into Markdown.

The iWeb desktop "database" is XML. The published HTML depends on Javascript and is hard to work with. The XML is -- of course -- densely wordy and convoluted as can be. But the words and markup are there.  I can use ElementTree to walk down through XML to locate the right tags.

There's a lot of code like this

    main_layer = child_root.find('ns0:site-page/ns0:drawables/ns0:main-layer', ns)

This example digs into site pages, and nested drawables, and main layers of content.  Eventually, we wind up looking at <p>, <span>, <attachment-ref>, and <link> tags in the XML to build the relevant content.

The nuance is style. They're not part of the inline markup. They're stored separately, and included by reference. Each of the four tags that seem to be in use have a style attribute that references styles defined within the posting. Once these references are resolved, I think Mardown can be generated.

The Unholy Mess

The hateful part of this is the disconnect between HTML (and XML) and Markdown. The source data permits indefinite nesting of tags. Semantically meaningless <p><p>words</p></p> are legal. The "flattening" from HTML/XML to Markdown is worrisome: what if I trash an entry by missing something important?

Ideally, it's this:



Pragmatically, HTML/XML can be more complex. This diagram assumes we won't have paragraphs inside list items. HTML permits it. It's redundant in Markdown.

Worse, of course, are the inline tags. HTML has a kabillion of them. The software I've been using seems to limit me to <img>, <strong>, <em>, and <a>. HTML/XML allows nesting. Markdown doesn't.

Ideally, I can reframe the inline tags to create a flat sequence of styled-text objects within any of the tags.

Right now. Headaches.

Working on the code. It's not a general solution to anyone else's problem. But. I'm hoping -- as I beat the problem into submission -- to find a way to make some useful tutorial materials on mapping between complex, and different, data structures.

Tuesday, September 11, 2018

Code Review

I can't actually share all the code. So this is feels incomplete. But I can share what I said about the code. Then you can look at your code and decide if you've got similar problems to fix.

My responses were these. I'll expand on them below.
  1. This appears to be a single cell in a Jupyter notebook? Why isn’t it a script?
  2. The code doesn’t look like any effort was made to follow any conventions. Use black. Or pylint. Make the code look conventional. 
  3. There don’t appear to be any docstring comments. That’s really a very bad practice. 
  4. The design appears untestable. That’s a very bad practice. 
  5. If this is an example of “production” code, I would suggest it needs a lot of rework.
Let's review these in a little more detail.

Number 1 was based on the file name being something_p36.ipynb.txt. The Jupyter notebookiness of the name is a problem. The _p36 is extra creepy, and indicates either a severe problem understanding how bash "shebang" comments work, or a blatant refusal to simply use Python3. It's hard to say what's going on, and I didn't even try to ask because... well... too many other things weren't clear.

Don't make up complex, weird naming rules. Use something.py. Simple. Flat. Pythonic.

Number 2 was based on things like this: def PrintParameters(pca): I hate to get super-pure PEP-8, but this kind of thing is simply hard to read. There were a LOT of other troubling aspects to the code. Once this is corrected, some of the other problems will go away, and we could move forward to more substantial issues.

Follow existing code styles. Find Python code. The standard library has a LOT of examples already part of your installation. Read it. Enjoy it. Mimic it.

Use pylint. Always.

Number 3 and Number 4 are consequences of the bulk of the code being a flat script with few class or function definitions. Actually, there were one of each. One class. One function. 240 or so lines of code. There was no separate __name__ == "__main__" section, so I was generally unhappy with the overall design.

Also. There's code like this

if True:

Yes.  That's a real line of code. Sigh.

Here's an ancillary problem. If you need to write something like this, you're doing it wrong.

##########################
 -- init Stuff
##########################

The code that follows one of these "big billboard comment" sections *must* be part of a function or class. It can't be left floating around with a billboard for demarcation. It should be refactored into a function (or method of a class), documented, and tested.

Did I mention tested?

It's untestable as written. Sigh.

Number 5 may be a misunderstanding on my part. The email had this: "They have produced production code that mathematically optimizes stuff for [redacted]. So, they are heads up type of people."

I'm guessing this is relevant because the team has some "production" code in Python and consider themselves knowledgeable. Otherwise, this is noise, and I should have ignored it.

I'm hopeful they'll use black, make the code minimally readable, and we can move on to substantial issues regarding design for testability and overall possible correctness issues.

It wasn't the worst code I've seen. But. It shows a lot of room for growth and improvement.

Tuesday, September 4, 2018

Handy Flask Configuration -- Bookmark the original article

Pycoders Weekly (@pycoders)
Configure Python 3, Flask and Gunicorn on Ubuntu 18.04 LTS – bit.ly/2vRZYQR

We worked through this about a year ago, without the help of this post. Having the article would have saved us some time and effort. You should bookmark it.

We liked this tech stack because it was simple and effective.

The team I'm on now is using NGINX and uWSGI as well as Python3 and Flask. It's also effective and it's also pretty simple. It has a few more moving parts, but works reliably.

Tuesday, August 28, 2018

Cool success story of Cython

Real Python (@realpython)
A multi-core Python HTTP server (much) faster than Go (spoiler: Cython)

nexedi.com/NXD-Blog.Multi…

https://www.nexedi.com/NXD-Blog.Multicore.Python.HTTP.Server

This is handy. It makes perfect sense that Python -- with a little help -- can be compiled down to super-fast code. Hopefully, the Cython world will continue to evolve toward using native Python type hints.

When Cython uses fully-native type hints, it becomes a super-convenient and transparent performance booster.

Without full-native type hints it becomes a place where bugs are injected as part of trying to improve performance.

Tuesday, August 21, 2018

Python Dependency Management

Freezing Python’s Dependency Hell in 2018

Excellent advice.

Excepot for the "Don't use Anaconda." Yes. It's a big download. Odds are good you'll need most of it. So. Just do it now.

The (miniconda + environment.yml) as an entry point is really good. The "rely on people to actually know and consistently use their best practices" doesn't seem like a problem, it seems like a consequence of an evolving software ecosystem.

Tuesday, August 14, 2018

Why is Python so slow?

This is brilliant. 

Why is Python so slow? by Anthony Shaw

It covers three aspects of the implementation in a respectable level of detail. Helpful information. Bookmark it to help stop pointless bickering with people who don't understand the value of getting something to run right now vs. getting something that will eventually run and be fast.

Tuesday, July 24, 2018

Mastering Object-Oriented Python -- 2nd Edition

It's time to revise Mastering Object-Oriented Python. While the previous edition is solidly focused on Python3, it lacks some important features:
  • F-Strings
  • Type Hints
  • types.NamedTuple
  • Data Classes
So. There's some stuff to add. I don't think there's too much to take away. I plan to make some things a little more tidy. I will remove all references to Python2 and all references to how things used to be and why they're better now.

It will be several months before this is available. Stand by for updates.

The earliest drafts of this book date back to 2002. Seriously. I've been over this material a lot in the past 1.5 decades.

The nascent form of this book took me years (maybe 10 years) to accumulate. It covered everything: data structures, statements, built-in functions, classes, and a bunch of libraries. It was beyond merely ambitious and off into some void of "cover all the things." 

I was motivated by my undergrad CS text books on the foundations of computer science. The idea of putting the language features into a parallel structure with boolean algebra, set theory, and number theory was too cool for words. And -- lacking the necessary formal background -- it was something I'm not able to present very well.

While I wanted to cover all of Computer Science, acquisition editors were pointed out how crazy that idea was. A focus on the object-oriented features of Python was sufficient to sell a distinctive book. And they were absolutely right.

As I rework the outline for the 2nd edition, there are some other topics that crop up. These are not going to wind up in the book, but they're an implicit feature of the topics being covered.

CS Foundations and Python

One of the best of the introductory books (which came out after I graduated) was Structured Concurrent Programming With Operating Systems Applications. They presented a nested collection of sub-languages: SP/k. The organization of the nested subsets can be helpful for exposing programming incrementally. There are issues, and we'll look at them in detail below. Here's the collection of subsets from the original book (and related articles.)

  • SP/1 expressions and output. The print() function.
  • SP/2 variables, assignment, and the input() function.
  • SP/3 selection and repetition. The Python if and while constructs are the logical minimum, but the for statement makes more sense because it's so widely used.
  • SP/4 character strings. 
  • SP/5 arrays. Python lists, really.
  • SP/6 procedures. Python function definition.
  • SP/7 formatted input-output. f-strings for output, and regular expressions for parsing.
  • SP/8 records and files.
There are a lot of gaps between this list of subsets and modern programming languages. SP/k was explicitly based on subset of PL/I, saving the complexity of implementing special compilers. It also reflects the mid-70's state of the art.

What didn't age well is the implicit understanding that numbers are the only built-in data types. Strings are so magical they're isolated into two separate subsets: SP/4 and SP/7. Arrays are called out, but sets and dictionaries didn't exist in PL/I and aren't part of this nested sequence.

Also. And even more fundamental.

There's a bias toward "procedural" programming. The SP/k subsets expose the statements of the language. There are few data structures, and it seems the data structures require some statements before they're useful.

This leads to my restructuring of this. It doesn't apply to the Mastering OO Python book. It's something I use for Python bootcamp training.

  • py/1 expressions and output: int, float, numeric built-in functions, and the print() function.
  • py/2 variables, assignment, and the input() function.
  • py/3 strings, formatting, and various built-in string parsing methods.
  • py/4 tuples and multiple assignment. (Since tuples are immutable, they're more like strings than they are like lists.) And yes, this is kind of short.
  • py/5 if statements and try/except statements. These are the two fundamental "selection" statements. The raise statement is deferred until the functions section.
  • py/6 sets and the for statement.
  • py/7 lists.
  • py/8 dictionaries.
  • py/9 functions (avoiding higher-order functions, decorators, and generator functions.)
  • py/10 contexts, with, and file I/O.
  • py/11 classes and objects.
  • py/12 modules and packages.
The point here is to expose the data structures as the central theme of Python. Statements follow as needed to work with the data structures. 

Note that some topics -- like break, continue, and while -- are advanced parts of working with data structures.

The standard library? Not included. Perhaps should be. But. It's technically separate from the language and all of this can be done without any imports. We would then cover a bunch of standard library modules. The order includes math, random, re, collections, typing, and pathlib

Tuesday, July 17, 2018

Patient Crawling and Possible Phishing

Once every few months I get an email like this. What is it? Phishing?

I've finally looked into it, and learned two important lessons.

Here's the body of the email.
Hello there,
Your page http://www.itmaybeahack.com/homepage/iblog/C364310209/E20080407095503.html has some good references to cyber security so I wanted to get in touch with you. I've recently written an article The 6 Types Of Cyber Attacks To Protect Against In 2018 and was wondering if you thought my article could be a good addition to your page.
You can read my article right here: https://pagely.com/blog/cyber-attacks-in-2018/
I would like to hear your opinion on this article. Also, if you find it useful, please consider linking to it from your page I mentioned earlier. If you prefer you may republish the article. Let me know what you think.
Thank you very much,
Really?

The page they cited has three (3) external links. One is to actual cyber security content. Another now gets redirected to generic advertising, and the third (like the original blog post) is a decade old.

What does this mean?

Clearly, it means some bot found my page. One of the links was to something they're trying to SEO boost. (How do I know it's SEO? I don't. The email address is similar to an SEO boosting company, so it seems like that's what's going on here.)

I've been haphazard about responding to these because I'm a fundamentally charitable person.

Or I'm a total pushover to certain kinds of social engineering. You choose.

You see the appeal to my vanity in the email? They read my ancient content! Swoon!

The email looks personal. There's a name. Spelled consistently. With no digits in it. Someone read my content and reached out to me! I'm in love! Ah! Sweet Mystery of Life at last, I've found you!

The email makes me think -- somehow -- it's not a bot and there's a person involved. A person trying to make a buck selling content and advertising. I should help them, right? Amplify their signal and all?

What a chump I am! I should simply ignore these.

In the past, I have responded with a "Nope. That content is too old to do anything with. I should delete it but I'm too lazy." Once a bot found a link on live content, and I dutifully updated it. I now know any response is a mistake.

I checked out the page.ly site. It's a nice summary of cyber attacks. It seems to be a not-to-dangerous link to not-bad content. Except for the Unicode errors throughout the document. Like someone copied and pasted the original bytes -- intended for CP-1252 -- to a site explicitly using UTF-8.

That's not all.

The name on the email, and the author of the article don't match.  The email says "my article" but the article has a different author.

Red Flag.

After (finally) spending five minutes on this, I learned two things.

  • First: this is nonsense. It's some kind of phishing attack. Or some kind of SEO-boosting bot that doesn't check dates very well.
  • Second: I'm an easy mark when people appeal to my vanity. I need to stop responding, no matter how effusive the (inferred) praise I think I'm hearing.

Tuesday, July 10, 2018

10 common security gotchas in Python and how to avoid them

First, read this: 10 common security gotchas in Python and how to avoid them by Anthony Shaw

Of these, most are important, but not specific to Python at all. Only items 3, 4, 7, and 8 are pretty specific to Python. They talk about the assert statement, some timing vulnerabilities, and the bad idea of transmitting pickle files.

Item 5 is also specific to Python, but I quibble about it's relevance. It is at the very edge of "security." The PYTHONPATH environment variable is most definitely not "...one of the biggest security holes in Python." If the path is a security hole, then any code is a security hole. If we view code as a security hole, then the only truly secure system has no software.

(As someone who lived on a sailboat. I happen to subscribe the position that the only truly secure system has no software. Use line, shackles, and well-known knots if you want to stake your life on it. Use fancy electronics with software to make it simple and fun.)

Bad programming is the biggest security hole. Failure to prevent SQL injection. Failure to use CSRF tokens. Failure to properly handle credentials. These are security holes of epic proportions.

The PYTHONPATH cannot be changed through any kind of request handling. Even colossally dumb software that blindly uploads XML or JPEG files without vetting them won't change the PYTHONPATH.  You'd have to write code that changed sys.path. Or you'd have to write code that reset the os.environ and then started applications in the new environment. This is seriously bad code, and has nothing to do with Python.

Otherwise, the only way to change PYTHONPATH requires an Evil Super Genius who has your compromised credentials. Once your credentials are compromised anything is possible, including the setting the PATH environment variable, or deleting all the accounts, or rm -rf /. None of which is specific to Python.

Item 9 -- patching the system Python -- may be important, All OS's should have patches applied early and often. However. We strongly discourage our developers from using the system Python for anything. We always build environments. We always install our own Python 3 with our own packages. We generally ignore the system Python to the extent possible.

Item 7, though, is a huge deal. We use OAS (formerly known as swagger.) The old swagger.json end-point was -- clearly -- json. The new OAS 3, however, suggests the specifications be provided at  openapi.yaml. This week we're rolling out a cluster of microservices using our shiny new OAS 3 specifications. And we're using default yaml.load() instead of yaml.safe_load() as part of the contract hand-shake among the services. All internally-facing handshakes, but still unsafe with respect to a man-in-the-middle hacking our specifications.

While I can quibble about two of the ten items, the other eight are rock solid, and should be part of periodic in-house code reviews.

And number 7 is killer. 

Friday, June 22, 2018

Type Hinting Edge Case

Warning. I'm new to this. Yes, my book Functional Python Programming -- 2nd ed -- is full of type hints. But my examples are all (intentionally) relatively simple. There are edge cases that I do not pretend to understand.

Here's a fun one. Start here

This is a cool question.

Here's an essential clarification on what this structure is.


This is tricky and I think there are two reasons why it's hard.
1. We want to specify some details internal to instances of the np.array class.
2. We want to provide a size constraint, something that I don't think typing can do.

The size constraint may be handled by using Tuple, but it doesn't really fit in a general way. This three-tuple is Tuple[float, float, float]. You can see how that rapidly gets hideous for higher-dimension objects. You'd want Tuple[float*3], right?

The internal constraint, similarly, is challenging. However. An np.array() -- for the most part -- is a Sequence with extra features.

I have a suggestion.

1. A stubs/numpy.py file with this. I think this characterizes the array structure.

from typing import TypeVar, Sequence

_Base = TypeVar("_Base")

def array(*args: Sequence[_Base]) -> Sequence[_Base]: ...


2. Here's the target function.

import numpy as np
from typing import Sequence

Vector3 = Sequence[float]

def vec3(x: float, y: float, z: float) -> Vector3:
    return np.array((x, y, z))


This seems to capture part of the type definition. It doesn't capture the 3-ness of the vector.

Tuesday, June 12, 2018

Coping with a Spreadsheet Database

A common way to save persistent, important data is a spreadsheet. It provides a handy, potentially normalized store that's readily accessible with minimal tooling. It has a UI usable by people with a spectrum of skills.

Sadly.

There's a core conflict:
  • The advantages of spreadsheets-as-database are numerous. 
  • The disadvantage is the lack of any strict, formal control over the schema.
At the very best, the steward of the data has some discipline and they include column headers and assure they're used throughout the rows of data.

It goes downhill rapidly from that ideal.

Let's look at some scenarios. And. How to cope. And. Python to the Rescue.

Outliers, Special Cases, Anomalies, and other Irregularities

The whole point of a "normalized" view of the data is to identify a pattern, assign the lofty title of "Schema" to the pattern, and assure all of the data fits the schema. In rare cases, all of the data fits a simple schema. These cases are so rare they only exist in examples of SQL code in tutorials.

A far more common case is to have several subtypes which are so similar that optional attributes (or "nullable columns" in SQL parlance) allow one schema description to encompass all of the cases. If you're a JSON Schema person, this is the "OneOf" or "AnyOf" type definition.

Some folks will try argue that optional attributes don't always mean that there are several subtypes. They'll ramble on for a while and eventually land on "state change" as a reason for optional attributes. The distinct states are distinct subtypes. Read up on the State design pattern for OO programming. Optional attributes is the definition of subtype.

The hoped-for simple case is a superclass extended by subclasses used to add new attributes. In this case, they're all polymorphic with respect to the superclass. In a spreadsheet page, the column names reflect the union of all of the various attributes. There are two minor variants in the way people use this:

  • An attribute value is a discriminator among the subtypes. We like this in SQL processing because it's fast. It also allows for some validation of the discriminator value and the pattern of attributes present vs. attributes omitted. Of course, the pattern of empty cells may disagree with the discriminator value provided.
  • The pattern of attributes provided versus omitted is used to identify the subtype. This is a more reliable way to detect subtypes. There can, of course, be problems here with values provided accidentally, or omitted accidentally.
The less desirable case is disjoint classes with few common attributes. Worse, the common attributes are not part of the problem domain, but are things that feel databasey, like made-up surrogate keys. There's an "ID" in column A or some other such implementation detail. Some of the rows use column A and columns B to G. The other rows use column A and columns H to L. The only common attributes are the surrogate keys, perhaps mixed with foreign key references to rows in other spreadsheet tables or pages.)

This is a collection of disjoint types, slapped together for no good reason. SQL folks like to call it "multiple master-detail relationships". The master record has children of multiple types. In some cases, the only thing the children have in common is the foreign key relationship with the parent. If you want a concrete example, think of customer contact information: multiple email addresses, multiple phone numbers. The two contacts have nothing in common except belonging to one customer. 

These don't belong in a single spreadsheet table. But. There they are. Our code must disentangle the subtypes.

Arrays

A lot of spreadsheet data is a two-dimensional grid. Budgets, for example, might have categories down the page and months across the page. 

This is handy for visualization. But. It's not the right way to process the data at all. 

This extends, of course, to higher orders. Each tab of a spreadsheet may be a dimension of visualization. There may be groups of tabs with a complex naming convention to include multiple dimensions into tab names. Rows may have multiple-part names, or use bullets and indentation to show a hierarchy.

All of these techniques are ways to provide a number of dimensions around a fact that's crammed into a cell. The budget amount is the fact. The category and the month information are the two dimensions of that cell. In many cases, Star-Schema techniques are helpful for understanding the underlying data, separate from the visualization as a spreadsheet.

Our code must disentangle the dimensions of the meaningful facts. 

Normalization

There are tiers of normalization. The normalization described above is part of First Normal Form (1NF): all rows are the same and all data items are atomic. Pragmatically, it's rare that all spreadsheet rows are the same, because it's common to bundle multiple subtypes into a single table.
Sidebar Rant. Yes, the presence of nullable columns in a SQL table *is* a normalization error. There, I said it. Error. We can always partition the rows of table into a number of separate tables; in each of those tables, all columns are required. We can rebuild the original table (with optional fields) via a union of the various decompositions (none of which have optional fields). The SQL folks prefer nullable columns and 1NF violations over unions and 1NF absolutism. I'm a fan of 1NF absolutism to understand each and every nullable attribute because casual abuse of nulls is a common design error.
The other part of 1NF is each value is atomic: there's no internal structure to the value. In manually-prepared spreadsheet data, this is difficult to insist on.  Stuff gets combined into a single cell because -- well -- it seemed helpful to the people entering it. They put all the lines of an address into a single cell because they like to see it that way.

Third Normal Form (3NF) forbids derived data (and transitive dependencies). In a spreadsheet, we might have a row-level computation. It helps the person confirm the data is correct. It's not "essential". It breaks the 3NF rule because the computed attribute depends on other field values; a change to one attribute will also change the derived attribute.

When we first encounter spreadsheet data, this isn't always obvious. In some cases, the derived data is computed "off-line" -- i.e., manually -- and entered into the spreadsheet. Really. People pull up a calculator app (or whip out their phone), compute a value, and type it in. In other cases, they look something up manually and enter it.

These kinds of data entry weirdnesses require code to normalize the manually-prepared data. We'll have to decompose non-atomic fields. And we'll have to handle derived data gracefully. (Reject it? Fix it? Warn them about it? Handle it as an exception?)

Relationships

Let's talk about Second Normal Form (2NF). We really want to have a row in a table represent a single thing. The SQL folks require all of the attributes to be dependent on the row's key. In spreadsheet world, we may have a jumble of attributes with a jumble of dependencies. We may have multiple relationships in a single row.  Look at the Second Normal Form page on Wikipedia for examples of multiple relationships mashed together into a single row.

When a spreadsheet has 2NF problems, there will be situations were some collection of attributes is repeated -- verbatim -- in multiple places. The most common example in US-based data is City-State-ZIP Code. These three *always* form a consistent triple of data, and should be repeated as part of an address. In SQL terms, City and State have a functional dependency on the ZIP Code. In an Object-Oriented database, we might have a separate City-State-Zip class definition. In a document datastore, we might combine these items into a sub-document.

In any 2NF problem area, we're forced to write code which normalizes this internal relationship.

And. When we do that we'll find the kinds of problems we find with derived data: The ZIP code 22102 might be McLean or Tysons Corner. One of them is "right" and the other is "wrong", Or perhaps there needs to be an exception to handle this. Or perhaps a correction applied to coerce the wrong values to be right.

The "Association" Table

There's a SQL design pattern called an association table. This is used to handle a many-to-many relationship between two entities. Consider Boats and Owners. A boat will have multiple owners. An owner may have multiple boats. In SQL world, this requires a special table with two foreign keys. In the degenerate case, there are no other attributes. In the boat-owner relationship case, however, there's often a range of dates that specifies when an owner was associated with a boat. The range of dates applies to the relationship itself, not to boat nor to owner.

In a spreadsheet there are numerous ways to represent this. Numerous. A list of boat rows after each owner.  A list of owner rows after each boat. A number of owner columns for each boat.  A block of text with a list of owner names in a single cell. Creative people will create many creative solutions to this data representation problem.

Note that the association table is a SQL hack. It's an implementation detail, not an essential feature of the problem domain. In Python, for example, we'll need to use weakref objects to handle this cleanly. 

When Owner O1 refers to Vessel V1 it's easy to have a list of vessel references under the owner. When the Owner O1 object is no longer needed, it can be removed from memory. This decrements the references count for Vessel V1 to zero, and it will also be removed from memory, too. 

When we have mutual references, we have a problem, solved by weakrefs.

If Owner O1 refers to Vessel V1 and we also have Vessel V1 referring to Owner O1, we have mutual references. O1 has a list that includes V1.  V1 also has a list that includes O1. This means there are two strong references to O1: some variable, owner, and Vessel V1 also refers to O1. When the variable owner is no longer needed, then the reference count to O1 is decremented from two to one. And the object can't be deleted yet. 

If V1 has a weak reference to O1, then the strong reference count -- based on the variable owner -- is only one. The weak reference from V1 doesn't count for memory management purposes. O1 can be removed from memory, references to V1 will be decremented, and it, too, can be removed.

Our code will have to parse and populate the relationships. And we'll need to use weakref to be sure we can cleanly remove objects.

Coping Strategies

As noted above, we have to cope with manually-prepared spreadsheet data. It looks like this:
  1. Figure out what the likely data structure is. This isn't simple. We'll look at Pythonic techniques below. When starting, it helps to draw UML class diagrams (or ER diagrams) over and over again to try and depict the data. I'm a fan of using https://yuml.me to draw the pictures because they have a super-handy text notation for the relationships and attributes.
  2. Leverage the Extract-Transform-Load design pattern.

    • The "extract" reads the source spreadsheet data. A first version will be trivial use of xlrd or csv module. Or any of the modules listed here: http://www.python-excel.org
    • The "transform" should be implemented as a function to transform source to the target model. Pragmatically, this single function will leverage a number of other functions to validate, cleanse, convert, and normalize the data.
    • The "load" may not be anything more than creating instances of the underlying model classes. In some cases, the instances of the model classes may wind up in an in-memory dictionary. In other cases, the "load" might be a simple use of pickle or shelve to persist the useful data.

  3. Separate Model, ETL, and "Real Work" from each other. The model should evolve very slowly. It's the essential problem we're solving. The ETL may vary with each major revision to the spreadsheet database. Users add columns, they change meanings, their understanding evolves. The final work is based on the model -- and only the model -- ignoring the vagaries of ETL.
  4. Plan for change. Each manually-prepared spreadsheet is a unique snowflake, precious and distinct. This leads to an important lesson based on the Open/Closed Principle: Code Must Be Closed To Modification and Open To Extension. Each version of the source data means adding new functions or classes to cope with each bizarre new spreadsheet issue. When the source data changes, don't modify any old code; Always Be Adding. This means planning for multiple versions of functions: validate_1(), validate_2(), validate_3().  It's essential to be able process *all* old versions of the data and get meaningful, useful results for regression testing.

Python To The Rescue

Data modeling must be done slowly and reluctantly. Don't overfit the model to the first spreadsheet.

Here's the place to start

from typing import SimpleNamespace
class Model(SimpleNamespace ):
    pass

This is *enough* modeling to get started. Don't over-engineer the model. We can then do things like this.

class Owner(Model):
    pass

This defines the class Owner as an instance of some abstract Model class. The SimpleNamespace allows us to have any attributes we think we need.

owner = Owner(vessel=some_id, name=row['name'])

We can leverage the SimpleNamespace to build useful objects with minimal code. This can be replaced with a typing.NamedTuple or a @dataclass class definition when the definition is more mature.

The "extract" code needs to gather row-like objects. Ideally, this is a generator function. Because normalization and dereferencing may require multiple passes through the data, a list can be slightly easier to deal with. We'll come back to normalization and dereferencing below.

For some background in the classes used here, see https://sourceforge.net/projects/stingrayreader/. (Yes, this is old; I'm thinking of moving it to GitHub and updating it to Python 3.7.)

def load_live_rows(workbook, sheet_name):
    sheet1 = sheet.EmbeddedSchemaSheet(workbook, sheet_name, schema.loader.HeadingRowSchemaLoader)
    dict_rows = sheet1.schema.rows_as_dict_iter(sheet1)
    clean_data = filter(lambda row:not row['Hull No.'].is_empty(), dict_rows)
    initial_data = take_until(lambda row:row['Hull No.'].to_str() == 'Definitely WB Owners:', clean_data)
    return list(initial_data)

Step-by-step.
  1. We're working with a sheet that has the schema embedded in it. That means using the heading rows as column information. The HeadingRowSchemaLoader will be grabbing the first few rows from the EmbeddedSchemaSheet. Sometimes we need more complex loaders to read multiple rows. If the schema is separate from the sheet, then the loader doesn't interact with the source of data. 
  2. Each row is modeled as a simple dictionary in this example code.
  3. A filter locates rows that have hull numbers. Other rows are quietly discarded.
  4. The take_until() function reads rows until the matching row is found, then stops. This chops off the bottom of the spreadsheet where manual notes were kept.
The resulting list of rows can be validated, cleansed, and normalized to create the useful instances of the various Model subclasses.

Here's the "transform" portion.

def make_owner_1(row: Dict[str, Cell]) -> Owner:
    return Owner(
        last_name=null_strip(row["Owner's Last Name"].to_str()),
        first_name=null_strip(row["Owner's First Name"].to_str()),
        display_name=null_strip(row["Display Name"].to_str()),
        website=null_strip(row["Website"].to_str()),
        owner_vessel=[],
    )

We've built an instance of the Owner subclass of Model by extracting a number of attributes from the row. There are other columns not extracted; they are part of various normalizations and dereferencing.

The owner_vessel attribute is a parent-child relationship that can't be trivially populated from the row. The SQL folks would include a foreign key in each child that refers to the parent. The vessel page of the spreadsheet has this information, and it's used to populate the owner's details. This is one of the dereferencing activities that needs to be done as part of "loading".

The to_str() method is feature of the Stingray Reader's cell definitions. Conversion methods like this are not typical of idiomatic Python code. If we were only creating built-in str, float, or int, the bunch of conversion methods would be A Bad Idea. To be useful, we also need to create Decimal objects, and that leads us to embracing a grid of conversion methods for each cell source to desired resulting objects. We could use decimal(str(cell)), but it seems cleaner to use cell.to_decimal().

Multiple Passes

We often touch the source more than once.
  1. There's a "validate and load" pass to get rows that are sensible to process. A generator might make sense here. 
  2. There may be a "cleanse and convert" pass to reformat the source data, perhaps parsing complex cells into components or combining multiple source rows into a single entity description. This, too, might involve a generator to restructure the spreadsheet rows into something sensible.
  3. There will be multiple "normalization" passes. Any 2NF relationships need to be extracted to create model objects. Any restructuring of complex dimensions should be handled via restructuring source data from grid to rows, or from multiple sheets to a single, long, sequence of rows with the various dimensions as explicit attributes of each row.
  4. There may be multiple "load" passes to build final objects from the source rows. This will often lead to including the built objects as part of the source data.
  5. There will be some final "dereferencing" passes where foreign key relationships are turned into proper references among the objects. These should be weakref references to permit proper garbage collection.
At this point, the application will have tidy collections of Python objects that can be used for the real work.

What's essential is finding a balance between end-user visualization of the data in a spreadsheet and schema validation in Python. It's often helpful to be flexible when trying to automate processing of complex, irregular, manually gathered data.

Letting candidate users work with spreadsheets lowers the barrier to automation.

Coping with irregularity gets the process started.

As the work matures, some schema controls will tend to evolve. People tend to recognize the cost and complexity of irregular data. They will try to identify the patterns and impose some order on those patterns. As they uncover patterns in the data, the "schema" will evolve. This is a good thing, and Python lets this proceed at a human pace.

We can -- easily-- create flexible tools that let people understand and organize their data.

Tuesday, May 15, 2018

PyCon 2018 Highlights

And yes, this is truncated because I left early, and missed some important things. I'm going to have to catch on YouTube https://www.youtube.com/channel/UCsX05-2sVSH7Nx3zuk3NYuQ/videos




Of course, you'll also need to see the keynotes.


And there's a HUGE number of talks I didn't get to. 

Tuesday, May 1, 2018

Misunderstanding OO Programming

Read this. Goodbye, Object Oriented Programming

I like this because parts of it are wrong, and parts are based on peculiarities of specific languages which aren’t problems in other languages.

The “wrong” things are on a spectrum. At one end are things almost right. The other end is hoped-for things which — frankly — were never true.

The most important piece of nonsense is class-level reuse across projects. Class-level reuse in a new project was not a thing in OO programming. The monkey-banana-jungle “problem” only exists in a strange world were someone made up the idea of single classes being reused in isolation. The rest of us knew the scope of reuse was within a project or a narrow family of projects aimed at a single problem domain.

"Utility" classes that could be reused and generic data structures were always available as frameworks and libraries. Things built to solve a specific problem were going to be tailored to the problem. Most OO designers knew this and knew that making something generic would be hard. Making something reusable and installable by others was even harder. (Especially in compiled languages where you wanted to hide intellectual property by keeping the source secret.)

The "OO promised me reuse and lied" is a misstatement. Please rephrase this is "I imagined there could be class-level reuse and discovered it was hard."

Multiple inheritance does work in a number of languages, so I’ll skip the complaints centered on single inheritance.

I don't fully understand the complained about encapsulation. There are lots of books on separating interface from implementation to more fully isolate implementation details. If references need to be treated more opaquely, there are lots of techniques for this. It’s not broken. Indeed, it’s really well understood. ("But I won't want to introduce wrapper classes to insulate the references." Sigh. That's how it's done.)

I think the "references leak details about encapsulation" requires rephrasing as "I imagined some kind of perfectly isolated programming where references were not usable in spite of me making them usable." Or perhaps "I wish references had special treatment to make them not work as references except in a limited context which I get to imagine."

The polymorphism complaint appears to be “okay, this actually works.”  I guess. Or. “There are other ways to do this in other languages.” I'm sure it's an important point, but I can't quite discern what OO principle is allegedly broken here.

tl;dr

No one was lied to. If someone was "burned" by some OO hype, I’d like to see the actual quote of the actual hype. The “I was told there would be X”, requires some substantiation.

And. Stop griping about encapsulation. When the source is available (as it is in many languages) there's no enforcement other than public shaming.

Also. Use Python. Most of the original post seems to be complaints about C++ weirdness.

Tuesday, April 24, 2018

Functional Python Programming 2e -- Type Hints!

You might want to look into this: Functional Python Programming - Second Edition.

Let's talk about the type hints, shall we?

Most of the examples have had type hints added. This means running everything through mypy. And it also means running everything through doctest, as well.

More important than the technical steps, there's a change in viewpoint that comes with type hints.

If you follow a variety of Pythonistas on Twitter, you can see some debates on the merits of type-hinting. Some key points:

  • It's hard.
  • It's so hard, only do it if you absolutely need it.
  • It's too verbose
  • It's hard, but it can help.
  • It's really helpful.
  • It represents a "gap" in the language and without run-time type checking, the whole thing is worthless.
The last point a weird view. I work in a shop that's heavily Pythonic. But. You still hear nonsense. Python a very popular language and it's popularity is growing. The popularity of Python isn't like the popularity of a movie where you're not planning on making a living off of it (I know someone who makes their living off the popularity of movies.) The popularity of Python is like the popularity of automobiles or air travel or electricity.

I hear the "a real language would have prevented that with type-checking." And I respond, "Then why do you unit test?" And they don't really have much of an answer. Python has the same workflow as statically type-checked languages, so the "prevention" thing seems to be nonsense.

Moving on.

"It's hard." Anything new is hard. The complaint is vague, so it's *hard* to respond. (Heh.)

Anything like "only do it if you absolutely need it" bothers me because it seems like a passive-aggressive barricade around things. Also. It's vague.

Verbosity

Verbosity in type hints is a real problem. When creating complex objects from built-in types, we often forget to give names to the intermediate object classes.

Consider Dict[Tuple[Tuple[int, int], Tuple[int, int]], float]

It's long. It describes a structure like this {((12, 13), (14, 15)): 2.8284271247461903, ...} 

Writing something like the following d_map() function without hints is easy. Adding hints seems hard.

def d_map(points):
    return {(p1, p2): hypot(p1[0]-p2[0], p1[1]-p2[1]) for p1, p2 in points}

The declaration became L.. O... N... G... because we ignored the intermediate types.

def d_map(points: List[Tuple[Tuple[int, int], Tuple[int, int]]]) -> Dict[Tuple[Tuple[int, int], Tuple[int, int]], float]:
    return {(p1, p2): hypot(p1[0]-p2[0], p1[1]-p2[1]) for p1, p2 in points}

These hints, however, doesn't really describe what's happening. The hints elide important details. The hints don't reflect the underlying semantics of the data structure.

One of Python's strengths is the rich collection of first-class data structures with built-in syntax. We can abbreviate some complex concepts into succinct, expressive code.

However.

We shouldn't lose sight of what the succinct code represents. And in this case, it represents some rather complex concepts.
<rant>
Let me sit in my lawn chair and shake my fist in helpless fury at you kids. When I was your age, we sent half a semester of undergraduate work trying to get linked lists, and simple hash mapping to work. Months of work. Later on, as a professional -- years of actual experience -- it took forever to build a binary tree-based collections.Counter definition to gather simple numbers from a flat file. Nowadays, you just slap a Counter down into your code like it's a nothing. It's not a nothing. It's serious, sophisticated software engineering. It's more than Dict[Any, int]. </rant>

What can we do?

When in doubt, Expose the Intermediate Types.

Point = Tuple[int, int]
Leg = Tuple[Point, Point]
Distances = Dict[Leg, float]
def d_map(points: Iterable[Leg]) -> Distances:
    return {(p1, p2): hypot(p1[0]-p2[0], p1[1]-p2[1]) for p1, p2 in points}

This exposes the details. In some cases, it causes us to rethink using a two-tuple to represent a point. The p1[0] syntax starts to chafe a little. Perhaps this should have been

class Point(NamedTuple):
    x: int
    y: int

That leads to tiny (almost-but-not-quite trivial) simplifications. Instead of building simple tuples for each point, we can now build named Point tuples and use p1.x and p1.y to make the code more civilized.

One consequence of this is actually avoiding (), [], and {} to build tuples and lists. Yes. This is heresy. I seriously recommend using tuple(), list(), dict(), and set() because we can replace them with equivalent types. And yes, I text my mother with the same fingers that wrote that.

"But," you object, "It's objectively LONGER! You didn't save me anything! You're a fraud!"

My first response is, "Correct." It is objectively longer. And "Correct," I didn't really "save" you anything; I'm not sure what you're saving. Lines of code do have a cost, but I think clarity has value. And finally, "Correct," I've often been wrong, and I may be wrong here, too.

I like this because the type definitions are reusable, I think this can add clarity throughout the application.

When this kind of declaration is part of a reusable module, the goodness spreads like smiles and hugs throughout the application. Before long, other functions have been tweaked and everyone is sending each other little teddy-bear hug gifts with rainbow cupcakes.

(Please don't exchange mylar balloons. They're evil. Also, see this.)

tl;dr

When your type hints seem ungainly and large, consider Exposing the Intermediate Types. Break down a big structural type hint into the constituent pieces.

If you had to create a class definition for EVERY variation on list, dict, set, and tuple, what would your new class be named?

If you had to describe the underlying meaning of a class -- separate from it's structure -- what name would you give it?

Picking names is one of the two hardest problems in computing. It isn't easy. (The other hardest problem? Cache invalidation and off-by-one errors.)

Friday, April 6, 2018

Should I use x.__len__() or len(x)?

In the context of providing type hints, someone had a function like this.

def f(x: Sized) -> Whatever: ...

And, since sized objects have a __len__() method it seemed sensible to use x.__len__(). It was a good question about the use of special methods.

My advice is to avoid using the special methods in general. Use them only when defining classes that need to behave like Python objects.

(I'll make an exception for using x.__dict__, to avoid having to introduce an explicit dictionary object when there's one built-in to most objects.)

Use len(x) and be happy.  The function wrapper around a special method is a common Python feature; it occurs in many places; use it.

Tuesday, April 3, 2018

RESTful Web Services Design

This -- REST is the new SOAP -- has so many demolished strawman arguments that it feels like looking at a van Gogh painting of people harvesting wheat.

I won't dive into listing all the strawmen. Most of my responses are approximately "How is that an actual problem?" or "Yes, it was new to you, so?" or "Yes, people disagreed with each other over an implementation choice."

Some of the observations about "proper REST" vs. "bah, that's not really RESTful" point out the differences between expedient REST-like design and really good REST design. Some of these considerations can be helpful.

The one point worthy of deeper thought is the nature of verb-heavy highly-stateful RPC design and RESTful noun-heavy design. The question here is the definition of state and the nature of state change. Some people appear to be enthralled with many nuanced state changes. I've been doing too much data warehouse and functional design where the data is essentially stateless and CRUD rules are refined down to CRD with a rare U under limited circumstances.

And, yes, that means using relatively "stateless" OO design where an object is wrapped inside a new object that includes derived data or a compositions of stateless objects. The following example leverages duck typing to create immutable objects where the class reflects the state of the object.

class Thing:
    def __init__(self, a, b):
        self.a, self.b = a, b
    def set_c(self, c):
        return DerivedThing(self, c)

class DerivedThing:
    def __init__(self, thing: Thing, c):
        self.thing, self.c = thing, c
    @property
    def a(self):
        return self.thing.a
    @property
    def b(self):
        return self.thing.b
    @property
    def value(self):
        return self.a * self.c + self.b

And, yes, I'm not building things which are absolutely stateless because Python has stateful lists and mappings, and web services rely on stateful persistence. And, yes, I reject functional purism because I'm stupid. Can we move on, now?

Something that seemed essential to me (but appears to be confusing from reading complaints about REST) is understanding the notion of "state." One view of state is an aggregation of details. The final state of an object is a reduction over the changes -- akin to a sum(), max(), or min(), or perhaps something more involved like last(). The paucity of REST verbs is not a problem when you understand current state as the end product of applying a journal of previous state change mementos. Each "change", then, isn't a complex Update (REST Put or Patch) where there aren't enough verbs to describe each nuanced change. It's a Create (REST Post) of the next change memento. The RESTful service can eagerly apply the change to compute the current state. Or it can lazily apply the changes to compute the current state.

Some of the blog post cited above sounds like "it was new and I didn't like it." Therefore, read the article, locate the strawmen, and know there will always be someone who will complain. Some of the complaints will have merit, some will be whining about the novelty.

In a RESTful context, I'm a fan of this kind of pattern.

/things
    post:
        summary: Creates a new thing with a and b
    responses:
        201:
            description: thing was created
/things/{id}/c
    post:
        summary: Sets a value of c for an existing thing, previous value is discarded.
    responses:
        201:
            description: c property of thing {id} was set
       
For more useful advice, start here, for example: RESTful API Designing guidelines — The best practices. Articles like this are useful, too: 10 Best Practices for Better RESTful API.

Tuesday, March 27, 2018

Functional Python Programming 2e -- Now With Type Hints

Functional Python Programming, 2nd ed.

This has been fun to cleanup some rambling, reset the math to be sure it's actually right.

And.

Type Hints.

Almost every example has had type hints added.

(And I raised the pylint scores be rearranging some spacing and what-not.)

Bonus. We will be moving the publication date up from June to possibly April. We're still doing technical reviews and what-not, so things aren't *done*.

What was hardest?

Generics, specifically, decorators can have quite complex type hints. Indeed, type hinting raises important questions about trying to write super-generic functions that can handle too wide a spectrum of types.

def some_function(arg):
    if isinstance(arg, dict):
        do_something(arg)
    elif isinstance(arg, list):
        do_something({i: v for i, v in enumerate(arg)})
    else: 
        do_something(dict(arg=arg))


This kind of thing turns out to be ill-advised. It's probably a bad design. More importantly, it's difficult to annotate, making it difficult to discern if it behaves correctly.

In this case, the argument is Union[Dict, Sequence, Any]. I've got a few examples of Union types, but they're rare because I'm not a fan in the first place. And the few places I used them, the complexity of getting past mypy type checks showed that they add risk and cost without a dramatic reduction in complexity.

In this specific case, the some_function() function is merely a type-converting wrapper around the do_something() function. It's probably better to refactor the type conversion responsibility into the clients of some_function().

The arguments about "encapsulation" or "the client shouldn't know that detail" are generally kind of silly. We're all adults here, we generally have to know what's going on with respect to the conversions in order to use the function correctly and write unit tests.

Tuesday, March 20, 2018

HATEOAS is useless? Or not used enough?

See Why HATEOAS is useless and what that means for REST.

The article provides a background leading up to these observations:
  • There are very few good tools to create a REST API using this style
  • There are no clients widely used to consume these types of APIs
The "useless" in the title is more like "not used enough."

There's a multi-part conclusion that may be more helpful if it's fleshed out further. For now, however, it appears that the big problems center around:
  • You still need to write Open Api Specifications (OAS, f/k/a Swagger). I don't think this is bad. The blog post makes it sound like a problem. I think it's essential.
  • You need to put versioning somewhere. The path is less than idea. I'm big on the Accept header containing application-specific MIME types. For example, application/vnd.com.your-name-here.app.json+v1. This doesn't strike me as a problem, either.
  • The whole approach is "closer to RPC than some REST lovers like to admit." I think this point revolves around the way JSON-RPC or SOAP involves some overheads above basic HTTP that are unhelpful. I don't think the "closer to RPC" follows logically from the lack of tooling for HATEOS, but it certainly could be true that a badly-done API might involve too many of the wrong kinds of overheads.
I think there's a hidden strawman here. The "automatic discovery" idea. I don't think this idea makes a lick of sense. Some people think it's implied (or required) by REST, and any failure to provide for fully-automated semantically rich discovery of an API is some kind of failure.

I don't think full semantic discovery is possible or even desirable. 
  • It's not possible because of the problem of assigning names and meanings to resources and verbs in an end-point. The necessary details can only be exposed with a semantically complete ontology and complex SPARQL queries into the ontology to find resources and end-points. 
  • It's not desirable because we replace a human-focused OAS with a complete ontology that has to be rigorously defined, and tested to be sure that all kinds of automated discovery algorithms can understand the provided details. And none of this addresses the actual application, it's all rich, detailed meta-description of the application.
I don't see why we're trying to replace people. API discovery is actually kind of hard. The resources, their relationships, and the verbs for getting or updating those resources involves an essentially difficult knowledge capture and dissemination problem. 

Friday, March 9, 2018

Python Interviews

The #Python Interviews book is out. Mike Driscoll interviewed a bunch of Python experts. And me, too. get 30% off the Amazon paperback version of the book using the code 30PYTHONhttps://goo.gl/5A3uhq

Here's a flavor of how this went:

Driscoll: So how did you end up becoming an author of Python books?
Lott: Most roles in my career more or less just happened to me, but becoming a writer was a conscious decision.
In this case, I had decided that there could be value in teaching the Python language and the associated software engineering skills. I started to collect notes for a book in 2002. By 2010, I had tried self-publishing several books on Python.
When Stack Overflow started, I was an early participant. There were many interesting Python questions. The questions showed gaps where more information was needed about Python specifically and software engineering in general. Over a few years, I answered thousands of questions about Python and somehow built up a large reputation.

Monday, March 5, 2018

Python Interviews -- Coming Soon from Packt

See https://www.packtpub.com/web-development/python-interviews

I'm honored.

I'll be studying what the other folks have to say in here. Being in the Python community means respecting other's views. And that means understanding them.

This looks like fun because it isn't *deeply* technical, it's about people and technology.

Tuesday, January 30, 2018

The SQL-based relational database isn't perfection? Whoa if true

Yes, there are people for whom document databases (and the file system) are confusing and weird.

I was sent this: Relational Algebra Is the Root of SQL Problems which is really brilliant and provides some helpful concrete examples of stuff SQL is really bad at.

The accompanying email was filled with nonsense about how important and world-changing SQL was.

I can't disagree. Back when disk was very expensive and very small, the SQL-based join strategies where essential for micro-managing every bit of data. Literally. Every Bit.

And then we would denormalize the structure for performance reasons. Because we always knew the SQL was terrible at a fairly large number of things.

Those days are behind us. We can now chose to use a document database, and make our lives simpler. Storage is relatively inexpensive, and the labor to normalize and denormalize data doesn't create significant value. The need to write stored procedures to turn a single conceptual operation into a bunch of inserts and updates was a symptom that this wasn't the best approach.

I've had many "But what about..." conversations regarding document databases.

"What about ad-hoc queries in SQL?"

- Do you really do these without writing a Python script or creating a Pandas dataframe? I doubt it. But. If you really think you'll do this, most document stores either support a modified SQL or Javascript. And yes, you hate Javascript, duly noted. I hate SQL, so we're even there.

"What about joins?"

- It's a space-saving technique. We don't need the overheads to save the space. The "update anomalies" still require careful design, and may lead to some decomposition of data into multiple documents. But the ruthless normalization shouldn't be seen as a requirement.

"What about the schema?"

- It's brittle and schema migration creates a lot of low-value labor. We can use Python JSONSchema to validate documents. See NoSQL Database doesn't Mean No Schema.

Transactional v. Analytical

It requires some care to understand the distinction between "transactional" and "analytical" uses for data. While folks try to leverage this distinction, it's a spectrum not a distinction.

A lot of data collection is a simple sequence of event documents. These have no sensible state change, so they're not really transactional. They are often created by concurrent processes where locking prevents corruption, so transactions *seem* helpful. Except, of course, the file system writes can be trivially sharded by process ID and then unified later. And all document databases serialize document writes from multiple client processes, so there's no value to writing a relational database.

Some data operations are properly stateful. By normalizing our tables, moving from consistent state to consistent state is made complex. Which requires a defined transaction as a work-around. And don't get me started on replication and two-phase commit as yet another layer of complexity on top of transactions.

A document database allows us to skip over 1NF. We can think of a document as being a row in a table where the data types are complex data structures involving mappings, sequences, strings, numbers, booleans, and nulls. (See JSON Schema.) A lot of multi-step SQL transactions are operations on several children of a common parent. If the parent was persisted as a single document, there wouldn't be multiple operations, an atomic MongoDB update operation can make complex rewrites to a complex document.

We can contrive a design where state changes must be coordinated and the data cannot be colocated in a single document. It's not difficult to stipulate enough requirements to make single documents difficult. The presence of these contrived requirement, however, doesn't suddenly invalidate document datastores for transactional data. In the SQL world, the idea of long-running and reversible long-running transactions has always been a horrible problem. Allowing stacked "undo" for the user means either creating a chain of Memento objects that can recover previous state, or having numerous flags and indicators on each record, allowing the state to be reversed. Some design problems are really hard. And the SQL model seems to make them harder.

The core ACID concepts of always consistent is -- in practice -- nonsense. As soon as we have to consider "isolation levels" and "read consistency" it becomes clear that there is no consistent state unless all transactions and queries are serialized via exclusive "whole database" locking. Competent DBA's know that long-running analytic queries performed concurrently with transactional updates can't use locking, and must tolerate inconsistencies in the database.

It's common practice to do data extracts so that analytic queries aren't working against the (inconsistent) transactional data. In this case, the frequency of extracts is the timing of "eventual consistency" promised by the BASE concept.

Bottom Line: Relational ACID rules are almost always broken in practice by read consistency rules and extracts to analytic databases. Analytical data is always based on eventual consistency expectations. The batch extracts means "eventually" is measured in hours. A document data store can often create consistency in milliseconds. (MongoDB primary failure, voting, and secondary promotion to primary relies on a 10-second heartbeat, so it takes time to discover and repair.)

Also

A second email detailed their amazement (Amazing! Wow! Unbelievable! You Must Inform The World Of This!) that analytic processing of data is actually faster and simpler using the file system. The very idea of HDFS was so amazing that they were amazed.

Somehow, the idea of the raw filesystem as being really, really fast was the source of much amazement.

I'm glad they're making an effort to catch up. I'm glad they're seeing the relational model as a bad choice that has a limited number of use cases. Mostly, relational databases are useful for an organization can't write API's to handle the integrity issues.

To SQL or NoSQL? That's the database question | Ars Technica

Tuesday, January 23, 2018

PyCon 2018 Program Committee


I was "volunteered" by a colleague to help the program committee for PyCon 2018. I rarely think of myself as qualified for this kind of thing. Yes. I have six books on Python (with a seventh on the way) but the PSF folks are brilliant and dedicated and hard-working, and I'm just a slob.

Yes, I do get to help the community by reviewing almost 700 individual proposals. Some good. Some really good. Some which we *must* hear. 

The collateral benefit? 

Side reading.

My browser history is filled with things I hadn't known existed. 

Next time, I need to get started *before* the deadline so I can have a little more interaction with the authors. There were a few outlines where we could only discuss the possibility of making a change if the proposal was accepted. 

In particular, there seemed to be a *lot* of Machine Learning-Bayesian-Deep Learning-Recommender-Data Science pitches that had abbreviated outlines. They tend to all look alike to someone who's not an expert. Five bullet points: the author's background, the problem domain, ML (or modeling or whatever), a Jupyter notebook showing the results, and a conclusion.  Providing some distinct angle to the pitch (other than the problem domain) might help me understand them more fully. It seemed best to defer to the consensus on these.

I've been learning to live with my personal bias against meta-talks about building community. A presentation on community building at a community event seems redundant to me. But that doesn't mean they're not thorough, articulate talks that will be useful to others. Since I have a seat at the table, I'm biased. The Python tie-in feels weak, but our code of conduct (Open, Considerate, Respectful) means PyCon really is the place for more of this. Most importantly, they're objectively solid talks. (And -- as a member of the the over-represented old male nerd class, I do need to listen more.)

It's been enlightening. And the conference will rock.