Tuesday, June 21, 2016

Why Python? (Sad Follow-up)

In "Why Python?" I linked to a deep and sophisticated analysis of programming languages. Anyway, I thought it was a deep and sophisticated analysis.

I got a reply that shows how wrong I was. Here's the quote:
The point is that the Python ecosystem has a lot to offer. We could argue about the language design choices. However, why bother? Why not just take advantage of what the ecosystem has to offer.
Ah. Discussing the language is just "arguing". I guess the points are all debatable and my comparison of Python to any benchmark is just the seed for an argument. A religious war, perhaps. I guess this wasn't compelling. It was a "why bother?"

Why bother pointing out the strong points of the language?

The email emphasized the "ecosystem" with a cool, but short example of how scipiy.spatial.KDTree works. 

It appears that -- for some people -- "Python code actually works" is a useful response to "why python?" 

I would have thought that "Python code actually works" was a precondition to even discussing the value proposition behind Python.

But -- clearly -- I was wrong.  The mere fact of a working example is a Very Important Thing™.

What does this mean?
  1. There are people who use software that doesn't actually work. When they see software that works, it's important. Very important.
  2. When software actually works, these people find this simple fact to be a compelling and substantial argument for placing a high value on the software.
  3. Other considerations like clarity and simplicity aren't relevant. If these poor souls are suffering software that doesn't actually work, then broken and obscure is still broken. Other parts of the long discussion from Wirth are just arguing points.
The email included "consider amending the why python? blog w/ the other big pro: ecosystem" I'm not sure I actually understand the request. When code that works is a "big pro", this comes from a world I can't pretend to understand.

Also. The example code used xrange(). Which is a Python 2 smell. Those days are passed.

Tuesday, June 14, 2016

Continuous Data Migration

See http://slott-softwarearchitect.blogspot.com/2013/07/database-conversion-or-schema-migration.html

People talk about CI/CD (Continuous Integration/Continuous Deployment).

They also need to talk about CM (Continuous Migration).

"Wait, what?" you ask.

When we roll out a new version of the software (CD) there are three common situations.
  1. The new software uses the existing data model with no changes. This is a "minor version change": from v3.2 to v3.3.
  2. The new software requires a tweak to the schema, but it's backward compatible. This, too, is a minor version change. In a SQL context, we might have used an ALTER TABLE to add a nullable column. If there are no SELECT * statements in the code, this change is essentially transparent to legacy code.
  3. The new software involves a new schema that's not backwards compatible. This is a major version change. From v3.2 to v4.0. This is difficult. Really difficult.
Clearly, the first two can be done with the data in place. New software is installed, the servers are restarted, and away we go. In a big environment, there may be a rolling deployment. There may be a canary release that will get converted first, then others will be brought online.

A change of the Second Kind does involve the one-time database transformation script. This may lead to some down-time. Or it may lead to a feature toggle so that the new software can work with the old database until the script is run.

In a NoSQL context, a change of the Second Kind doesn't require the one-time script. The new documents have new fields that old documents don't have. NoSQL apps -- in general -- must be able to cope with data model variations.

A change of the Third Kind is trouble.

Big trouble.

We have two schema: the v3 schema and the v4 schema. We have two sets of software: the v3.2 release and the v4.0 release. We'd like to have just one valid set of data. How do we deal with this?

How can we do schema migration badly?

We can't easily have a single software release that includes one set of data in both schema. It's technically possible Anything that doesn't involve time travel, anti-gravity or perpetual motion is technically possible. But it rapidly becomes so complex that we have to set this uber-version idea aside.

We have to do more deployment work to have both v3.2 and v4.0 installed in parallel. v3.2 will use data in the v3 schema, v4.0 will use data in the new schema.

How do we migrate the data from the old schema to the new schema?

This can be tricky. There are proven bad ideas out there. Really epically bad ideas.

One Very Bad Idea (VBI™) is the one-time-only data migration. Back in the olden days, we couldn't afford enough storage to have two copies of the database. Seriously. When a company owned exactly one computer (before PC's -- a Very Long Time Ago) the conversion had to be done by making special backups and restoring the backups into the new schema.

This VBI is still with us today.  Lots of places want to do one-time-only data migrations because it's the traditional approach. If they can't done a one-time conversion (over a long weekend) they complain. Loudly.

BTW. This never worked well. The one-time-only conversion software was never tested carefully, and therefore rarely worked the one time it was needed. Also, data profiling was never done, so edge and corner cases were found during conversion. These often called the new software's features into question, leading to larger and larger problems.

Continuous Migration

The ideas behind continuous migration are these.
  1. We're always going to be migrating the data. Always.
  2. Storage is cheaper than labor. When in doubt, buy more storage.
  3. Data outside the database (in CSV files or YAML documents) is smaller than data inside the database. Don't be afraid to export.
  4. Data outside the database is inaccessible. Be cautious of the implied down-time during exports and imports.
  5. ABP. Always Be Profiling. If you don't have a data profiler in place right now, that's the first thing to build. There are schema definition tools and schema checking tools. Look at JSON-Schema.org. Write schema definitions and use a data profiler to examine all rows and check all rules. All. Seriously. All. In a SQL DB, actually check the foreign keys to be sure the referenced row exists; you'll be surprised.
  6. We're moving forward. We're not milling around; we're not supporting the old version except for the purposes of a parallel test or a fall-back in the event the next version doesn't work. There's no long-term coexistence strategy. Preserve the data; upgrade the software.
Here's the central data migration requirement:

Be able to migrate to the new schema as many times as needed.

I'll repeat that. As. Many. Times. As. Needed.

Migration is not a one-time thing. You do it all the time.
  1. Migrating (and possibly sanitizing or subsetting) production data into the development environment.
  2. Migrating production data for QA testing.
  3. Migrating production data for integration testing.
  4. Migrating production data for performance testing.
  5. Migration production data for the production upgrade.
These are all the same activity. 

I'll repeat that. The. Same. Activity. Sometimes with mappings. Sometimes with filters. 

Since you'll do many, many migrations, your data migration programming is as important as your application programming. Perhaps more important than the application code because it's what preserves the data, and the data is the only thing of value. Applications come and go. Data is forever.

Having real data available permits seamless, silent, and automatic parallel testing. We can easily do a parallel test with v3.2 and v4.0 release candidates by simply running the migration (or migration with subset filter) to gather some data for the parallel test. If the release candidate has problems, we can fix v4.0 to create the next release candidate, re-migrate the data, and try the parallel test again.

At some point the v4.0 release is final, and we need to migrate all of the data. This (usually) involves some feature toggles to put v3.2 into a special "end-of-life" mode where the keys for records which change are logged separately. After turning off v3.2 and turning on v4.0, a second phase of migration will process these end-of-life rows through the migration mill.

Software and Schema Design Consequences

This has an important consequence.

Your software must be explicitly bound to a specific schema by major version number.

Explicitly bound. In a SQL context, you can use the "schema" construct an include the version number in the schema name. "myapp_v3" vs. "myapp_v4". This becomes a ubiquitous qualifier on all table names. SELECT col FROM myapp_v4.some_table AS st.

Yes. Do this Everywhere. Do it Now. 

If you're using mybatis or SQLAlchemy to get the SQL out of your application, then this kind of thing is a trivial change. If you have SQL in your application code, well, you have two problems to solve. First, get the SQL out of your application. Then make the schema version explicit.

In a NoSQL context, you can include the schema version as part of a collection name. "collection_v3" or "collection_v4".

This should be present everywhere.

Then, you'll need data validation apps and data migration apps. The validation apps will use your favorite schema definition and schema validation framework. Start running this as soon as you think you might need to make a major version change.

Finally, you'll need the data migration tool set. This will involve filter rules and sanitizing rules. These are not sophisticated "rules engine" kind of things with unbounded complexity. They're usually if statements and simple computations. But they come and go pretty freely, so design the software in a way that makes the filter and sanitizing code obvious.

Now you can -- trivially -- migrate data between schema versions inside the same database. You can have v3.2 and v4.0 running side-by-side. You can migrate the data early and often. You can profile and validate the data. You have a formal schema for the data validation. 

Tuesday, June 7, 2016

Why rewrite a shell script in Python?

Here's the actual quote:

Why would you need to rewrite a working script in python ? Was there any business direction towards this ?

This was an unexpected response. And unwelcome. I guess I called their baby ugly.

The short answer is that the shell script language is perhaps one of the worst programming languages ever invented. Okay. I suppose it's better than whitespace.  Okay it's better than many others. See https://en.wikipedia.org/wiki/Esoteric_programming_language

The longer answer is this:
  • There are (at least) three ksh scripts involved, two of which are over 1,000 lines long. It's not perfectly clear precisely what's involved. It's ksh. Code could come from a variety of places through very obscure paths; e.g. the source command and it's synonym, ..
  • There are no comments other than #!/usr/bin/ksh and a few places where code is commented out.  Without explanation.
  • There is no other documentation. The author had sent a email describing the github repo. The repo lacked a README. It took two tries to get them to understand that any email describing a repo should have been the README in the repo. There is barely even a command-line synopsis. (Eventually, I found it in the parser for command-line options.)
  • No tests of any kind.
The last point is the one that I find shocking. And I find it shocking on a regular basis.

Folks are able and willing to write 1,000's of lines of shell script without a single unit test, integration test, system test, performance test, anything test. How do they know if this works? Why am I supposed to trust it?

More importantly, how can I meaningfully wrap this into a RESTful API if I'm not even sure what the command-line interface really is? It's the shell. It could use environment variables that are otherwise undocumented. They would be discovered when they cause a crash at run time. Crashes that become an HTTP 500 status code and a traceback error message in the web log.

The "business direction" sounds like an attempt to trump the technical discussion with a business consideration like "cost" or "benefit". It should be pretty self-evident that 1,000's of lines of shell script code is technical debt.

The minimally viable replacement will probably be a similarly-sized of Python script that mindlessly mirrors the original shell script. It's sometimes quite hard to tell what purpose a shell function really serves. The endless use (and re-use) of global variables can make state change hard to follow. Also, the use of temporary files which are parsed (and reparsed) as a way to set state is a serious problem.

What's important is that the various OS services used by the shell script are mockable. Which means that each of the 100 or so individual functions within the script can be tested in isolation. Once that's out of the way, refactoring becomes plausible.

Let's savor those words for a moment: Tested. In. Isolation.


The better replacement is (I think) about 250 lines of Python -- perhaps less -- that perform the real 8-step process that we're automating.  Getting rid of bash language cruft is challenging, but essential.

Tuesday, May 10, 2016

Why Python? What's it good for? How is it special?

First. The question is moot. It's a programming language. It's good for programming.

When I push back, folks try to produce languages which exist only in certain pigeon holes.

"You know. PHP is for web and JavaScript runs in the browser. What's Python for?"

The PHP and JavaScript examples aren't helpful. That doesn't narrow the domain of problems for which Python is appropriate. It only shows that some languages have narrow domains.

"You know. Objective-C and Swift are for iOS. What's the predominant place Python is used?"

Python also runs on iOS. I don't know if it has suitable bindings for building apps. If it does, that doesn't change my answer. It's good for programming.

"Java is used mainly for web apps, right? What about Python?"

Okay. At this point, the question has slipped from moot to ignorant.

Can we just set that aside? Can we move on?

If you want some useful insight, start here:


Yes, it's an essay from 1974.  Parts of it are a little old-fashioned, but a lot of it is still rock-solid. For example, the idea of strongly-typed pointers is considered more-or-less standard now. It was debatable then. And Wirth's opinion continues to drive language design.

Page 28 has the key points: features of a programming language. Enumerated by the inventor of Pascal, Modula, Oberon, and other languages too numerous to recall.

Some of the list is a little dated. "...different character sets...," for example, has been superseded by Unicode.

Also, the list is focused on compiled languages. Python is a dynamic language. It's interpreted. Yes, there's a compiler, but that's mostly an optimization of the source code. If you replace "compiler" with "run-time", the list stands up as a description of good languages.

I like this list because it helps characterize why Python works out so well. And why many other languages are also pretty good. It points up the reason why quirky languages like JavaScript (or even Ruby) are suspicious. Some of the points about efficiency are important topics for further discussion.

I often have to remind folks who work with Big Data that most of our processing is I/O bound. Python waits for the database somewhat more efficiently than Java. Why does Python wait more efficiently? Because it uses less memory. Sometimes this is a win.

Let's not ask silly questions about a general-purpose language. Instead, let's benchmark solutions, and compare tangible performance numbers using real code.

Tuesday, May 3, 2016

The Lynda.com Experience

One word: "wow"

More words: "Helping shy people get up and do what needs to be done."

Yes, that's Garrison Keillor's tag line for one of the "sponsors" of "A Prairie Home Companion": the Powdermilk Biscuits company.  (Heavens. they're tasty and expeditious.)

The folks at Lynda are truly great at shepherding folks through the process of preparing and recording their material.

Recording is hard. The point is to say each thing perfectly. But, the things have to fit into a larger narrative of a section that fits into the larger sequence of chapters that makes up the course.'

Giving essentially the same content in a presentation at a conference is almost unrelated. Talking at a conference has a live audience. It's one-time-only, and you can ad-lib.

Doing this takes patience. And skilled editing both at a content level and at a technical level. Lynda has it all.

The thing that made me the most comfortable was having my presentation material ready. Each section is a 5-minute lightning talk. I was had all of my slides ready. I'd been through them enough times to be sure that I could handle the 5-minute format. And when there are editorial changes, they tended to be relatively minor.

I may try it again. It's a lot of work. Certainly more work than writing a chapter in a book. A chapter can go deep. A presentation has to stick to the high points: this means that the supporting depth must be there, but you're not going to wallow around in it. Essentially, you're making the "elevator pitch" for each one of your points.

The recording and live action studio space were fun. I've never been recorded or taped like that before. They eased me into it, coached me through it, and made sure all of the content was there in a way that could be edited into a high quality final product.

Tuesday, April 19, 2016

A NoSQL Conversation

This cropped up recently. It's part of a "replace Mongo with Relational DB" conversation.

I'm going to elide the conversation down to five key points. The three post-hoc nonsensical ideas, and the two real points.

What's (to me) very telling is that someone else published the five reasons in this order. As if they larded three on the front. Or included the two at the end out of guilt because they were avoiding the real issues.

Relational Queries are Desired. "the only way to find [the documents] would be to write a query that literally trolls through the entire database in order to find the most recent values". 

I beg to differ. "Only way" is a strong statement. Mongo has indexes. To suggest that they don't exist or don't work is misleading. The details of the use case involved searching by date. It's possible to contrive a database that does bad searches by date; the implication being that Mongo couldn't do date matching or something. 

To Enforce Constraints and Schema. "It is still possible for the application layer to ensure the constraint, but that relies on every single point in the application code enforcing it – a single error can lead to inconsistent data". 

This runs perilously close to the "what if some bonehead bypasses the API and hacks into the database directly?" question. Which is isomorphic to "what if all corporate governance disappeared tomorrow?" and "what if an evil genius hacks all our database drivers?"

Lack of Document-Oriented Access Patterns.  "If there are more complex access patterns (like reading certain fields from many records, or frequently updating single fields within a record) then a document-oriented database is not a good fit"

That's nonsense. Mongo has field-level updates. There was one example of a long-running transaction that appeared to be mis-designed. I suggested that an improved design might be less complex and expensive than rewriting the API's and moving the data.
Desire to Utilize [Relational DB]

More Support Available for [Relational DB]

Clearly, these last two are the real reasons. Everything above looks like post-hoc justification for the real issue. 
We're not sure we like Mongo. 
My point in the conversation was not to talk them out of making a switch. The last two reasons included the kind of compelling rationalization that can't be disputed.  The best I could do was to challenge the errors in the first three reasons so that everyone could be honest about the change. It's not technical. It's organizational.

Tuesday, April 5, 2016

The GUI Problem

I write Microservices. And not-so-micro Services. API's.

I got this email recently.

"Goal: get you to consider adding Gooey to your Python tool set"

What it's for: Turn a console-based Python program into one that sports a platform-native GUI.
Why it's great: Presenting people, especially rank-and-file users, with a command-line application is among the fastest ways to reduce its use. Few beyond the hardcore like figuring out what options to pass, or in what order. Gooey takes arguments expected by the argparse library and presents them users as a GUI form, with all options labeled and presented with appropriate controls (such as a drop-down for a multi-option argument, and so on). Very little additional coding -- a single include and a single decorator -- is needed to make it work, assuming you're already using argparse."

The examples and the GitHub documentation make it look delightful.

However.  It's utterly useless for me.  Interesting but useless.

From my perspective, API's and microservices are vastly more important than desktop GUI's.

I'll repeat that in order to start a food-fight:

API's and microservices are more important than desktop GUI's

I almost forgot the important qualifiers: When working with Big Data. Or When working with DevOps Automation.

I realize that some people like to cling to the desktop GUI as a Very Important Thing™. Which could be why they send me emails touting the advantages of some kind of GUI tool or framework. The Desktop GUI is important, but, from my perspective, it's a niche.

Actually two niches.

Niche 1. The word processor and spreadsheet and a few other generic tools for putting text into a computer. While desktop versions are better than server-side emacs and vi, they fill a similar purpose. An IDE is (from this perspective) is little more than a glorified text editor. In places that use Jenkins and Hudson and uDeploy and all of those server-based tools, the desktop IDE is a place to stage code for Jenkins jobs to do the "real" build.

Niche 2. All the other tools that turn a small-ish computer into a dedicated workstation for specific kinds of media production. Video. Audio. Image. Typesetting. These are not "generic" applications like word processors or spreadsheets; they're very specific and narrowly-focused applications. They rely on effectively transforming the general-purpose computer into a very special-purpose computer.

Super-fancy desktop-based tools for analytics or Big Data processing are not actually too useful. Anyone trying to use a desktop as an enterprise systems of record is asking for trouble.  I work with folks trying to process terabyte datasets on their laptops and wondering why it takes so long. My company has servers. We pay for MongoDB and Hadoop. We have API's to access big databases with big piles of data. I'm automating the toolsets as fast as I can so they can work with giant datasets.

Gooey looks like fun. But not for me.