Tuesday, July 26, 2016

Another Python to the Rescue Story -- Creating a DSL from Python Class Definitions

https://medium.com/capital-one-developers/automating-nosql-database-builds-a-python-to-the-rescue-story-that-never-gets-old-1d9adbcf6792#.8xp69yxqj

Tuesday, July 12, 2016

Getting Rid of the Gang-of-Four Design Patterns is Nonsense

Someone found Yet Another Post (YAP™) insisting that the Gang of Four (GOF™) patterns were on their last legs. The email was misleading, because this is not precisely what the article said. The bottom-line was that Design Patterns in general are merely a response to gaps in the underlying programming language. A position that's nonsense at its very foundation.

The lexicon of design patterns varies from language to language. GoF patterns aren't "going away." They're part of the Java/C++ world. They don't apply quite the same way to Python or functional languages.

There's a more serious issue, though: Language Mapping. First some background.

Design Patterns

Design Patterns will always exist. They're an artifact of how we process the world. We tend to classify individual objects so that we don't have to deal with each object as a separate wonder of nature.

It's Just Another Brick In The Wall.

We don't have to examine each rectangular solid of ceramic and understand the wonderfulness of it. We can group and summarize. Classify. Brick is a design pattern. So is masonry. So is wall. They're all patterns. It's how we think.

Design Patterns and Language Gaps

There's a claim that moving toward functional languages will kill design patterns. This presumes (partly) that non-OO languages magically don't have design patterns. This is (see above) kind of insane. Languages have design patterns. We recognize these patterns all the time.

A functional language has a common technique (or pattern) for visiting nodes in a hierarchy. We don't dwell on the wonderfulness of the code as if we'd never seen it before. Instead, we classify it based on the design pattern, and leverage this higher-level understanding to figure out why we're walking a hierarchy.

Sounding the death knell for design patterns also presumes (partly) that functional languages are magically more complete that OO languages. In this newer better language, we don't need patterns because there are no gaps. This is pretty much nutso, too. The Patterns Fill Language Gaps school of thought ignores the fact that there are many ways to implement these "gaps". We can use GoF design patterns, or we can use other software designs that don't fit the GoF design patterns. Both work.

The patterns aren't filling a "gap." They're providing guidance on how to implement something. That's all. Nothing more. Guidance.

"But wait," you say, "since I needed to write code, that's evidence that there's a gap."

"What?" I ask, incredulous. "Are you claiming that any code is evidence of a language gap? Does that mean all application software is just a language gap?"

"Let's not be silly," you say. "I can split a hair and create a tiny distinction between software I shouldn't have to write and software I should have to write."

I remain incredulous.

Design Patterns as Damage

The idea that somehow the GoF design patterns are a problem is also goofy. The GoF design patterns are pretty slick. They solve a fairly broad suite of problems in an elegant and consistent manner.

They're just good design.

Yes, they can be complex. Sorry about that. Software can be complex if you want really excellent flexibility and extensibility.

AND.

Bonus.

Software can be complex when you have to work around the problems of "compiler" and "locked libraries" and "no source." That is, the GoF patterns apply in full force for C++ and Java where you're trying to protect your intellectual property by disclosing only headers and obfuscated implementation details. Indeed, there are few alternatives to the GoF patterns if you're going to distribute a framework that has no visible source and needs to leave extension points for users.

If you don't have Locked-NoSource-Compiled code as a backdrop, the GoF patterns can be simplified a little. But some of the patterns are essential. And remain essential. There are some really great ideas there.

In Python world, we rely on a modified subset of the GoF patterns. They work extremely well.

When writing functional-style Python using immutable data structures (to the extent possible), we use a different set of design patterns. Not so many GoF patterns when we're trying to avoid stateful objects. But some patterns (like the Abstract Factory) are really very helpful even in a largely functional context. It morphs from an abstract factory class to a factory function, and it loses the "abstract" concept that's part of C++ and Java, but the core Factory design pattern remains.

The Serious Issue

The serious issue that is surfaced by the email is Language Mapping. We cannot (and must not) try to map languages to each other. What is true for Java design is emphatically not true for Python design. And it doesn't apply to assembly languages, FORTRAN, FORTH, or COBOL.

Languages are different.

There. I said it.

If there was an underlying "universal deep structure" behind all programming languages, the surface features would be merely syntax, and we'd have automated translation among languages. The universal deep structure (the underlying Turing Machine that does computations) appears to be too abstract to map well among programming languages. Hence the lack of translators.

When switching among languages, it's important to leave all baggage behind.

When moving from Java < 8 to Java >= 8<8 java="" to=""> (i.e., non-functional Java to more functional Java) we can't trivially map all design patterns among the language features. It's a new language with new features that happens to be compatible with the old language.

Attempting to trivially map concepts between non-functional (or strictly-OO Java) and more functional Java leads to dumb conclusions. Like the GoF patterns are dying. Or the GoF patterns represent damage or something else equally goofy.

The language changes lead to design pattern changes.

Language change doesn't deserve an gleeful/anguished blog post celebrating/lamenting the differences. It's a consequence of learning a new language, or new features of an existing language.

Please avoid mapping languages to each other.

Tuesday, June 21, 2016

Why Python? (Sad Follow-up)

In "Why Python?" I linked to a deep and sophisticated analysis of programming languages. Anyway, I thought it was a deep and sophisticated analysis.

I got a reply that shows how wrong I was. Here's the quote:
The point is that the Python ecosystem has a lot to offer. We could argue about the language design choices. However, why bother? Why not just take advantage of what the ecosystem has to offer.
Ah. Discussing the language is just "arguing". I guess the points are all debatable and my comparison of Python to any benchmark is just the seed for an argument. A religious war, perhaps. I guess this wasn't compelling. It was a "why bother?"

Why bother pointing out the strong points of the language?

The email emphasized the "ecosystem" with a cool, but short example of how scipiy.spatial.KDTree works. 

It appears that -- for some people -- "Python code actually works" is a useful response to "why python?" 

I would have thought that "Python code actually works" was a precondition to even discussing the value proposition behind Python.

But -- clearly -- I was wrong.  The mere fact of a working example is a Very Important Thing™.

What does this mean?
  1. There are people who use software that doesn't actually work. When they see software that works, it's important. Very important.
  2. When software actually works, these people find this simple fact to be a compelling and substantial argument for placing a high value on the software.
  3. Other considerations like clarity and simplicity aren't relevant. If these poor souls are suffering software that doesn't actually work, then broken and obscure is still broken. Other parts of the long discussion from Wirth are just arguing points.
The email included "consider amending the why python? blog w/ the other big pro: ecosystem" I'm not sure I actually understand the request. When code that works is a "big pro", this comes from a world I can't pretend to understand.

Also. The example code used xrange(). Which is a Python 2 smell. Those days are passed.

Tuesday, June 14, 2016

Continuous Data Migration

See http://slott-softwarearchitect.blogspot.com/2013/07/database-conversion-or-schema-migration.html

People talk about CI/CD (Continuous Integration/Continuous Deployment).

They also need to talk about CM (Continuous Migration).

"Wait, what?" you ask.

When we roll out a new version of the software (CD) there are three common situations.
  1. The new software uses the existing data model with no changes. This is a "minor version change": from v3.2 to v3.3.
  2. The new software requires a tweak to the schema, but it's backward compatible. This, too, is a minor version change. In a SQL context, we might have used an ALTER TABLE to add a nullable column. If there are no SELECT * statements in the code, this change is essentially transparent to legacy code.
  3. The new software involves a new schema that's not backwards compatible. This is a major version change. From v3.2 to v4.0. This is difficult. Really difficult.
Clearly, the first two can be done with the data in place. New software is installed, the servers are restarted, and away we go. In a big environment, there may be a rolling deployment. There may be a canary release that will get converted first, then others will be brought online.

A change of the Second Kind does involve the one-time database transformation script. This may lead to some down-time. Or it may lead to a feature toggle so that the new software can work with the old database until the script is run.

In a NoSQL context, a change of the Second Kind doesn't require the one-time script. The new documents have new fields that old documents don't have. NoSQL apps -- in general -- must be able to cope with data model variations.

A change of the Third Kind is trouble.

Big trouble.

We have two schema: the v3 schema and the v4 schema. We have two sets of software: the v3.2 release and the v4.0 release. We'd like to have just one valid set of data. How do we deal with this?

How can we do schema migration badly?

We can't easily have a single software release that includes one set of data in both schema. It's technically possible Anything that doesn't involve time travel, anti-gravity or perpetual motion is technically possible. But it rapidly becomes so complex that we have to set this uber-version idea aside.

We have to do more deployment work to have both v3.2 and v4.0 installed in parallel. v3.2 will use data in the v3 schema, v4.0 will use data in the new schema.

How do we migrate the data from the old schema to the new schema?

This can be tricky. There are proven bad ideas out there. Really epically bad ideas.

One Very Bad Idea (VBI™) is the one-time-only data migration. Back in the olden days, we couldn't afford enough storage to have two copies of the database. Seriously. When a company owned exactly one computer (before PC's -- a Very Long Time Ago) the conversion had to be done by making special backups and restoring the backups into the new schema.

This VBI is still with us today.  Lots of places want to do one-time-only data migrations because it's the traditional approach. If they can't done a one-time conversion (over a long weekend) they complain. Loudly.

BTW. This never worked well. The one-time-only conversion software was never tested carefully, and therefore rarely worked the one time it was needed. Also, data profiling was never done, so edge and corner cases were found during conversion. These often called the new software's features into question, leading to larger and larger problems.

Continuous Migration

The ideas behind continuous migration are these.
  1. We're always going to be migrating the data. Always.
  2. Storage is cheaper than labor. When in doubt, buy more storage.
  3. Data outside the database (in CSV files or YAML documents) is smaller than data inside the database. Don't be afraid to export.
  4. Data outside the database is inaccessible. Be cautious of the implied down-time during exports and imports.
  5. ABP. Always Be Profiling. If you don't have a data profiler in place right now, that's the first thing to build. There are schema definition tools and schema checking tools. Look at JSON-Schema.org. Write schema definitions and use a data profiler to examine all rows and check all rules. All. Seriously. All. In a SQL DB, actually check the foreign keys to be sure the referenced row exists; you'll be surprised.
  6. We're moving forward. We're not milling around; we're not supporting the old version except for the purposes of a parallel test or a fall-back in the event the next version doesn't work. There's no long-term coexistence strategy. Preserve the data; upgrade the software.
Here's the central data migration requirement:

Be able to migrate to the new schema as many times as needed.

I'll repeat that. As. Many. Times. As. Needed.

Migration is not a one-time thing. You do it all the time.
  1. Migrating (and possibly sanitizing or subsetting) production data into the development environment.
  2. Migrating production data for QA testing.
  3. Migrating production data for integration testing.
  4. Migrating production data for performance testing.
  5. Migration production data for the production upgrade.
These are all the same activity. 

I'll repeat that. The. Same. Activity. Sometimes with mappings. Sometimes with filters. 

Since you'll do many, many migrations, your data migration programming is as important as your application programming. Perhaps more important than the application code because it's what preserves the data, and the data is the only thing of value. Applications come and go. Data is forever.

Having real data available permits seamless, silent, and automatic parallel testing. We can easily do a parallel test with v3.2 and v4.0 release candidates by simply running the migration (or migration with subset filter) to gather some data for the parallel test. If the release candidate has problems, we can fix v4.0 to create the next release candidate, re-migrate the data, and try the parallel test again.

At some point the v4.0 release is final, and we need to migrate all of the data. This (usually) involves some feature toggles to put v3.2 into a special "end-of-life" mode where the keys for records which change are logged separately. After turning off v3.2 and turning on v4.0, a second phase of migration will process these end-of-life rows through the migration mill.

Software and Schema Design Consequences

This has an important consequence.

Your software must be explicitly bound to a specific schema by major version number.

Explicitly bound. In a SQL context, you can use the "schema" construct an include the version number in the schema name. "myapp_v3" vs. "myapp_v4". This becomes a ubiquitous qualifier on all table names. SELECT col FROM myapp_v4.some_table AS st.

Yes. Do this Everywhere. Do it Now. 

If you're using mybatis or SQLAlchemy to get the SQL out of your application, then this kind of thing is a trivial change. If you have SQL in your application code, well, you have two problems to solve. First, get the SQL out of your application. Then make the schema version explicit.

In a NoSQL context, you can include the schema version as part of a collection name. "collection_v3" or "collection_v4".

This should be present everywhere.

Then, you'll need data validation apps and data migration apps. The validation apps will use your favorite schema definition and schema validation framework. Start running this as soon as you think you might need to make a major version change.

Finally, you'll need the data migration tool set. This will involve filter rules and sanitizing rules. These are not sophisticated "rules engine" kind of things with unbounded complexity. They're usually if statements and simple computations. But they come and go pretty freely, so design the software in a way that makes the filter and sanitizing code obvious.

Now you can -- trivially -- migrate data between schema versions inside the same database. You can have v3.2 and v4.0 running side-by-side. You can migrate the data early and often. You can profile and validate the data. You have a formal schema for the data validation. 

Tuesday, June 7, 2016

Why rewrite a shell script in Python?

Here's the actual quote:

Why would you need to rewrite a working script in python ? Was there any business direction towards this ?

This was an unexpected response. And unwelcome. I guess I called their baby ugly.

The short answer is that the shell script language is perhaps one of the worst programming languages ever invented. Okay. I suppose it's better than whitespace.  Okay it's better than many others. See https://en.wikipedia.org/wiki/Esoteric_programming_language

The longer answer is this:
  • There are (at least) three ksh scripts involved, two of which are over 1,000 lines long. It's not perfectly clear precisely what's involved. It's ksh. Code could come from a variety of places through very obscure paths; e.g. the source command and it's synonym, ..
  • There are no comments other than #!/usr/bin/ksh and a few places where code is commented out.  Without explanation.
  • There is no other documentation. The author had sent a email describing the github repo. The repo lacked a README. It took two tries to get them to understand that any email describing a repo should have been the README in the repo. There is barely even a command-line synopsis. (Eventually, I found it in the parser for command-line options.)
  • No tests of any kind.
The last point is the one that I find shocking. And I find it shocking on a regular basis.

Folks are able and willing to write 1,000's of lines of shell script without a single unit test, integration test, system test, performance test, anything test. How do they know if this works? Why am I supposed to trust it?

More importantly, how can I meaningfully wrap this into a RESTful API if I'm not even sure what the command-line interface really is? It's the shell. It could use environment variables that are otherwise undocumented. They would be discovered when they cause a crash at run time. Crashes that become an HTTP 500 status code and a traceback error message in the web log.

The "business direction" sounds like an attempt to trump the technical discussion with a business consideration like "cost" or "benefit". It should be pretty self-evident that 1,000's of lines of shell script code is technical debt.

The minimally viable replacement will probably be a similarly-sized of Python script that mindlessly mirrors the original shell script. It's sometimes quite hard to tell what purpose a shell function really serves. The endless use (and re-use) of global variables can make state change hard to follow. Also, the use of temporary files which are parsed (and reparsed) as a way to set state is a serious problem.

What's important is that the various OS services used by the shell script are mockable. Which means that each of the 100 or so individual functions within the script can be tested in isolation. Once that's out of the way, refactoring becomes plausible.

Let's savor those words for a moment: Tested. In. Isolation.

Ahhh.

The better replacement is (I think) about 250 lines of Python -- perhaps less -- that perform the real 8-step process that we're automating.  Getting rid of bash language cruft is challenging, but essential.

Tuesday, May 10, 2016

Why Python? What's it good for? How is it special?

First. The question is moot. It's a programming language. It's good for programming.

When I push back, folks try to produce languages which exist only in certain pigeon holes.

"You know. PHP is for web and JavaScript runs in the browser. What's Python for?"

The PHP and JavaScript examples aren't helpful. That doesn't narrow the domain of problems for which Python is appropriate. It only shows that some languages have narrow domains.

"You know. Objective-C and Swift are for iOS. What's the predominant place Python is used?"

Python also runs on iOS. I don't know if it has suitable bindings for building apps. If it does, that doesn't change my answer. It's good for programming.

"Java is used mainly for web apps, right? What about Python?"

Okay. At this point, the question has slipped from moot to ignorant.

Can we just set that aside? Can we move on?

If you want some useful insight, start here:

http://web.eecs.umich.edu/~bchandra/courses/papers/Wirth_Design.pdf

Yes, it's an essay from 1974.  Parts of it are a little old-fashioned, but a lot of it is still rock-solid. For example, the idea of strongly-typed pointers is considered more-or-less standard now. It was debatable then. And Wirth's opinion continues to drive language design.

Page 28 has the key points: features of a programming language. Enumerated by the inventor of Pascal, Modula, Oberon, and other languages too numerous to recall.

Some of the list is a little dated. "...different character sets...," for example, has been superseded by Unicode.

Also, the list is focused on compiled languages. Python is a dynamic language. It's interpreted. Yes, there's a compiler, but that's mostly an optimization of the source code. If you replace "compiler" with "run-time", the list stands up as a description of good languages.

I like this list because it helps characterize why Python works out so well. And why many other languages are also pretty good. It points up the reason why quirky languages like JavaScript (or even Ruby) are suspicious. Some of the points about efficiency are important topics for further discussion.

I often have to remind folks who work with Big Data that most of our processing is I/O bound. Python waits for the database somewhat more efficiently than Java. Why does Python wait more efficiently? Because it uses less memory. Sometimes this is a win.

Let's not ask silly questions about a general-purpose language. Instead, let's benchmark solutions, and compare tangible performance numbers using real code.

Tuesday, May 3, 2016

The Lynda.com Experience

One word: "wow"

More words: "Helping shy people get up and do what needs to be done."

Yes, that's Garrison Keillor's tag line for one of the "sponsors" of "A Prairie Home Companion": the Powdermilk Biscuits company.  (Heavens. they're tasty and expeditious.)

The folks at Lynda are truly great at shepherding folks through the process of preparing and recording their material.

Recording is hard. The point is to say each thing perfectly. But, the things have to fit into a larger narrative of a section that fits into the larger sequence of chapters that makes up the course.'

Giving essentially the same content in a presentation at a conference is almost unrelated. Talking at a conference has a live audience. It's one-time-only, and you can ad-lib.

Doing this takes patience. And skilled editing both at a content level and at a technical level. Lynda has it all.

The thing that made me the most comfortable was having my presentation material ready. Each section is a 5-minute lightning talk. I was had all of my slides ready. I'd been through them enough times to be sure that I could handle the 5-minute format. And when there are editorial changes, they tended to be relatively minor.

I may try it again. It's a lot of work. Certainly more work than writing a chapter in a book. A chapter can go deep. A presentation has to stick to the high points: this means that the supporting depth must be there, but you're not going to wallow around in it. Essentially, you're making the "elevator pitch" for each one of your points.

The recording and live action studio space were fun. I've never been recorded or taped like that before. They eased me into it, coached me through it, and made sure all of the content was there in a way that could be edited into a high quality final product.