Tuesday, December 20, 2016

The Royal Road

Warning: Long Boring Anecdote: Conclusions will be drawn from a single example.

First, this quote:
"All your suggestions were great if I had wanted to do a systematic study and truly understand. The goal is to understand and learn as little as possible to be able to undertake the code challenge for [a specific job opportunity]."
The subject of systematic study is Python. The focus of learn as little as possible is Pandas.

The goal is more-or-less impossible. Focus on a specific code challenge will devolve into other aspects of the language. Or. It will lead nowhere.

Also. I'm not sure what a "systematic study" is. From the omitted back-story, I'm seems clear to me that the advice to "read a tutorial" is restated here as "systematic study." And is unacceptable. I guess because of time constraints.

It gets better.

There's this:
"For now, I am just going to follow the pattern and get stuff to work."
This is the ideal way to be defeated by technology. The "pattern" is defined to a large degree by the programming language. The two are inextricably linked. Trying to identify a pattern in programming that's magically not part of the implementation language seems deranged.

Learning Pandas isn't simple. There's no royal road.

The real crux are several questions that are difficult to reproduce, so I'm forced to summarize.

  • What does name.name().name().name() mean? How can you call multiple methods "simultaneously"?
  • What is the range of values for some parameter? 

And the capper.

  • I understand object.method().method().  I cannot understand object.name -- how can a method have no ()'s?
I'm sorry, but, the advice still stands. These are not questions that can be answered in a vacuum. This is serious -- and foundational -- object-oriented programming. Each of these small things was a total show-stopper, leading to four emails merely to clarify the question. Then more when the answer was rejected as not consistent with something, or astonishing, or 

For the name.name().name(), the answer "Google Fluent Interface" was too complex. Code examples were requested to show how it was possible for an object to return another object that had methods. 

Advice to use Python's >>> prompt and the dir() function were apparently part of "systematic study" and not part of "learn as little as possible."

For the range of values of a parameter, the answer "read the source" was a non-starter. Reading the source was flat-out rejected as not making any sense. I finally had to actually provide the link to the source repo for Pandas before it became clear what "open source" even meant. This was found to be nothing short of astonishing. The side-bar conversation on "how is this even possible?" was confusing to me because -- sadly -- I assumed people knew that the words "open" and "source" together meant that the source was open for inspection. My assumption was wrong. At least one person did not know this. That means there are others.

Finally. For the "capper" question. The exchange really did include this: "The dot notation I thought was Object.Method.Method."

A great deal of the back-and-forth amounted to "I reject anything other than fluent methods because that's the only thing I've decided to understand." Words like "property" and "attribute" were ignored as noise, AFAIK.

I say "amounted to" because a lot of the back-and-forth was restating the question. Other parts were exchanging links to the Pandas documentation in an effort to follow the "learn as little as possible" strategy. Any link to a tutorial would be "systematic study". Any link to Pandas, however, was acceptable. But (of course) confusing because the Pandas documentation assumes a modicum of language knowledge.

Here's what appears to be the problem: it's impossible to learn a complex tool like Pandas without starting with a basic understanding of Python. 

I don't think I've ever seen it suggested that one can leap into a package without knowing the language. I'm not sure how one can develop the idea of learning as little as possible in the first place. But, there it is. 

There's at least one person who thinks they can learn as little as possible and still get Pandas code to work. That likely means there are others.

It appears there's a market for books like 

Learn as little Python as possible to be able to use Pandas

and

Learn the least Java necessary to make Spark work

and

The Royal Road to Data Science

I'm not sure I'm capable of writing books like these, but for someone who does, it might be a really lucrative line of books.

Tuesday, December 13, 2016

Amazing how Windows is “special.”

Here's the quote:
...it is amazing how Windows is “special.” Back when ..., special things had to be done for Windows. Python continues the tradition w/ an entire section in its doco titled “3. Using Python on Windows”
I wasn't sure what to make of this.

It appeared that they want Windows to be the norm and Linux/Mac OS/POSIX to be treated as an exception.

I find that baffling. POSIX is a standard. Mac OS X is POSIX-compliant. Linux distros are generally POSIX-compliant. There's a nice list: https://en.wikipedia.org/wiki/POSIX

Windows is not POSIX-compliant. It seems to me that non-standard == special shouldn't be "amazing." It should be "tiresome" or "annoying."

Windows tools are uniquely awful. Or, perhaps, "special". Windows is so "special" that the IDE concept appears to have evolved as a solution to the awfulness. Using a Windows IDE (like Visual Studio) insulated one from the vagaries of Windows. It appears this is particularly important when trying to create a binary that will work in multiple incarnations of Windows.

In Linux (and POSIX-compliant OS's in general) the OS is the IDE. Start here: https://sanctum.geek.nz/arabesque/unix-as-ide-introduction/.  This seems so much simpler and more rational. Perhaps I'm just biased because I've used so many OS's that aren't Windows.

Worth considering: http://www.psychocats.net/ubuntu/virtualbox

When asked about IDE's for Python, I tell people that I've used a number of text editors to write Python code:
  • vi
  • BBEdit
  • Atom
  • Komodo Edit
  • Notepad++
  • PyCharm
  • IDLE
  • Jupyter Notebook
They all work nicely. It's difficult to recommend one because they all have distinct features. I always wind up with a lot of command-line interaction. The "run-a-command-from-the-IDE" has complex dialog boxes and sometimes confusing limitations. It's easier to simply write a script than discern the nuances of the IDE configuration rules.

These are (mostly) platform-independent. They can minimize a few of the Windows "features." They don't eliminate all of the Windows issues.

In all cases -- except using IDLE -- I also have a Python `>>>` prompt open in a terminal window.  

I strongly encourage everyone to work this way. The terminal window interaction can be copied and pasted into doctest strings. You've written a unit test without really trying. It's extremely productive. It gets away from IDE wrappers. It does expose some Windows-isms, but as long as you can limit the number of times you find Windows "amazing," that's not a problem.

Tuesday, November 29, 2016

A Reason for Avoiding Programming

From someone in the process of becoming a data scientist. They had a question on regular expressions, which made almost no sense. It appears that the core concepts of ETL -- Extracting source data, Transforming it into a useful form and the Loading into some persistent storage for long-term analysis -- had not been embraced. It appears the design pattern was unknown. All I could gather from the sketchy email chain was that something involving regular expressions had become difficult.

I wrote this in response: Handling Irregular File Formats.

Here's part of the follow-up.

"I have been focusing on the math associated w/ math optimization. I have been using spreadsheets to perform the computations."

Really.

Spreadsheets.

The ETL pipeline question/rant/complaint was part of loading a spreadsheet?

That seems somehow wrong. There are real tools available that really do real data science work. The word "optimization" hints that scipy.optimize might be a more useful exercise than hacking around with spreadsheets.

Perhaps some advice from a real data scientist might help: http://www.becomingadatascientist.com

Tuesday, November 22, 2016

The Modern Python Cookbook

See https://www.packtpub.com/application-development/modern-python-cookbook

This is a large (!) collection of recipes, focused on Python 3, exclusively.

It's much easier to write about the version of Python I actually use each day, and leave the old, quirky, slow Python 2 behind. This book doesn't have any "this will be different in Python 2" warnings. Those days seem to have passed. Finally.

The clock is counting down. https://pythonclock.org


Wednesday, November 16, 2016

Tuesday, November 1, 2016

Handling Irregular File Formats

This is a common issue. We have a file which was printed for human consumption. Consequently, it has many different kinds of lines.

These are the two kinds of lines of interest:

900296268 4/9/16 Mobility, Data Mining and Privacy Expired

900295204 4/1/16 Pro .NET Best Practices
Expired

The first is a single physical line.  It has four data elements. The second is two physical lines. The first has three data elements.

There are a number of other noise lines in the file which must be filtered out.

The first "solution" pitched to me could be summarized with this:

Move "Expired" on a line by itself to the previous line

That was part of the email subject line. The body of the email was some whining about regular expressions. Which I mostly ignored. Multiline regular expressions are their own kind of challenge.

We (should) all know this: https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/

Let's do this without regular expressions. There are two things we need to know. One is buffering, and the other is the best way to split each line. It turns out that there are spaces as well as tabs, and can can, by splitting on tabs, make a lot of progress.

Instead of the good approach, I'll pick the other approach that doesn't involve splitting on tabs.

Here's the simulated file, with data lightly redacted.

sample_text = '''
"Your eBooks"

Show 200



Page: 1



Order # Date Title Formats Status Download
-------
xxx315605 9/30/16 R for Cloud Computing Available



xxx304790 6/21/16 Java XML and JSON Available
xxx304790 6/21/16 Accelerated DOM Scripting with Ajax, APIs, and Libraries Available

xxx291633 2/28/16 Practical Google Analytics and Google Tag Manager for Developers
Expired
'''

It's not perfectly obvious (because of line wrapping) but there are three examples of the "all-complete-in-one-line" records. There's one example of the "two-lines" record.

Rather than mess with the file, we'll build a file-like object with our sample data.

import io
file_like_object = io.StringIO(sample_text)

I like this because it lets me write proper unit test cases.

The file has four kinds of lines:

  • Complete Records
  • Record Headers (without Available/Expired)
  • Record Trailers (only Available/Expired)
  • Noise

We'll create some decision rules for the two obvious kinds of file lines: complete records and trailers. We can deduce the headers based on a simple adjacency rule: they precede a trailer. The fourth kind of lines are those which are possible headers but are not immediately prior to a trailer.

def complete(words):
    return len(words) > 3 and words[-1] in ('Available', 'Expired')

def trailer(words):
    return len(words) == 1 and words[0] in ('Available', 'Expired')    

We can spot these two kinds of lines easily. The other kinds require a Buffered Generator.

def emit_clean(source):
    header = None
    for line in (line.strip() for line in source):
        words = [w.strip() for w in line.split()]
        if len(words) == 0: continue
        if complete(words):
            yield(line)
            header = None
        elif trailer(words) and header:
            yield(header + '\t\t' + line)
            header = None
        else:
            # Possible header
            # print('??', line)
            header = line

The Buffered Generator is a way to implement a "look ahead one item" (LA1) algorithm. We do this by buffering rows. When we get to the next row we can use the buffered row and the current row to implement the look-ahead logic.

The actual implementation uses a look-behind buffer, header.

The (line.strip() for line in source) generator expression strips away leading and trailing spaces. This gets rid of the newline characters at the end of each input line.

The default behavior of split() is to split on whitespace. In this case, it will create a number of words for complete records or header records, and a single word for a trailer record. If we had split on tab characters, some of this logic would be simplified.

That's left as an exercise for the reader.

If the len(words) is zero, the line is blank.

If the line matches the complete() function, we can yield it as one of the iterable results of the generator function. We also clear out the look-behind buffer, header.

If the line is a trailer and we have a buffered look-behind line, this is the two-physical-line case. We can assemble a complete record and emit it.

Otherwise, we don't know what the line is. It's a possible header line, so we'll save it for later examination.

This algorithm involves no regular expressions.

With Regular Expressions


An alternative would use three regular expressions to match the three kinds of lines.

import re
all_one_pat = re.compile("(.*)\t(.*)\t(.*)\t\t((?:Available)|(?:Expired))")
header_pat = re.compile("(.*)\t(.*)\t(.*)")
trailer_pat = re.compile("((?:Available)|(?:Expired))")

This has the advantage that we can then use the groups() method of each successful match to emit useful data instead of text which needs subsequent parsing. This leads to a slightly more robust process.

def emit_clean2(source):
    header = None
    for line in (line.strip() for line in source):
        if len(line) == 0: continue
        all_one_match = all_one_pat.match(line)
        header_match = header_pat.match(line)
        trailer_match = trailer_pat.match(line)
        if all_one_match:
            yield(all_one_match.groups())
            header = None
        elif header_match and not header:
            header = header_match.groups()
        elif trailer_match and header:
            yield header + trailer_match.groups()
            header = None
        else:
            pass # noise

The essential processing involves seeing which of the regular expressions match the line at hand. If it's all-in-one, this is good. We can yield the groups of meaningful data. If it's a header, we can save the groups. If it's a trailer, we can combine header and trailer groups and yield the composite.

This has the advantage of explicitly rejecting noise lines instead of treating each noise line as a possible header.

Handling Irregular File Formats

This is a common issue. We have a file which was printed for human consumption. Consequently, it has many different kinds of lines.

These are the two kinds of lines of interest:

900296268 4/9/16 Mobility, Data Mining and Privacy Expired

900295204 4/1/16 Pro .NET Best Practices
Expired

The first is a single physical line.  It has four data elements. The second is two physical lines. The first has three data elements.

There are a number of other noise lines in the file which must be filtered out.

The first "solution" pitched to me could be summarized with this:

Move "Expired" on a line by itself to the previous line

That was part of the email subject line. The body of the email was some whining about regular expressions. Which I mostly ignored. Multiline regular expressions are their own kind of challenge.

We (should) all know this: https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/

Let's do this without regular expressions. There are two things we need to know. One is buffering, and the other is the best way to split each line. It turns out that there are spaces as well as tabs, and can can, by splitting on tabs, make a lot of progress.

Instead of the good approach, I'll pick the other approach that doesn't involve splitting on tabs.

Here's the simulated file, with data lightly redacted.

sample_text = '''
"Your eBooks"

Show 200



Page: 1



Order # Date Title Formats Status Download
-------
xxx315605 9/30/16 R for Cloud Computing Available



xxx304790 6/21/16 Java XML and JSON Available
xxx304790 6/21/16 Accelerated DOM Scripting with Ajax, APIs, and Libraries Available

xxx291633 2/28/16 Practical Google Analytics and Google Tag Manager for Developers
Expired
'''

It's not perfectly obvious (because of line wrapping) but there are three examples of the "all-complete-in-one-line" records. There's one example of the "two-lines" record.

Rather than mess with the file, we'll build a file-like object with our sample data.

import io
file_like_object = io.StringIO(sample_text)

I like this because it lets me write proper unit test cases.

The file has four kinds of lines:

  • Complete Records
  • Record Headers (without Available/Expired)
  • Record Trailers (only Available/Expired)
  • Noise

We'll create some decision rules for the two obvious kinds of file lines: complete records and trailers. We can deduce the headers based on a simple adjacency rule: they precede a trailer. The fourth kind of lines are those which are possible headers but are not immediately prior to a trailer.

def complete(words):
    return len(words) > 3 and words[-1] in ('Available', 'Expired')

def trailer(words):
    return len(words) == 1 and words[0] in ('Available', 'Expired')    

We can spot these two kinds of lines easily. The other kinds require a Buffered Generator.

def emit_clean(source):
    header = None
    for line in (line.strip() for line in source):
        words = [w.strip() for w in line.split()]
        if len(words) == 0: continue
        if complete(words):
            yield(line)
            header = None
        elif trailer(words) and header:
            yield(header + '\t\t' + line)
            header = None
        else:
            # Possible header
            # print('??', line)
            header = line

The Buffered Generator is a way to implement a "look ahead one item" (LA1) algorithm. We do this by buffering rows. When we get to the next row we can use the buffered row and the current row to implement the look-ahead logic.

The actual implementation uses a look-behind buffer, header.

The (line.strip() for line in source) generator expression strips away leading and trailing spaces. This gets rid of the newline characters at the end of each input line.

The default behavior of split() is to split on whitespace. In this case, it will create a number of words for complete records or header records, and a single word for a trailer record. If we had split on tab characters, some of this logic would be simplified.

That's left as an exercise for the reader.

If the len(words) is zero, the line is blank.

If the line matches the complete() function, we can yield it as one of the iterable results of the generator function. We also clear out the look-behind buffer, header.

If the line is a trailer and we have a buffered look-behind line, this is the two-physical-line case. We can assemble a complete record and emit it.

Otherwise, we don't know what the line is. It's a possible header line, so we'll save it for later examination.

This algorithm involves no regular expressions.

With Regular Expressions


An alternative would use three regular expressions to match the three kinds of lines.

import re
all_one_pat = re.compile("(.*)\t(.*)\t(.*)\t\t((?:Available)|(?:Expired))")
header_pat = re.compile("(.*)\t(.*)\t(.*)")
trailer_pat = re.compile("((?:Available)|(?:Expired))")

This has the advantage that we can then use the groups() method of each successful match to emit useful data instead of text which needs subsequent parsing. This leads to a slightly more robust process.

def emit_clean2(source):
    header = None
    for line in (line.strip() for line in source):
        if len(line) == 0: continue
        all_one_match = all_one_pat.match(line)
        header_match = header_pat.match(line)
        trailer_match = trailer_pat.match(line)
        if all_one_match:
            yield(all_one_match.groups())
            header = None
        elif header_match and not header:
            header = header_match.groups()
        elif trailer_match and header:
            yield header + trailer_match.groups()
            header = None
        else:
            pass # noise

The essential processing involves seeing which of the regular expressions match the line at hand. If it's all-in-one, this is good. We can yield the groups of meaningful data. If it's a header, we can save the groups. If it's a trailer, we can combine header and trailer groups and yield the composite.

This has the advantage of explicitly rejecting noise lines instead of treating each noise line as a possible header.

Tuesday, October 25, 2016

Speakers advice

First. Read this: http://webapplog.com/10-conf-donts/

Some additional thoughts on the don't list.

  1. Avoid reading to your audience unless you are a poet, journalist, judge or politician. Poets and journalists are paid to write well and read there words well. Judges and politicians are paid to be ultra precise. 
  2. Avoid Type and Talk unless it is a Code Dojo presentation where the typing is essential. I've seen too many bad type-and-talk where the lack of organization made it nearly impossible to figure out what was going on.
  3. Avoid GIFs and clever graphics. 
  4. Avoid insulting people. Don't alienate your audience. If you can't be completely 100% inclusive of every single human being in the room, don't speak in public.
  5. Avoid sitting if you are able to stand. If you must sit, please try to sit where folks can see you. This can't always work. Someone able to stand who chooses to sit is doing themselves a disservice. A singer or vocal coach will tell you that your standing posture helps you breathe properly and project properly. If you are able to stand, please stand.
  6. Avoid nervous behaviors. Avoid drawing attention to yourself, and draw attention to your material  Fear (or nervousness) is hard to avoid. It's important to focus on the audience and their curiosity about your talk. They were intrigued by the title. They want to hear you..
  7. Avoid apologies. Apologize if you offend someone, of course. But don't "pre-apologize" for some irrelevant aspect of your presentation. Your audience came for the content, not for apologies.
  8. Avoid too much sales pitch. I've sat through too many product demos that had a half-hour sales pitch that left only a half-hour for the actual useful information. This has happened even when I told vendors -- explicitly -- not to provide any sales information during the product demo.
  9. Avoid too much personal background. A complete recitation of your CV isn't interesting and brushes up against an Argument from Authority fallacy.
  10. Avoid dressing badly.
A list of things to do.
  1. Speak with passion about your topic. Your slides are your road-map through the agenda. A few key points and reminders are all you should have.
  2. Speak to the people listening. Canned code examples are good, if they emphasize your point. Copy and paste into an IDE if you are demonstrating the IDE.
  3. Focus on the material, not other irrelevant cleverness.
  4. Focus on the audience as people interested in your topic.
  5. To project your voice -- and your presence -- you need to be visible. Stand if you can. Try to be as visible as possible.
  6. Focus on your audience's need to hear your material. It's not about you, it's about your content.  
  7. Focus on the good, useful, informative information you're providing.
  8. Present outstanding content first. Sales are merely a hoped-for consequence of a good presentation. 
  9. Your content should stand on it's own. You only need a brief summary of your qualifications. 
  10. Project your presence. Dress so that you can be seen without being distracting.

Tuesday, October 18, 2016

Optimizing complex generator expressions [Updated]

See this: https://twitter.com/jakevdp/status/786920174595158018

The core expression is similar to this

y = (f(x) for x in L if f(x) is not None)

There are a lot of variations on the filter. The point is that the function appears twice in the above expression.

We have a number of alternatives.

  • y = filter(None, f(x) for x in L)
  • y = filter(None, map(f, L))
  • y = (x for x in map(f, L) if x)
  • y = (x for x in (f(y) for y in L) if x is not None)
  • y = (val for x in L for val in (f(x),) if val is not None)


My preference is two steps, even though I don't really have a good reason for this.

y1 = (f(x) for x in L)
y2 = (f for f in y1 if f)


The thread leads to this path: https://twitter.com/TomAugspurger/status/786922167522828289  and the idea of "Let Bindings." We could extend the language slightly to bind a variable within the confines of the generator expression.

Like this:

y = (f(x) as val for x in L if val is not None)

The as clause binds the f(x) to val so that it can be used in the if clause.

Summary: Interesting.

Tuesday, October 4, 2016

Alternatives to PowerPoint (or Keynote) for Presentations

This: https://opensource.com/business/16/9/alternatives-powerpoint

Missing from the list? The S5-based slide-show tools that are part of docutils.

The only issue with S5 is that you need to carefully review each and every page to be sure you material fits. There's no autosizing of the fonts, or other trickery to pack too much trash onto a slide.

TIL that Lync/Skype doesn't politely handle Keynote on a Mac. You must mirror displays because Lync can only share one of your displays, and it's not the one Keynote chooses to display on.

I often use Keynote because it's expedient. I sometimes use S5 to show off an entirely open-source toolchain.





Tuesday, September 27, 2016

Database Schema Migration

Some thoughts: http://workingwithdevs.com/delivering-databases-migrations-vs-state/

This covers a lot of ground on the Declarative vs. Procedural question. It explains a lot of the considerations that lead to choosing a procedural schema evolution vs. a declarative schema with an implied change sequence to migrate to each new declared state.

The article calls the declarative "state-based" and procedural approach "migration-based".

My 2¢ are focused on this point: 
When using a state-based solution you will most often be using a diff tool like those provided by Redgate or Visual Studio to examine the differences and generate an upgrade script. While this is a very efficient solution for most changes, with table renames and a few other types of table refactoring they can do bad things, ...
This point about table refactoring is, for me, the show-stopper. Relational theory tells me that I can map any schema to any other schema using selection, projection, and join. I can denormalize data and I can normalize again via group-by clauses. I can reduce the original schema to a sequence of object-attribute-value triples, and restructure this into any desired new schema. 

Given enough time, a change tracking tool should be able to find a minimal-cost transformation from schema to schema. This might involve a complex search over a large state space, and it certainly involves creating costs for each alternative query plan. 

Pragmatically, I'm not sold on this being a good idea. And I'm rarely sure I even want to get involved in a fully automated solution. While a tool might be able to detect and automate a variety of simple changes, I think that developers must always vet those change scripts.

In particular, the search space is emphatically not limited to select, project, and join. There are also database unload-reload, index create and drop. There are even more complex operations like creating intermediate results which aren't part of the final database structure. With proper indices, these might actually be beneficial.

In some cases, the continuous operation requirements are such that we might have two copies of a database: one being used and the other being transformed. A logger tracks transactions in the older copy and a synchronizer replicates those transactions in the new copy. After the data is moved, the customer access is moved via a feature toggle from the old database to the new database.

Semantic Drift

Also important is the issue of semantic drift. When we're making structural changes where the "before" column names match the "after" column names, then there's little chance for semantic drift. There's still some possibility, though. We can (and sometimes do) repurpose columns, preserving the original name. In some cases, we might change a database constraint without renaming the column.

In the larger case, of course, it doesn't require "‘hot-fix’ changes to QA or even production databases" to create profound semantic changes. All it takes is an app developer deciding that a column should be repurposed. There's may be no structural change on the schema overall. 

A non-structural change in some past release could have implications for structural change in a future release. Imagine three columns in three tables with the same names. Two started out life as simple foreign keys to the third. But one became optional, and now the semantics don't match but the names do. Automated tools are unlikely to discern the intent here. 

Conclusion?

It's all procedural migration. I'm not declarative ("state") tools can be trusted beyond discerning the changes and suggesting a possible migration.

Wednesday, September 21, 2016

Bad Trends and Sloppy Velocity

Read this: https://www.linkedin.com/pulse/story-points-evil-brad-black-in-the-market-

There are good quotes from Ron Jeffries on the worthlessness of story points. (I've heard this from other Agile Consultants, also.) Story Points are a political hack to make management stop measuring the future.

The future is very hard to measure. The difficulty in making predictions is one of the things which distinguishes the future from the past. There's entropy, and laws of thermodynamics, and random quantum events that make it hard to determine exactly what might happen next.

If Schrödinger's cat is alive, we'll deliver this feature in this sprint. If the cat is dead, the feature will be delayed. The unit test result is entangled with the photon that may (or may not) have killed the cat. If you check the unit tests, then the future is determined. If you don't check the unit test, the future can be in any state.

When project management becomes Velocity Dashboard Centered (VDC™), that's a hint that something may be wrong.

My suspicion on the cause of VDC?

Product Owners ("Management") may have given up on Agility, and want to make commitments to a a schedule they made up on a whiteboard at an off-site meeting. Serious Commitments. The commitment has taken on a life of its own, and deliverable features aren't really being prioritized. The cadence or tempo has overtaken the actual deliverable.

It feels like the planning has slipped away from story-by-story details.

What can cause VDC?

I think it might be too many layers of management. The PO's boss (or boss's boss) has started dictating some kind of delivery schedule that is uncoupled from reality. The various bosses up in the stratosphere are making writing checks their teams can't cash.

What can we do?

I don't know. It's hard to provide schooling up the food chain to the boss of the boss of the product owner. It's hard to explain to the scrum master that the story points don't much matter, since the stories exist independent of any numbering scheme.

The link above says that there's some value in ordering the stories; assigning random-ish point numbers somehow helps order them.

I reject the last part of this. The stories can be ordered without any numbers. The Agile manifesto is pretty clear on this point: talk about it. The points don't enhance the conversation. Push the story cards around on the board until you have something meaningful. Assigning numbers is silliness.

Actually. I think its harmful.

Rolling numbers up to "senior" management isn't facilitating a conversation. It's summarizing things with empty numerosity. ("Numerosity"? Yes. Empty numerosity: applying numeric methods inappropriately. For example, averaging the day of the week on which it rains, for example, to discover something about Wednesday.)

The Best Part (TBP™)

Irreproducibility.

On Project X, we had a velocity of 50 Story Points per Sprint. On Project U, the "same" team -- someone quit and two new people were hired -- had a velocity of 100 Story Points per Sprint. Wow! Right?

Except. Of course, the numbers were inflated because the existing folks figured the new folks would take longer to get things done. But the new folks didn't take longer. And now the team is stuck calling a 3-point story a 5-point story because the new guys are calibrated to that new range of random numbers.

So what's comparable between the "same" team on two projects? It's not actually the same people. That's out.

We can try to pretend that the projects have the "same" technology (Java 1.6 v. Java 1.8) or the same CI/CD pipeline (Ant v. Maven, Hudson v. Jenkins) or even the same overall enterprise. But practically, there's nothing repeatable about any of it. It's just empty numerosity.

Sorry. I'm not sold, yet, on the value of Story Points.

Tuesday, September 20, 2016

What was I thinking?

Check out this idiocy: https://github.com/slott56/py-false

What is the point? Seriously. What. The. Actual. Heck?

I think of it this way.

  • Languages are a cool thing. Especially programming languages where there's an absolute test -- the Turing machine -- for completeness. 
  • The Forth-like stack language is a cool thing. I've always liked Forth because if it's elegant simplicity. 
  • The use of a first-class lambda construct to implement if and while is particularly elegant.
  • Small languages are fun because they can be understood completely. There are no tricky edge cases in the semantics.
I have ½ of a working GW-Basic implementation in Python, too. It runs. It runs some programs like HamCalc sort of okay-ish. I use it to validate assumptions about the legacy code in https://github.com/slott56/HamCalc-2.1.  Some day, I may make a sincere effort to get it working.

Even languages like the one that supports the classic Adventure game are part of this small language fascination. See adventure.pdf for a detailed analysis of this game; this includes the little language you use to interact with the game.

Tuesday, September 13, 2016

On One Aspect of Design Patterns -- Flexibility

Something I forget to think about is the degree of detail or granularity of design patterns.  I have my own viewpoint and I often assume that others share it.

Here's a quote from an email describing the PLoP (Pattern Languages of Programs) patterns as quite distinct from the Gang of Four (Design Patterns: Elements of Reusable Object-Oriented Software) patterns.
In the main, the PLoP patterns are less granular than the persnickety GoF
"Design Patterns." (Classic GoF, in part, static type binding work-arounds. And
you need to talk about a "facade" pattern? Really? Although see Fowler's at it
again, coining a ™ term - "fluent API" - for Some Not Egregiously Stupid
Practice, to feed to the credulous who have never reflected on what they are doing.)
Cutting through the editorializing, the author is describing two families.

  • GoF patterns that are essentially ways to cope with static type checking in Java and C++.
  • PLoP patterns which are a little more generic and more widely applicable.

More...
"Plug-in Pattern" is a nice example. Enumerates the stuff you kinda know, with
qualities / attributes of its proposal, plus application samples / outcomes of
applying the pattern. The claims to relevance throughout are reminiscent of the
investigation behind Parnas' "Criteria for Decomposing Systems into Modules."
My habit is to assume this is pretty widely known. I assume everyone has wrestled with design patterns large and small and found that some of the GoF apply to Python, but the implementation details will differ. Dramatically. 

Look at the Singleton design pattern, for example. The concept is profound. There are times when we want stateful, global, Singleton instances. The Java or C++ technique of a small factory method which returns the one-and-only instance (or creates the one-and-only instance in the rare edge case) is extremely strange in Python. We can implement it. But why?

Module objects in Python are stateful singletons. Rather than invent a Singleton class, we can -- trivially -- just use a module. And we're done. Problem solved. No Code Written.

The email served as a reminder that sometimes people aren't quite so flexible in their understanding of design patterns. I need to cut them some slack and guide them to seeing that there's wiggle room there. The email reminds me that some people feel compelled to either follow the GoF prescription or discard the GoF entirely. The reminder about PLoP and other pattern languages is a helpful reminder to be more flexible.

The point here is that patterns are a concept. Not a law.

Tuesday, August 30, 2016

Obscure Standards, Packaged Products, Latent Bugs

Read this: http://jeffq.com/blog/the-ethernet-pause-frame/

Fascinating.

A world of interconnected devices in which we place a kind of implicit trust. There's little visibility for ordinary consumers. It takes a skilled specialist to determine that there are flaws in a product.

It's not that the system is "flaky."

It's that this combination of components each with unexpected edge-case behavior is actually broken.

Tuesday, August 23, 2016

On Generator Functions, Yield and Return

Here's the question, lightly edited to remove the garbage. (Sometimes I'm charitable and call it "rambling". Today, I'm not feeling charitable about the garbage writing style filled with strange assumptions instead of questions.)

someone asked if you could have both a yield and a return in the same ... function/iterator. There was debate and the senior people said, let's actually write code. They wrote code and proved that couldn't have both a yield and a return in the same ... function/iterator. .... 
The meeting moved on w/out anyone asking the why question. Why doesn't it make sense to have both a yield and a return. ...

The impact of the yield statement can be confusing. Writing code to mess around with it was somehow unhelpful. And the shocking "proved that couldn't have both a yield and a return in the same ... function" is a serious problem.

(Or a seriously incorrect summary of the conversation; a very real possibility considering the garbage-encrusted email. Or a sign that Python 3 isn't widely-enough used and the emil omitted this essential fact. And yes, I'm being overly sensitive to the garbage. But there's a better way to come to grips with reality and it involves asking questions and parsing details instead of repeating assumptions and writing garbage.)

An example


>>> def silly(n, stop=None):
 for i in range(n):
  if i == stop: return
  yield i

  
>>> list(silly(5))
[0, 1, 2, 3, 4]
>>> list(silly(5, stop=3))
[0, 1, 2]

This works in both Python 3.5.1 and 2.7.10.

Some discussion

A definition with no yield is a conventional function: the parameters from some domain are mapped to a return value in some range. Each mapping is a single evaluation of the function with concrete argument values.

A definition with a yield statement becomes an iterable generator of (potentially) multiple values. The return statement changes its behavior slightly. It no longer defines the one (and only) return value. In a generator function (one that has a yield) the return statement can be thought of as if it raised the StopIteration exception as a way to exit from the generator.

As can be seen in the example above, both statements are in one function. They both work to provide expected semantics.

The code which gets an error is this:

>>> def silly(n, stop=3):
...     for i in range(n):
...         if i == step: return "boom!"
...         yield i


The "why?" question is should -- perhaps -- be obvious at this point.  The return raises an exception; it doesn't provide a value.

The topic, however, remains troubling. The phrase "have both a yield and a return" is bothersome because it fails to recognize that the yield statement has a special role. The yield statement transforms the semantics of the function to make it into a different object with similar syntax.

It's not a matter of having them "both". It's matter of having a return in a generator. This is an entirely separate and trivial-to-answer question.

A Long Useless Rant

The email seems to contain an implicit assumption. It's the notion that programming language semantics are subtle and slippery things. And even "senior people" can't get it right. Because all programming languages (other then the email sender's personal favorite) are inherently confusing. The confusion cannot be avoided.

There are times when programming language semantics are confusing.  For example, the ++ operator in C is confusing. Nothing can be done about that. The original definition was tied to the PDP-11 machine instructions. Since then... Well.... Aspects of the generated code are formally undefined.  Many languages have one or more places where the semantics are "undefined" or only defined by example.

This is not one of those times.

Here's the real problem I have with the garbage aspect of the email.

If you bring personal baggage to the conversation -- i.e., assumptions based on a comparison between some other language and Python -- confusion will erupt all over the place. Languages are different. Concepts don't map from language to language very well. Yes, there are simple abstract principles which have different concrete realizations in different languages. But among the various concrete realizations, there may not be a simple mapping.

It's essential to discard all knowledge of all previous favorite programming languages when learning a new language.

I'll repeat that for the author of the email.

Don't Go To The Well With A Full Bucket.

You won't get anything.

In this specific case, the notion of "function" in Python is expanded to include two superficially similar things. The syntax is nearly identical. But the behaviors are remarkably different. It's essential to grasp the idea that the two things are different, and can't be casually lumped together as "function/iterator".

The crux of the email appears to be a failure to get the Python language rules in a profound way. 

Tuesday, August 16, 2016

Twelve Important Design Patterns

Read this: http://12factor.net/

Then. After reading it. Read it again to be sure you've got it. It's dense with best practices.

Now that you've read it, make yourself a Quality Engineering checklist.

I. Codebase: One codebase tracked in revision control, many deploys
II. Dependencies: Explicitly declare and isolate dependencies
III. Config: Store config in the environment
IV. Backing services: Treat backing services as attached resources
V. Build, release, run: Strictly separate build and run stages
VI. Processes: Execute the app as one or more stateless processes
VII. Port binding: Export services via port binding
VIII. Concurrency: Scale out via the process model
IX. Disposability: Maximize robustness with fast startup and graceful shutdown
X. Dev/prod parity: Keep development, staging, and production as similar as possible
XI. Logs: Treat logs as event streams
XII. Admin processes: Run admin/management tasks as one-off processes

If your app doesn't follow all of these patterns, you've got technical debt to work off. Start by posting the debt remediation stories in Jira (or whatever you're using.)

I've got config issues left, right, and center. Numerous assumptions include the URL's for RESTful services on which my RESTful services rely: this is not good.

Some of these things, however, are a done deed in the Python/Flask world with no real thinking required.

  • Build, release, run - done
  • Processes - done
  • Port binding - done
  • Disposability - done

Other things require some care. And the config is something that I've really got to fix.

Tuesday, August 9, 2016

That Feeling When... You're reading your own documentation because it's useful and (mostly) correct

I'm looking at code (as a man does) and I can't remember if there's a class that does X. There's a lot of code. I wrote almost all of it. And -- maybe it's the gin -- but I just can't recall if there's an X. It seems like there should be.

Scan. Scan. Scroll. Scroll.

Read. Read.

Wait!

I have a pretty good gh-pages branch for this. Sphinx-based. Mostly up-to-date. Let's look there.

Ahhh. So much nicer than scrolling through code. Indexes work.

This whole "documentation" thing is pretty cool. Now I'm actually happy that other people guilted me into doing it.

Tuesday, August 2, 2016

Lamenting the Death of Object-Oriented Programming. (Sigh) Again?

See Goodbye, Object Oriented Programming.

I don't want to say that the entire article is bunk. It's not. It raises a few good points. Points which I thought were pretty well known.

What's aggravating is that this lamentation is overly broad.  It treats all languages as if they're Java or C++. That's not true, and as a consequence, the article is less useful than it could be.

Banana Monkey Jungle Problem. Only true if you are sadly mistaken about the unit of reuse. The class as unit of reuse -- across projects -- is false, has been false, and will always be false. The idea of class inheritance for reuse makes perfect sense. Sharing individual classes between projects has never (as far as I know) been a promise of OO programming. Maybe I read the wrong books and missed that promise.

The Triangle Problem. Isn't actually a problem. Python has a defined method resolution order.

The Fragile Base Class Problem. This points out the well known issue with having concrete classes depend on other concrete classes. The SOLID design principles suggest concrete classes should depend on abstractions. Abstractions do not suffer (as much) from the fragile base class problem.

The Hierarchy Problem. I guess the idea that the real world is multi-dimensional can be confusing. If everything has to be force-fit into single inheritance, this would create the hierarchy problem. If we allow multiple inheritance, this problem evaporates.

The Reference Problem. Even C++ has "smart" pointer packages. Java has garbage collection. Python does reference counting. This is only a problem if you go out of your way to deal with pointers in a primitive way.

The part on Polymorphism didn't make any sense. There didn't seem to be a tidy problem. Just a confusingly vague statement that "Interfaces will give you [polymorphism?]. And without all of the baggage of OO". I don't get how interfaces are necessary without the baggage of OO. So, I can't really try to refute this.

In the long run, I guess this was a way to introduce some of the benefits of a functional approach. I'm not sure that this kind of criticism of object-oriented programming is very helpful. It doesn't apply to all OO languages, so it's misleading at best. (At worst, it's simply wrong.)

I think these problems are interesting and can be used to show the benefits of functional programming. But without the actual functional programming examples, this isn't very useful.

Tuesday, July 12, 2016

Getting Rid of the Gang-of-Four Design Patterns is Nonsense

Someone found Yet Another Post (YAP™) insisting that the Gang of Four (GOF™) patterns were on their last legs. The email was misleading, because this is not precisely what the article said. The bottom-line was that Design Patterns in general are merely a response to gaps in the underlying programming language. A position that's nonsense at its very foundation.

The lexicon of design patterns varies from language to language. GoF patterns aren't "going away." They're part of the Java/C++ world. They don't apply quite the same way to Python or functional languages.

There's a more serious issue, though: Language Mapping. First some background.

Design Patterns

Design Patterns will always exist. They're an artifact of how we process the world. We tend to classify individual objects so that we don't have to deal with each object as a separate wonder of nature.

It's Just Another Brick In The Wall.

We don't have to examine each rectangular solid of ceramic and understand the wonderfulness of it. We can group and summarize. Classify. Brick is a design pattern. So is masonry. So is wall. They're all patterns. It's how we think.

Design Patterns and Language Gaps

There's a claim that moving toward functional languages will kill design patterns. This presumes (partly) that non-OO languages magically don't have design patterns. This is (see above) kind of insane. Languages have design patterns. We recognize these patterns all the time.

A functional language has a common technique (or pattern) for visiting nodes in a hierarchy. We don't dwell on the wonderfulness of the code as if we'd never seen it before. Instead, we classify it based on the design pattern, and leverage this higher-level understanding to figure out why we're walking a hierarchy.

Sounding the death knell for design patterns also presumes (partly) that functional languages are magically more complete that OO languages. In this newer better language, we don't need patterns because there are no gaps. This is pretty much nutso, too. The Patterns Fill Language Gaps school of thought ignores the fact that there are many ways to implement these "gaps". We can use GoF design patterns, or we can use other software designs that don't fit the GoF design patterns. Both work.

The patterns aren't filling a "gap." They're providing guidance on how to implement something. That's all. Nothing more. Guidance.

"But wait," you say, "since I needed to write code, that's evidence that there's a gap."

"What?" I ask, incredulous. "Are you claiming that any code is evidence of a language gap? Does that mean all application software is just a language gap?"

"Let's not be silly," you say. "I can split a hair and create a tiny distinction between software I shouldn't have to write and software I should have to write."

I remain incredulous.

Design Patterns as Damage

The idea that somehow the GoF design patterns are a problem is also goofy. The GoF design patterns are pretty slick. They solve a fairly broad suite of problems in an elegant and consistent manner.

They're just good design.

Yes, they can be complex. Sorry about that. Software can be complex if you want really excellent flexibility and extensibility.

AND.

Bonus.

Software can be complex when you have to work around the problems of "compiler" and "locked libraries" and "no source." That is, the GoF patterns apply in full force for C++ and Java where you're trying to protect your intellectual property by disclosing only headers and obfuscated implementation details. Indeed, there are few alternatives to the GoF patterns if you're going to distribute a framework that has no visible source and needs to leave extension points for users.

If you don't have Locked-NoSource-Compiled code as a backdrop, the GoF patterns can be simplified a little. But some of the patterns are essential. And remain essential. There are some really great ideas there.

In Python world, we rely on a modified subset of the GoF patterns. They work extremely well.

When writing functional-style Python using immutable data structures (to the extent possible), we use a different set of design patterns. Not so many GoF patterns when we're trying to avoid stateful objects. But some patterns (like the Abstract Factory) are really very helpful even in a largely functional context. It morphs from an abstract factory class to a factory function, and it loses the "abstract" concept that's part of C++ and Java, but the core Factory design pattern remains.

The Serious Issue

The serious issue that is surfaced by the email is Language Mapping. We cannot (and must not) try to map languages to each other. What is true for Java design is emphatically not true for Python design. And it doesn't apply to assembly languages, FORTRAN, FORTH, or COBOL.

Languages are different.

There. I said it.

If there was an underlying "universal deep structure" behind all programming languages, the surface features would be merely syntax, and we'd have automated translation among languages. The universal deep structure (the underlying Turing Machine that does computations) appears to be too abstract to map well among programming languages. Hence the lack of translators.

When switching among languages, it's important to leave all baggage behind.

When moving from Java < 8 to Java >= 8<8 java="" to=""> (i.e., non-functional Java to more functional Java) we can't trivially map all design patterns among the language features. It's a new language with new features that happens to be compatible with the old language.

Attempting to trivially map concepts between non-functional (or strictly-OO Java) and more functional Java leads to dumb conclusions. Like the GoF patterns are dying. Or the GoF patterns represent damage or something else equally goofy.

The language changes lead to design pattern changes.

Language change doesn't deserve an gleeful/anguished blog post celebrating/lamenting the differences. It's a consequence of learning a new language, or new features of an existing language.

Please avoid mapping languages to each other.

Tuesday, June 21, 2016

Why Python? (Sad Follow-up)

In "Why Python?" I linked to a deep and sophisticated analysis of programming languages. Anyway, I thought it was a deep and sophisticated analysis.

I got a reply that shows how wrong I was. Here's the quote:
The point is that the Python ecosystem has a lot to offer. We could argue about the language design choices. However, why bother? Why not just take advantage of what the ecosystem has to offer.
Ah. Discussing the language is just "arguing". I guess the points are all debatable and my comparison of Python to any benchmark is just the seed for an argument. A religious war, perhaps. I guess this wasn't compelling. It was a "why bother?"

Why bother pointing out the strong points of the language?

The email emphasized the "ecosystem" with a cool, but short example of how scipiy.spatial.KDTree works. 

It appears that -- for some people -- "Python code actually works" is a useful response to "why python?" 

I would have thought that "Python code actually works" was a precondition to even discussing the value proposition behind Python.

But -- clearly -- I was wrong.  The mere fact of a working example is a Very Important Thing™.

What does this mean?
  1. There are people who use software that doesn't actually work. When they see software that works, it's important. Very important.
  2. When software actually works, these people find this simple fact to be a compelling and substantial argument for placing a high value on the software.
  3. Other considerations like clarity and simplicity aren't relevant. If these poor souls are suffering software that doesn't actually work, then broken and obscure is still broken. Other parts of the long discussion from Wirth are just arguing points.
The email included "consider amending the why python? blog w/ the other big pro: ecosystem" I'm not sure I actually understand the request. When code that works is a "big pro", this comes from a world I can't pretend to understand.

Also. The example code used xrange(). Which is a Python 2 smell. Those days are passed.

Tuesday, June 14, 2016

Continuous Data Migration

See http://slott-softwarearchitect.blogspot.com/2013/07/database-conversion-or-schema-migration.html

People talk about CI/CD (Continuous Integration/Continuous Deployment).

They also need to talk about CM (Continuous Migration).

"Wait, what?" you ask.

When we roll out a new version of the software (CD) there are three common situations.
  1. The new software uses the existing data model with no changes. This is a "minor version change": from v3.2 to v3.3.
  2. The new software requires a tweak to the schema, but it's backward compatible. This, too, is a minor version change. In a SQL context, we might have used an ALTER TABLE to add a nullable column. If there are no SELECT * statements in the code, this change is essentially transparent to legacy code.
  3. The new software involves a new schema that's not backwards compatible. This is a major version change. From v3.2 to v4.0. This is difficult. Really difficult.
Clearly, the first two can be done with the data in place. New software is installed, the servers are restarted, and away we go. In a big environment, there may be a rolling deployment. There may be a canary release that will get converted first, then others will be brought online.

A change of the Second Kind does involve the one-time database transformation script. This may lead to some down-time. Or it may lead to a feature toggle so that the new software can work with the old database until the script is run.

In a NoSQL context, a change of the Second Kind doesn't require the one-time script. The new documents have new fields that old documents don't have. NoSQL apps -- in general -- must be able to cope with data model variations.

A change of the Third Kind is trouble.

Big trouble.

We have two schema: the v3 schema and the v4 schema. We have two sets of software: the v3.2 release and the v4.0 release. We'd like to have just one valid set of data. How do we deal with this?

How can we do schema migration badly?

We can't easily have a single software release that includes one set of data in both schema. It's technically possible Anything that doesn't involve time travel, anti-gravity or perpetual motion is technically possible. But it rapidly becomes so complex that we have to set this uber-version idea aside.

We have to do more deployment work to have both v3.2 and v4.0 installed in parallel. v3.2 will use data in the v3 schema, v4.0 will use data in the new schema.

How do we migrate the data from the old schema to the new schema?

This can be tricky. There are proven bad ideas out there. Really epically bad ideas.

One Very Bad Idea (VBI™) is the one-time-only data migration. Back in the olden days, we couldn't afford enough storage to have two copies of the database. Seriously. When a company owned exactly one computer (before PC's -- a Very Long Time Ago) the conversion had to be done by making special backups and restoring the backups into the new schema.

This VBI is still with us today.  Lots of places want to do one-time-only data migrations because it's the traditional approach. If they can't done a one-time conversion (over a long weekend) they complain. Loudly.

BTW. This never worked well. The one-time-only conversion software was never tested carefully, and therefore rarely worked the one time it was needed. Also, data profiling was never done, so edge and corner cases were found during conversion. These often called the new software's features into question, leading to larger and larger problems.

Continuous Migration

The ideas behind continuous migration are these.
  1. We're always going to be migrating the data. Always.
  2. Storage is cheaper than labor. When in doubt, buy more storage.
  3. Data outside the database (in CSV files or YAML documents) is smaller than data inside the database. Don't be afraid to export.
  4. Data outside the database is inaccessible. Be cautious of the implied down-time during exports and imports.
  5. ABP. Always Be Profiling. If you don't have a data profiler in place right now, that's the first thing to build. There are schema definition tools and schema checking tools. Look at JSON-Schema.org. Write schema definitions and use a data profiler to examine all rows and check all rules. All. Seriously. All. In a SQL DB, actually check the foreign keys to be sure the referenced row exists; you'll be surprised.
  6. We're moving forward. We're not milling around; we're not supporting the old version except for the purposes of a parallel test or a fall-back in the event the next version doesn't work. There's no long-term coexistence strategy. Preserve the data; upgrade the software.
Here's the central data migration requirement:

Be able to migrate to the new schema as many times as needed.

I'll repeat that. As. Many. Times. As. Needed.

Migration is not a one-time thing. You do it all the time.
  1. Migrating (and possibly sanitizing or subsetting) production data into the development environment.
  2. Migrating production data for QA testing.
  3. Migrating production data for integration testing.
  4. Migrating production data for performance testing.
  5. Migration production data for the production upgrade.
These are all the same activity. 

I'll repeat that. The. Same. Activity. Sometimes with mappings. Sometimes with filters. 

Since you'll do many, many migrations, your data migration programming is as important as your application programming. Perhaps more important than the application code because it's what preserves the data, and the data is the only thing of value. Applications come and go. Data is forever.

Having real data available permits seamless, silent, and automatic parallel testing. We can easily do a parallel test with v3.2 and v4.0 release candidates by simply running the migration (or migration with subset filter) to gather some data for the parallel test. If the release candidate has problems, we can fix v4.0 to create the next release candidate, re-migrate the data, and try the parallel test again.

At some point the v4.0 release is final, and we need to migrate all of the data. This (usually) involves some feature toggles to put v3.2 into a special "end-of-life" mode where the keys for records which change are logged separately. After turning off v3.2 and turning on v4.0, a second phase of migration will process these end-of-life rows through the migration mill.

Software and Schema Design Consequences

This has an important consequence.

Your software must be explicitly bound to a specific schema by major version number.

Explicitly bound. In a SQL context, you can use the "schema" construct an include the version number in the schema name. "myapp_v3" vs. "myapp_v4". This becomes a ubiquitous qualifier on all table names. SELECT col FROM myapp_v4.some_table AS st.

Yes. Do this Everywhere. Do it Now. 

If you're using mybatis or SQLAlchemy to get the SQL out of your application, then this kind of thing is a trivial change. If you have SQL in your application code, well, you have two problems to solve. First, get the SQL out of your application. Then make the schema version explicit.

In a NoSQL context, you can include the schema version as part of a collection name. "collection_v3" or "collection_v4".

This should be present everywhere.

Then, you'll need data validation apps and data migration apps. The validation apps will use your favorite schema definition and schema validation framework. Start running this as soon as you think you might need to make a major version change.

Finally, you'll need the data migration tool set. This will involve filter rules and sanitizing rules. These are not sophisticated "rules engine" kind of things with unbounded complexity. They're usually if statements and simple computations. But they come and go pretty freely, so design the software in a way that makes the filter and sanitizing code obvious.

Now you can -- trivially -- migrate data between schema versions inside the same database. You can have v3.2 and v4.0 running side-by-side. You can migrate the data early and often. You can profile and validate the data. You have a formal schema for the data validation. 

Tuesday, June 7, 2016

Why rewrite a shell script in Python?

Here's the actual quote:

Why would you need to rewrite a working script in python ? Was there any business direction towards this ?

This was an unexpected response. And unwelcome. I guess I called their baby ugly.

The short answer is that the shell script language is perhaps one of the worst programming languages ever invented. Okay. I suppose it's better than whitespace.  Okay it's better than many others. See https://en.wikipedia.org/wiki/Esoteric_programming_language

The longer answer is this:
  • There are (at least) three ksh scripts involved, two of which are over 1,000 lines long. It's not perfectly clear precisely what's involved. It's ksh. Code could come from a variety of places through very obscure paths; e.g. the source command and it's synonym, ..
  • There are no comments other than #!/usr/bin/ksh and a few places where code is commented out.  Without explanation.
  • There is no other documentation. The author had sent a email describing the github repo. The repo lacked a README. It took two tries to get them to understand that any email describing a repo should have been the README in the repo. There is barely even a command-line synopsis. (Eventually, I found it in the parser for command-line options.)
  • No tests of any kind.
The last point is the one that I find shocking. And I find it shocking on a regular basis.

Folks are able and willing to write 1,000's of lines of shell script without a single unit test, integration test, system test, performance test, anything test. How do they know if this works? Why am I supposed to trust it?

More importantly, how can I meaningfully wrap this into a RESTful API if I'm not even sure what the command-line interface really is? It's the shell. It could use environment variables that are otherwise undocumented. They would be discovered when they cause a crash at run time. Crashes that become an HTTP 500 status code and a traceback error message in the web log.

The "business direction" sounds like an attempt to trump the technical discussion with a business consideration like "cost" or "benefit". It should be pretty self-evident that 1,000's of lines of shell script code is technical debt.

The minimally viable replacement will probably be a similarly-sized of Python script that mindlessly mirrors the original shell script. It's sometimes quite hard to tell what purpose a shell function really serves. The endless use (and re-use) of global variables can make state change hard to follow. Also, the use of temporary files which are parsed (and reparsed) as a way to set state is a serious problem.

What's important is that the various OS services used by the shell script are mockable. Which means that each of the 100 or so individual functions within the script can be tested in isolation. Once that's out of the way, refactoring becomes plausible.

Let's savor those words for a moment: Tested. In. Isolation.

Ahhh.

The better replacement is (I think) about 250 lines of Python -- perhaps less -- that perform the real 8-step process that we're automating.  Getting rid of bash language cruft is challenging, but essential.