Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, August 28, 2012

Password Encryption -- Short Answer: Don't.

First, read this.    Why passwords have never been weaker—and crackers have never been stronger.

There are numerous important lessons in this article.

One of the small lessons is that changing your password every sixty or ninety days is farcical.  The rainbow table algorithms can crack a badly-done password in minutes.  Every 60 days, the cracker has to spend a few minutes breaking your new password.  Why bother changing it?  It only annoys the haxorz; they'll be using your account within a few minutes.  However.  That practice is now so ingrained that it's difficult to dislodge from the heads of security consultants.

The big lesson, however, is profound.

Work Experience

Recently, I got a request from a developer on how to encrypt a password.  We have a Python back-end and the developer was asking which crypto package to download and how to install it.

"Crypto?" I asked.  "Why do we need crypto?"

"To encrypt passwords," they replied.

I spat coffee on my monitor.  I felt like hitting Caps Lock in the chat window so I could respond like this: "NEVER ENCRYPT A PASSWORD, YOU DOLT."

I didn't, but I felt like it.

Much Confusion

The conversation took hours.  Chat can be slow that way.  Also, I can be slow because I need to understand what's going on before I reply.  I'm a slow thinker.  But the developer also needed to try stuff and provide concrete code examples, which takes time.

At the time, I knew that passwords must be hashed with salt.  I hadn't read the Ars Technica article cited above, so I didn't know why computationally intensive hash algorithms are best for this.

We had to discuss hash algorithms.

We had to discuss algorithms for generating unique salt.

We had to discuss random number generators and how to use an entropy source for a seed.

We had to discuss http://www.ietf.org/rfc/rfc2617.txt in some depth, since the algorithms in section 3.2.2. show some best practices in creating hash summaries of usernames, passwords, and realms.

All of this was, of course, side topics before we got to the heart of the matter.

What's Been Going On

After several hours, my "why" questions started revealing things.  The specific user story, for example, was slow to surface.

Why?

Partly because I didn't demand it early enough.

But also, many technology folks will conceive of a "solution" and pursue that technical concept no matter how difficult or bizarre.  In some cases, the concept doesn't really solve the problem.

I call this the "Rat Holes of Lost Time" phenomena: we chase some concept through numerous little rat-holes before we realize there's a lot of activity but no tangible progress.  There's a perceptual narrowing that occurs when we focus on the technology.  Often, we're not actually solving the problem.
IT people leap past the problem into the solution as naturally as they breathe. It's a hard habit to break.
It turned out that they were creating some additional RESTful web services.  They knew that the RESTful requests needed proper authentication.  But, they were vague on the details of how to secure the new RESTful services.

So they were chasing down their concept: encrypt a password and provide this encrypted password with each request.  They were half right, here.  A secure "token" is required.  But an encrypted password is a terrible token.

Use The Framework, Luke

What's most disturbing about this is the developer's blind spot.

For some reason, the existence of other web services didn't enter into this developer's head.  Why didn't they read the code for the services created on earlier sprints?

We're using Django.  We already have a RESTful web services framework with a complete (and high quality) security implementation.

Nothing more is required.  Use the RESTful authentication already part of Django.

In most cases, HTTPS is used to encrypt at the socket layer.  This means that Basic Authentication is all that's required.  This is a huge simplification, since all the RESTful frameworks already offer this.

The Django Rest Framework has a nice authentication module.

When using Piston, it's easy to work with their Authentication handler.

It's possible to make RESTful requests with Digest Authentication, if SSL is not being used.  For example, Akoha handles this.  It's easy to extend a framework to add Digest in addition to Basic authentication.

For other customers, I created an authentication handler between Piston and ForgeRock OpenAM so that OpenAM tokens were used with each RESTful request.  (This requires some care to create a solution that is testable.)

Bottom Lines

Don't encrypt passwords.  Ever.

Don't write your own hash and salt algorithm.  Use a framework that offers this to you.

Read the Ars Technica article before doing anything password-related.

Tuesday, August 21, 2012

Performance "Tuning": running in 1/100th the time

For the 757 Python Meetup group, someone proposed looking at some Python code they had which was slow.  The code implemented a variation on the Elo chess rating system.  It applied the ratings to other sports, and used the points scored as well as basic win/lose/tie to work out a ranking for sports teams.  Very clever stuff.

But.

It was described as horribly slow. Since it was only 400 lines of code, it was a great subject for review in a Python meetup.  I would be able to show some tweaks and performance tips.

Hypothetical Step 1:  Profile

My first thought was to run the profiler and see what popped up as a possible root cause of slowness.

Generally, there are three kinds of performance problems.

  • Wrong data structure.  Or wrong algorithm.  (They're simply opposite sides of the same coin.)  This generally leads to dramatic, earth-shattering improvements.
  • Piss-Poor Programming Practices.  This kind of fine-tuning yields minuscule improvements.  In some cases, no measurable change at all.
  • Bugs.  Every time I've been asked to improve "working" code, I find that it has bugs.  Every time.  These can exacerbate performance problems.  Interestingly, most of these are readily apparent during an initial survey:  i.e., while simply reading the code to see how it works.  Trying to create unit tests for the purposes of refactoring often reveals additional bugs.
I didn't know what I was going to see in this 400-line program.

Hypothetically, profiling will help show what kind of problems we have.

Prior to profiling, of course, we need to run the darn thing.  Thankfully, they provided some typical input files.  For example, 1979 High School Football season results.  159 lines of patiently gathered teams and scores.

It Didn't Run

When we tried to run it with the profiler, we found that we had a bug.  Tracking down the bug, revealed the essential performance problem, also.

The core problem was a failure to make use of Python data structures.  

This manifested itself in a number of really bad design errors.  Plus, it presented as the actual, show-stopping, serious bug.

In order to work out team rankings, the program kept a list of teams.

I'll repeat that for folks who are skimming.
In order to work out team rankings, the program kept a list of teams.
Not a dict mapping from team name to team details.  But a simple list.  Locating a team in the list meant iterating through the list, looking for the matching team.  Really.

for t in team_list:
    if t['name'] == target_name:
        process( t )

This kind of thing was repeated in more than one place.

And one of those repeats had the bug in it.

What we have here is "wrong data structure".  Replacing a list with a dict will have earth-shattering impact on performance.

The Bug

The bug, BTW, was a lookup loop which had the additional requirement of adding missing teams.  It tried to use the for-else structure.  This was the intended code (not the actual code).

for t in team_list:
    if t['name'] == target_name:
        return
else:
    t['name']= init_new_team(target_name)
    
This is a first-class bit of Python.  An else clause on a for statement is a real feature of the language.  However, it's obscure enough that it's easy to get wrong.

However, it's also unique to Python, and the kind of thing that can lead to confusion.  I discourage it's use.  

Test-Driven Reverse Engineering

We're doing TDRE on a little piece of recreational programming.  This means that we need to fabricate unit tests for code that is purported to work.  Some folks like to call this "exploratory testing" (a phrase which is contradictory, and consequently meaningless.)  We're "exploring".  We're not "testing".

Once the core bug is fixed, we can run sample data through the application to get "big picture" results.  We can extract bits from the sample data and exercise various functions in isolation to determine what they actually do now, bugs included.

Since this is so simple--and we're going to review it during a 2-hour meet up--and there's nothing billable going on--we can get by with a few really simple unit tests.  We'll run the darn thing once to create some expected output.  We save the output and use that single output log as the expected results from each refactoring step.

More complex applications require more unit tests.  For a billable TDRE effort last year, I had to create 122 unit tests to successfully refactor a program of over 6,000 lines of code.  It took several weeks.

Real Step 1: Fix The Data Structure

Before profiling (but after running to capture some output) we have to fix the essential data structure.  Repeated scanning of a list is never going to perform well.
The whole point of the study of "Data Structures" is to prevent (or optimize) search.
In this case, we can prevent linear search of a list by using a dictionary.  That, clearly, is job one.

It was a pervasive rewrite.   Essentially, everything in this little program included a goofy loop to lookup a team in the list.  The Elo algorithm itself, which is already O(n2), is dragged down by using the linear search for the opposing team four more times, making it O(n3).

Cyclomatic Complexity

One of the big "issues" is the use of if statements throughout the scoring algorithm.  An if statement creates Cyclomatic Complexity and can lead to performance problems.  Generally, if statements should be avoided.

This algorithm applies some normalization factors to reconcile scoring with win/loss numbers in different sports.  Basketball, specifically, involves generally high scores.  Since there are 2-point and 3-point scoring opportunities, a factor is used to normalize the points into "goals".  Football, similarly, has numerous scoring opportunities with values of 1, 2, 3 and 6 points; the scores here are also normalized.

This normalization was done with an if statement that was evaluated deep inside the Elo algorithm.  Repeatedly. Evaluated.

The two functions that handled the normalizations, plus the normalization factors, are ideal candidates for OO design.  There's a clear hierarchy of classes here.  A superclass handles most sports, and two polymorphic subclasses handle football and basketball normalization.

The if statement is now "pushed up" to the very beginning of the program where an instance of the sports normalization object is created.  This object's methods are then used by the Elo algorithm to normalize scores.

Icing on the Cake

Once we've fixed the bug and replaced a list with a dict, everything else is merely icing.
Some other OO changes.
  1. The "Team" information should not be a flat, anonymous dictionary.  It should be a proper class definition with proper attributes.  There aren't many methods, so it's easy to create. 
  2. The "Game" information is read by csv.DictReader.  However, it should not remain a simple, anonymous dict.  As with a Team, a simple class can be created to handle Game.
  3. The overall structure of the application needs to be broken into two sections.  The command-line interface parses options, opens files, and generally gets everything set up.  The actual ranking algorithm should be a function that is given an open file-like object plus the Sport object for normalization.  This allows the ranking algorithm to be reused in other contexts than the command-line (i.e. a web service).
A more subtle OO design point is the question of "mutability".  A Team in this application is little more than a name.   There are also numerous "stateful" values that are part of the Elo algorithm.   A Game, similarly, is an immutable pair of teams and scores.  However, it has some mutable values that are part of the Elo algorithm.  

Really, we have immutable Team and GameHistory objects, plus a few values that are used as part of the Elo calculation.  I'm a big fan of disentangling these mutable and immutable objects from each other.  

I suspect that the Elo algorithm doesn't really need to update the "state" of an object.  I suspect that it actually creates (and discards) a number of immutable candidate ranking objects.  The iteration that leads to convergence might be a matter of object creation rather than update.  I didn't make this change, since it required real work and we were out of time.

Bottom Line

The more times I do TDRE to improve performance, the more I realize that it's all about bugs and data structures.

This recreational application took 45-60 seconds to process one year's record of games for a given league.  It now takes less than 0.2 seconds to do the same volume of work.  Two test cases involving a complete run of 159 records runs in 0.411 seconds.  That's 1/100th the time simply from switching data structures.

The idea of "tweaking" a working program to improve performance is generally misleading.  It might happen, but the impact is minuscule at best.  

Here's the checklist for getting 100:1 improvements.
  • Remove searches.  
  • Remove deeply-nested if statements.  
Generally, reduce Cyclomatic Complexity.

Tuesday, August 7, 2012

How Expensive is a "Waterfall" Project Plan?

It's impossible to step into the same river twice; other waters are flowing toward the sea.  It's impossible to do "head-to-head" project comparisons.  You can't have the same people doing the same thing with only one constraint changed.  You can try to resort to having similar people doing the exact same thing.

It's impossible to provide trivial quantification of the costs of a waterfall approach.  However, there's a lot of places where "up front" engineering adds cost and risk.  A waterfall plan adds friction, zero-value administrative costs and (sometimes) extra work.

For software development that doesn't involve "toy" projects, but apply to the magical "real world" projects, it's difficult to justify building the same "real world" thing twice.  It's hard to create a detailed and purely economic analysis of this question.  Therefore, we have to resort to hand-waving and speculation.

[I put "real world" in quotes, because the "real world" is a kind of dog-whistle phrase that only project managers understand.  It appears to mean "projects that fit my preconceived notions" or "projects that can't be done any other way because of political considerations" or "projects constrained by a customer to have a farcical structure".  Another variant on this dog-whistle phrase is "the realities of running a business."  There's something in their real world that they (a) can't define and (b) prevents any change.]

A few months back, I saw a http://prezi.com presentation that was intended to drive architecture, cost and schedule conversations.  The authors were happy.  I was left relatively mystified.

There are three relevant issues that don't fit well into a prezi presentation.
  1. Open Technology questions.  If it's new software development, there are unanswered software API, performance, quality, fitness-for-purpose questions.  If it's a clone or a clean-room reverse engineering, then there may be few questions.  Otherwise, there must be unknowns.
  2. Open User Story questions.  If it's new software development, then the user stories are not complete, consistent and optimized.  There must be something missing or extra.
  3. Open User Experience questions.  I'm not a UX person.  Tinkering with an Arduino has taught me that even in the most prosaic, simplistic project there is a UX element.  
It's easy to trivialize this as "details" or "supporting information" omitted from the presentation.  Actually, it's considerably more interesting that simply elaborating details.  Indeed, the missing stuff is often more important that the elements of the presentation than were provided.

The Standard Gaps

While a presentation can show some elements of UX, it's not interactive.  It cannot provide any useful depth on the UX.  Failure to do UX exploration (with real interaction) is crazy.  Making assumptions about UX is more likely to create cost and risk.

User Stories can be kicked around in a presentation.  However.  Once the highest-priority user stories are built, and lessons learned about the users, the problem and the emerging solution, then the other user stories can (and should) change based on those lessons learned.  Assuming a list of user stories and then committing to a specific schedule enshrines the costs in a bad plan.  And we all know that the plan (and the dates) are sacred once written down.

Software does not exist in a vacuum.

I'll repeat that, since it's a hidden assumption behind the waterfall.


Software does not exist in a vacuum.

You have a Heisenberg problem when you start building software and releasing it incrementally.  The presence of the new software, even in a first-release, beta-test, barely-working mode changes the ecosystem in which the software exists.  Other applications can (and will) be modified to adjust to the new software.  Requirements can (and should) change.

A waterfall project plan exists to prevent and stifle change.  A long list of tasks and dates exists to assure that people perform those tasks on those dates.  This is done irrespective of value created and lessons learned.

Technology changes.  Our understanding of the technology changes.  One wrong assumption about the technology invalidates the deliverables for a sprint.  In a project with multiple people, multiple assumptions and multiple sprints, this effect is cumulative.  Every assumption by every person for every technology is subject to (considerable) error.

Every technology assumption must be looked at as a needless cost that's being built into the project.

Recently, I've been working on a project with folks that don't know Django very well.  Their assumptions are -- sometimes -- alarming.  Part way through a sprint, I got a raft of technical questions on encrypting passwords.  It's hard to state it strongly enough: never encrypt a password.  What's more useful is this: always hash passwords.  The original password must be unrecoverable.  There are lots of clever hashing techniques.  Some of which are already part of Django.  A significant portion of the sprint (appeared) to be based on reinventing a feature already part of Django.

Do the math: a few wrong assumptions made during project planning are canonized forever as requirements in the project.  With a waterfall project, they're not easy to change.  Project managers are punished for changing the project.  You can't increase the project deadline; that's death.  You can't decrease it, either: you get blamed for sand-bagging the estimates.

Arduino Technology Lessons Learned

After buying several 74HC595 shift registers from Sparkfun, I realized that my assumptions about the interfaces were utterly wrong.  I needed to solder a mix of right-angle header and straight-through headers onto the breakout boards.  I tried twice to order the right mix headers.  It seems so simple.  But assumptions about the technology are often wrong.

This is yet more anecdotal evidence that all assumptions must be treated as simple lies.  Some managers like to treat assumptions as  "possibly true" statements; i.e., these are statements that are unlikely to be false.  This is wrong.  Assumptions always have a very high probability of being false, since they're not based on factual knowledge.

Some managers like to treat assumptions as "boundary conditions":  i.e.,  if the assumption is true, then the project will go according to plan.  Since all of the assumptions will be prove to be incorrect, this seems like simple psychosis.  

[Interestingly, the assumptions-are-boundary folks like to play the "real world" card: "In the real world, we need to make assumptions and go forward."  Since all assumptions will be shown to be incorrect, why make them?  Wouldn't it be more rational to say "we need to explore carefully by addressing high-righ unknowns first"?  Wouldn't it be better to both gather facts and build an early release of the software at the same time?]

Arduino User Story Assumptions

After building a prototype that addressed two of the user stories, it slowly became clear that the third user story didn't actually exist.

Sure, the words were there on paper.  Therefore, there's a user story.

But. 

There was nothing to actually do.  

The whole "As a [role], I need [feature] so that I can [benefit]" that was written on day one was based on a failure to understand precisely how much information was trivially available.  The benefit statement was available with a small software change and no separate user story.  And no separate hardware to support that user story.

Exploration and tinkering reduced the scope of the work.  In the real world.

[In the "real world" where waterfall is considered important, exploration is described as unbounded spending of limited resources.  In the real real world, money must be spent; it can either be long hand-wringing meetings or it can be prototype development.]

The user story (written before a prototype existed) was based on a failure to fully understand the UX.  The only way to fully understand the UX is to build it.  Education costs money.

Arduino UX Learnings

Perhaps the most important issue here is UX.

Once upon a time, UX was expensive, difficult and complex.  So difficult that special-purpose prototyping tools were created to make it possible to create a preliminary UX that could be used to confirm UX and user stories.

This UX prototyping effort was real money spent as part of "requirements" or "design"; it created documentation that flowed over the waterfall for subsequent development.

This notion is obsolete.  And has been obsolete for several years.

UX is now so easy to build that it makes much more sense to build two (or three) competing UX's and compare them to see which is actually better.

Indeed, it makes a lot of sense to build one UX and release it; make a real effort at solving the user's problems.  Then build a second UX for A/B testing purposes to see if there's room for improvement.

I'll repeat that for folks who really like the waterfall.

It's now cheaper to actually build two than it is to write detailed requirements for one.

[In the "real world", this is deprecated as "wasting time playing with the UX".  As if written requirements based on nothing more than a whiteboard are more "real" than hands-on experience with the UX.]

You can prove this to yourself by actually observing actual UX developers knocking out pages and screens.  Long "requirements gathering" meetings with prospective users amount to a waste of time and money.  Long "brainstorming" sessions, similarly, are wasteful.  Short, ongoing conversations, a build, a release, and a follow-up review has more value, and costs less.  

Do the math.  Several users and business analysts in several multiple-hour meetings costs how much?

A pair of developers banging out fully-functioning, working UX for a use case costs how much?

A slavish adherence to waterfall development creates "real world" costs.

Thursday, July 12, 2012

Innovation, Arduino and "Tinkering"

Many of my customers (mostly super-large IT shops) wouldn't recognize innovative behavior.  Large organizations tend to punish defectors (folks that don't conform), and innovation is non-conforming.

I've just had two object lessons in innovation.  The state of the art has left many in-house IT development processes in the dust.  The cost and complexity of innovation has fallen, but organizations continue to lumber along pretending that innovation is risky or complex.

You can find endless advice on how to foster a culture of innovation.  Often, this advice includes a suggestion that innovative projects should somehow "fail fast".  I'm deeply suspicious of "fail fast" advice.  I think it misleads IT management into thinking there's a super-cheap way to innovate.  It's misleading because "fail fast" leaves too many questions unanswered.

  • How soon do you know something's about to be a failure?  
  • What's the deadline that applies so that failure can happen quickly?  
  • What's the leading indicator of failure?

If you are gifted enough to predict the future -- and can predict failure -- why not apply that gift to predicting success?  Give up on the silliness of unstructured "innovation" and simply implement what will not fail.

At MADExpo, I saw an eye-opening presentation on the Arduino.  I followed that with viewing Massimo Banzi's TED Talk on the subject of Arduino, imagination and open source.

There are two central parts of the Arduino philosophy.

  • Tinkering.
  • Interaction Design.

Background

Without delving too deeply, I'm trying to build a device that will measure the position of a hydraulic piston.  It's the hydraulic steering on a boat, and a measurement of the piston position provides the rudder position, something that's handy for adjusting sail trim to reduce the strain on an autopilot.

Clearly, such a device needs to be calibrated with the extreme port and starboard range of motion.  Barring unusual circumstances, the amidships position is simply the center between the two limits.

Part 1.  Buy an Arduino, a Sharp GP2Y0A02YK0F IR distance measurer (for 10-80 cm), plus miscellaneous things like breadboard, jumpers, LED's, test leads, etc.  A trip to Radio Shack covers most of the bases.  The rest comes from SparkfunRobot Shop, Digi-Key and Mouser.

Part 2.  Learn the language (a subset of C.)  Learn core algorithms (de-bouncing buttons and the IR sensor).

Tinkering

At this point, we've tinkered.  Heavily.

What's important for IT managers is that tinkering doesn't have a project plan.  It doesn't have a simple schedule and clear milestones.  It's a learning process.  It's knowledge acquisition.

The current replacement for tinkering is training.  Rather than learn by attempting (and failing), IT managers hire experts to pass on knowledge.  This is, generally, limiting and specifically stifles innovation.

Years ago, I worked on embedded systems: hardware software hybrids.  We burned ROMs and programmed in assembler.  Back in those days, this kind of tinkering was difficult, and consequently frowned upon.  It was difficult to specify, locate, source, and assemble the components.  There was a lot of reading complex product data sheets to try and determine what to buy and how few were needed.

What had once been a very serious (and very difficult) electrical engineering exercise (IR sensor, button, LED, power supply, etc., etc.) was a few days of tinkering with commodity parts.  The price was low enough and availability ubiquitous enough that frying a few LED's is of no real consequence.  Even frying an Arduino or two isn't much of a concern.

Interaction Design

The next step is to work out the user interface.  For the normal operating mode, the input comes from the hydraulic piston and the output is some LED's to show the displacement left or right of center.  Pretty simple.

However.

There's the issue of calibration.  Clearly, the left and right limits (as well as center position) need to be calibrated into the device.

Just as clearly, this means that the device needs buttons and LED's to switch from normal mode to calibration mode.  And it needs some careful interaction design.  There are several operating modes (uncalibrated, calibrating, normal) with several submodes for calibrating (setting left, setting right, setting center.)

Once upon a time, we wrote long, wordy documents.  We drew complex UML state charts.  We drew all kinds of pictures to try and capture the important features of the interaction.

Enter Arduino

The point of Arduino is not to spend too much time up front over-specifying something that's probably a bad idea.   The point is to experiment quickly with different user interface and interaction experiences to see what works and what doesn't work.

The same is true of many modern development environments.  Web development, for example, can be done by using sophisticated frameworks, writing little backend code and messing with the jQuery, CSS and HTML5 aspects of the interaction.

The scales fell from my eyes when I started to document the various operating modes.

Arduino doesn't have a great unit testing environment.  It's for tinkering, after all.  It's also for building small, focused things.  Not large, rambling, hyper-complex things.  You can achieve complexity through the interaction of small, easy-to-test things.  But don't start out with complexity.

After writing a few paragraphs, I realized that the piston movements could easily be self-calibrating.  Simply track the maximum and minimum distances ever seen.  That's it.  Nothing more.  In the case of a boat, it means swinging the wheel from stop to stop to define the operating range.  That's it.

A button (to clear the accumulated history) is still useful.  But much simpler since it's a one-time-only reset button.  Nothing more.

Moving from idea to working prototype took less time than writing this blog post.

Next steps are to tinker with various display alternatives.  How many LED's?  What colors?  LCD Text Display?  There are numerous choices.

Rather than wasting times on UML, specifications, whiteboard and diagrams, it's a simpler matter to write the user stories and tinker with display hardware.

Tuesday, July 10, 2012

Cool stuff I saw at MADExpo

A random list of cool things.

  1. HTML5.  Start now.  It's supported (more or less) by all browsers if you add appropriate shims.  Start with sites like http://www.html5rocks.com/en/ and continue reading.  It's largely arrived and there's no compelling reason to delay.
  2. JavaScript.  I'm not a fan of the language.  However.  It's clearly here to stay. It's part of many important technologies (like CouchDB).  HTML5 shims (or shivs) are necessary.  Flex (and other browser plug-in languages) are effectively dead.  JavaScript is all that's left.
  3. Redis.  Wow! Is that cool?  Rather that fart around trying to get outstanding performance out of a clunky old RDBMS, use a simple, high performance data store.
  4. MongoDB.  Now that I've seen it, I have a vague notion of places where Mongo is better and places where Couch might be better.  In 80% of the application space, both are fine.  But there's a 20% where we might be able to split a hair and leverage the slightly different feature sets.
I don't build mobile apps.  But if I did, I think I'd start with Appcelerator Titanium before switching to a "native" development environment.  The idea that several standard features are readily accessible with a convenient development environment is pleasant.

I finally laid eyes on an Arduino.  Let me simply say that it looks like dangerous, dangerous fun.  I think I'll be hanging around at Radio Shack a lot.

Tuesday, July 3, 2012

Book Deal Fell Apart (sigh)

After spending a couple of years (really) working with a publisher, the deal has gone south.

The problem was—likely—all mine.  The book wasn't really what the publisher wanted.  Perhaps it was close and they thought they could edit it into shape.  And perhaps I wasn't responsive enough to criticism.

I like to think that I was responsive, since I made almost all of the suggested changes.

But the deal fell apart, so my subjective impression isn't too relevant.

What To Do?

I have two essential choices here.
  1. Finish up.  Simply press on with the heavily-edited version.  The path of least resistance.
  2. Roll back the content.  This involves going back to a less heavily edited version and merging in the relevant changes and corrections from the writing process.
Option 2 is on the table because my editor was unhappy with "digressive" material.  The editor had a vision of some kind of narrative arc that exposed Python in one smooth story line.  My attempted presentation was to expose language features and provide supporting details.

Perhaps I'm too deeply invested in the details of computer science.   Or.  Perhaps I'm just a lousy writer.  But I felt that the digressions were of some value because they could fill in some of the gaps I observed while coaching and teaching programmers over the last few decades.

In addition to the editorial challenge, there's technical challenge.  Do I step back from LaTeX?
  • Use LaTeX.  This means that I would have to create the non-PDF version with a LaTeX to HTML translator.  Read Converting LaTeX to HTML in the Modern Age.  See Pandoc.  Or, it means that I don't offer an HTML version (a disservice, I think.)  Also, I need to unwind the publisher's LaTeX style libraries and revert to a plain LaTeX.  Since LaTeX is semantically poor, I need to rework a lot of RST markup.
  • Revert to RST.  While RST tools can make both LaTeX and HTML from a common source, it does mean that some of the fancier LaTeX formatting has to go.  Specifically, I really got to know the algorithm and algorithmic packages.  I hate to give those up.  But maybe I can work something out with Sphinx 1.1.3's various features.
The problem is that three of the four combinations of paths have advantages.
  • Finish up using LaTeX is easiest.  Remove the publisher's document style.  Use HeVeA or Pandoc to make HTML.  Move on.  
  • Finish up but revert to RST.  This means conversion from the publisher's LaTeX to RST followed by some editing to fix the missing RST semantic markup and then some debugging to get the LaTeX and HTML to look good.
  • Rollback the content using LaTeX.  This would be challenging because I would have to merge manually edited publisher-specific LaTeX from the heavily-edited version with the Sphinx-generated LaTeX from the less-heavily edited source.  Things like the publishers' style tags would complicate the merge. 
  • Rollback the content and revert to RST. This means using Pandoc to convert the heavily-edited LaTeX to RST. Then merging the relevant edits into the RST original text.   This actually seems pretty clean, since the heavily-edited RST (converted from LaTeX) would be short.
Perhaps, if I was a better writer, I wouldn't have these problems.

It appears than the solution is MacFarlane's Pandoc.  This can reverse LaTeX to RST, allowing easy side-by-side merging of texts from various sources.  Or.  It can convert LaTeX to HTML, allowing easy work with the heavily edited LaTeX version.

Thursday, June 28, 2012

How to Write Crummy Requirements

Here's an object lesson in bad requirements writing.

"Good" is defined as a nice simple and intuitive GUI interface. I would be able to just pick symbol from a pallette and put it somewhere and the software would automatically adjust the spacing.
Some problems.
  1. Noise words.  Phrases like "'Good' is defined as" don't provide any meaning.  The word "just" and "automatically" are approximately useless.  Here is the two-step test for noise words.  1.  Remove the word and see if the meaning changed.  2.  Put the opposite and see if the meaning changed.  If you can't find a simple opposite, it's noise of some kind.  Often it's an empty tautology, but sometimes it's platitudinous buzzwords.
  2. Untestable requirements.  "nice, simple and intuitive" are unqualified and possible untestable.  If it's untestable, then everything meets the criteria (or nothing does.)  Elegant.  State of the art.  Again, apply the reverse test:  try "horrid, complex and counter-intuitive" and see if you can find that component.  No?  Then it's untestable and has no place.
  3. Silliness.  "GUI".  It's 2012.  What non-GUI interfaces are left?  Oh right.  The GNU/Linux command line.  Apply the reverse test: try "non-GUI" and see if you can even locate a product.  Can't find the opposite?  Don't waste time writing it down.
What's left?  
pick symbol from a palette ... the software would ... adjust the spacing.
That's it.  That's the requirement.  35 words that mean "Drag-n-Drop equation editing".

I have other issues with requirements this poorly done.  One of my standard complaints is that no one has actually talked to actual users about their actual use cases.  In this case, I happen to know that the user did provide input.

Which brings up another important suggestion.
  • Don't listen to the users.
By that I mean "Don't passively listen to the users and blindly write down all the words they use.  They're often uninformed."  It's important to understand what they're talking about.  The best way to do this is to actually do their job briefly.  It's also important to provide demos, samples, mock-ups, prototypes or concrete examples.  It's 2012.  These things are inexpensive nowadays. 

In the olden days we used to carefully write down all the users words because it would take months to locate a module, negotiate a contract, take delivery, install, customize, integrate, configure and debug.  With that kind of overhead, all we could do was write down the words and hope we had a mutual understanding of the use case.  [That's a big reason for Agile methods, BTW:  writing down all the user's words and hoping just doesn't work.]

In 2012, you should be able to download, install and play with candidate modules in less time than it takes to write down all the user's words.  Often much less time.  In some cases, you can install something that works before you can get the users to schedule a meeting.

And that leads to another important suggestion.
  • Don't fantasize.
Some "Drag-n-Drop" requirements are simple fantasies that ignore the underlying (and complex) semantic issues.  In this specific example, equations aren't random piles of mathematical symbols.  They're fairly complex and have an important semantic structure.  Dragging a ∑ or a √ from a palette will be unsatisfying because the symbol's semantics are essential to how it's placed in the final typeset equation.

I've worked recently with some folks that are starting to look at Hypervideo.  This is often unpleasantly difficult to write requirements around because it seems like simple graphic tools would be all that's required.  A lesson learned from Hypertext editors (even good ones like XXE) is that "WYSIWYG" doesn't apply to semantically rich markup.  There are nesting and association relationships that are no fun to attempt to show visually.  At some point, you just want to edit the XML and be done with it.

Math typesetting is has deep semantics.  LaTeX captures that semantic richness.  

It's often best to use something like LaTeXiT rather than waste time struggling with finding a Drag-n-Drop tool that has appropriate visual cues for the semantics.  The textual rules for LaTeX are simple and—most importantly—fit the mathematical meanings nicely.  It was invented by mathematicians for mathematicians.