Bio and Publications

Thursday, June 27, 2013

How to Estimate a Project

A recent question
"what we might expect, in terms of 1) Time to completion, 2) Cost to implement according to the mockup, 3) Monthly server/maintenance costs and 4) approximate team size required"
This question was followed by this acknowledgement:
"it is hard to make the above estimations, given the lack of clarity on the architecture which will be employed, and given the nature of software development itself."
This is an understatement. The statement is generally well written, but the word "hard" is weak. It's not hard. It's essentially impossible.

I.  The Conundrum 

Let's say you decide that the budget for "everything" is two point five kajillion dollars. Clearly, you don't want to just fork that money over to a roomful of developers and wait a year for something to happen.

An Agile approach is a sensible alternative. Instead of building everything, you build a first release that does something. Ideally, something that creates the most value for potential users.

What's that first release? Presently, you don't have a formal, testable specification. More work needs to be done to define a first release. Formal and testable are pretty high barriers.

More important than details is this: there's a nasty circularity issue. Until you build something, you don't know where the technical roadblocks are. Once you build something... well... you've built something. And you're going to be building something potentially releasable merely to get to the point of being able to write a budget.

There's no way to know the budget without having started to build something.

Once you start to build something, you did useful work in advance of having a budget.

One (false) claim floating around the software development world is that we can somehow do more research to resolve the unknowns before we start actually building software. This is simply false. As we do "research" we're doing high-level design: we're building software. You may resolve a few unknowns, but there will be more.

The only way to know all the details about platform and the application is to build the application using the platform. We don't know anything until we're done building something that resolves some unknowns.

Interestingly, the definition of "done" cannot possibly exist. We'll return to the farcical nature of "done", below.

Worse, there's no way to know the budget without knowing the people who will be doing the building.

II. The Productivity Issue

Until there's a relatively stable team, it's essentially impossible to know how quickly programmers can build anything. And even then, there can be unexpected, unforeseeable problems with the team.

Let me tediously pound this point home with an even more detailed analysis. The point is to make it very clear that the future is impossible to predict.

Let's say the first release is essentially a clone of an existing open source project.

This is simple. What's the budget to clone the existing open source project?

Choice 1. Find an unpaid intern. (This may no longer be legal, but it's still popular.) Have them clone the repository, rebrand it, and you have something running. How quickly can they do this? You don't know until we meet the intern and watch them work. After they've cloned the existing open source project you then know how long it will take them. Until you've seen them work, you know nothing about the time they'll require.

Choice 2. Find a kid still in school who knows the technology. Pay them sub-minimimum wage to clone the package and rebrand it. How much will you pay? You don't know until you meet the intern and watch them work. After they've cloned the existing open source project, you'll know how much it costs.

Choice 3. Make up a schedule based on what little is known. Put the "clone the existing open source project"  out to bid on http://freelancer.com and hope that others make bids that fit with your expectations. This is fixed price so the budget is—in principle—known. In order to be sure you get something high quality and usable, you'll need to write a lot of test cases and very detailed specifications. Sadly, that pre-work is of imponderable complexity. When you get bids that are too big, you learn that your specifications weren't good enough; and you need to fix your specifications to narrow the scope of work. Now you're doing much of the work (spec writing and test case writing) in order to get a proposal that includes your budget. Note the circularity where you're doing some of the work to figure out the budget for the work you're doing.

Choice 4. Offer someone a share in the company to clone the existing open source project.  Now you don't have a budget at all. You merely have a schedule. When will they be done? You're back to option 1, the unpaid intern, except now with better incentives to be quick. But you don't know how long they'll take until you've seen them do it once.

Choice 5. Offer someone an hourly rate plus a share in the company to clone the existing open source project.  Now you're back to having a budget, and perhaps it has an upper bound. You can pay up to some amount, after that the share in the company is their incentive to get something done.

I beat this point to death because there actually is no answer.  No matter what strategy you choose, you still can't predict developer productivity. It varies by a factor of at least 10 to 1. Some studies show it varying by 100 to 1.

The idea of forecasting development costs is shameful lie created by accountants. Really. The GAAP requires controls and budgets before spending money and we're supposed to compare plan and actual. This is all farcical in software world. Software development is like R&D: it's structured learning and encoding the learning into software.

III. The Done Issue

One of the Great Lies is that software has a defined "done" state. This only true for reductionist classroom exercises. Real software grows, often without bounds.

"Wait," you say, "I have a vision of what I want, that defines a boundary."

Today, that defines a boundary.

In six weeks, after two releases and some trouble support calls and requests for new features, your original vision is out the window, and you're off chasing the things your real users really are asking for.

Only in-house IT managers for large (dumb) companies stick to the original plan in spite of all the lessons learned along the way.

Then you get partnership offers. And you see new platforms and tools, and you get more user requests. The browser landscape changes. Tablets become faster. Other changes that are impossible to imagine will happen.

The vision will not be stable.

It won't even be finite.  A good business model grows and adapts and expands.

IV. Strategy 1: Estimate

What can you do?

Clearly, you want some kind of budget for creating some kind of software.

Clearly, there's no way to provide a good answer.

You can, however, find a farcical answer.

Step 1: find a developer who's willing to make a sincere commitment to a cost and schedule.

Step 2: trust the sincerity of their commitment, even though it's is absolutely going to be wrong. The Great Lie is that we might only wrong by a factor of 2. In reality we can often be wrong by a factor of 10: the $100,000 job turned out to cost over a million. (See above, 10:1 productivity is just one of the unknowns.) The million dollar job was ill-advised and cancelled after the second release, but the users were happy, so it was successful in many ways. But it was cancelled.

A sincere estimate is just a random number. However, many managers find that the sincerity gives them comfort.

Since productivity is unknowable and "done" is unknowable, a detailed estimate and plan means you must now spend a lot of time writing "change orders" and reallocating the budget every time you learn something new.

I'll repeat that.

When you have an estimate, all you do with it is reallocate the estimated budget as you learn more about the customers, the development team and the product. All you do is reallocate; the idea that there's "plan" which is compared with "actual" is farcical because the plan changes constantly. So constantly as to be meaningless.

[Accountants will claim that this is analysis wrong because the future is somehow knowable. I can only stare dumbfounded at them. The future is knowable? Really? They'll say that a plan is a commitment and comparing actual to plan somehow makes sense. They'll give all kinds of weird analogies that don't apply to software. Software development is not a "production" task like brick laying or making pins from wire. If the future was knowable, the project ROI would be a fixed 150% or 300% or, well anything. Oh. Right. Somethings are unknowable. Like the future. Ahem.]

V. Strategy 2: Agile

The very best you can do—indeed, the only rational thing you can do—is to locate talent who are willing to work for an indefinite period of time.

A person or people you trust.

You establish a release cycle. Two or three weeks are typical sprint cycle times. two weeks works well for very new development. Three weeks is better for more established teams.

You identify the first three or so releases by writing those high-priority, high-value user stories as carefully as you can.  Testable, finite user stories. Clear boundaries on acceptable vs. unacceptable behavior. Too few user stories makes it difficult to foresee the future. Too many user stories can be needless preliminary work since they're going to change anyway.

You do Scrum development with a two-week cycle.  http://www.ambysoft.com/essays/agileLifecycle.html

"Useless," you say, "because there's no overall budget!"

Correct. There's no overall budget. You don't (and shouldn't) have a legally-binding definition of "done". Done means "business death." You have a vision for the first release. From that you'll make enough money to get to the second release. Which gets you to the third release. You're not done until you're out of ideas and no one wants your product anymore.

Done should always be defined as "planned release [X] is the last release." After that, it's donate the intellectual property into the public domain and move on to something profitable.

"Then logically," you say, "There can be a budget for the first release."

Except, as noted above, you don't know how productive the team is. So there's no useful budget for even the first release. Ideally, 1 sprint = 1 release. But. Until you know the team, and the user stories, and the platform, and the application, you can't assume that.

Which gets us to this:

The budget is only enough to get you through the next two-week sprint. A three-person team for two weeks is 240 hours. $50/hr. $12,000 per sprint. Perhaps with a larger team, it may be $20,000.

Each sprint must produce something releasable or everyone is fired. It's that crisp. The company is out of business—as currently organized—when the team can't create something releasable. Either the user stories aren't testable or the sprint planning is too ambitious. Or someone lacks the skills they were thought to have during the interview process. Or something is wrong with the team chemistry.

Sometimes, a sprint's work product is not deployed for marketing purposes. It's saved up into the next sprint so that the monthly release is far cooler than the bi-weekly release.

I'm aware that this is an unsatisfying answer. It's nice to hope that software development is a finite, linear process with just minor bumps in the road. Sadly, it's not. It's a completely out-of-control process that hurtles down the wave fronts making progress in a reasonably desirable direction in spite of currents, winds and weather. It's (by definition) a learning process. As knowledge is accumulated, it's encoded in the form of software. Once all the knowledge is available, the software happens to be done, also.

Also: http://slott-softwarearchitect.blogspot.com/2011/11/justification-of-project-staffing.html.
And this: http://slott-softwarearchitect.blogspot.com/2010/03/great-lies-design-vs-construction.html.
This, too: http://slott-softwarearchitect.blogspot.com/2009/11/on-risk-and-estimating-and-agile.html.
Okay, fine: http://slott-softwarearchitect.blogspot.com/2009/10/breaking-into-agile.html.

Tuesday, June 25, 2013

How to Make Technology Choices

I get emails looking for help with technology choice. Essentially: "I've got this idea for game-changing software idea, what technology should I use?" These questions have disturbing expectations. There's a Gordian Knot of dependencies that's sometimes baffling.

Sometimes the questions are about choosing a "tech stack" or an "architecture". Sometimes it's the "framework" or the "platform".

All the questions, however, are very similar. They amount to either this

"What's the one, perfect and final technology choice we need to make?"

or this

"We're considering [X, Y and Z] can you validate this choice?"

Notice that the emphasis is on making One Perfect Final Decision.

An incidental part of this question is the context: this varies widely:
  • There might be a pretty good software idea.
  • Sometimes there's a list of user stories or use cases. Other times, there's a blatant refusal to consider human users, and a bizarre focus on technologies.
  • Less often, there's some sense of the business model; i.e., who will pay for this. Simply saying "advertisers" is a hint that there's no business model. Lack of a business model is a hint that technology choices are premature.
I'm not asked handle questions on business models; I'm not a venture capitalist; I'm just a tech consultant. But I expect that a business model is in place. Technology choices support a business; not the other way around. If there's no income, then there's no point in making technology choices, is there?

Unreasonable Expectations

What's disturbing are the expectations. We'll start with one expectation that is disturbing and then look at another.

The expectation of finality is the most disturbing: the expectation that someone can make One Perfect Final Decision.

No technology choice is ever final. Today's greatest ever state-of-the-art, kick-ass-and-take-names SDK may evaporate in a cloud of lawsuits tomorrow. Today's tech giant may collapse. Today's platform of choice may be tomorrows weird anachronism.

Worse, a super-popular framework or platform may—after deeper examination—be totally brain-dead regarding some specific API or standard. Details matter, and details emerge slowly. A vendor (or open source community) may claim that it's (for example) RESTful, but you won't know until you try it.

Principle 0. Software Development is Knowledge Capture. You do not already know everything about the business, the technology or problem being solved. If you already know everything, it means you learned everything based on already having working software. 

Principle 1. Change happens. A fixed technology stack is a mistake. A fixed set of interface specifications is less of a mistake than a fixed set of technology choices. Software development involves learning, and while the learning is going on, the marketplace is changing. Note that learning is a two-way street, also. You learn about the users, the users learn about your technology. The problem you're trying to solve can morph as the users learn.

Principle 2. Change happens quickly.  As you learn about the marketplace, the problem, the technology and the business model, you'll be changing your software. Agility matters more than perfection. The most adaptable solution wins.

This next rule is harsh. But it's important.

Principle 3. If you have nothing to demonstrate, you have nothing. A good idea without a demo is difficult, almost impossible to work with. Without a demo, it's all just hand-waving. You must encode your knowledge in working software before you can make a technology choice.

Yes. It's circular. Sorry. You can't make a software technology choice until you have demo software that shows the problem areas. You can't create the demo without making a (potentially inappropriate) technology choice.

Demo To Product

When I ask about the existence of any demo software, I get into trouble because some folks don't want to even start building a demo until they have the One Perfect Final Decision firmly in hand.

This leads to a second unreasonable expectation.

The expectation of continuous evolution from demo to product is also disturbing: the expectation that even one line of code from the initial demo will become part of the final product.

Getting from idea to product will involve many changes. The user stories, the technology choices, the business model, every aspect is a candidate for a disruptive change. Success comes from making these changes. The first developer to abandon a bad idea is the furthest ahead. The most adaptable solution wins.

Cutting the Gordian Knot: Making Choices

Making a final, perfect technology choice for building the initial demo is not even helpful.

So don't.

Cut the Gordian Knot by building something. Build early. Build often. 

What's essential is to build something which (a) works, (b) has automated tests, and (c) can be evolved as the user stories evolve and improve. As you learn, you'll encode your evolving knowledge into evolving software. This is what software development really is: learning and encoding.

The initial demo may have to be discarded because better technology is located. Usually, however, the initial demo must be discarded based on experience in the marketplace, experience with the users, or experience solving the user's problems. It's more often these "other" non-technology lessons learned that trash the initial demo.

It's impossible to make a "future proof" technology choice. The future technology alternatives are difficult to know in advance. We distinguish between future and past by the lack of certainty in the future. As experience is gained, the initial round of user stories will get rewritten or possibly even discarded. A technology choice based on obsolete user stories is a liability, not an asset.

Some folks beg for something that will be "scalable" or "responsive" or "efficient" without having any actual scaling or performance problem that needs to be solved.

Using appropriate data structures and algorithms leads to inherently high-performance software. Beyond this vague platitude nothing much can be said.

Until.

Until there's a demo that has a specific scalability issue or performance bottleneck. Once a problem has been uncovered, then there's something to solve, and technology choices begin to matter. Most of the time, this will be a data structure or algorithm choice. Less often, this will be a larger architectural choice regarding parallelism or persistence.

Hand Wringing

"But what if," the professional hand-wringer asks, "What if my user stories are perfect, my demo is perfect, but I've made some sub-optimal technology choice and I'm forced to rework everything for purely technical reasons that—in hindsight—I could have foreseen?"

The answers are (A) Are you an absolute genius of flawless user story creation? (B) Is your code so bad that the rewrite is more than just a refactoring? (C) When did you plan to fix you code so it could be refactored? (D) Did you really think you were never going to be forced to make a core technology change?

"But what if," the hand-wringer asks, "What if I can't afford to write the whole thing twice."

The answers are (A) Is your business plan so fragile that a rewrite invalidates everything? (B) What do you think "user support" entails? (C) What will you do when users ask for new features?

If this is about "time-to-market" and you have to rush to be early or first or something, then technology choice doesn't matter, does it? Time to market matters. So build something that works and get it to the market first.

"But what if," the hand-wringer asks, "I choose a lousy platform initially?"

The answers are (A) Nothing is really wrong, it's just somewhat more costly or somewhat more complex. (B) So do others. (C) They rewrite, also.

"But what if I don't have skills in the best technology choice? What if I master a lousy technology to build the demo and release 1 and now I have to learn a whole new technology for release 2?"

The answers are (A) Did you really think that any technology would last forever? (B) Why can't you learn something new?

Basic Rules

The essential rules are these.

Build Early. Build Often.

The first step in making technology choices, then, is to pick a technology that you can actually make work, and build a demo.

Once you have a demo, recruit some potential or actual users.

Learn your lessons from these users: solve their problems: be sure your software is testable: troubleshoot your software as it is applied by real users to their real problems.

Plan to rebuild your demo to satisfy your user's demands. You will be learning from your users.

In order to maximize the learning, you're going to need to log carefully. The default logging in something like Apache is useless; log scraping is useless. You'll need detailed, carefully planned, application-specific logging to capture enough information that you really know what's really going on.

Once you have working software with real users, you're going to switch into support mode. You'll be using your application-specific logging to figure out what they're doing. 
[War Story. For testability purposes, I added a special logger for a particularly gnarly and visible calculation of actuarial risk. The logger dumped everything in a giant JSON document. To simplify debugging, I wrote a little app that loaded the JSON document and produced a ReStructured Text document so that I could read it and understand it. When requested, I could trivially pump the RST through docutils to create PDF's and send them to customer actuaries who questioned a result. This PDF-of-the-details became a user story for a link that would show supporting details to an actuarial user.]
Once you have working software, and a base of users, you can consider more refined technology choices. Now the question of PHP vs. Python vs. Java might become material.

[Hint. The right answer was RESTful web services with Python and mod_wsgi all along. Now you know.]

When the product is evolving from release 1 to release 2, you may have to reconsider your choice of database, web server, protocols, API's, etc. It turns out you're always going to be making technology choices. There will never be a final decision. Until no one wants your software.

If you are really, really lucky, you may get big enough to have scalability issues. Having a scalability issue is something we all dream about. Until you actually have a specific scalability issue, don't try to "pre-solve" a potential problem you don't yet have. If your software is even moderately well design, adding architectural layers to increase parallelism is not as painful as supporting obscure edge cases in user stories.

When you're still circulating your ideas prior to writing a demo, all technology choices are equally good. And equally bad. It's more important to get started than it is to make some impossibly Perfect Final Decision. Hence the advice to build early and build often.

Thursday, June 20, 2013

Automated Code Modernization: Don't Pave the Cowpaths

After talking about some experience with legacy modernization (or migration), I received information from Blue Phoenix about their approach to modernization.

Before talking about modernization, it's important to think about the following issue from two points of view.

Modernization can amount to nothing more than Paving the Cowpaths.

From a user viewpoint, "paving the cowpaths" means that the legacy usability issues have now been modernized without being fixed. The issues remain. A dumb business process is now implemented in a modern programming language. It's still a dumb business process. The modernization was strictly technical with no user-focused "value-add".

From a technical viewpoint, "paving the cowpaths" means that bad legacy design, bad legacy implementation and legacy platform quirks have now been modernized. A poorly-designed application in a legacy language has been modernized into a poorly-designed application in yet another language. Because of language differences, it may go from poorly-designed to really-poorly-designed.

The real underlying issue is how to avoid low-value modernization. How to avoid merely converting bad design and bad UX from one language to another.

Consider that it's possible to actually reduce the value of a legacy application through poorly-planned modernization. Converting quirks and bad design from one language to another will not magically make a legacy application "better". Converting quirky code to Java will merely canonize the quirks, obscuring the essential business value that was also encoded in the quirky legacy code.

Focus on Value

The fundamental modernization question is "Where's the Value?" Or, more specifically, "What part of this legacy is worth preserving?"

In some cases, it's not even completely clear what the legacy software really is. Old COBOL mainframe systems may contain hundreds (or thousands) of application programs, each of which does some very small thing.

While "Focus on Value" is essential, it's not clear how one achieves this. Here's a process I've used.

Step 1. Create a code and data inventory. 

This is essential for determine what parts of the legacy system have value. Blue Phoenix has "Legacy Indexing" for determine the current state of the application portfolio. Bravo. This is important.

I've done this analysis with Python. It's not difficult. Many organizations can provide a ZIP file with all of the legacy source and and all of the legacy JCL (Z/OS shell scripts). A few days of scanning can produce inventory summaries showing programs, files, inputs and outputs.

A suite of tools would probably be simpler than writing a JCL parser in Python

A large commercial operation will have all kinds of source checked into the repository. Some will be inexplicable. Some will have never been used. In some cases, there will be executable code that was not actually built from the source in the master source repository.

A recreational project (like HamCalc) reveals the same patterns of confusion as large multi-million dollar efforts. There are mystery programs which are probably never used; the code is available, but they don't appear in shell scripts or interactive menus. There are programs which have clear bugs and (apparently) never worked. There are programs with quirks; programs that work because of an undocumented "feature" of the language or platform.

Step 2. Capture the Data.

In most cases, the data is central: the legacy files or databases need to be preserved. The application code is often secondary. In most cases, the application code is almost worthless, and only the data matters. The application programs serve only as a definition of how to interpret and decode the data.

Blue Phoenix has Transition Bridge Services. Bravo. You'll be moving data from legacy to new (and the reverse, also.) We'll return to this "Build Bridges" below.

Regarding the data vs. application programming distinction, I need to repeat my observation: Legacy Code Is Largely Worthless. Some folks are married to legacy application code. The legacy code does stuff to the legacy files. It must be important, right?

"That's simple logic, you idiot," they say to me. "It's only logical that we need to preserve all the code to process all the data."

That's actually false. It's not simple logic. It's just wishful thinking.

When you actually read legacy code, you find that a significant fraction (something like 30%) is trivial recapitulation of SQL's "set" operations: SQL DML statements have an implied loop that operates on a set of data. Large amounts of legacy code merely recapitulates the implied loop. This is trivially true of legacy SQL applications with embedded SQL; explicit FETCH loops are very wordy. There's no sense in preserving this overhead if it can be avoided.

Programs which work with flat files always have long stretches of code that models SQL loops or Map-Reduce loops. There's no value in the loop management parts of these programs.

Another significant fraction is "utility" code that is not application-specific in any way. It's an application program that merely does a "CREATE TABLE XYZ(...) AS SELECT ....": a single line of SQL. There's no sense in preserving this through an "automated" tool, since it doesn't really do anything of value.

Also. The legacy code has usability issues. It doesn't precisely fit the business use cases. (Indeed, it probably hasn't fit the business use cases for decades.) Some parts of the legacy code base are more liability than asset and should be discarded in order to simplify, streamline or improve operations.

What's left?

The high value processing.

Step 3. Extract the Business Rules.

Once we've disposed of overheads, utility code, quirks, bad design, and wrong use cases, what's left are a the real brass tacks. A few lines of code here and there will decode a one-character flag or indicator and determine the processing. This code is of value.

Note that this code will be disappointingly small compared to the total inventory. It will often be widely scattered. Bad copy-and-paste programming will lead to exact copies as well as near-miss copies. It may be opaque.

IF FLAG-2 IS "B" THEN MOVE "R" TO FLAG-BC.

Seriously. What does this mean? This may turn out to be the secret behind paying bonus commissions to highly-valued sales associates. If this isn't preserved, the good folks will all quit en masse.

This is the "Business Rules" layer of a modern application design. These are the nuggets of high-value coding that we need to preserve.

These are things that must be redesigned when moving from the old database (or flat files) to the new database. These one character flag fields should not simply be preserved as a single character. They need to be understood.

The business rules should never be subject to automated translation. These bits of business-specific processing must always be reviewed by the users (or business owners) to be absolutely sure that it's (a) relevant and (b) has a complete suite of unit test cases.

The unique processing rules need to have modern, formal documentation. Minimally, the documentation must be in the form of unit test cases; English as a backup can be helpful.

Step 4. Build Bridges.

A modernization project is not a once-and-done operation.

I've been told that the IT department goal is to pick a long weekend, preferably a federal Monday holiday weekend (Labor Day is always popular), and do a massive one-time-only conversion on that weekend.

This is a terrible plan. It is doomed to failure.

A better plan is a phased coexistence. If a vendor (like Blue Phoenix) offers bridge services, then it's smarter and less risky to convert back and forth between legacy and new over and over again.

The policy is to convert early and convert often.

A good plan is the following.
  1. Modernize some set of features in the legacy quagmire of code. This should be a simple rewrite from scratch using the legacy code as a specification and the legacy files (or database) as an interface.
  2. Run in parallel to be sure the modern version works. Do frequent data conversions from old to new as part of this parallel test.
  3. At some point, simply stop converting from old to new and start using the new because it passes all the tests. Often, the new will have additional features or remove old bugs, so the users will be clamoring for it.
For particularly large and gnarly systems, all features cannot be modernized at once. There will be features that have not yet been modernized. This means that some portion of new data will be converted back to the legacy for processing.

The feature sets are prioritized by value. What's most important to the users? As each feature set is modernized, the remaining bits become less and less valuable. As some point, you get to the situation where you have a portfolio of unconverted code but no missing features. Since there are no more desirable legacy features to convert, the remaining code is -- by definition -- worthless.

The unconverted code is a net cost savings.

Automated Translation

Note that there is very little emphasis on automated translation of legacy code. The important work is uncovering the data and the processing rules that make the data usable. The important tools are inventory tools and data bridging tools.

Language survey tools will be helpful. Tools to look for file operations. Tools to look for places where a particular field of a record is used.

Automated translation will tend to pave all the cowpaths: good, bad and indifferent. Once the good features are located, a manual rewrite is just as efficient as automated translation.

Automated translation cannot capture meaning, identify use cases or write unit test cases. Thoughtful manual analysis of meaning, usability and unit tests is how the value of legacy code and data is preserved.

Tuesday, June 18, 2013

The Small Class Large Class "Question"

This isn't really a question. Writing a few "large" omnibus classes is simply bad design.

There are several variations on the theme of principles of OO programming. None of them include "a few large omnibus classes with nebulous responsibilities."

Here's one set of principles: Class Responsibility Collaboration. Here's one summary of responsibility definition: "Ask yourselves what each class knows and what each class does".  Here's another: "A responsibility is anything that a class knows or does." from Class Responsibility Collaborator (CRC) Models.

This idea of responsibility defined as "Knows or Does" certainly seems to value focus over sprawling vagueness.

Here's another set of principles from Object-Oriented Design; these echo the SOLID Principles without the clever acronym.

Getting down to S: a single reason to change means that the class must be narrowly-focused. When there are a few large classes, then each large class has to be touched for more than one reason. By more than one developer.

Also, getting to O: open to extension, closed to modification requires extremely narrow focus. When this is done well, new features are added via adding subclasses and (possibly) changing an initialization to switch which Factory subclass is used.

But Why?

Why do people reject "lots of small classes"?

Reason 1. It's hard to trivially inspect a complex solution. I've had an argument similar to the one Beefarino alludes to.  In my case, it was a manager who simply didn't schedule the time to review the design in any depth.

Reason 2. Folks unfamiliar with common design patterns often see them as "over-engineered". Indeed, I've had programmers (real live Java programmers, paid to write Java code) who claimed that the java.util data structures (specifically Map, TreeMap and HashMap) were needless, since they could write all of that using only primitive arrays. And they did, painstakingly write shabby code that had endless loops and lookups and indexing garbage instead of simply using a Map.

Reason 3. Some folks with a strong background in simple procedural programming reject class definitions in a vague, general way. Many good programmers work out ways to do encapsulation in languages like C, Fortran or COBOL via naming conventions or other extra-linguistic tricks.

They deeply understand procedural code and try to map their ideas of functions (or subroutines) and their notions of "encapsulation via naming conventions" onto OO design.

At one customer site, I knew there would be friction because the project manager was very interested in "code conventions" and "naming conventions". This was a little upsetting at the time. But I grew to realize that some folks haven't actually seen any open source code. They don't understand that there are established international, recognized conventions for most programming languages, and examples are available on the World Wide Web. Just download a popular package and read the source.

The "naming conventions" was particularly telling. The idea that Java packages (or Python packages and modules) provide distinct namespaces was not something that this manager understood. The idea that a class defines a scope was not really making much sense to them.

Also Suspicious

Another suspicious design feature are "utility" packages. It's rare (not impossible, but rare) for a class to truly be interpackagial in scope and have no proper home. The "java.util" package, for example, is a strange amalgamation of the collection data structures, national and cultural class definitions (calendars, currency, timzones, etc.) handy pattern abstractions, plus a few algorithms (priority queue, random).

Yes, these have "utility" in that they're useful. They apply broadly to many programming problems. But so does java.lang and java.io. The use of a vague and overly inclusive term like "util" is an abdication of design responsibility to focus on what's really being offered.

These things do not belong together in a sprawling unfocused package.

Nor does disparate functionality belong in a sprawling, unfocused class.

Education

The answer is a lot of eduction. It requires time and patience.

One of the best methods for education is code walkthroughs. This permits reviews of design patterns, and how the SOLID principles are followed (or not followed) by code under development.

Thursday, June 13, 2013

HamCalc and Quirks

Careful study of the HamCalc shows a number of quirks. Some are funny, some are just examples of the need for unit test frameworks.

The Wikispaces for the modernization project is here: http://hamcalc.wikispaces.com/home

For example, the following line of code, in GW-Basic, will (usually) set Y to zero.

Y = O

Yes. That's the variable "O", not the number 0.

Why does this work? Why can we use "O" instead of 0?

Most programmers avoid using the variable "O", since it's hard to read.  GW-Basic provides default values of 0 for almost all variables. So, "Y=O" works as well as "Y=0" most of the time. The only time is doesn't work is if the program happens to have "O" used as a variable.

This is one of the examples where people start shouting that a compiled language is so obviously superior that the rest of us must be brain-damaged to use a dynamic language like Python.

This isn't a very compelling argument for the overhead of a compiler. It's a more compelling argument for avoiding languages with default values. Python, for example, would throw an exception if the variable "O" had no value.

This isn't common (so far, I've only found one example) but it's amusing.

Another amusing quirk is the occasional tangle of GOTO/GOSUB logic that defies analysis. There are several examples of GOSUB/RETURN logic that is totally circumvented by a GOTO that circumvents the return. This should (eventually) lead to some kind of stack overflow. But GW-Basic doesn't really handle recursion well, so it would probably just be ignored.

One of my favorites is this.


    730 FOR N=A TO T STEP B
    750 IF T/N=INT(T/N)THEN X=X+1:PN(X)=N:T=T/N:GOTO 730
    760 A=3:B=2
    770 NEXT N

What does the GOTO on line 750 mean? Since GW-Basic doesn't use a stack of any kind, it doesn't create recursion or stack overflow. It appears to "restart" the loop with a new value of T. I think.


Tuesday, June 11, 2013

Python Roadmap Amplifications and Clarifications

Some additional points on using Python 2.7 in a way that bridges the gap to Python 3.2. The steps are small and simple. You can start taking them now.

Recently I suggested that one should always include from __future__ import division, print_function on every module. Always. Every Module.

I also suggested using input=raw_input in those few scripts where input might be expected. This  isn't the best idea, but it forces you to depend on the semantics of the Python 3 input() function.

I failed to mention that you must stop using the % operator for string formatting. This operator will be removed from Python 3.2. Start using "".format() string formatting right now. Always. Every Module.

A follow-up question was "What the heck is from __future__?"

The Python __future__ package contains proposed language changes.

There are a number of modules. Of those, two are highly relevant to easing the switch to 3.2.

Division.

The division module changes the semantics of division. The "/" operator becomes "exact" division instead of "depends on the arguments" division.

In Python2.7, do

>>> 22/7
>>> 22/7.0

To see the "depends on the arguments" (or classical) mode.

Then try

>>> from __future__ import division
>>> 22/7

This is the exact division operation that's used in Python 3.

For integer division, the "//" operator is used.

>>> 22//7

Start now. Use them like this.

Print Function.

The print_function module actually changes the Python compiler to reject the print statement as a syntax error. This allows you to use the (less quirky) print() function.

In Python 3, the print statement has been removed.  It's easiest to simply get out of the habit of using the print statement by switching to the print() function as soon as possible.

This means that examples from older books will have to be translated.

print "hello world"  

becomes

print("hello world")

Not too significant a change, really.

In some later chapters, they may introduce print >> somefile, data, data.

The "chevron print". This syntax was a HUGE blunder, and is one of the reasons for eliminating the print statement and replacing it with the print() function. The print function equivalent is print( data, data, file=somefile ). Much more regular; much less quirky.

Thursday, June 6, 2013

Obstinate Idiocy [Updated]

Once in a great while, you see someone engaging in Obstinate Idiocy.

Here's my recent example.

They're solving some kind of differential equation. Not sure why. Symptom 1 of Obstinate Idiocy is No Rational Justification. The explanation is often "that's not relevant, what's relevant is this other thing I want to focus on" or something equivalent to "never mind about that."

The equation is this:

\[y + 3 \ln \lvert y \rvert = \frac{1}{3}(x-1)^3-\frac{11}{3}\]

Pretty gnarly.

Apparently, they were so flummoxed by this that they immediately turned to Excel.  Really.  Excel.

Symptom 2 of Obstinate Idiocy is Random Tool Choice. Or perhaps Ineffective Tool Choice. A kind of weird, unthinking choice of tools.

Of course, Excel struggles with this sort of thing, since it appears to gnarly. I was told that there's an Excel Solver, but there was some problem with using it. It didn't scale, or it required some understanding of the shape of the equation or something.

Symptom 3 of Obstinate Idiocy is Seemingly Random Whining. It's random because there's no rational justification for what's going on and the tool was chosen apparently at random.

Ask a question like "why not use another tool?" and you don't get an answer. You get an argument about tool choice or the politics of the situation or "tool choice isn't the point" or some other dismissive non-answer.

Ask a question like "what are you really trying to do?" and you get user stories that make approximately no sense. We had to endure a long discussion on system-assigned surrogate keys as if that was somehow relevant to the graphing the equation shown above. See Symptom #1. There's just no reason for this. It's Very Important, but No One Else Can Understand The Reason Why.

How To Begin?

So, now we're at this weird impasse.

We have the obstinate idiot who won't discuss their tool choice. Somehow I'm supposed to sprinkle around some Faerie Dust and magically make Excel do something better or different than what it normally does. Indeed, I'm having trouble understanding any of the whining about Excel.

Clearly, they've never heard of MatLab or Mathematica or any commercial product that does this nicely. Apparently, they've never even seen the graph tool on Mac OS X which simply draws the graph with no effort of any kind on the part of the user.

Clearly, they've never seen Google and can't make it work.

They asked how a Pythonista would approach a problem this gnarly. I couldn't even properly understand that question, since they hadn't Googled anything and didn't really have a question that could be answered. As a Pythonista, I use Google. I wasn't sure how to approach an answer, since I couldn't really understand what their goal was or what their knowledge gap was.

Since their principal complaint was about Excel, asking a Python-related question didn't make much sense. Were they planning on dropping Excel? If so, why not use MatLab or Mathematica?

See Symptom 2. The tool choice was fixed. Other tools weren't on the table. If so, why ask about Python?

At this point, the place to begin seems to be this link: http://bit.ly/11usbtH And that's not going to be helpful with the Obstinate Idiot. They'll claim they already knew all of that, they just needed some additional or different help.

They specifically said they weren't going to use Python. Which raises the question "Why ask me anything, then?" To which there was no real answer, just sulking about me not being helpful.

Correct. I'm not being helpful. I can't figure out what the problem is. There's a gnarly formula and Excel somehow doesn't work in some optimal way. And database surrogate keys. And departmental politics.

Did You Try This?

The equation simplifies to

\[ x = (3y + 9 \ln \lvert y \rvert + 11)^{\frac{1}{3}} + 1 \]

Which is really easy to graph. \(x=f(y)\) is, of course, not the usual approach of \(y=f(x)\).

Apparently, the Obstinate Idiot had not actually applied algebra to the equation. Nor had they ever conceived of graphing \(x=f(y)\).

Which brings us to Symptom 4 of Obstinate Idiocy: Slow To Ask For Help.

And the variation on Symptom 1 of Obstinate Idiocy: Goal-Free Activity. By this I mean that the thrashing around with Excel and discussing Python was all just a long, drawn-out and utterly irrelevant side-bar from the real purpose, which apparently was to find something out related to a differential equation. It's still unclear what the equation is being used for and why the graph is helpful.

Python Approach

First: Differential Equations are hard. Nothing makes them easy.

Interactive Python, however, can be of some help after you've taken the first steps with pencil and paper.

Here's a console log of something I did to help the Obstinate Idiot.

>>> import math
>>> import pprint
>>> 
>>> def lde_1(y):
...     try:
...         x = (3*y+9*math.log(abs(y))+11)**(1/3)+1
...     except ValueError:
...         x = float("NaN")
...     return x
... 
>>> def eval(y_lo=-15, y_hi=15, y_step=0.5, f_y=lde_1):
...     # Next smaller power of 2: prettier numbers. Less noise.
...     step_2 = 2**math.floor(math.log(y_step, 2))
...     for t in range(int((y_hi-y_lo) // step_2)):
...         y = y_lo + step_2*t
...         x = f_y(y)
...         yield( x, y )
... 
>>> data1= list(eval())
>>> pprint.pprint(data1)

I'll leave out the dump of the data points. However, it's possible to see the asymptote at zero and the ranges where the results switch from real to complex numbers.

We can drill into the region around zero to see some details.

data2 = list(eval(-2, 2, .0625))
pprint.pprint(data2)

These are just numbers.  A picture is worth a thousand numbers.

We have lots of choices for graphic packages in Python. The point here, however, is that evaluating the gnarly equation required two preliminary steps that were far, far more important than choosing a graphic package.

  1. Do some simple algebra.
  2. Write a simple loop.

If output to Excel is somehow important, there's always this.

>>> import csv
>>> with open("data.csv","w") as target:
...    wtr= csv.writer(target)
...    wtr.writerows(data1)

That will produce a CSV that Excel will tolerate and display as an X-Y scatter plot.

A jupyter notebook with pyplot will knock out a picture directly, allowing visualization.

Tuesday, June 4, 2013

Python Big Picture -- What's the "roadmap"? [Revised]

Here's an interesting idea: http://www.xmind.net/m/WvfC/

This is associated with the following question: "I've had a hard time finding the Big Picture re: Python, and it makes it difficult ... to proceed and prioritize my efforts without one."

An interesting question: what is the overview or strategy for mastering Python?

In this case, the focus is on "Big Data", but I've found that to be merely a tangential. The application area has a small influence, and then only around the fringes of the language and libraries.

I'm going to disagree with several particulars on the mind map. I'll present an alternative, with a point-by-point commentary on the mind map. (And I'll eschew the graphics, I don't find them helpful.)

Foundation

The language itself is (duh) the foundation. I find it important to emphasize this because the Python universe is replete with a seemingly endless supply of packages and libraries that help solve nearly every problem a programmer might encounter.

This profusion of packages is -- in a way -- it's own problem.

It's obligatory to run the following interaction in Python. (Any Python 2.7 or 3.2 will work; older Pythons prior to 2.7 need to be upgraded.)

>>> import antigravity

Yes.

Everything you can imagine is an add-on package. Everything.

But.

That's not the starting point for learning how to solve problems with Python. That's merely one waypoint along the course. And it's not the most important waypoint.

Attractive Nuisance

We have to set the external libraries aside as an "attractive nuisance." They're a distraction, in fact. Let's focus on the stuff that comes with the installation kit: language and library.

When looking at the Language, we actually see two things: Data and Processing. The "Data" is the built-in data structures: bool, int, float, complex, exception, context, string, tuple, list, map, set, lambda, function, class, module and package. The "Processing" is the imperative programming features: the 21 (or so) statements that comprise the language.

Both facets are essential, but they're also (approximately) orthogonal to each other.

For years, I was convinced that the way to learn Python was to come to grips with most of the imperative statements and then apply these statements to the various data structures. The tidy orthogonality between many of the statements and some of the data structures makes this appealing. I wrote two Python tutorials based on this idea.

My approach was to echo the ancient Structured Concurrent Programming with Operating System Applications. They define a nested series of subsets of a hypothetical PL/I-like (or Pascal-like) programming language. While the details don't apply well to Python, the approach does make a lot of sense. Start with constants, expressions and output (i.e., print) as the minimal language. Then add state change via variables, assignment and input. Then add if/elif/else. Fold in for and while, and continue to add features in this careful progression: functions, exceptions, contexts, generators, etc.

I'm becoming less and less sure that the imperative, procedural statements should define the roadmap through the language.

It's true that computing is defined by number theory. The original Turing Machine theorem equates all of number theory to an imperative, procedural notion of computers and programming. While unconditionally true, it's not necessarily the most helpful strategy. We could, for example, start programming by covering Boolean Algebra and Set Theory first. But it would be a long dull slog before we got to anything that appeared "useful."

Data Is Central

I'm starting to see that the data structures are more helpful than imperative statements. This leads to a different approach to studying this language. Experienced programmers may feel that a list of fundamental language topics isn't too helpful.

However. I've noted that many experienced programmers tend to skip over the unique-to-Python features. This leads them to write clunky and awkward Python code because they missed something that would lead to simplicity and clarity.
  1. int. Natural numbers are boring but necessary. The first explorations of Python can easily be simple expressions, output, variables and input using integers.
  2. bool. Comparisons and logic allow introduction of the if, elif and else statements in a graceful way. 
  3. str. Strings can be a gentle introduction to objects which are collections. Strings have methods, unlike integers and booleans. Strings introduce a number of conversion functions (int, float, str, hex, oct, etc.) This allows introduction of the for statement based on this simple collection.
  4. float and complex. Floating point numbers are an important side-bar. They're not central. The notion of "approximation" can't be stressed enough, and pathological examples of noise bits at the end of floats is absolutely central. The math library is perhaps part of this. Also the decimal and rational modules.
  5. Exception. For programmers who have a background in languages like C (without exceptions) the exception seems complex and mysterious. However, for Python they are absolutely central. And easy to play with by getting simple Value Errors. This introduces the try/except statements, also. While it's a little advanced, the class MyException( Exception ): pass is not a bad thing at this point. Yes, it's a bit of a "magical incantation." But so is len(string).
  6. tuple and list. This is an extension to some of the discussion of string. It's also a time to introduce mutability and show some of the consequences of a mutable list. This introduces iterability, also.
  7. dict and defaultdict. This introduces more loop constructs including list comprehensions and various kinds of generator expressions.
  8. set and frozenset. This allows a review of mutability and the ways list and tuple differ.
  9. function and lambda. The def and return statements, plus global. Additionally, the sort method of a list as well as the sorted iterator function can be looked at in some depth. 
  10. file, open and context. This includes the with statement. This is a two-part or three-part exploration. It has to include some of numerous library packages for dealing with the file system. Plus data representation in CSV and JSON files. The way that a file is iterable is essential.
  11. Iterators, generators and the itertools package. This includes techniques for implementing map-reduce algorithms using iterators and generators.
  12. namedtuple. This is a small thing, but it can help to crystalize attribute access and some of the features that are part of a class.
  13. class. This must include an multi-step excursion into special method names. 
  14. module and package.  Note that these are different things. Java only offers "package". A Python module is a very, very important concept. The module (not the class) is the practical unit of reuse. Python is emphatically not written in the style of Java with one class per file. 
Class Definitions

The essential goal behind the first 14 topics is to get to the point where all the language features can be used to create workable class definitions.

  1. Common object-oriented design patterns. Most of the "Gang-of-Four" suite of patterns is relevant to Python. A few changes to the textbook examples are required to remove the C++ and Java biases. Patterns like State, Strategy and Factory are central to good OO design. The Python version of Singleton has to be treated carefully; the Python Borg pattern is rarely useful; on the other hand the concept of module global variable is important and underpins some of the standard library
  2. Above and beyond the common design patterns, Python has a number of unique design patterns. These are largely exemplified by the special method names. Attribute Access (properties and descriptors). This allows creation of simple collections.
  3. Callable objects allows a review of functions and lambdas, also. The Abstract Base Class definitions must be emphasized for this to work out well in the long run.
  4. Sequence Types expands simple collections to created ordered collections. 
  5. Number Types. This allows a complete understanding of decimal and rational packages, also.
  6. Some additional design patterns need to be added, also. Specifically, things like metaclass and classmethod are features of Python that are absent from Java or C++.
Programmers experienced in other languages might object to this depth in Python OO design techniques and design patterns.

What I find is that programmers who don't really "get" the Python design patterns (especially the ABC's) overwrite their programs. They needlessly reinvent methods that are already first-class features of the language, but weren't well understood. Properties and descriptors, for example, allow for a simpler and very clear syntax; it's often better than the endless parade of explicit getter and setter method calls that characterize Java Beans programming.

Additionally, bad habits from other languages need to be unlearned. For example, many Java (and C++) programmers are taught to overuse the private keyword. When they learn Python, they think that private is somehow really important.  When the find out about __ (double underscore) name mangling, they go off the deep end, using __ names everywhere. This is all bad.

Encapsulation has little to do with private. In Python, the _ (single underscore) is the convention for private. But it's not like Java's (or C++) compiler-enforced privacy, it's just a nodding understanding. As the creator of Python says "we're all adults here." An overused Java private is more of a problem for proper extension of a Java class than Python's casual "nudge-nudge-wink-wink-private".

The Standard Library

After looking at class definitions, it's important to look at the default library, subsection by subsection. There is a lot to the installed library.

For most Python programmers, sections 1 to 6 will have been covered by the previous material. Sections 26 and on to the end, also, are less important.

Sections 7 to 25 of the library reference contain the centrally important modules. A familiarity with the list of topics is essential before tackling "real" projects. This is so important, we'll use this set of topics as the basis for our point-by-point commentary on the mind-map linked above.

External Components and Downloads

One of the reasons why Python is a well-designed language is the way the principle of orthogonality is applied.

Most statements and data structures play well together. For example, all the built-in collections are sequences, so that they are iterable; the for statement works directly with collections.

Also, the external libraries themselves are all independent of the language, and the language exists without resorting to any of the external libraries.

Looking at the mind map, there are several interesting topics. And a few mysteries. And some unhelpful labels. Here's a quick commentary on the mind map.
  • Basic Stack. I supposed these can be called "essential" external packages. This seems to be a way to emphasize other packages listed elsewhere on the diagram. I'm not sure why this topic is here.
  • Newer Packages. This is a completely opaque label. Not helpful.
  • Integrated Platforms. This isn't too helpful, either. I suppose one could make a guess based on the list of packages.
  • Visualization. Ah. Now we're getting somewhere. These are some helpful visualization packages. PIL isn't listed, perhaps because it's too primitive.
  • Data Formats. YAML isn't listed. The SQL and NoSQL categories make precious little sense. Those are all about persistence, not data formats. Data format and persistence are separate and unrelated. JSON, for example, is a data format. CouchDB is persistence.
  • Packages. I suppose it's helpful to point out PyPi, but it doesn't make sense in this context. This is metadata and relatively unhelpful.
  • Efficiency. Cython for "efficiency" makes precious little sense. Proper data structure and algorithm is the secret to efficiency. See my post on a 100:1 speedup in Python. For efficiency, it's sometimes necessary to drop out of Python and write the important 20% of the code in C++. 
  • Parallel. A non-Windows OS handles parallelism gracefully. Process-level parallelism with pipelines is simple and efficient. Thread-level parallelism is often more trouble than benefit. 
  • GPU. This is an example of where a little C++ code can go a long way to improving the 20% of the code that's the actual performance bottleneck.
  • Glue. Interfaces to other applications or packages can be useful if the other package is actually a first-class part of the solution.
  • MapReduce. This is essentially persistence, and goes with SQL database and noSQL databases. It's also a fundamental design pattern that can be exploited trivially in Python.
On this mind-map, there are a few topics that are really important. So important that the topics parallel the Python library.
  • Data Persistence, chapter 11. Databases and Files. This includes SQL and noSQL databases as well as pickled data structures. Python comes with SQLite, allows SQL development without additional downloads. Postgres and MySQL libraries often popular because the price is right and the functionality is outstanding.
  • Archive and Compressed Structures, chapter 12. ZIP, BZ2, etc. Compression is sometimes relevant for big data projects.
  • Data Representation and File Formats, chapters 13, 18 and 19. CSV, JSON, YAML, XML, HTML, etc. It's important to note that JSON is more compact (and almost as expressive) as XML. While XML is popular, it's sometimes overused.
  • OS Features, chapter 15 and 16. These are tools needed to build command-line applications. For Big Data applications, logging and command-line parameter parsing are essential.
  • Multiprocessing. This is it's own design discipline. What's important here is that the OS process-level design is central. The queue and multiprocess packages are sufficient for this. There are some external multiprocessing packages, also, like Zero MQ.
  • Internet Protocols, chapter 20. This is part of using RESTful web services, which is essential for making noSQL database (like CouchDB) work. For creating RESTful servers, the WSGI approach is essential.
  • Unit Testing and Documentation, chapter 25. Sphinx is extremely important for creating useful documentation with minimal pain.
  • Visualization. matplotlib, PIL are popular. The built-in turtle package is a bit primitive. However, it's also rather sophisticated, and a great deal can be done with it.
  • Numeric Processing. numpy or scipy.
Note that the number of external packages on this list is rather small. Python comes with batteries included. 

Admittedly, it's hard to make general recommendations for external packages. But it's misleading to provide a huge list of external packages when the default suite of packages will solve a large number of problems gracefully.

Which Python Version?

Generally, everything should be done in Python3.2.

In some cases a crucial package hasn't been upgraded to Python 3.2. In these exceptions, Python 2.7 can be used.  For example, nltk is still focused on Python 2.7.

But.

Every Python2.7 program should always begin with

from __future__ import print_function, division

That's every and always. All new development should always be focused on Python3.2. There is no rational exception to this rule.

If there's any need to use the input() function, then the following line must be included, also.

input= raw_input

This will use the Python 3.2 version of the input() function.