S.Lott-Software Architect: 2009

Monday, December 28, 2009

The Secret Architect's Cabal

Recently, I had two very weird "meta" questions on the subject of OO design.

They bother me because they imply that some Brother or Sister Architect has let slip the presence of the Secret Technologies that we Architects are hiding from the Hoi Polloi developers.

These are the real questions. Lightly edited to fix spelling and spacing.

"What are the ways to implement a many to many association in an OO model?"
"Besides the relational model, what other persistence mechanisms are available to store a many to many association?"

These are "meta" questions because they're not asking anything specific about a particular data model or set of requirements. I always like unfocused questions because all answers are good answers. Focus allows people to see through the smoke and mirrors of Architecture.

The best part about these questions (and some similar questions that I didn't paste here) is that they are of the form "Is there a secret technique you're not telling us about?"

It's time to come clean. There is a Secret Cabal of Architects. There are things we're not telling you.

Many-to-Many

The many-to-many question shows just how successful the Society of Secrets (as we call ourselves) has been about creating a SQL bias. When folks draw higher-level data model diagrams that imply (but don't show) the required many-to-many association table, the Architects have failed. In other organizations the association table is So Very Important that it is carefully diagrammed in detail. This is a victory for forcing people to think only in implementation details.

In the best cases, the DBA's treat the association table as part of the "dark art" of being a DBA. It's something they have to dwell on and wring their hands over. This leads to developers getting wrapped around the axle because the table isn't a first-class part of the data model, but is absolutely required as part of SQL joins.

It's a kind of intellectual overhead that shows how successful the Secret Architecture Society is.

The presence of a dark secret technique for implementing association leads to smart developers asking about other such intellectual overhead. If there's one secret technique, there must be many, many others.

It is to laugh.

The Secret Techniques for Associations

The problem arises when someone ask about the OO implementation of many-to-many associations. It's really difficult to misdirect developers when the OO implementation is mostly trivial and not very interesting. There's no easy to add complexity.

In Python there are a bunch of standard collections. The language has a bunch that are built in. Plus, in Python 2.6, the collections module has Abstract Base Classes that clearly identify all of the collections.

There isn't too much more to say on the subject of many-to-many associations. That makes it really hard to add secret layers and create value as an architect.

The best I can do with questions like this is say "I was sworn to secrecy by the secret Cabal of Architects, so I can't reveal the many-to-many association techniques in a general way. Please get the broomstick of the Wicked Witch of the West if you want more answers."

Persistence

The persistence question, however, was gift. When someone equates "relational model" with a "persistence mechanism", we have lots of room to maneuver. I know that we're talking about a "relational database" as a "persistence mechanism". However, it's clear they don't know that, and that's my opportunity to sow murkiness and confusion.

Sadly, the OS offers us exactly one "persistence mechanism". Yet, the question implies that the Secret Cabal of Architects know about some secret "alternative persistence mechanisms" that mere programmers can't be told about.

Every device with a changeable state appears as a file. All databases (relational, hierarchical, object, whatever) are just files tarted up with fancy API's that allow for performance and security. Things like indexing, locking, buffering, access controls, and the like are just "features" layered on top of good-old files. But those features are So Very Important, that they appear to be part of persistence.

Excellent.

Logical vs. Physical

What's really helpful here is the confusion folks have with "Logical" vs. "Physical" design layers.

Most DBA's claim (and this is entirely because of ERwin's use of the terms) that physical design is when you add in the database details to a logical design. This is wrong, and it really helps the Architect Secret Society when a vendor misuses common terms like that.

The Physical layer is the file-system implementation. Table spaces and blocks and all that what-not that is the underlying persistence.

The Logical layer is what you see from your API's: tables and views.

The relational database cleanly separates logical from physical. Your applications do not (indeed, can not) see the implementation details. This distinction breaks down in the eyes of DBA's, however, and that lets us insert the idea that a database is somehow more than tarted-up files.

Anyone asking about the "relational model" and "persistence mechanism" has -- somehow -- lost focus on what's happening inside the relational database. This allows us to create Architectural Value by insisting that we add a "Persistence Layer" underneath (or on top of or perhaps even beside) the "Database Layer". This helps confuse the developers by implying that we must "isolate" the database from the persistence mechanism.

Many-to-many and ORM

Sadly, these two questions may turn out to be ORM questions. The problem with ORM layers is that the application objects are trivially made persistent. It's really hard to add complexity when there's an ORM layer.

However, a Good Architect can sometimes find room to maneuver.

A programmer with SQL experience will often think in SQL. They will often try to provide a specific query and ask how that SQL query can be implemented in the ORM layer. This needs to be encouraged. It's important to make programmers think that the SQL queries are First Class Features. The idea that class definitions might map directly to the underlying data must be discouraged.

A good DBA should insist on defining database tables first, and then applying the ORM layer to those tables. Doing things the other way around (defining the classes first) can't be encouraged. Table-first design works out really well for imposing a SQL-centered mind-set on everyone. It means that simple application objects can be split across multiple tables (for "performance reasons") leading to hellish mapping issues and performance problems.

No transaction should make use of SQL set-oriented processing features. Bulk inserts are a special case that should be done with the database-supplied load application. Bulk updates indicate a design problem. Bulk deletes may be necessary, but they're not end-user oriented transactions. Bulk reporting is not transactional and should be done in a data warehouse.

Subverting the ORM layer by "hand-designing" the relational database can create a glorious mess. Given the performance problems, some DBA's will try to add more SQL. Views and Dynamic Result Sets created by Stored Procedures are good ways to make the Architecture really complex. The Covert Coven of Architects likes this.

Sometimes a good developer can be subvert things by creating a "hybrid" design where some of the tables have a trivial ORM mapping and work simply. But. A few extra tables are kept aside that don't have clean ORM mappings. These can be used with manually-written SQL. The best part is populating these extra tables via triggers and stored procedures. This assures us that the architecture is so complex that no one can understand it.

The idea of separating the database into Logical and Physical layers hurts the Architectural Cabal. Wrapping the Logical layer with a simple ORM is hurtful, too. But putting application functionality into the database -- that really helps make Architecture appear to be magical.

The Persistence Mechanisms

The bottom line is that the Secret Conference of Architects doesn't have a pat answer on Persistence Mechanisms. We have, however, a short list of misdirections.

API and API Design. This is a rat-hole of lost time. Chasing API design issues will assure that persistence is never really found.
Cloud Computing. This is great. The cloud can be a great mystifier. Adding something like the Python Datastore API can sow confusion until developers start to think about it.
Multi-Core Computing. Even though the OS handles this seamlessly, silently and automatically, it's possible to really dig into multi-core and claim that we need to rethink software architecture from the very foundations to rewrite our core algorithms to exploit multiple cores. Simply using Unix pipelines cannot be mentioned because it strips the mystery away from the problem.
XML. Always good a for a few hours of misdirection. XML as a hierarchical data model mapped to a relational database can really slow down the developers. Eventually someone figures it out, and the Architect has nothing left to do.
EJB's. This is digging. It's Java specific and -- sadly -- trumped by simple ORM. But it can sometimes slow the conversation down for a few hours.

Sunday, December 20, 2009

The Data Cartel and "Users"

I work with a CIO who calls the DBA's "The Data Cartel". They control the data. Working with some DBA's always seems to turn into hostage negotiation sessions.

The worst problems seem to arise when we get out of the DBA comfort zone and start to talk about how the data is actually going to be used by actual human beings.

The Users Won't Mind

I had one customer where the DBA demanded we use some Oracle-supplied job -- running in crontab -- for the LDAP to database synchronization. I was writing a J2EE application; we had direct access to database and LDAP server. But to the data cartel, their SQL script had some magical properties that seemed essential to them.

Sadly, a crontab job introduces a mandatory delay into the processing while the user waits for the job to run and finish the processing. This creates either a long transaction or a multi-step transaction where the user gets emails or checks back or something.

The DBA claimed that the delays and the complex workflow were perfectly acceptable to the users. The users wouldn't mind the delay. Further, spawning a background process (which could lead to multiple concurrent jobs) was unacceptable.

This kind of DBA decision-making occurs in a weird vacuum. They just made a claim about the user's needs. The DBA claimed that they wouldn't mind the delay. Since the DBA controls the data, we're forced to agree. So if we don't agree, what? A file "accidentally" gets deleted?

The good news is that the crontab-based script could not be made to work in their environment in time to meet the schedule, so I had to fall back to the simpler solution of reading the LDAP entries directly and providing (1) immediate feedback to the user and (2) a 1-step workflow.

We wasted time because the data cartel insisted (without any factual evidence) that the users wouldn't mind the delays and complexity.

[The same DBA turned all the conversations on security into a nightmare by repeating the catch-phrase "we don't know what we don't know." That was another hostage negotiation situation: they wouldn't agree to anything until we paid for a security audit that illustrated all the shabby security practices. The OWASP list wasn't good enough.]

The Users Shouldn't Learn

Recent conversations occurred in a similarly vacuous environment.

It's not clear what's going on -- the story from the data cartel is often sketchy and missing details. But the gaps in the story indicate how uncomfortable DBA's are with people using their precious data.

It appears that a reporting data model has a number of many-to-many associations. Periodically, a new association arrives on the scene, and the DBA's create a many-to-many association table. (The DBA makes it sound like a daily occurrence.)

Someone -- it's not clear who -- claimed this was silly. The DBA claims the product owner said that incremental requirements causing incremental database changes was silly. I think the DBA is simply too lazy to create the required many-to-many association tables. It's a table with two FK references. A real nightmare of labor. But there were 3 or maybe 4 instances of this. And no end in sight.

It appears that the worst part was that the data model requirements didn't arrive all at once. Instead, these requirements had the temerity to trickle in through incremental evolution of the requirements. This incremental design became a "problem" that needed a a "solution".

Two Layers of Hated User Interaction

First, users are a problem because they're always touching the data. Some DBA's do not want to know why users are always touching the data. Users exist on the other side of some bulkhead. What the users are doing on their side is none of our concern as DBA.

Second, users are a problem because they're fickle. Learning -- and the evolution of requirements that is a consequence of learning -- is a problem that we need to solve. Someone should monitor this bulkhead, collect all of the requirements and pass them through the bulkhead just once. No more. What the users are learning on their side is none of our concern as DBA.

What's Missing?

What's missing from the above story? Use Cases.

According to the DBA, the product owner is an endless sequence of demands for data model features. Apparently, adding features incrementally is silly. Further, there's no rhyme or reason behind these requests. To the DBA they appear random.

The DBA wanted some magical OO design feature that would make it possible to avoid all the work involved in adding each new many-to-many association table.

I asked for use cases. After some back and forth, I got something that made no sense.

It turns out that the data model involves "customers" the DBA started out describing the customer-centric features of the data model. After all, the "actor" in a use case is a person and the database contains information on people. That's as far as the DBA was willing to go: repeat the data model elements that involved people.

If It Weren't For the Users

The DBA could not name a user of the application, or provide a use case for the application. They actually refused to articulate one reason why people put data in or took data out. They sent an angry email saying they could not find a reason why anyone would need these many-to-many association tables.

I responded that if there's no user putting data in or getting data out then there's no system. Nothing to build. Stop asking me for help with your design if no person will ever use it.

To the DBA, this was an exercise in pure data: there was no purpose behind it. Seriously. Why else would they tell me that there were no use cases for the application.

Just Write Down What's Supposed to Happen

So I demanded that the DBA write down some sequence of interactions between actual real-world end-user and system that created something of value to the organization. (My idea was to slide past the "use case" buzzword and get past that objection.)

The DBA wrote down a 34-step sequence of steps. 34 steps! While it's a dreadful use case, it's a start: far better than what we had before, which was nothing. We had a grudging acknowledgement that actual people actually used the database for something.

We're moving on to do simplistic noun analysis of the use case to try and determine what's really going on with the many-to-many associations. My approach is to try and step outside of "pure data" and focus on what the users are doing with all those many-to-many associations.

That didn't go well. The data cartel, it appears, doesn't like end-users.

The Final Response

Here's what the DBA said. "The ideal case is to find a person that is actually trying to do something and solve a real end user problem. Unfortunately, I don't have this situation. Instead, my situation is to describe how a system responds to inputs and the desired end state of the system."

Bottom line. No requirements for the data model. No actors. No use case. No reality. Just pure abstract data modeling.

Absent requirements, this approach will turn into endless hypothetical "what if" scenarios. New, fanciful "features" will inevitably spring out of the woodwork randomly when there are no actual requirements grounded in reality. Design exists to solve problems. But the DBA has twice refused to discuss the problem that they're trying to solve by designing additional tables.

Tuesday, December 15, 2009

The Problem with Software Development is...

Folks like to say that there's a "Software Crisis". We can't build software quickly enough, cheaply enough or well enough.

I agree with EWD -- software is really very, very complex. See EWD 316 for more justification of this position.

Is my conclusion is that the cost of software stems from complexity? Hardly news.

No, my conclusion is that the high cost of software comes from doing the wrong things to manage the high cost of software.

The Illusion of Control

Nothing gives a better illusion of control than a project plan. I think that software development project management tools -- MS Project specifically -- is the biggest mistake we can make.

As evidence, I look at Agile methods. One key element of Agile methods is to reduce (or eliminate) the project management nonsense that accumulates around software development.

I think that software development projects are generally pretty complex and a big MPP file doesn't reduce the complexity or help anyone's understanding. I think that we should not make an effort to capture the complexity -- that's simply silly.

If you find that you need a really complex document to capture a lot of really complex complexity, you're doing something wrong.

Hands in the Pocket Explanations

I think that user stories are great because they reduce the complexity down to something that we can articulate and remember. This gives us a fighting chance at understanding.

If the use case requires a big, complicated document, we're missing something essential. It should have a pithy, easy-to-remember, easy-to-write-on-a-sticky-note summary. It can have a detailed technical appendix. But it has to have a pithy, easy-to-articulate summary.

If you can't explain the use case with your hands in your pockets, it's too complex.

Architecture

An architecture diagram is helpful. Architecture -- as a foundation -- has to be subject to considerable analysis to be sure it's right. You need to be absolutely confident that it works. And like any piece of mathematical analysis, you need diagrams and formulas, and you need to show your work.

A miraculous pronunciation that some architecture will work is a terrible thing. A few pithy formula (that we can remember) and some justification are a whole lot better.

The WBS Is The Problem

I find that projects with complicated WBS's have added a layer of complexity and management that aren't helpful. The cost of software is high, so lets add management to try and reduce our costs. On the surface, adding labor to reduce labor doesn't make sense.

Rather than waste time adding work, it would be better to introduce someone who can facilitate decision-making (i.e., a Scrum Master) and keep progress on track.

Incremental releases of partial solutions have more value than weekly status reports.

Meetings with product owners have more value than a carefully-written schedule for doing the poorly-understood process of detailed design.

Justifications

We can justify project management by saying that it somehow makes the software development process more efficient by eliminating "roadblocks" or "inefficiencies".

I used to believe. I no longer buy this.

Let's look at some candidate roadblocks that a project management might smooth out.

User Involvement. Or rather, the lack of user involvement. I don't see how a PM does anything except nag the users. If the users aren't motivated to help with software development by answering questions or reviewing the results of a sprint, then the software isn't creating any value. Stop work now and find something the users really want.
Technical Resources. Coordinating technical resources (DBA's, sysadmins, independent testers, etc.) doesn't require a complex plan, status meetings or reports. It only requires some phone calls among the relevant folks. Directly.
Decision-Making. The PM isn't the product owner, nor are they a user, nor are they technically proficient enough to understand what's really at stake. Essentially, they only act as a facilitator in a conversation that don't fully understand. That's fine, as long as they stick to facilitating and don't take on responsibilities that aren't assigned to them.

At this point, I can find a use for a facilitator ("Scrum Master"). But I can't see why we have just an emphasis on IT project management. The Agile folks seem to have it right. Reduce cost and complexity by actually reducing the cost and complexity. Not by adding management.

Wednesday, December 9, 2009

Hypothetical Designs and Numerosity

I love hypothetical questions. Got a bunch recently. I guess it's the season for hypotheticals.

These all seem to come from the "Averted Glance" school of management. The best part about the "I don't want to know the details" management is that we need to substitute metrics for understanding. One could also call this the "Numerosity" school of management. It's one step above numerology.

There is no substitute for hands-on work. Quantity leads directly to Quality. Bottom Line: touch the technology early and often.

Easier

I described the Sphinx production pipeline as "easier" than DocBook.

Someone asked for a definition of "easier". I had to research the definition of "easier" and found the Standard Information Management Process and Logical Effort index (SIMPLE). This index has a number of objective scoring factors for platform, language, toolset, performance, steps, problems encountered, rework and workaround, as well as the price to tea in China.

I find the SIMPLE index to be very useful for answering the random questions that arise when someone does not want to actually besmirch their fingers by touching the technology.

Considering that Sphinx and the DocBook processing components are both largely free, it seemed easier to me to actually rig them up and run them a few times to see how they work. But that relies on the undefined term "easier". To cut the Gordian Knot while keeping the eyes averted, one resorts to numerosity.

Cleaner and More Uniform

I described XML as cleaner and more uniform than LaTeX. (Even though I've switched to LaTeX because it produces better results.)

Someone asked for a definition of Cleaner and More Uniform. I tried using the Flesch-Kincaid Readability Index, but it was limited to the content and didn't work well for the markup. I tried using this calculator, but it barfed. So I invented by own index based on the Halstead complexity metrics.

I find the Halstead complexity to be very useful for answering random questions that arise when someone does not want to actually burden themselves with looking at the technology. I suppose actual examples of XML vs. LaTex vs. RST would burn holes in the brain, running the risk of killing one of the few working brain cells.

Inheritance vs. Delegation

My favorite, however, is the question on "criteria for when to use / not use inheritance". Asking for criteria is the leading indicator of the Numerosity School of Design. How do I know this?

No Example.

Hypothetical questions never have practical class definitions. They either have no classes at all, or an overly simplified design based on Foo, Bar and Baz. Rather than actually write some code, we'll talk about what might go into writing some code.

The most important part of learning OO design is to actually do a lot of OO design. Code Kata works. Quantity translates directly to Quality.

Don't talk about writing code. Write code. Write lots of code. Write as much code as possible.

I'm revising my Building Skills in OO Design book to make it a better answer to the Inheritance vs. Delegation question. Do the exercises. At the end of the book, you'll know the answers.

Sadly

Sadly, the bulk of IT management does not believe in skill-building. Training is limited to one or two weeks out of 52 (just under 2% of your working life) and that's as often cancelled as granted. Any time spent trying something to see if it will work is aggressively punished ("Quit gold plating that solution and put some garbage into production right now! Learn on your own time. King Cnut Demands It.")

Monday, December 7, 2009

Mutability Analysis

First, there are several tiers of mutability in requirements. These tiers define typical levels of change context of the problem, the problem itself and the forces that select a solution to the problem.

Natural Laws (i.e., Gravity, Natural Selection). As well as metaphysical "laws" (i.e., reality). These don't change much. Sometimes we encapsulate this information with static final constants so we can use names to identify the constants. PI, E, seconds_per_minute, etc.
Legal Context (both statutory law and case law), as well as standards and procedures with the effect of law (i.e. GAAP). Most software products are implicitly constrained, and the constraints are so fundamental as to be immutable. They aren't design constraints, per se, they are constraints on the context space for the basic description of the problem. Like air, these are hard to see, and their effects are usually noted indirectly.
Industry. That is to say, industry practices and procedures which are prevalent, and required before we can be called a business in a particular industry. Practices and procedures that cannot be ignored without severe, business-limiting consequences. These are more flexible than laws, but as pervasive and almost as implicit. Some software will call out industry-specific features. Health-care packages, banking packages, etc., are explicitly tailored to an industry context.
Company. Constraints imposed by the organization of the company itself. The structure of ownership, subsidiaries, stock-holders, directors, trustees, etc. Often, this is reflected in the accounting, HR and Finance systems. The chart of accounts is the backbone of these constraints. These constraints are often canonized in customized software to create unique value based on the company's organization, or in spite of it.
Line of Business. Line of business changes stem from special considerations for subsets of vendors, customers, or products. Sometimes it is a combination of company organization and line of business considerations, making the relationship even more obscure. Often, these are identified as special cases in software. In many cases, the fact that these are special, abnormal cases is lost, and the "normal" case is hard to isolate from all the special cases. Since these are things change, they often become opaque mysteries.
Operational Bugs and Workarounds. Some procedures or software are actually fixes for problems introduced in other software. These are the most ephemeral of constraints. The root cause is obscure, the need for the fix is hidden, the problem is enigmatic.

Of these, tiers 1 to 3 are modeled in the very nature of the problem, context and solution. They aren't modeled explicitly as constraints on problem X, or business rules that apply to problem X, they are modeled as X itself. These things are so hard to change that they are embodied in packaged applications from third parties that don't create unique business value, but permit engaging in business to begin with.

Layers 4 to 6, however, might involve software constraints, explicitly packaged to make it clear. Mostly, these are procedural steps required to either expose or conceal special cases. Once in a while these become actual limitations on the domain of allowed data values.

Considerations.

After considering changes to the problem in each of these tiers, we can then consider changes to the solution. The mutation of the implementation can be decomposed into procedural mutation and data model mutation. The Zachman Framework gives us the hint that communication, people and motivation may also change. Often these changes are manifested through procedural or data changes.

Procedural mutation means programming changes. This implies that flexible software is required to respond to business changes, customer/vendor/product changes, and evolving workarounds for other IT bugs. Packaged solutions aren't appropriate ways to implement unique features of these lower tiers: the maintenance costs of changing a packaged solution are astronomical. Internally developed solutions that require extensive development, installation and configuration aren't appropriate either.

As we move to the lower and less constrained tiers, scripted solutions using tools like Python are most appropriate. These support flexible adaptation of business processes.

Data Model.

Data lasts forever, therefore, the data model mutations fall into two deeper categories: structural and non-structural.

When data values are keys (natural, primary, surrogate or foreign) they generally must satisfy integrity constraints (they must exist, or must not exist, or are mandatory or occur 0..m times). These are structural. The data is uninterpretable, incomplete and broken without them. When these change, it is a either a profound change to the business or a long-standing bug in the data model. Either way the fix is expensive. These have to be considered carefully and understood fully.

When data values are non-key values, the constraints must be free to evolve. The semantics of non-key data fields are rarely fixed by any formalism. Changes to the semantics are rampant, and sometimes imposed by the users without resorting to software change. In the face of such change, the constraints must be chosen wisely.

"Yes, it says its the number of days overdue, but it's really the deposit amount in pennies. They're both numbers, after all."

Mutability Analysis, then, seeks to characterize likely changes to requirements (the problem) as well as the data and processing aspects of the solution. With some care, this will direct the selection of solutions.

Focus.

It's important to keep mutability analysis in focus. Some folks are members of the Hand-Wringers School of Design, and consider every mutation scenario as equally likely. This is usually silly and unproductive, since their design work proceeds at a glacial pace while they overconsider the effects of fundamental changes to company, the industry, the legal context and the very nature of reality itself.

Here's my favorite quote from a member of the HWSoD: "We don't know what we don't know."

This was used to derail a conversation on security in a small web application. Managers who don't know the technology very well are panicked by statements like this. My response was that we actually do know the relevant threat scenarios, just read the OWASP vulnerabilities. Yes, some new threat may show up. No, we don't need to derail work to counter threats that do not yet exist.

Progress.

The trick with mutability analysis is to do the following.

1. Time-box the work. Get something done. Make progress. A working design that is less than absolute perfection is better than no design at all. Hand-wringing over vanishingly unlikely futures is time wasted. Wasted. Create value as quickly as possible.

2. Work up from the bottom. Consider the tiers most likely to change first. Workarounds are likely to change. Features of the line of business might change. Company changes only matter if you've been specifically told the company is buying or for sale. Beyond that, it's irrelevant for the most part. ("But my software will change the industry landscape." No it won't. But if it is really novel, then delivery soon matters more than flexibility. If the landscape changes, you'll have to fundamentally rewrite it anyway.)

3. Name Names. Vague hand-waving mutation scenarios are useless. You must identify specific changes, and who will cause that change. Name the manager, customer, owner, stakeholder, executive, standard committee member, legislator or diety who will impose the change. If you can't name names, you don't really have a change scenario, you have hand-wringing. Stop worry. Get something to work.

But What If I Do Something Wrong?

What? Is it correct? Is it designed to make optimal use of resources? Can you prove it's correct, or do you have unit tests to demonstrate that it's likely to be correct? Can you prove it's optimal? Move on. Maintainability and Adaptability are nice-to-have, not central.

Getting something to work comes first. When confronted with alternative competing, correct, optimal designs, adaptability and maintainability are a way to choose between them.

Thursday, December 3, 2009

The King Cnut School of Management

See this story of King Cnut ruling the waves.

The King Cnut School of Management is management by fiat. Declaring it so.

PM: "When will this transition to production?"

Me: "After the firewall and VM configuration."

PM: "So, can we say Thursday?"

Me: "You can say that, if you want, but you have no basis for that. The firewall hardware is sitting on the loading dock, and the RHEL VM's won't run Python 2.6 with the current SELinux settings. I have no basis for expecting this to be fixed in a week."

PM: "We can just pick a date, and then revise it."

Me: "Good plan. Pick a random date and complain when it's not met. While you're at it, hold the tide back for a few hours, too."

Monday, November 30, 2009

Python Book -- Thanks for the Bug Reports

I made some fundamental changes to the text processing pipeline. I think I've corrected all of the typographical and production problems. (Plus, I fixed some content errors, too.)

I've republished the Building Skills in Python, both HTML and PDF.

Hopefully, this is considerably better and more usable.

Next step -- revising the OO Design publication pipeline.

Thursday, November 26, 2009

Python Book -- Version 2.6

Completely revised the Building Skills in Python book.

It now covers Python 2.6 and is much, must easier to maintain in ReStructured Text markup, formatted with Sphinx and LaTeX (via TeXLive) than it was in XML.

XML -- while modern and clean and uniform -- isn't as convenient as LaTeX and RST.

Tuesday, November 24, 2009

Standard "Distributed" Database Issues

Here's a quote "standard issues associated w/ a disitributed db". And "There is the push versus pull of data. Say you use push and..." and more stuff after that.

First, by "Distributed Database", the question could mean almost anything. However, they provide the specific example of Oracle's Multi-Master Replication. That narrows the question somewhat.

This appears to mean that -- for them -- Distributed Database means two (or more) applications, two (or more) physical database instances and at least one class of entities which exist in multiple applications and are persisted in multiple databases.

That means multiple applications with responsibility for a single class of objects.

That breaks at least one fundamental design principle. Generally, a class has one responsibility. Now we have two implementations sharing some kind of responsibility for a single class of objects. Disentangling the responsibilities is always hard.

Standard Issues

There's one standard issue with this kind of distributed database. It is horribly complex and never worth it.

Never.

You broke the Single Responsibility Principle. You'll regret that.

The "distributed database" is like a spread sheet.

First, you have a problem that you think you can solve with a distributed database.

Now you have two problems.

Sensible Alternatives

There are two standard solutions to problems that appear to require a distributed database.

A data warehouse. Often, there is no actual state change that is part of a transactional workflow that moves back and forth between the applications. In most cases, the information needs be merged for reporting and analysis purposes. Occasionally, this merged information is used for transactional processing, but that's easily handled by the dimensional bus feeding back to source applications.

An Enterprise Service Bus (ESB) and a Service-Oriented Architecture (SOA). The rest of the time, one has a "Distributed Transaction". This is better thought of as a Composite Applications. A composite application is not part of any of the foundational ("distributed") applications; a composite is fundamentally different and of a higher level

Stay Out Of That Box

In short, the "standard issues" with attempting a distributed database are often insurmountable. So don't try.

Pick a fundamentally simpler architecture like Composite Applications via an SOA using an ESB.

Yes, simpler. In the long run, a composite application exploits the foundational applications without invoking a magical two-way distributed coherence among multiple data stores. A composite application leverages the foundational applications by creating a higher-level workflow to pass data between the foundational applications as needed by the composite application.

Read any vendor article on any ESB and you'll see numerous examples of "distributed" databases done more simply (and more effectively) by ditching the concept of "distributed".

IBM, Oracle (which now owns Sun's JCAPS), JBoss, WSO2, OpenESB, Glassfish ESB

Thursday, November 19, 2009

On Risk and Estimating and Agile Methods

See The Question of Risk.

Also, see Lean Projects -- Not Deficient Projects.

And Keeping the Customer Satisfied.

These are notes for a long, detailed rant on the value of Agile methods.

One specious argument against an Agile approach is the "risk management" question. In this case, however, it becomes a "how much of a contingency budget should be write into the contract." Which isn't really risk management.

Sunday, November 15, 2009

ORM magic

The ORM layer is magic, right?

The ORM layer "hides" the database, right?

We never have to think about persistence, right? It just magically "happens."

Wrong.

Here's some quotes from a recent email:

"Somehow people are surprised that we would have performance issues. Somehow people are surprised that now that we are putting humpy/dumpy together that we would have to go back and look at how we have partitioned the system."

I'm not sure what all of that means except that it appears that the author thinks mysterious "people" think performance considerations are secondary.

I don't have a lot of technical details, just a weird ranting list of complaints, including the following.

"... the root cause of the performance issue was that each call to the component did a very small amount of work. So, they were having to make 10 calls to 10 different components to gather useful info. Even though each component calls was quick (something like 0.1 second), to populate the gui screen, they had to make 15 of them."

Read the following Stack Overflow questions: Optimizing this Django Code?, and Overhead of a Round-trip to MySql?

ORM Is A "Silver Bullet" -- It Solves All Our Problems

If you think that you can adopt some architectural component and then program without further regard for the what that component actually does, stop coding now and find another job. Seriously.

If you think you don't have to consider performance, please save us from having to clean up your mess.

I'm repeatedly shocked at people who claim that some particular ORM (e.g., Hibernate) was unacceptable because of poor performance.

ORM's like Hibernate, iBatis, SQLAlchemy, Django ORM, etc., are not performance problems. They're solutions to specific problems. And like all solution technology, they're very easy to misuse.

Hint 1: ORM == Mapping. Not Magic. Mapping.

The mapping is from low-rent relational row-column (with no usable collections) to object instances. That's all. Just mapping rows to objects. No magic. Object collections and SQL foreign keys are cleverly exchanged using specific techniques that must be understood to be used.

Hint 2: Encapsulation != Ignorance. OO design frees us from "implementation details". This does not mean that it frees us from performance considerations. Performance is not an "implementation detail". The performance considerations of class encapsulation are central to the very idea of encapsulation.

One central reason we have object-oriented design is to separate performance from programming nuts and bolts. We want to be able to pick and choose alternative class definitions based on performance considerations.

ORM's Role.

ORM saves writing mappings from column names to class instances. It saves us from writing SQL. It doesn't remove the need to actually think about what's actually going on.

If an attribute is implemented as a property that actually does a query, we need to pay attention to this. We need to read the API documentation, know what features of a class do queries, and think about how to manage this.

If we don't know, we need to write experiments and spikes to demonstrate what is happening. Reading the SQL logs should be done early in the architecture definition.

You can't write random code and complain that the performance isn't very good.

If you think you should be able to write code without thinking and understanding what you're doing, you need to find a new job.

Tuesday, November 10, 2009

Another HTML Cleanup

Browsers are required to skip over bad HTML and render something.

Consequently, many web sites have significant HTML errors that don't show up until you try to scrape their content.

Beautiful Soup has a handy hook for doing markup massage prior to parsing. This is a way of fixing site-specific bugs when necessary.

Here's a two-part massage I wrote recently that corrects two common (and show-stopping) HTML issues with quoted attributes values in a tag.


# Fix style="background-image:url("url")"
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
   return 'background-image:url("e;%s"e;)' % ( match.group(1) )

# Fix src="url name="name""
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
   return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )

fix_style_quotes = [
   (background_image, fix_background_image),
   (bad_img, fix_bad_img),
]

The "fix_style_quotes" sequence is provided to the BeautifulSoup contructor as the markupMassage value.

Friday, November 6, 2009

BBEdit Configuration

After installing Python 2.6 in Mac OS X, I had problems with BBEdit not finding the right version of Python. It kept running an old 2.5 version.

I finally tracked down the BBEdit documentation, http://pine.barebones.com/manual/BBEdit_9_User_Manual.pdf.

Found this: "BBEdit expects to find Python in /usr/bin, /usr/local/bin, or /sw/bin. If you have installed Python elsewhere, you must create a symbolic link in /usr/local/bin pointing to your copy of Python in order to use pydoc and the Python debugger."

Checked in /usr/bin and found an old Python there. I think Fink did that. Removed it and BBEdit is much happier. As is Komodo Edit.

Wednesday, November 4, 2009

Parsing HTML from Microsoft Products (Like Front Page, etc.)

Ugh. When you try to parse MS-generated HTML, you find some extension syntax that is completely befuddling.

I've tried a few things in the past, none were particularly good.

In reading a file recently, I found that even Beautiful Soup was unable to prettify or parse it.

The document was filled with  constructs that looked vaguely directive or comment-like, but still managed to stump the parser.

The BeautifulSoup parser has a markupMassage parameter that applies a sequence of regexps to the source document to cleanup things that are baffling. Some things, however, are too complex for simple regexp's. Specifically, these nested comment-like things were totally confusing.

Here's what I did. I wrote a simple generator which emitted the text that was unguarded by these things. The resulting sequence of text blocks could be assembled into a document that BeautifulSoup could parse.


def clean_directives( page ):
"""
Stupid Microsoft "Directive"-like comments!
Must remove all <!--[if...]>...<![endif]--> sequences.  Which can be nested.
Must remove all <![if...]>...<![endif]> sequences.  Which appear to be the nested version.
"""
if_endif_pat= re.compile(  r"(\<!-*\[if .*?\]\>)|(<!\[endif\]-*\>)" )
context= []
start= 0
for m in if_endif_pat.finditer( page ):
   if "[if" in m.group(0):
       if start is not None:
           yield page[start:m.start()]
       context.append(m)
       start= None
   elif "[endif" in m.group(0):
       context.pop(-1)
       if len(context) == 0:
           start= m.end()+1
if start is not None:
   yield page[start:]

Stored Procedures and Ad Hominem Arguments

The question of "Stored Procedures and Triggers" comes up fairly frequently.

Over the years (since the 90's, when stored procedures were introduced to Oracle) I've learned precisely how awful a mistake this technology is.

I've seen numerous problems that have stored procedures as their root cause. I'll identify just a few. These are not "biases" or "opinions". These are experience.

The "DBA as Bottleneck" problem. In short, the DBA's take projects hostage while the development team waits for stored procedures to be written, corrected, performance tuned or maintained.
The "Data Cartel" problem. The DBA's own parts of the business process. They refuse (or complicate) changes to fundamental business rules for obscure database reasons.
The "Unmaintainability" problem. The stored procedures (and triggers) have reached a level of confusion and complexity that means that it's easier to drop the application and install a new one.
The "Doesn't Break the License" problem. For some reason, the interpreted and source-code nature of stored procedures makes them the first candidate for customization of purchased applications. Worse, the feeling is that doing so doesn't (or won't) impair the support agreements.

When I bring these up, I wind up subject to weird ad hominem attacks.

I've been told (more than once) that I'm not being "balanced" and that stored procedures have "There are pros and cons on both sides". This is bunk. I have plenty of facts. Stored procedures create a mess. I've never seen any good come from stored procedures.

I don't use GOTO's haphazardly. I don't write procedural spaghetti code. No one says that I should be more "balanced."

I don't create random database structures with 1NF, 2NF and 3NF violations in random places. No one says I should be more "balanced".

Indeed, asking me to examine my bias is an ad hominem argument. My fact-based experience with stored procedures is entirely negative.

But when it comes to stored procedures, there's a level of defensiveness that defies my understanding. I assume Oracle, IBM and Microsoft are paying kickbacks to DBA's to support stored procedures and PL/SQL over the more sensible alternatives.

Saturday, October 31, 2009

Open Source in the News

Whitehouse.gov is publicly using open source tools.

See Boing Boing blog entry. Plus Huffington Post blog entry.

Most importantly, read this from O'Reilly.

Many places are using open source in stealth mode. Some even deny it. Ask your CIO what the policy on open source is, then check to see if you're using Apache. Often, this is an oops -- policy says "no", practice says "yes".

For a weird perspective on open source, read Binstock's Integration Watch piece in SD Times: From Open Source to Commercial Quality. "rigor is the quality often missing from OSS projects". "Often" missing? I guess Binstock travels in wider circles and sees more bad open source software than I do. The stuff I work with is very, very high quality. Python, Django, Sphinx, the LaTeX stack, PIL, Docutils -- all seem to be outstandingly good.

I guess "rigor" isn't an obvious tangible feature of the software. Indeed, I'm not sure what "commercial quality" means if open source lacks this, also.

All of the commercial software I've seen has been in-house developed stuff. Because there's so much of it, it must have these elusive "rigor" and "commercial quality" features that Binstock values so highly. Yet, the software is really bad: it barely works and they can't maintain it.

My experience is different from Binstock's. Also, most of the key points in his article are process points, not software quality issues. My experience is that the open source product exceeds commercial quality. Since there's no money to support a help desk or marketing or product ownership, the open source process doesn't offer all the features of a commercial operation.

Wednesday, October 28, 2009

Painful Python Import Lessons

Python's packages and modules are -- generally -- quite elegant.

They're relatively easy to manage. The __init__.py file (to make a module into a package) is very elegant. And stuff can be put into the __init__.py file to create a kind of top-level or header module in a larger package of modules.

To a limit.

It took hours, but I found the edge of the envelope. The hard way.

We have a package with about 10 distinct Django apps. Each Django app is -- itself -- a package. Nothing surprising or difficult here.

At first, just one of those apps used a couple of fancy security-related functions to assure that only certain people could see certain things in the view. It turns out that merely being logged in (and a member of the right group) isn't enough. We have some additional context choices that you must make.

The view functions wind up with a structure that looks like this.

@login_required
def someView( request, object_id, context_from_URL ):
   no_good = check_other_context( context_from_URL )
   if no_good is not None: return no_good
   still_no_good = check_session()
   if still_no_good is not None: return still_no_good
   # you get the idea

At first, just one app had this feature.

Then, it grew. Now several apps need to use check_session and check_other_context.

Where to Put The Common Code?

So, now we have the standard architectural problem of refactoring upwards. We need to move these functions somewhere accessible. It's above the original app, and into the package of apps.

The dumb, obvious choice is the package-level __init__.py file.

Why this is dumb isn't obvious -- at first. This file is implicitly imported. Doesn't seem like a bad thing. With one exception.

The settings.

If the settings file is in a package, and the package-level __init__.py file has any Django stuff in it -- any at all -- that stuff will be imported before your settings have finished being imported. Settings are loaded lazily -- as late as possible. However, in the process of loading settings, there are defaults, and Django may have to use those defaults in order to finish the import of your settings.

This leads to the weird situation that Django is clearly ignoring fundamental things like DATABASE_ENGINE and similar settings. You get the dummy database engine, Yet, a basic from django.conf import settings; print settings.DATABASE_ENGINE shows that you should have your expected database.

Moral Of the Story

Nothing with any Django imports can go into the package-level __init__.py files that may get brought in while importing settings.

Monday, October 26, 2009

Process Not Working -- Must Have More Process

After all, programmers are all lazy and stupid.

Got his complaint recently.

"Developers on a fairly routine basis check in code into the wrong branch."

Followed by a common form of the lazy and stupid complaint. "Someone should think about which branch is used for what and when." Clearly "someone" means the programmers and "should think about" means are stupid.

This was followed by the "more process will fix this process problem" litany of candidate solutions.

"Does CVS / Subversion have a knob which provides the functionality to

prevent developers from checking code into a branch?"

"Is there a canonical way to organize branches?" Really, this means something like what are the lazy, stupid programmers doing wrong?

Plus there where rhetorical non-questions to emphasize the lazy, stupid root cause. "Why is code merging so hard?" (Stupid.) "If code is properly done and not coupled, merging should be easy?" (Lazy; a better design would prevent this.) "Perhaps the developers don't understand the code and screw up the merge?" (Stupid.) "If the code is not coupled, understanding should be easy?" (Both Lazy and Stupid.)

Root Cause Analysis

The complaint is about process failure. Tools do not cause (or even contribute) to process failure. There are two possible contributions to process failure: the process and the people.

The process could be flawed. There could be no earthly way the programmers can locate the correct branch because (a) it doesn't exist when they need it or (b) no one told them which branch to use.

The people could be flawed. For whatever reason, they refuse to execute the process. Perhaps they know a better way, perhaps they're just being jerks.

Technical means will not solve either root cause problem. It will -- generally -- exacerbate it. If the process is broken, then attempting to create CVS / Subversion "controls" will end in expensive, elaborate failure. Either they can't be made to work, or (eventually) someone will realize that they don't actually solve the problem. On the other hand, if the people are broken, they'll just subvert the controls in more interesting, silly and convoluted ways.

My response -- at the time -- was not "analyze the root causes". When I first got this, I could only stare at it dumbfounded. My answer was "You're right, your developers are lazy and stupid. Good call. Add more process to overcome their laziness and stupidity."

After all, the questioner clearly knows -- for a fact -- that more process helps fix a broken organization. The questioner must be convinced that actually talking to people will never help.

The question was not "what can I do?" The question was "can I control these people through changes to CVS?" There's a clear presumption of "process not working -- must have more process."

The better response from me should have been. "Ask them what the problem is." I'll bet dollars against bent pins that no one tells them which branch to use in time to start work. I'll bet they're left guessing. Also, there's a small chance that these are off-shore developers and communication delays make it difficult to use the correct branch. There may be no work-orders, just informal email "communication" between on-shore and off-shore points-of-contact (and, perhaps, the points-of-contact aren't decision-makers.)

Bottom Line. If people can't make CVS work, someone needs to talk to them to find out why. Someone does not need to invent more process to control them.

Thursday, October 22, 2009

Breaking into Agile

I had a recent conversation with some folks who were desperate to "processize" everything. They were asking about Scrum Master certification and what standards organizations define the "official" Scrum method.

Interestingly, I also saw a cool column in Better Software magazine, called "Scrumdamentalism" on the same basic question.

In my conversation, I referred them to the Agile Manifesto. My first point was that process often gets in the way of actual progress. Too much process focus lifts up "activity" in place of "accomplishment".

My second point, however, was that the Agile Manifesto and the Scrum method are responses to a larger problem. Looking for a process isn't an appropriate response to the problem.

The One True Scrum Quest

Claiming that there's one true Scrum method and everything else is "not scrum" is an easy mental habit. The question gets asked on Stack Overflow all the time. The questions are usually one of two kinds.

What's the "official" or "best practice" Scrum method and how do I define a process that rigidly enforces this for my entire team of 20?
We are doing our design/code/test/integration/release in a way that diverges from the "official" form in the Ken Schwaber and Mike Beedle book. Or it diverges from the Eclipse version. Or it diverges from the Control Chaos overview. Or the Mountain Goat version. Or the C2 Wiki version. Or this version. Is it okay to diverge from the "standard"?

Sigh. The point of Agile is that we should value "Individuals and interactions over processes and tools". The quest for "One True Scrum" specifically elevates the process above the people.

In The Real World

The biggest issue is that the Agile Manifesto is really a response to some fundamental truths about software development.

In management fantasy world, a "project" as a fixed, definite, limited, clearly articulated scope. From this fixed scope, we can then document "all" the requirements (business and technical). This requirements document is (a) testable against the scope, (b) necessary for all further work and (c) sufficient for design, code, test and transition to production. That's not all. And -- in order to make a point later on -- I'll continue to enumerate the fantasies. The fantasy continues that someone can create a "high-level design" or "specification" that is (a) testable against the requirements, (b) necessary for all further work and (c) sufficient to code, test and transition to production. We can then throw this specification over the transom into another room where a "coder" will "cut code" that matches the specification. The code production happens at a fixed, knowable rate with only small random variation based on risk of illness. The testing, similarly, can be meticulously scheduled and will happen precisely as planned. Most "real-world" (management fantasy) projects do not leave any time for rework after testing -- because rework won't happen. If it won't happen, why test? Finally, there will be no technology transfer issues because putting a freshly-written program into production is the same as installing a game from a DVD.

Managers like to preface things with "In The Real World". As in "In The Real World we need to know how long it will take you to write this."

The "in the real world" speech always means "In My Management Fantasy Land." The reason it's always a fantastic speech is because software development involves the unknowable. I'm not talking about some variable in a formula with a value that's currently unknown. I'm talking about predicting the unknowable future.

The Agile Response to Reality

In the Real real world, software development is extraordinarily hard.

Consider this: the computer clock runs in 100-nanosecond increments (1.0E-7). We expect an application to run 24x7x0.999 = 6.04E5 seconds. That's from 100-nano to half-million: about 12 orders of magnitude to keep in one's head.

Consider this: storage in a largish application may span almost a terabyte, (1.0E12). From bytes to terabytes: about 12 orders of magnitude to keep in one's head.

Consider this: a web application written in a powerful framework (Django) requires one to know the following languages and frameworks. Shell script, Apache Config, Python, Django Templates, SQL, HTML, CSS, Javascript, HTTP (the protocol is it's own language), plus the terminology of the problem domain. That's 9 distinct languages. We also have the OS, TCP/IP Apache, mod_wsgi, Django, Python, browser and our application as distinct frameworks. That's 8 distinct framework API's to keep in one's head.

Consider this: the users can't easily articulate their problem. The business analyst is trying to capture enough information to characterize the problem. The users, the analyst, the project manager (and others outside the team) all have recommendations for a solution, polluting the problem description with "solution speak" that's only adds confusion.

In the Management Fantasy "Real World", this is all knowable and simple. In the Real Real World, this is rather hard.

Adapting to Reality

Once we've recognized that software development is hard, we have several responses.

Deny. Claim that software developers are either lazy or stupid (or both). Give them pep-talks that begin "in the real world" and hope that they cough up the required estimates because they're motivated by being told that software development "in the real world" isn't all that hard.
Processize(tm). Claim that software development is a process that can be specified to a level where even lazy, stupid programmers can step through the process and create consistent results.
Adapt. Adapting to the complexity of software development requires asking, "what -- if anything -- expedites software development?"

What Do We Need to Succeed?

There are essentially two domains of knowledge required to create software: the problem domain and the solution domain.

Problem Domain. This is the "business rules", the "scope", the "requirements", the "purpose", etc. We have the features and functions. What the software does. The value it creates. The "what", "who", "where", "when" and "why".
Solution Domain. This is the technology that makes it go. The time and space dimensions (all 12 orders of magnitude in each dimension), all the languages and all the frameworks. The "how".

The issue is this:

We don't start out with complete knowledge of problem and solution.

At the start of the project -- when we're asked to predict the future -- we can never know the whole problem, nor can we ever know the whole solution we're about to try and build.

What we need is this:

Put Problem Domain and Solution Domain knowledge into one person's head.

The question then becomes "Who's head?"

We have two choices:

Non-Programmers. We can try to teach the various non-programmers all the solution domain stuff. We can make the project manager, business analyst, end-users, executive sponsor -- everyone -- into programmers so that they have problem domain and solution domain knowledge.
Programmers. We can try to impart the problem domain knowledge on the programmers. If we're seriously going to do this, we need to remove the space between programmer and problem.

That's the core of the Agile Response: Close the gap between Problem Domain and Solution Domain by letting programmers understand the problem.

The Bowl of Bananas Solution(tm)

"But wait", managers like to say, "in the real world, we can't just let you play around until you claim you're done. We have to monitor your activity to make sure that you're making 'progress' toward a 'solution'."

In the Real real world, you can't define the "problem", much less test whether anything is -- or is not -- a solution. I could hand most managers a bowl of bananas and they would not be able to point to any test procedure that would determine if the bowl of bananas solves or fails to solve the user's problems.

Most project scope documents, requirements documents, specifications, designs, etc., require extensive tacit problem domain knowledge to interpret them. Given a bowl of bananas, the best that we can do is say "we still have the problem, so this isn't a solution." Our scope statements and requirements and test procedures all make so many assumptions about the problem and the solution that we can't even figure out how evaluate an out-of-the-box response -- like a bowl of bananas.

In the Real real world, management in organization A demands that information be kept in a one database. Management organization B has a separate database for reasons mired in historical animosity and territorial scent-marking. Management in yet another organization wants them "unified" or "reconciled" and demands that someone manually put the data into spreadsheets. This morphs into requirements for a new application "system" to unify this data, making the results look like poorly-design spreadsheets. This morphs into a multi-year project to create a "framework" for data integration that maintains the poorly-designed spreadsheet as part of the "solution".

A quick SQL script to move data from A to B (or B to A) is the bowl-0f-bananas solution. It cannot be evaluated (or even considered) because it isn't a framework, system or application as specified in the scope document for the data integration framework.

This is the problem domain knowledge issue. It's so hard to define the problem, that we can't trust the executive sponsor, the program office, the project managers, the business analysts or anyone to characterize the problem for the developers.

The problem domain knowledge is so important that we need to allow programmers to interact with users so that both the problem and the solution wind up in the programmer's head.

Wednesday, October 21, 2009

Unit Test Naming [Updated]

Just stumbled across several blog postings on unit test naming.

Essentially the TestCase will name the fixture. That's pretty easy to understand.

The cool part is this: each test method is a two-part clause: condition_"should"_result or "when"_condition_"then"_result.

See https://wiki.openmrs.org/display/docs/Unit+Testing+With+at-should+Annotation,

Or possibly "method_state_behavior".

See http://osherove.com/blog/2005/4/3/naming-standards-for-unit-tests.html

What a handy way to organize test cases. Only took me four years to figure out how important this kind of thing is.

[Updated to follow moved links.]

Friday, October 16, 2009

Django Capacity Planning -- Reading the Meta Model

I find that some people spend way too much time doing "meta" programming. I prefer to use someone's framework rather than (a) write my own or (b) extend theirs. I prefer to learn their features (and quirks).

Having disclaimed an interest in meta programming, I do have to participate in capacity planning.

Capacity planning, generally, means canvassing applications to track down disk storage requirements.

Back In The Day

Back in the day, when we wrote SQL by hand, we were expected to carefully plan all our table and index use down to the kilobyte. I used to have really sophisticated spreadsheets for estimating -- to the byte -- Oracle storage requirements.

Since then, the price of storage has fallen so far that I no longer have to spend a lot of time carefully modelling the byte-by-byte storage allocation. The price has fallen so fast that some people still spend way more time on this than it deserves.

Django ORM

The Django ORM obscures the physical database design. This is a good thing.

For capacity planning purposes, however, it would be good to know row sizes so that we can multiply by expected number of rows and cough out a planned size.

Here's some meta-data programming to extract Table and Column information for the purposes of size estimation.


import sys
from django.conf import settings
from django.db.models.base import ModelBase

class Table( object ):
   def __init__( self, name, comment="" ):
       self.name= name
       self.comment= comment
       self.columns= {}
   def add( self, column ):
       self.columns[column.name]= column
   def row_size( self ):
       return sum( self.columns[c].size for c in self.columns ) + 1*len(self.columns)

class Column( object ):
   def __init__( self, name, type, size ):
       self.name= name
       self.type= type
       self.size= size

sizes = {
   'integer': 4,
   'bool': 1,
   'datetime': 32,
   'text': 255,
   'smallint unsigned': 2,
   'date': 24,
   'real': 8,
   'integer unsigned': 4,
   'decimal': 40,
}
def get_size( db_type, max_length ):
   if max_length is not None:
       return max_length
   return sizes[db_type]

def get_schema():
   tables = {}
   for app in settings.INSTALLED_APPS:
       print app
       try:
           __import__( app + ".models" )
           mod= sys.modules[app + ".models"]
           if mod.__doc__ is not None:
               print mod.__doc__.splitlines()[:1]
           for name in mod.__dict__:
               obj = mod.__dict__[name]
               if isinstance( obj, ModelBase ):
                   t = Table( obj._meta.db_table, obj.__doc__ )
                   for fld in obj._meta.fields:
                       c = Column( fld.attname, fld.db_type(), get_size(fld.db_type(), fld.max_length) )
                       t.add( c )
                   tables[t.name]= t
       except AttributeError, e:
           print e
   return tables

if __name__ == "__main__":
   tables = get_schema()
   for t in tables:
       print t, tables[t].row_size()

This shows how we can get table and column information without too much pain. This will report an estimated row size for each DB table that's reasonably close.

You'll have to add storage for indexes, also. Further, many databases leave free space within each physical block, making the actual database much larger than the raw data.

Finally, you'll need extra storage for non-database files, logs and backups.

Wednesday, October 14, 2009

Unit Testing in C

I haven't written new C code since the turn of the millennium. Since then it's been almost all Java and Python. Along with Java and Python come JUnit and Python's unittest module.

I've grown completely dependent on unit testing.

I'm looking at some C code, and I want a unit testing framework. For pure C, I can find things like CuTest and CUnit. The documentation makes them look kind of shabby. Until I remembered what a simplistic language C is. Considering what they're working with, they're actually very cool.

I found a helpful posting on C++ unit testing tools. It provided some insight into C++. But this application is pure C.

I'm interested in replacing the shell script in CuTest with a Python application that does the same basic job. That's -- perhaps -- a low-value add-on. Perhaps I should look at CUnit and stay away from replacing the CuTest shell script with something a bit easier to maintain.

Monday, October 12, 2009

Sometimes the universe appears multidimensional -- but isn't

Had a knock-down drag-out fight with another architect recently over "status" and "priority".

She claimed that the backlog priority and the status where the same thing. I claimed that you can easily have this.

Priority: 1, Status: Not Started

Priority: 2, Status: In Process

Priority: 3, Status: Completed

See? It's obvious that they're independent dimensions.

She said that it's just as obvious that you're doing something wrong.

Here's her point:

If you have priority 1 items that aren't in process now, then they're really priority 2. Fix them to honestly say priority 2.
If you have priority 2 items that "somehow" jumped ahead of priority 1 items, they were really priority 1. Fix them to say priority 1. And don't hand her that "in the real world, you have managers or customers that invert the priorities". Don't invert the priorities, just change them and be honest about it.
The only items that are done must have been priority 1, passed through an "in-process" state and then got finished. Once they're done, they're not priority 1 any more. They're just done.
Things that hang around in "in-process, not done" have two parts. The part that's done, and some other part that's in the backlog and not priority 1.

She says that priority and status are one thing with the following values.

Done.
Priority 1 = in process right now.
Priority 2 = will be in process next. Not eventually. Next.
Priority 3 through ∞ = eventually, in order by priority.

Any more complex scheme is simply misleading (Priority 1 not being done right now? Is it a resource issue? A priority issue? Why aren't you doing it?)

Tuesday, October 6, 2009

Flattening Nested Lists -- You're Doing It Wrong

On StackOverflow you can read numerous questions on "flattening" nested lists in Python.

They all have a similar form.

"How do I flatten this list [ [ 1, 2, 3 ], [ 4, 5, 6 ], ... , [ 98, 99, 100 ] ]?"

The answers include list comprehensions, itertools, and other clever variants.

~~All~~ Much of which is ~~simply wrong~~ inappropriate.

You're Doing it Wrong

The only way to create a nested list is to append a list to a list.

theList.append( aSubList )

You can trivially replace this with the following

theList.extend( aSubList )

Now, your list is created flat. If it's created flat, you never need to flatten it.

Obscure Edge Cases

Sometimes it may be necessary to have both a flattened and an unflattened list. I'm unclear on when or how this situation arises, but this may be edge case that makes some of itertools handy.

For the past 3 decades, I've never seen the "both nested and not nested" use case, so I can't fathom why or how this would arise.

Visiting a Tree

Interestingly, a tree visitor has a net effect somewhat like "flattening". However, it does not actually create an intermediate flat structure. It simply walks the structure as it exists. This isn't a proper use case for transforming a nested list structure to a flat structure. Indeed, this is a great example of why nested structures and flat structures are quite separate use cases.

Monday, October 5, 2009

Code Kata : Analyze A Hard Drive

This isn't computer forensics; it's something much simpler.

A colleague has been struck down with a disease (or won the lottery) and won't be back to work any time soon. Worse, they did not use SVN to do daily check-ins. Their laptop has the latest and greatest. As well as all experiments, spike solutions, and failed dead-ends.

You have their hard drive mounted in an enclosure and available as /Volumes/Fredslaptop or F: if you're a windows programmer.

There are, of course, thousands of directories. Not all of which are terribly useful.

Step 1 - find the source. Write a small utility to locate all directories which contain "source". Pick a language you commonly work in. For C programmers, you might be looking for .c, .cpp, .h, and .hpp files. For Python programmers, you're looking for .py files. For Java programmers, you're looking for .java, and .class files.

Step 2 - get information. For each directory that appears to have source, we want to know the number of source files, the total number of lines in all those source files, and the most recent modification time for those files. This is a combination of the output from wc and ls -t.

Step 3 - produce a useful report. To keep your team informed, create a .CSV file, which can be loaded into a spreadsheet that summarizes your findings.

Friday, October 2, 2009

Agile Methods and "Total Cost"

Many folks ask about Agile project planning and total cost. As our internal project managers wrestle with this, there are a lot of questions.

Mostly these questions are rejections of incremental delivery ("All or Nothing") or rejections of flexibility ("Total Total Cost"). We'll look at these rejections in detail.

Traditional ("waterfall") project planning creates a master plan, with all skills, all tasks, all effort, and all costs. It was easy to simply add it up to a total cost.

Software development, unlike -- for example -- carpentry, has serious unknowns. Indeed software development has so many unknowns that it's not possible to compare software project management with the construction trades.

A carpenter has a task ("frame up these rooms") that has an absolute boundary with no unknown deliverables. No one says things like "we need to separate the functions of these users from those users." They say "build a wall, surface dry-wall, tape, paint, add molding." The carpenter measures, and knows precisely the materials required.

The carpenter rarely has new technology. The pace of change is slow. A carpenter may switch from hand-held nails to a nail gun. It's still nails. The carpenter may switch from wooden 2x4's to metal supports. It's still vertical members and nails. The carpenter may switch brands of wall-board. It's still wall-board.

The consequence of this is that -- for software projects -- Total Cost Is Hard To Predict.

Hack-Arounds

Total cost is hard to predict, but we try to do it anyway. What we do is add "risk factors" to inflate our estimate. We add risk factors for the scope of delivery. We add risk factors for our ability to deliver.

We can organize these risk factors into several subtle buckets. The COCOMO model breaks scope down into three Product Attributes and four Hardware Attributes. It breaks delivery down into five Personnel Attributes and three Project Attributes.

This is a hack-around because we simply cannot ever know the final scope, nor can we ever know our ability to deliver. We can't know our ability to deliver because the team is constantly changing. We should not cope with this expected constant state of flux by writing an elaborate plan and then reporting our failure to meet that plan. That's stupid.

Worse still, we can't know the scope because it's usually a fabric of lies.

Scope Issue 1: "Required"

Customers claim that X, Y and Z are "required". Often, they have no idea what "required" even means. I spent a fruitless hour with a customer that had a 24×7 requirement. I said, "you haven't purchased hardware that will give you 24×7, so we're submitting this change order to remove it from the requirements."

They said, "It's more of a goal. We don't want to remove it."

I said, "It cannot be achieved. You will not pay us because we will fail. Can we remove it and rewrite it as a 'goal'?"

They said, "No need to remove it: we wouldn't failure to meet that requirement as a 'failure'."

"Okay," I said, "what's the minimum you'll put up with before suing us for failing?"

They couldn't answer that. They had no "required" up-time and could not determine what was "required". They had a goal, but no minimum that would trigger labeling the project a failure.

Of course, the project failed. But not because of up-time. There were dozens of these kinds of poorly-worded requirements that weren't really required.

Scope Issues 2: "The Game"

I worked with some users who were adept at gaming IT. They knew that IT was utterly incapable of delivering everything in the requirements document. They knew this and planned on it.

Also, the users knew that a simple solution would not "add enough value"; a simple solution would get rejected by the governance committee. They knew this and planned on it also.

The users would write amazing, fabulous, wondrous requirements, knowing that some of them were sacrificial. The extra requirements were there to (1) force IT to commit serious resources to the project and (2) convince governance that the software "added enough value".

IT spent (wasted?) hours planning, architecting, designing, estimating and tracking progress against all of the requirements. Then, when we got to acceptance testing, there were numerous "requirements" that were not required, nor even desired. They were padding.

What To Do?

Okay. Scope and delivery are unknowable. Fine. In spite of this, what do we do to provide a reasonable estimate of development effort?

Gather the "requirements" or "desires" or "wishes" or "epics" or "stories" or whatever you've got that provides some scope definition. This is the "analysis" or "elaboration" phase. Define "what", but not "how". Clearly define the business problem to be solved. Avoid solution-speak like "database", "application server", and the like.
Decompose. Define a backlog of sprints based on what you know. If necessary, dig into some analysis details to provide more information on the sprints. Jiggle the sprints around to get a consistent size and effort.
Prioritize based on your best understanding. Define some rational ordering to the sprints and releases. Provide some effort estimate for the first few releases. This estimate is simply the sum of the sprint costs. The sprints should be all about the same effort and about the same cost. About. Not exactly. Fine tune as necessary.
Prioritize again with the users. Note that the sprint costs and the sprints required to release are all in their face. They can adjust the order only. Cost is not negotiable. It's largely fixed.

Rejection 1: All Or Nothing

One weird discussion point is the following: "Until release X, this is all useless. You may as well not do release 1 to X-1, those individual steps are of no value."

This is not true, but it's a way some folks try to reject the idea of incremental releases.

You have two possible responses.

"Okay." In this case, you still create the releases, you just don't deliver them. We watched two members of the customer's management team argue about the all-or-nothing issue. One bone-head kept repeating that it was all-or-nothing. Everyone else claimed that Release 1 and 2 were really helpful, it was release 3 to X-1 that were not so useful.
"What not?" In this case, you suspect that the priorities are totally wrong and -- for some reason -- the customer is unwilling to put them in the correct order.

Everything can be prioritized. Something will be delivered first. At the very least, you can play this trump card. "We need to do incremental releases to resolve any potential problems with delivery and turn-over."

Rejection 2: Total Total Cost

The most frustrating conversations surround the "total cost" issue.

The trick to this is the prioritization conversation you had with your users and buyers. Step 4, above.

You gave them the Release - Sprint - Cost breakdown.

You walked through it to put the releases and sprints into the correct order.

What you have to do is add another column to the spread-sheet: "Running Cost". The running cost column is the sum of the sprint costs. Each running cost number is a candidate total cost. It's just that simple.

It takes several tries to get everyone's head wrapped around the concept.

Customer Control

You know the concept has started to sink in when the customer finally agrees that they can pull the plug on the project after any sprint. They grudgingly admit that perhaps they control the costs.

You know they really get it when they finally say something like this.

"We can stop at any time? Any time? In that case, the priority is all wrong. You need to do X first. If we were -- hypothetically -- going to cancel the project, X would create the most value. Then, after that, you have to do Z, not Y. If we cancel after X and Z, we've solved most of the real problems."

When they start to go though hypothetical project cancelation scenarios with you, then they get the way that they control the total cost.

This tends to avoid the tedious of negotiations where the customer then changes the requirements to meet their budget. Nothing is more awful than a customer who has solicited bids via a Request for Proposal (RFP) process. They liked our bid, but realized that they'd asked for too much, and want to reduce the scope, but don't have priorities or cost-per-release information.

If you do the priorities interactively -- with the customer -- there's no "negotiation". It's just decision-making on their part.

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.