Thursday, November 19, 2009

On Risk and Estimating and Agile Methods

See The Question of Risk.



These are notes for a long, detailed rant on the value of Agile methods.

One specious argument against an Agile approach is the "risk management" question. In this case, however, it becomes a "how much of a contingency budget should be write into the contract." Which isn't really risk management.

Sunday, November 15, 2009

ORM magic

The ORM layer is magic, right?

The ORM layer "hides" the database, right?

We never have to think about persistence, right? It just magically "happens."

Wrong.

Here's some quotes from a recent email:

"Somehow people are surprised that we would have performance issues. Somehow people are surprised that now that we are putting humpy/dumpy together that we would have to go back and look at how we have partitioned the system."

I'm not sure what all of that means except that it appears that the author thinks mysterious "people" think performance considerations are secondary.

I don't have a lot of technical details, just a weird ranting list of complaints, including the following.

"... the root cause of the performance issue was that each call to the component did a very small amount of work. So, they were having to make 10 calls to 10 different components to gather useful info. Even though each component calls was quick (something like 0.1 second), to populate the gui screen, they had to make 15 of them."

Read the following Stack Overflow questions: Optimizing this Django Code?, and Overhead of a Round-trip to MySql?

ORM Is A "Silver Bullet" -- It Solves All Our Problems

If you think that you can adopt some architectural component and then program without further regard for the what that component actually does, stop coding now and find another job. Seriously.

If you think you don't have to consider performance, please save us from having to clean up your mess.

I'm repeatedly shocked at people who claim that some particular ORM (e.g., Hibernate) was unacceptable because of poor performance.

ORM's like Hibernate, iBatis, SQLAlchemy, Django ORM, etc., are not performance problems. They're solutions to specific problems. And like all solution technology, they're very easy to misuse.

Hint 1: ORM == Mapping. Not Magic. Mapping.

The mapping is from low-rent relational row-column (with no usable collections) to object instances. That's all. Just mapping rows to objects. No magic. Object collections and SQL foreign keys are cleverly exchanged using specific techniques that must be understood to be used.

Hint 2: Encapsulation != Ignorance. OO design frees us from "implementation details". This does not mean that it frees us from performance considerations. Performance is not an "implementation detail". The performance considerations of class encapsulation are central to the very idea of encapsulation.

One central reason we have object-oriented design is to separate performance from programming nuts and bolts. We want to be able to pick and choose alternative class definitions based on performance considerations.

ORM's Role.

ORM saves writing mappings from column names to class instances. It saves us from writing SQL. It doesn't remove the need to actually think about what's actually going on.

If an attribute is implemented as a property that actually does a query, we need to pay attention to this. We need to read the API documentation, know what features of a class do queries, and think about how to manage this.

If we don't know, we need to write experiments and spikes to demonstrate what is happening. Reading the SQL logs should be done early in the architecture definition.

You can't write random code and complain that the performance isn't very good.

If you think you should be able to write code without thinking and understanding what you're doing, you need to find a new job.

Tuesday, November 10, 2009

Another HTML Cleanup

Browsers are required to skip over bad HTML and render something.

Consequently, many web sites have significant HTML errors that don't show up until you try to scrape their content.

Beautiful Soup has a handy hook for doing markup massage prior to parsing. This is a way of fixing site-specific bugs when necessary.

Here's a two-part massage I wrote recently that corrects two common (and show-stopping) HTML issues with quoted attributes values in a tag.

# Fix style="background-image:url("url")"
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
return 'background-image:url("e;%s"e;)' % ( match.group(1) )

# Fix src="url name="name""
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )

fix_style_quotes = [
(background_image, fix_background_image),
(bad_img, fix_bad_img),
]

The "fix_style_quotes" sequence is provided to the BeautifulSoup contructor as the markupMassage value.


Friday, November 6, 2009

BBEdit Configuration

After installing Python 2.6 in Mac OS X, I had problems with BBEdit not finding the right version of Python. It kept running an old 2.5 version.

I finally tracked down the BBEdit documentation, http://pine.barebones.com/manual/BBEdit_9_User_Manual.pdf.

Found this: "BBEdit expects to find Python in /usr/bin, /usr/local/bin, or /sw/bin. If you have installed Python elsewhere, you must create a symbolic link in /usr/local/bin pointing to your copy of Python in order to use pydoc and the Python debugger."

Checked in /usr/bin and found an old Python there. I think Fink did that. Removed it and BBEdit is much happier. As is Komodo Edit.

Wednesday, November 4, 2009

Parsing HTML from Microsoft Products (Like Front Page, etc.)

Ugh. When you try to parse MS-generated HTML, you find some extension syntax that is completely befuddling.

I've tried a few things in the past, none were particularly good.

In reading a file recently, I found that even Beautiful Soup was unable to prettify or parse it.
The document was filled with <!--[if...]>...<![endif]--> constructs that looked vaguely directive or comment-like, but still managed to stump the parser.

The BeautifulSoup parser has a markupMassage parameter that applies a sequence of regexps to the source document to cleanup things that are baffling. Some things, however, are too complex for simple regexp's. Specifically, these nested comment-like things were totally confusing.

Here's what I did. I wrote a simple generator which emitted the text that was unguarded by these things. The resulting sequence of text blocks could be assembled into a document that BeautifulSoup could parse.

def clean_directives( page ):
"""
Stupid Microsoft "Directive"-like comments!
Must remove all <!--[if...]>...<![endif]--> sequences. Which can be nested.
Must remove all <![if...]>...<![endif]> sequences. Which appear to be the nested version.
"""
if_endif_pat= re.compile( r"(\<!-*\[if .*?\]\>)|(<!\[endif\]-*\>)" )
context= []
start= 0
for m in if_endif_pat.finditer( page ):
if "[if" in m.group(0):
if start is not None:
yield page[start:m.start()]
context.append(m)
start= None
elif "[endif" in m.group(0):
context.pop(-1)
if len(context) == 0:
start= m.end()+1
if start is not None:
yield page[start:]

Stored Procedures and Ad Hominem Arguments

The question of "Stored Procedures and Triggers" comes up fairly frequently.

Over the years (since the 90's, when stored procedures were introduced to Oracle) I've learned precisely how awful a mistake this technology is.

I've seen numerous problems that have stored procedures as their root cause. I'll identify just a few. These are not "biases" or "opinions". These are experience.
  1. The "DBA as Bottleneck" problem. In short, the DBA's take projects hostage while the development team waits for stored procedures to be written, corrected, performance tuned or maintained.
  2. The "Data Cartel" problem. The DBA's own parts of the business process. They refuse (or complicate) changes to fundamental business rules for obscure database reasons.
  3. The "Unmaintainability" problem. The stored procedures (and triggers) have reached a level of confusion and complexity that means that it's easier to drop the application and install a new one.
  4. The "Doesn't Break the License" problem. For some reason, the interpreted and source-code nature of stored procedures makes them the first candidate for customization of purchased applications. Worse, the feeling is that doing so doesn't (or won't) impair the support agreements.
When I bring these up, I wind up subject to weird ad hominem attacks.

I've been told (more than once) that I'm not being "balanced" and that stored procedures have "There are pros and cons on both sides". This is bunk. I have plenty of facts. Stored procedures create a mess. I've never seen any good come from stored procedures.

I don't use GOTO's haphazardly. I don't write procedural spaghetti code. No one says that I should be more "balanced."

I don't create random database structures with 1NF, 2NF and 3NF violations in random places. No one says I should be more "balanced".

Indeed, asking me to examine my bias is an ad hominem argument. My fact-based experience with stored procedures is entirely negative.

But when it comes to stored procedures, there's a level of defensiveness that defies my understanding. I assume Oracle, IBM and Microsoft are paying kickbacks to DBA's to support stored procedures and PL/SQL over the more sensible alternatives.

Saturday, October 31, 2009

Open Source in the News

Whitehouse.gov is publicly using open source tools.

See Boing Boing blog entry. Plus Huffington Post blog entry.

Most importantly, read this from O'Reilly.

Many places are using open source in stealth mode. Some even deny it. Ask your CIO what the policy on open source is, then check to see if you're using Apache. Often, this is an oops -- policy says "no", practice says "yes".

For a weird perspective on open source, read Binstock's Integration Watch piece in SD Times: From Open Source to Commercial Quality. "rigor is the quality often missing from OSS projects". "Often" missing? I guess Binstock travels in wider circles and sees more bad open source software than I do. The stuff I work with is very, very high quality. Python, Django, Sphinx, the LaTeX stack, PIL, Docutils -- all seem to be outstandingly good.

I guess "rigor" isn't an obvious tangible feature of the software. Indeed, I'm not sure what "commercial quality" means if open source lacks this, also.

All of the commercial software I've seen has been in-house developed stuff. Because there's so much of it, it must have these elusive "rigor" and "commercial quality" features that Binstock values so highly. Yet, the software is really bad: it barely works and they can't maintain it.

My experience is different from Binstock's. Also, most of the key points in his article are process points, not software quality issues. My experience is that the open source product exceeds commercial quality. Since there's no money to support a help desk or marketing or product ownership, the open source process doesn't offer all the features of a commercial operation.