Bio and Publications

Monday, November 30, 2009

Python Book -- Thanks for the Bug Reports

I made some fundamental changes to the text processing pipeline. I think I've corrected all of the typographical and production problems. (Plus, I fixed some content errors, too.)

I've republished the Building Skills in Python, both HTML and PDF.

Hopefully, this is considerably better and more usable.

Next step -- revising the OO Design publication pipeline.

Thursday, November 26, 2009

Python Book -- Version 2.6

Completely revised the Building Skills in Python book.

It now covers Python 2.6 and is much, must easier to maintain in ReStructured Text markup, formatted with Sphinx and LaTeX (via TeXLive) than it was in XML.

XML -- while modern and clean and uniform -- isn't as convenient as LaTeX and RST.

Tuesday, November 24, 2009

Standard "Distributed" Database Issues

Here's a quote "standard issues associated w/ a disitributed db". And "There is the push versus pull of data. Say you use push and..." and more stuff after that.

First, by "Distributed Database", the question could mean almost anything. However, they provide the specific example of Oracle's Multi-Master Replication. That narrows the question somewhat.

This appears to mean that -- for them -- Distributed Database means two (or more) applications, two (or more) physical database instances and at least one class of entities which exist in multiple applications and are persisted in multiple databases.

That means multiple applications with responsibility for a single class of objects.

That breaks at least one fundamental design principle. Generally, a class has one responsibility. Now we have two implementations sharing some kind of responsibility for a single class of objects. Disentangling the responsibilities is always hard.

Standard Issues

There's one standard issue with this kind of distributed database. It is horribly complex and never worth it.

Never.

You broke the Single Responsibility Principle. You'll regret that.

The "distributed database" is like a spread sheet.

First, you have a problem that you think you can solve with a distributed database.
Now you have two problems.
Sensible Alternatives

There are two standard solutions to problems that appear to require a distributed database.
A data warehouse. Often, there is no actual state change that is part of a transactional workflow that moves back and forth between the applications. In most cases, the information needs be merged for reporting and analysis purposes. Occasionally, this merged information is used for transactional processing, but that's easily handled by the dimensional bus feeding back to source applications.

An Enterprise Service Bus (ESB) and a Service-Oriented Architecture (SOA). The rest of the time, one has a "Distributed Transaction". This is better thought of as a Composite Applications. A composite application is not part of any of the foundational ("distributed") applications; a composite is fundamentally different and of a higher level
Stay Out Of That Box

In short, the "standard issues" with attempting a distributed database are often insurmountable. So don't try.
Pick a fundamentally simpler architecture like Composite Applications via an SOA using an ESB.

Yes, simpler. In the long run, a composite application exploits the foundational applications without invoking a magical two-way distributed coherence among multiple data stores. A composite application leverages the foundational applications by creating a higher-level workflow to pass data between the foundational applications as needed by the composite application.

Read any vendor article on any ESB and you'll see numerous examples of "distributed" databases done more simply (and more effectively) by ditching the concept of "distributed".

IBM, Oracle (which now owns Sun's JCAPS), JBoss, WSO2, OpenESB, Glassfish ESB


Thursday, November 19, 2009

On Risk and Estimating and Agile Methods

See The Question of Risk.



These are notes for a long, detailed rant on the value of Agile methods.

One specious argument against an Agile approach is the "risk management" question. In this case, however, it becomes a "how much of a contingency budget should be write into the contract." Which isn't really risk management.

Sunday, November 15, 2009

ORM magic

The ORM layer is magic, right?

The ORM layer "hides" the database, right?

We never have to think about persistence, right? It just magically "happens."

Wrong.

Here's some quotes from a recent email:

"Somehow people are surprised that we would have performance issues. Somehow people are surprised that now that we are putting humpy/dumpy together that we would have to go back and look at how we have partitioned the system."

I'm not sure what all of that means except that it appears that the author thinks mysterious "people" think performance considerations are secondary.

I don't have a lot of technical details, just a weird ranting list of complaints, including the following.

"... the root cause of the performance issue was that each call to the component did a very small amount of work. So, they were having to make 10 calls to 10 different components to gather useful info. Even though each component calls was quick (something like 0.1 second), to populate the gui screen, they had to make 15 of them."

Read the following Stack Overflow questions: Optimizing this Django Code?, and Overhead of a Round-trip to MySql?

ORM Is A "Silver Bullet" -- It Solves All Our Problems

If you think that you can adopt some architectural component and then program without further regard for the what that component actually does, stop coding now and find another job. Seriously.

If you think you don't have to consider performance, please save us from having to clean up your mess.

I'm repeatedly shocked at people who claim that some particular ORM (e.g., Hibernate) was unacceptable because of poor performance.

ORM's like Hibernate, iBatis, SQLAlchemy, Django ORM, etc., are not performance problems. They're solutions to specific problems. And like all solution technology, they're very easy to misuse.

Hint 1: ORM == Mapping. Not Magic. Mapping.

The mapping is from low-rent relational row-column (with no usable collections) to object instances. That's all. Just mapping rows to objects. No magic. Object collections and SQL foreign keys are cleverly exchanged using specific techniques that must be understood to be used.

Hint 2: Encapsulation != Ignorance. OO design frees us from "implementation details". This does not mean that it frees us from performance considerations. Performance is not an "implementation detail". The performance considerations of class encapsulation are central to the very idea of encapsulation.

One central reason we have object-oriented design is to separate performance from programming nuts and bolts. We want to be able to pick and choose alternative class definitions based on performance considerations.

ORM's Role.

ORM saves writing mappings from column names to class instances. It saves us from writing SQL. It doesn't remove the need to actually think about what's actually going on.

If an attribute is implemented as a property that actually does a query, we need to pay attention to this. We need to read the API documentation, know what features of a class do queries, and think about how to manage this.

If we don't know, we need to write experiments and spikes to demonstrate what is happening. Reading the SQL logs should be done early in the architecture definition.

You can't write random code and complain that the performance isn't very good.

If you think you should be able to write code without thinking and understanding what you're doing, you need to find a new job.

Tuesday, November 10, 2009

Another HTML Cleanup

Browsers are required to skip over bad HTML and render something.

Consequently, many web sites have significant HTML errors that don't show up until you try to scrape their content.

Beautiful Soup has a handy hook for doing markup massage prior to parsing. This is a way of fixing site-specific bugs when necessary.

Here's a two-part massage I wrote recently that corrects two common (and show-stopping) HTML issues with quoted attributes values in a tag.

# Fix style="background-image:url("url")"
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
return 'background-image:url("e;%s"e;)' % ( match.group(1) )

# Fix src="url name="name""
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )

fix_style_quotes = [
(background_image, fix_background_image),
(bad_img, fix_bad_img),
]

The "fix_style_quotes" sequence is provided to the BeautifulSoup contructor as the markupMassage value.


Friday, November 6, 2009

BBEdit Configuration

After installing Python 2.6 in Mac OS X, I had problems with BBEdit not finding the right version of Python. It kept running an old 2.5 version.

I finally tracked down the BBEdit documentation, http://pine.barebones.com/manual/BBEdit_9_User_Manual.pdf.

Found this: "BBEdit expects to find Python in /usr/bin, /usr/local/bin, or /sw/bin. If you have installed Python elsewhere, you must create a symbolic link in /usr/local/bin pointing to your copy of Python in order to use pydoc and the Python debugger."

Checked in /usr/bin and found an old Python there. I think Fink did that. Removed it and BBEdit is much happier. As is Komodo Edit.

Wednesday, November 4, 2009

Parsing HTML from Microsoft Products (Like Front Page, etc.)

Ugh. When you try to parse MS-generated HTML, you find some extension syntax that is completely befuddling.

I've tried a few things in the past, none were particularly good.

In reading a file recently, I found that even Beautiful Soup was unable to prettify or parse it.
The document was filled with <!--[if...]>...<![endif]--> constructs that looked vaguely directive or comment-like, but still managed to stump the parser.

The BeautifulSoup parser has a markupMassage parameter that applies a sequence of regexps to the source document to cleanup things that are baffling. Some things, however, are too complex for simple regexp's. Specifically, these nested comment-like things were totally confusing.

Here's what I did. I wrote a simple generator which emitted the text that was unguarded by these things. The resulting sequence of text blocks could be assembled into a document that BeautifulSoup could parse.

def clean_directives( page ):
"""
Stupid Microsoft "Directive"-like comments!
Must remove all <!--[if...]>...<![endif]--> sequences. Which can be nested.
Must remove all <![if...]>...<![endif]> sequences. Which appear to be the nested version.
"""
if_endif_pat= re.compile( r"(\<!-*\[if .*?\]\>)|(<!\[endif\]-*\>)" )
context= []
start= 0
for m in if_endif_pat.finditer( page ):
if "[if" in m.group(0):
if start is not None:
yield page[start:m.start()]
context.append(m)
start= None
elif "[endif" in m.group(0):
context.pop(-1)
if len(context) == 0:
start= m.end()+1
if start is not None:
yield page[start:]

Stored Procedures and Ad Hominem Arguments

The question of "Stored Procedures and Triggers" comes up fairly frequently.

Over the years (since the 90's, when stored procedures were introduced to Oracle) I've learned precisely how awful a mistake this technology is.

I've seen numerous problems that have stored procedures as their root cause. I'll identify just a few. These are not "biases" or "opinions". These are experience.
  1. The "DBA as Bottleneck" problem. In short, the DBA's take projects hostage while the development team waits for stored procedures to be written, corrected, performance tuned or maintained.
  2. The "Data Cartel" problem. The DBA's own parts of the business process. They refuse (or complicate) changes to fundamental business rules for obscure database reasons.
  3. The "Unmaintainability" problem. The stored procedures (and triggers) have reached a level of confusion and complexity that means that it's easier to drop the application and install a new one.
  4. The "Doesn't Break the License" problem. For some reason, the interpreted and source-code nature of stored procedures makes them the first candidate for customization of purchased applications. Worse, the feeling is that doing so doesn't (or won't) impair the support agreements.
When I bring these up, I wind up subject to weird ad hominem attacks.

I've been told (more than once) that I'm not being "balanced" and that stored procedures have "There are pros and cons on both sides". This is bunk. I have plenty of facts. Stored procedures create a mess. I've never seen any good come from stored procedures.

I don't use GOTO's haphazardly. I don't write procedural spaghetti code. No one says that I should be more "balanced."

I don't create random database structures with 1NF, 2NF and 3NF violations in random places. No one says I should be more "balanced".

Indeed, asking me to examine my bias is an ad hominem argument. My fact-based experience with stored procedures is entirely negative.

But when it comes to stored procedures, there's a level of defensiveness that defies my understanding. I assume Oracle, IBM and Microsoft are paying kickbacks to DBA's to support stored procedures and PL/SQL over the more sensible alternatives.