Tuesday, November 10, 2009

Another HTML Cleanup

Browsers are required to skip over bad HTML and render something.

Consequently, many web sites have significant HTML errors that don't show up until you try to scrape their content.

Beautiful Soup has a handy hook for doing markup massage prior to parsing. This is a way of fixing site-specific bugs when necessary.

Here's a two-part massage I wrote recently that corrects two common (and show-stopping) HTML issues with quoted attributes values in a tag.

# Fix style="background-image:url("url")"
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
return 'background-image:url("e;%s"e;)' % ( match.group(1) )

# Fix src="url name="name""
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )

fix_style_quotes = [
(background_image, fix_background_image),
(bad_img, fix_bad_img),
]

The "fix_style_quotes" sequence is provided to the BeautifulSoup contructor as the markupMassage value.


Friday, November 6, 2009

BBEdit Configuration

After installing Python 2.6 in Mac OS X, I had problems with BBEdit not finding the right version of Python. It kept running an old 2.5 version.

I finally tracked down the BBEdit documentation, http://pine.barebones.com/manual/BBEdit_9_User_Manual.pdf.

Found this: "BBEdit expects to find Python in /usr/bin, /usr/local/bin, or /sw/bin. If you have installed Python elsewhere, you must create a symbolic link in /usr/local/bin pointing to your copy of Python in order to use pydoc and the Python debugger."

Checked in /usr/bin and found an old Python there. I think Fink did that. Removed it and BBEdit is much happier. As is Komodo Edit.

Wednesday, November 4, 2009

Parsing HTML from Microsoft Products (Like Front Page, etc.)

Ugh. When you try to parse MS-generated HTML, you find some extension syntax that is completely befuddling.

I've tried a few things in the past, none were particularly good.

In reading a file recently, I found that even Beautiful Soup was unable to prettify or parse it.
The document was filled with <!--[if...]>...<![endif]--> constructs that looked vaguely directive or comment-like, but still managed to stump the parser.

The BeautifulSoup parser has a markupMassage parameter that applies a sequence of regexps to the source document to cleanup things that are baffling. Some things, however, are too complex for simple regexp's. Specifically, these nested comment-like things were totally confusing.

Here's what I did. I wrote a simple generator which emitted the text that was unguarded by these things. The resulting sequence of text blocks could be assembled into a document that BeautifulSoup could parse.

def clean_directives( page ):
"""
Stupid Microsoft "Directive"-like comments!
Must remove all <!--[if...]>...<![endif]--> sequences. Which can be nested.
Must remove all <![if...]>...<![endif]> sequences. Which appear to be the nested version.
"""
if_endif_pat= re.compile( r"(\<!-*\[if .*?\]\>)|(<!\[endif\]-*\>)" )
context= []
start= 0
for m in if_endif_pat.finditer( page ):
if "[if" in m.group(0):
if start is not None:
yield page[start:m.start()]
context.append(m)
start= None
elif "[endif" in m.group(0):
context.pop(-1)
if len(context) == 0:
start= m.end()+1
if start is not None:
yield page[start:]

Stored Procedures and Ad Hominem Arguments

The question of "Stored Procedures and Triggers" comes up fairly frequently.

Over the years (since the 90's, when stored procedures were introduced to Oracle) I've learned precisely how awful a mistake this technology is.

I've seen numerous problems that have stored procedures as their root cause. I'll identify just a few. These are not "biases" or "opinions". These are experience.
  1. The "DBA as Bottleneck" problem. In short, the DBA's take projects hostage while the development team waits for stored procedures to be written, corrected, performance tuned or maintained.
  2. The "Data Cartel" problem. The DBA's own parts of the business process. They refuse (or complicate) changes to fundamental business rules for obscure database reasons.
  3. The "Unmaintainability" problem. The stored procedures (and triggers) have reached a level of confusion and complexity that means that it's easier to drop the application and install a new one.
  4. The "Doesn't Break the License" problem. For some reason, the interpreted and source-code nature of stored procedures makes them the first candidate for customization of purchased applications. Worse, the feeling is that doing so doesn't (or won't) impair the support agreements.
When I bring these up, I wind up subject to weird ad hominem attacks.

I've been told (more than once) that I'm not being "balanced" and that stored procedures have "There are pros and cons on both sides". This is bunk. I have plenty of facts. Stored procedures create a mess. I've never seen any good come from stored procedures.

I don't use GOTO's haphazardly. I don't write procedural spaghetti code. No one says that I should be more "balanced."

I don't create random database structures with 1NF, 2NF and 3NF violations in random places. No one says I should be more "balanced".

Indeed, asking me to examine my bias is an ad hominem argument. My fact-based experience with stored procedures is entirely negative.

But when it comes to stored procedures, there's a level of defensiveness that defies my understanding. I assume Oracle, IBM and Microsoft are paying kickbacks to DBA's to support stored procedures and PL/SQL over the more sensible alternatives.

Saturday, October 31, 2009

Open Source in the News

Whitehouse.gov is publicly using open source tools.

See Boing Boing blog entry. Plus Huffington Post blog entry.

Most importantly, read this from O'Reilly.

Many places are using open source in stealth mode. Some even deny it. Ask your CIO what the policy on open source is, then check to see if you're using Apache. Often, this is an oops -- policy says "no", practice says "yes".

For a weird perspective on open source, read Binstock's Integration Watch piece in SD Times: From Open Source to Commercial Quality. "rigor is the quality often missing from OSS projects". "Often" missing? I guess Binstock travels in wider circles and sees more bad open source software than I do. The stuff I work with is very, very high quality. Python, Django, Sphinx, the LaTeX stack, PIL, Docutils -- all seem to be outstandingly good.

I guess "rigor" isn't an obvious tangible feature of the software. Indeed, I'm not sure what "commercial quality" means if open source lacks this, also.

All of the commercial software I've seen has been in-house developed stuff. Because there's so much of it, it must have these elusive "rigor" and "commercial quality" features that Binstock values so highly. Yet, the software is really bad: it barely works and they can't maintain it.

My experience is different from Binstock's. Also, most of the key points in his article are process points, not software quality issues. My experience is that the open source product exceeds commercial quality. Since there's no money to support a help desk or marketing or product ownership, the open source process doesn't offer all the features of a commercial operation.

Wednesday, October 28, 2009

Painful Python Import Lessons

Python's packages and modules are -- generally -- quite elegant.

They're relatively easy to manage. The __init__.py file (to make a module into a package) is very elegant. And stuff can be put into the __init__.py file to create a kind of top-level or header module in a larger package of modules.

To a limit.

It took hours, but I found the edge of the envelope. The hard way.

We have a package with about 10 distinct Django apps. Each Django app is -- itself -- a package. Nothing surprising or difficult here.

At first, just one of those apps used a couple of fancy security-related functions to assure that only certain people could see certain things in the view. It turns out that merely being logged in (and a member of the right group) isn't enough. We have some additional context choices that you must make.

The view functions wind up with a structure that looks like this.
@login_required
def someView( request, object_id, context_from_URL ):
no_good = check_other_context( context_from_URL )
if no_good is not None: return no_good
still_no_good = check_session()
if still_no_good is not None: return still_no_good
# you get the idea

At first, just one app had this feature.

Then, it grew. Now several apps need to use check_session and check_other_context.

Where to Put The Common Code?

So, now we have the standard architectural problem of refactoring upwards. We need to move these functions somewhere accessible. It's above the original app, and into the package of apps.

The dumb, obvious choice is the package-level __init__.py file.

Why this is dumb isn't obvious -- at first. This file is implicitly imported. Doesn't seem like a bad thing. With one exception.

The settings.

If the settings file is in a package, and the package-level __init__.py file has any Django stuff in it -- any at all -- that stuff will be imported before your settings have finished being imported. Settings are loaded lazily -- as late as possible. However, in the process of loading settings, there are defaults, and Django may have to use those defaults in order to finish the import of your settings.

This leads to the weird situation that Django is clearly ignoring fundamental things like DATABASE_ENGINE and similar settings. You get the dummy database engine, Yet, a basic from django.conf import settings; print settings.DATABASE_ENGINE shows that you should have your expected database.

Moral Of the Story

Nothing with any Django imports can go into the package-level __init__.py files that may get brought in while importing settings.

Monday, October 26, 2009

Process Not Working -- Must Have More Process

After all, programmers are all lazy and stupid.

Got his complaint recently.

"Developers on a fairly routine basis check in code into the wrong branch."

Followed by a common form of the lazy and stupid complaint. "Someone should think about which branch is used for what and when." Clearly "someone" means the programmers and "should think about" means are stupid.

This was followed by the "more process will fix this process problem" litany of candidate solutions.

"Does CVS / Subversion have a knob which provides the functionality to
prevent developers from checking code into a branch?"

"Is there a canonical way to organize branches?" Really, this means something like what are the lazy, stupid programmers doing wrong?

Plus there where rhetorical non-questions to emphasize the lazy, stupid root cause. "Why is code merging so hard?" (Stupid.) "If code is properly done and not coupled, merging should be easy?" (Lazy; a better design would prevent this.) "Perhaps the developers don't understand the code and screw up the merge?" (Stupid.) "If the code is not coupled, understanding should be easy?" (Both Lazy and Stupid.)

Root Cause Analysis

The complaint is about process failure. Tools do not cause (or even contribute) to process failure. There are two possible contributions to process failure: the process and the people.

The process could be flawed. There could be no earthly way the programmers can locate the correct branch because (a) it doesn't exist when they need it or (b) no one told them which branch to use.

The people could be flawed. For whatever reason, they refuse to execute the process. Perhaps they know a better way, perhaps they're just being jerks.

Technical means will not solve either root cause problem. It will -- generally -- exacerbate it. If the process is broken, then attempting to create CVS / Subversion "controls" will end in expensive, elaborate failure. Either they can't be made to work, or (eventually) someone will realize that they don't actually solve the problem. On the other hand, if the people are broken, they'll just subvert the controls in more interesting, silly and convoluted ways.

My response -- at the time -- was not "analyze the root causes". When I first got this, I could only stare at it dumbfounded. My answer was "You're right, your developers are lazy and stupid. Good call. Add more process to overcome their laziness and stupidity."

After all, the questioner clearly knows -- for a fact -- that more process helps fix a broken organization. The questioner must be convinced that actually talking to people will never help.
The question was not "what can I do?" The question was "can I control these people through changes to CVS?" There's a clear presumption of "process not working -- must have more process."

The better response from me should have been. "Ask them what the problem is." I'll bet dollars against bent pins that no one tells them which branch to use in time to start work. I'll bet they're left guessing. Also, there's a small chance that these are off-shore developers and communication delays make it difficult to use the correct branch. There may be no work-orders, just informal email "communication" between on-shore and off-shore points-of-contact (and, perhaps, the points-of-contact aren't decision-makers.)

Bottom Line. If people can't make CVS work, someone needs to talk to them to find out why. Someone does not need to invent more process to control them.