Friday, July 31, 2009

Object Models and Relational Joins -- Endless Confusion

Check out this list of questions from Stack Overflow: [Django] join.

These are all folks trying to do joins or outer joins even though they have objects fetched through the ORM.

How does this confusion arise? Easy. Folks work with SQL as if the relational world-view is Important and Universal. It isn't. SQL isn't even a programming language, per se.

Here's the important thing for Django developers to know: SQL is a Hack; Leave it Behind.

The bad news is that all those years spent mastering the ins and outs of the SELECT statement doesn't have as much enduring value as I'd hoped it would have. [Yes, I was a DBA in Ingres and Oracle. I know my SQL.]

The good news is that Object Navigation replaces much of the hideousness of SQL. To an extent. Let's look at some cases.

Joins in General

SQL SELECT statements are an algebraic specification of a result set. The database is free to use any algorithm to build the required set.

SQL imposes the Join hack because SQL is a completely consistent set algebra system. A simple SELECT returns a row-column set of data. A join between tables has to construct a fake row-column set so that everything is consistent.

A Join is nothing more than navigation from an object to associated objects. In OO world, this is simply object containment; the navigation is simply the name of a related object. Nothing more.

Master-Detail (1:m) Joins

A master-detail join in SQL works with a foreign key reference on the children.

In Django, this has to be declared in a SQL-friendly way so that the ORM will work.

class Master( models.Model ):

class Detail( models.Model ):
master= models.ForeignKey( Master )

The "Join" query is simply this. The "detail_set" name is deduced by Django from the class that contains the foreign key.

for m in Master.objects.filter():
process m
for d in m.detail_set.all():
process d

"But wait!" the SQL purist cries, "isn't that inefficient?" The answer is "rarely". It's possible that the RDBMS, doing a "merge-join" algorithm to build the entire result set might be quicker than this.

As practical matter, however, the rest of the web transaction -- including the painfully slow download -- will dominate the timeline.

Association (m:m) Joins

An association in SQL requires an intermediate table to carry the combinations of foreign keys.

In Django, this has to be declared in a SQL-friendly way so that the ORM will work.

class This( models.Model ):

class That( models.Model ):
these = models.ManyToManyField( This )

The navigation, however, is simply following the relationships. There's no complicated SQL join required.

for this in This.objects.filter():
for that in this.that_set.all():
process this and that

Here's the other side of the navigation.

for that in That.objects.filter():
for this in that.these:
process this and that

Outer Joins

An Outer Join is a "Join with Null for Missing Relationships". It's navigation with an if-statement or an exception clause.

for that in That.objects.filter():
try:
this = that.this_set.get()
except This.DoesNotExist:
this = None
process this and that

There isn't any "join" in object-oriented programming. The ORM layer removes the need.

Monday, July 27, 2009

Python and UML

Searched for this the other day. Came up empty. Clearly, didn't search hard enough.

Automatically Generated Python Code from an UML diagram?

So far, there are three separate answers listing a total of five separate products.

I really want to annotate the UML with docstrings. If that was supported, it would be cool. Then my Sphinx documentation would include UML images pulled from the code itself.

Programming Language Popularity -- Update

I used to rely on the TIOBE Software Programming Community Index.

Today, I learned about the langpop site. The context was following an SD Times blog posting on Haskell. But I got distracted looking at language rankings and what it means to consulting companies.

[Until corrected, I was] particularly drawn to langpop's Amazon listing. It seems like everyone wants books on C, C++ and C#. All other languages are also-rans. Why? [I guessed] that it's because those languages are so (a) hard to work with and (b) are perceived as "old school" where print resources are more prevalent than on-line resources. [Turns out the rankings are screwed up. Dang.]

Marketplace

The langpop aggregate ranking is C, Java, C++, PHP, JavaScript, C# and Python. The TIOBE Community Index is Java, C, C++, PHP, VB, C#, Python.

Similar enough to confirm that a half-dozen languages dominate with two others fighting for position.

Years ago, one of our senior consultants made the case that we really have three fundamental tech-stacks into which our web services had to fit: VB/.Net, C#/.Net, and Java. Corporate IT doesn't demand a lot of PHP or Python from us.

Perhaps those are opportunities for growth. Or perhaps there isn't enough corporate IT demand for those frameworks.

Friday, July 24, 2009

Privacy and Encryption

See Massachusetts Says Encrypt It All!

This gives a hint as to the future of personal information collection and dissemination.

This is potentially A Bad Thing.

I don't see a problem with using SSL to encrypt "over the wire" data transfers. I don't see a problem with adding layers of encryption to these transfers.

Everything else is going to require something like Apple's File Lock to assure that the file -- no matter where it goes -- is encrypted. This will be a problem.

Non-Standards

A search for Windows File Encryption shows that there are a lot of choices. Hopefully, they will all find a way to adhere to some straightforward standard like AES. If we have to buy/download/install a pile of encryption applications, data sharing will become expensive and complicated. Even if Microsoft does their usual "standard + enhancements" offering, it will make things very expensive.

Imagine buying the "Crypto-Crummo" file system encryption package, deploying it enterprise-wide, finding a problem, and -- horrors -- being unable to unlock your files ever again. It's a bug, not a feature, but you still can't open your files.

How do you prevent that risk? Right. Keep an illegal unencrypted copy of everything.

Here's another scenario. Imagine buying the "Crypto-Locko" file system encryption package. You deploy it enterprise wide. You stop paying your license fees. It stop decrypting. You're corporate data is being held hostage by your encryption vendor.

Here's the third strike. You buy the "Crypto-Uniqueo" file system encryption package. It has a unique protocol, non-standard, proprietary. It gets hacked. Your in violation of the law.

Or, the company making "Crypto-Uniqueo" ceases support. Now how do you get into your files? Or, the company goes out of business? What now?

Unintended Consequences

Without an applicable encryption standard -- and some boundaries on what's really required -- I think these legal initiatives will do more harm than good. To prevent the various risks, companies will do dumb things. Things that are probably dumber than what they've done that lead to leaks of personal information.

Software Performance Improvement

Yesterday, I looked at some marketing material on SPI (Software Process Improvement). It was quite good. The approach was very pragmatic, the deliverables very sound.

The hard part is connecting with customers.

I've only worked with a few customers who were actually interested in process improvement. I've worked with close to a hundred customers who were interested in process enforcement, usually called "compliance".

Laurent Bossavit's Learning Notes has this entry, filed under "Four types of process errors".
Of course, what actually matters - what is worth discussing - is what people actually do. A 10-page process document or a flowchart are nice, but generally irrelevant unless they match very closely with what people actually do in the pursuit of shipping software.
In thinking about SPI, one has to find a way past the Core Hubris™. Bossavit identifies four types of errors. His are all focused on projects that "produce a bad outcome". We'll have to put this on one bucket, the "problem recognized" bucket. We'll rename Bossavit's Type I to Type IV as Type R-I to R-IV because a problem was Recognized.

The Language of Denial

What he misses are the process errors that "produce a questionable outcome". In this case, the outcome can be declared good by the manager that produced it (it went into production) or declared bad by the manager that maintains it (quality is appalling.) These are far, far more insidious and pernicious errors than the four he ties to a bad outcome.

Delivery is all that matters, right? If it goes into production, how "bad" can the outcome be?

The answer is -- sadly -- pretty bad. I'm often asked to work with production code that should have raised red flags, been identified as a bad outcome, and lead to serious questions about process improvement. And yet, there's no question raised at all.

Worse, I'm often asked to follow the process that lead to the horrifying code and the need for rework. What created the mess we're reworking? A flawed process? Why, then, must our proposal swear undying fealty to the broken process? So we can fail yet again?

Even worse, we're sometimes asked to follow a process for which there is no example. "Produce documentation like this," I'm told. Followed by, "but that's not a good example of what I mean." It turns out, there is no example of anyone ever following the written process. But, we're expected (no required) to comply with a process that has a nebulous definition and no examples.

Some Questionable Outcome Errors

I think there are four variations on the theme of process errors. We call this class "Q" errors because there was a questionable result. Not a recognized problem but a shadow of a doubt.
  • Type Q-I error (blame) is when you don't follow the written process, produce a questionable outcome, and blame non-conformance. The point here is that we don't ask why the written process was not followed. Why is the actual process different? Is it a mistake, or is the written process unusable as written?
  • Type Q-II error (fudge) is when you don't follow the written process, produce a questionable outcome, and declare the situation to be exceptional. Either the technology was new or the business problem was not well understood. (Note. All interesting projects have one or both features. If the technology was understood and the business problem was understood, you could download an open source solution.)
  • Type Q-III error (denial) is when you don't follow the written process, produce a questionable outcome and ignore the gaps between written and actual. No proposed changes. Nothing. Just business as usual.
  • Type Q-IV error (insight) is when you don't follow the written process, produce a questionable outcome, and ask two questions. "What was so wrong with the written process that we didn't follow it?" and "What was wrong with what we actually did?" (Note. I've never seen this happen. But that's just me.)

Marketing Past the Hubris

There's a Core Hubris in many software development organizations. It's a presumption that, since they have stuff in production, they know how to deliver more stuff.

Indeed, in many organizations, SPI dies an early death because of the Core Hubris. They already know what they're doing. They don't need any help with that. This is why the blame-fudge-denial errors are so common.

The Core Hubris is also why shoddy code goes into production. There are three paths a project can take.
  1. The High Road. The processes mostly work, are mostly followed, and code is produced that has reasonable quality and gets delivered.
  2. The Low Road. The processes don't work well or aren't followed and code is produced that's questionable. It's put into production anyway, victory is declared and little, if anything is learned.
  3. The Blocked Road. The processes don't work or aren't followed and a bad result is produced. Almost without exception, this means the project is cancelled early. Deeper questions aren't asked because the reasons for cancellation aren't well understood by everyone involved. One day you're working, the next day you're reassigned.
Paths 2 and 3 (the Low Road and the Blocked Road) are both places that need SPI. There are several marketing problems to overcome.

Getting Help

First, will they acknowledge the problem? Probably not. If you've delivered anything -- no matter how bad -- you don't need help. Further, you have two layers of the organization that need to acknowledge the problem. Management needs to recognize that they're wasting money on shoddy quality. Staff needs to recognize that they've got quality problems.

Second, will they ask for help? Probably not. Most of the process errors involve deflections or denials. To seek outside support for something as "simple" as software development is a defeatist attitude. It doesn't matter that software development actually is very hard. What matters is that it shouldn't be so hard, and asking for help is career suicide.

Third, will they follow through on the help? Probably not. Change is disruptive. It means grumpy people complaining about the 8:30 AM Scrum stand-up meeting. It means grumpy project managers having only one or two sprints carefully planned down to the last 6 minutes of activity, and the future sprints are unplanned. It means grumpy business analysts complaining about being forced to focus on just a few use cases and get those right, leaving the "big picture" to fall into a black-hole. It means grumpy DBA's complaining about an evolving data model. It means grumpy programmers complaining that unit test code is not deliverable and is a waste of time.

Management can -- and often does -- act schizophrenically around improvements. They both (a) demand improvement and simultaneously (b) demand that the improvements be subverted in order to deliver "on time".

What to Sell

I think the marketing message for SPI has to be something along the following lines.
Is your software actually perfect? Is maintenance easy? Is adaption and migration a simple administrative step?
  • Are you sure? Do you have evidence? If not, perhaps your processes aren't as perfect as you wish.
Do you scramble to deliver something that works? Is maintenance always more complex than you thought? Have you ever had to reverse engineer a system to replace it?
  • You might want to consider improving your processes.
Have you failed to deliver?
  • You need to reconsider your processes.
Do you have code that's both an asset and a liability? Is it so valuable you need to keep it in production, but it's in such bad shape that maintenance is an expensive nightmare?
  • The root cause is process problems. Address the process issues and you should be able to reduce maintenance costs, or get better quality results for your maintenance spend.
This, I think, is the target audience for SPI services. Most IT people think they're successful. I've seen their worst code, and I disagree.

By worst I mean the following: So valuable you can't throw it away and so broken you can't maintain it yourself. This code is a costly, risky burden on the IT organization but still creates value for the enterprise as a whole. Flawed processes put it into production, and flawed processes prevents effective rework.

The folks that understand that merely delivering may not be enough are the folks that might consider SPI services.

Thursday, July 23, 2009

Java vs. PL/SQL

Quite a while ago, I compared Java and PL/SQL to gauge their relative performance.

Recently (okay, back in mid-June) I got this request.
One thing I would like to compare is Java vs PL/SQL using native
compilation (search Oracle for NCOMP). Would you be willing to repeat
your benchmark tests using NCOMP? NCOMP is pretty straightforward to
set up, I think it is even easier in 11g, if you are using that.

Also, when you test Java vs. PL/SQL, are you using Java stored
procedures in Oracle, or are you using an external VM and connecting
to Oracle? (One annoying limitation to Java Stored Procedures is the
lack of threading ability, among a few other things).
Native Compilation will not make PL/SQL magically faster than Java. The very best it can do is make PL/SQL as fast as Java. The clunky, inelegance of PL/SQL isn't fixed by NCOMP, either.

My test was PL/SQL stored procedures in Oracle. These were compared against Java programs in a separate JVM. I didn't use Java stored procedures because the client didn't ask for this.

The client had legacy C code they wanted reverse engineered and reimplemented. PL/SQL was unsuitable for this task for a number of reasons.
  1. PL/SQL is slower than C or Java. Speed mattered.
  2. PL/SQL is a clunky and inelegant language. Worse than C and far worse than Java. The application would have grown to gargantuan proportions.
  3. The legacy C code was full of constructs that would have to be rethought from their very essence to recast them in PL/SQL. Java, for the most part, is a better fit with legacy C. The reverse-engineering was -- relatively -- easy in moving from C to Java.
There were some additional, minor considerations.
  1. There is some unit testing capability in PL/SQL (UTPLSQL), but it's not as feature-rich as JUnit. Unit testing was essential for proving that the legacy features were ported correctly.
  2. PL/SQL is hard to develop. A nice IDE (like NetBeans or Eclipse) makes it very easy to write Java. The customer was using Toad and wasn't planning to introduce the kind of IDE required to build large, complex applications.
In short, the simple speed test -- PL/SQL vs. Java -- was sufficient to show that PL/SQL is simply too slow for compute-intensive speed-critical applications.

XML Parsing

See Python XML parsing and Parsing simple XML files in python using etree.

Originally, I used SAX -- but built DOM objects with it. I moved from application-specific DOM's to generic DOM's.

Then I switched to the miniDOM parser. It gave me structures I could walk with a pleasant Visitor design.

Last year, I switched to Element Tree. Now I can use Visitor and the XPATH search.

Wednesday, July 22, 2009

Software Overdesign -- An Update

Saw a horrifying design document recently. One that was at the "gouge out my eyes" level of badness. That's one step below "drink until I forget what I saw", but one level above "beat the author with a tire iron."

They were -- I'm guessing here -- trying to develop their own Document Object Model. Distinct from any established DOM. The Wikipedia entry on DOM provides several examples of existing DOM's. Why reinvent?

The application is -- ultimately -- going to be in Python. There are two candidate DOM's that could have been used: the XML DOM and the RST DOM as implemented in Docutils nodes module. Instead, they were reinventing: they appear to have spent a great deal of time writing use cases for "editor". I expect there was a use case for "wheel" and "fire" in there also.

What scared me was the "flatness" of the model. Every buzzword had it's own class. There was no inheritance or reuse anywhere in the diagram. Parts of the model where influenced by the DocBook schemas. The actual DTD could have been turned into the model, but wasn't.

Further, undefinable terms like "sentence" showed up as class definitions. XML's DOM treats all text as -- well -- text. Any language structure is outside the DOM. RST, similarly, treats text as a container "... children are all `Text` or `Inline` subclass nodes."

All I could suggest was "locate common superclasses" and "until you can define 'sentence', omit it". And then run outside and gouge out my eyes.

It's hard to criticize something like that in a truly helpful manner. Fixing the model is merely putting lipstick on a pig.

As far as I can tell, the application is -- somehow -- an editor that imposes a severe set of structural constraints on what the author can do. It's as if RST, docutils and Sphinx don't exist. The real solution isn't "fix your object model" it's "fix your problem statement and learn RST."



Post Script

Check out this reply:
The advantage of having an "outliner data model" and a "document data model" like DocBook XML is that your outliner functionality is not limited by the DocBook XML. The downside is that have to create and support a second model as well as provide a mapping between the two.
In other words, rather than simplify, we'll (1) insist the eye-gougingly bad model is "better", (2) justify the complexity ("not limited by DocBook [DOM]") by and (3) plan to add some more complexity to map an overly complex (and atypical) DOM to a standard DOM.

Not a very parsimonious approach to design.

Saturday, July 11, 2009

Flying Saucer

The old code was 5700 lines of bad VB.

The new code is Velocity, Flying Saucer, iText and 120 lines of glue. The old code will be replaced with perhaps 500 lines of XHTML-producing Velocity templates.

[The Flying Saucer site -- with the main menu on the right -- was confusing at first. It made it look like a half-baked semi-functional idea for an open source project. Boy was I wrong. It totally rocks!]

I am absolutely delighted at the FS world-view.
  • Process the entire CSS specification -- every feature -- especially those related to printing.
  • Don't tolerate malformed XHTML. Gecko tolerates all kinds of HTML problems, making it big and sophisticated. Flying Saucer just doesn't tolerate ill-formed XML, making it simpler, and more able to handle every CSS nuance.
Since the document is very simple (with no side-bars or floating elements), simple CSS works. And the Flying Saucer PDF matches the HTML completely. The match was so good that I did a double-take at the first PDF I made.

The best part is being able to chuck 1000's of lines of VB and replace them with 100's of lines of XHTML. I think that could stand to reduce long-term maintenance costs.

Wednesday, July 1, 2009

Test-Driven Reverse Engineering and Perniciously Bad Code

I've done a fair amount of reverse engineering over the years.

In the early days, you went from code to specification to new code. It took forever and the problems you uncovered -- well -- they often derailed the project.

Recently, I used a TDD-like approach. Each piece of legacy code was turned into some Java code with some associated unit tests. Further, the users were able to cough up a canonical set of acceptance tests. These were turned into unit tests, and it wasn't too difficult to meet in the middle with plenty of testing for each piece of legacy conversion.

Given some subsequent experience, it turns out that user acceptance tests are essential to success in reverse engineering. Without user acceptance tests being provided up front, reverse engineering is a nightmare.

Mystery Code

Today's issue is legacy code that is -- frankly -- incompetently done. As a bonus, the user organization is a little vague on what it's supposed to do. They trust it, but they can't verify it. There are no official test cases.

The only explanation we can get is a demo. And because of the user's workload, we're only getting one of these. Limited to an hour. AFAIK, the only way we can test the conversion is to run it head-to-head with the legacy and take notes as the users complain about the differences.

There will be no easy way to get to create up-front acceptance tests to drive development. We'll have to take careful notes during the demo and transform the demo script into test results we can use.

Worse Still

What's worse is the incompetent coding. How bad can code be? Let me count the ways:
  1. Globals. Anyone who thinks a global is a legal programming construct needs to find a new career. A module that declares all the globals just compounds the horror. Everything is scopeless and could be used anywhere. There's no "interface" to anything, it's just a puddle of grey goo.
    • Using globals means functions have side-effects. They update global variables more-or-less spontaneously.
    • Using globals also means that all kinds of things may have hysteresis. You call it once, it does one thing. You call it again, it does something different.
  2. Random SQL. Anyone who thinks that SQL statements can be dropped in any random place in the application needs to find a new career. MVC is essential for segregating the SQL away from the View. Views functions can't query stuff that should have been part of the model, it means the model is incomplete -- and possibly in the wrong state. It also means that view functions are slow and possibly not strictly idempotent -- every time you refresh, a value in the view could diverge from the value in the "official" model.
  3. Copy-and-Paste coding. How hard can it be to put common code into a function? Apparently, it's nearly impossible. If you're copying and pasting common code, stop now. There's no excuse. It just raises the cost of maintenance and conversion through the roof.
  4. No Change Control. Or rather, the change control is to leave all previous versions of the code in place as comment blocks. For each line of real code, there are two lines of previous versions commented out. I don't care what it was; I want to know what it is. If you can't use SVN, or even VSS, you need to find another career.
There. I feel better. Back to trying to figure out what this application really does.