Tuesday, November 30, 2010

Questions, or, How to Ask For Help

Half the fun on Stack Overflow is the endless use of closed-ended questions. "Can I do this in Python?" being so common and so hilarious.

The answer is "Yes." You can do it.

Perhaps that's not the question they really meant to ask.

See "Open versus Closed Ended Questions" for a great list of examples.

Closed-Ended Questions have short answers, essentially yes or no. Leading Questions and presuming questions are common variations on this theme. A closed-ended question is sometimes called "dichotomous" because there are only two choices. They can also be called "saturated", possibly because all the possible answers are laid out in the question.

Asking Questions

The most important part about asking questions is to go through a few steps of preparation.
  1. Search. Use Google, use the Stack Overflow search. A huge number of people seem to bang questions into Stack Overflow without taking the time to see if it's been asked (and answered) already.
  2. Define Your Goal. Seriously. Write down your objective. In words. Be sure the goal includes an active-voice verb -- something you want to be able to do. If you want to be able to write code, write down the words "I want to write code for [X]". If you want to be able to tell the difference between two nearly identical things, write down the words "I want to distinguish [Y] from [Z]". When in doubt, use active voice verbs to write down the thing you want to do. Focus on actions you want to take.
  3. Frame Your Question. Rewrite your goal into a sentence by changing the fewest words. 90% of the time, you'll switch "I want to" to "How do I". The rest of the time, you'll have to think for a moment because your goal didn't make sense. If your goal is not an active-voice verb phrase (something you want to do) then you'll have trouble with the rewrite.
In some cases, folks will skip one or more steps. Hilarity Ensues.

Leading/Presuming Questions

Another form of closed-ended question is the veiled complaint. "Why doesn't Python do [X] the way Perl/PHP/Haskell/Java/C# does it?"

Essentially, this is "my favorite other language has a feature Python is missing." The question boils down to, "Why is [Y] not like [Z]?" Often it's qualified by some feature, but the question is the same: "Regarding [X], why is Python not like language [Z]?"

The answer is "Because they're different." The two languages are not the same, that's why there's a difference.

This leads to "probing" questions of no real value. "Why did Python designers decide to leave out [X]" and other variants on this theme.

If the answer was "Because they're evil gnomes" what does it matter? If the answer was "because it's inefficient" how does that help? Feature [X] is still missing, and all the "why?" questions won't really help add it back into the language.

It's possible that there's a legitimate question hidden under the invective. It might be "How do I implement [X] in Python? For examples, see Perl/PHP/Haskell/Java/C#." Notice that this question is transformed into an active-voice verb: "implement".

If we look at the three-step question approach above, there's no active-voice verb behind a "why question". What you "know" isn't really all that easy to provide answers for. Knowledge is simply hard to provide. Questions about what you want to do, are much, much easier to answer.

Probing/Confirming Questions

One other category are the "questions" that post a pile of details looking for confirmation. There are three common variations.
  • tl;dr. The wealth of detail was overwhelming. I'm a big fan of the "detail beat-down". It seems like some folks don't need to summarize. There appear to be people with massive brains that don't need models, abstractions or summaries, but are perfectly capable of coping with endless details. It would be helpful if these folks could "write down" to those of us with small brains who need summaries.
  • No question at all, or the question is a closed-ended "Do you agree?" An answer of "No." is probably not what they wanted. But what can you do? That's all they asked for.
  • Sometimes the question is "Any comments?" This often stems from having no clear goal. Generally, if you've done a lot of research and you simply want confirmation, there's no question there. If you've got doubts, that means you need to do something to correct the problems.
Here's what is really important with tl;dr questions: What do you want to do?

80% of the time, it's "Fix my big, complex tl;dr proposal to correct problem [X]." [X] could be "security" or "deadlock" or "patent infringement" or "cost overrun" or "testability".

Here's how to adjust this question from something difficult to answer to something good.

You want to know if your tl;dr proposal have problem [X]. You're really looking for confirmation that your tl;dr proposal is free from problem [X]. This is something you want to know -- but knowledge is not a great goal. It's too hard to guess what you don't know; lots of answers can provide almost the right information.

Reframe your goal: drop knowledge and switch to action. What do you want to do? You want to show that your tl;dr proposal is free from problem [X]. So ask that: "How do I show my tl;dr proposal is free from problem [X]?"

Once you write that down, you now have to focus your tl;dr proposal to get the answer to this question. In many cases, you can pare things down to some relevant parts that can shown to be free from problem [X]. In most cases, you'll uncover the problem on your own. In other cases, you've got a good open-ended question to start a useful conversation that will give you something you can do.

Tuesday, November 23, 2010

Open-Source, moving from "when" to "how"

Interesting item in the November 1 eWeek: "Open-Source Software in the Enterprise".

Here's the key quote: "rather than asking if or when, organizations are increasingly focusing on how".

Interestingly, the article then goes on to talk about licensing and intellectual property management. I suppose those count, but they're fringe issues, only relevant to lawyers.

Here's the two real issues:
  1. Configuration Management
  2. Quality Assurance
Many organizations do things so poorly that open source software is unusable.

Configuration Management

Many organizations have non-existent or very primitive CM. They may have some source code control and some change management. But the configuration of the test and production technology stacks are absolutely mystifying. No one can positively say what versions of what products are in production or in test.

The funniest conversations center on the interconnectedness of open source projects. You don't just take a library and plug it in. It's not like neatly-stacked laundry, all washed and folded and ready to be used. Open Source software is more like a dryer full of a tangled collection of stuff that's tied in knots and suffers from major static cling.

"How do we upgrade [X]"? You don't simply replace a component. You create a new tech stack with the upgraded [X] and all of the stuff that's knotted together with [X].

Changing from Python 2.5 to 2.6 changes any binary-compiled libraries like PIL or MySQL_python, mod_wsgi, etc. These, in turn, may require OS library upgrades.

A tech stack must be a hallowed thing. Someone must actively manage change to be sure they're complete and consistent across the enterprise.

Quality Assurance

Many organizations have very weak QA. They have an organization, but it has no authority and developers are permitted to run rough-shod over QA any time they use the magic words "the user's demand it".

The truly funny conversations center on how the organization can be sure that open source software works, or is free of hidden malware. I've been asked how a client can vet an open source package to be sure that it is malware free. As if the client's Windows PC's are pristine works of art and the Apache POI project is just a logic bomb.

The idea that you might do acceptance testing on open source software always seems foreign to everyone involved. You test your in-house software. Why not test the downloaded software? Indeed, why not test commercial software for which you pay fees? Why does QA only seem to apply to in-house software?

Goals vs. Directions

I think one other thing that's endlessly confusing is "Architecture is a Direction not a Goal." I get the feeling that many organizations strive for a crazy level of stability where everything is fixed, unchanging and completely static (except for patches.)

The idea that we have systems on a new tech stack and systems on an old tech stack seems to lead to angry words and stalled projects. However, there's really no sensible alternative.

We have tech stack [X.1], [X.2] and [X.3] running in production. We have [X.4] in final quality assurance testing. We have [X.5] in development. The legacy servers running version 1 won't be upgraded, they'll be retired. The legacy servers running version 2 may be upgraded, depending on the value of the new features vs. the cost of upgrading. The data in the version 3 servers will be migrated to the version 4, and the old servers retired.

It can be complex. The architecture is a direction in which most (but not all) servers are heading. The architecture changes, and some servers catch up to the golden ideal and some servers never catch up. Sometimes the upgrade doesn't create enough value.

These are "how" questions that are more important than studying the various licensing provisions.

Thursday, November 18, 2010

Software Patents

Here's an interesting news item: "Red Hat’s Secret Patent Deal and the Fate of JBoss Developers".

Here's an ancient -- but still relevant -- piece from Tim O'Reilly: "Software and Business Method Patents".

Here's a great article in Slate on the consequences of software patents. "Weapons of Business Destruction: How a tiny little 'patent troll' got BlackBerry in a headlock".

The biggest issue with software patents is always the "non-obvious" issue. Generally, this can be debated, so a prior art review is far more valuable.


To participate, see Peer To Patent. Locate prior art and make patent trolls get real jobs.


Thursday, November 11, 2010

Hadoop and SQL/Relational Hegemony

Here's a nice article on why Facebook, Yahoo and eBay use Hadoop: "Asking Any Question Of All Your Data".

The article has one tiny element of pandering to the SQL hegemonists.

Yes, it sounds like a conspiracy theory, but it seems like there really are folks who will tell you that the relational database is effectively perfect for all data processing and should not be questioned. To bolster their point, they often have to conflate all data processing into one amorphous void. Relational transactions aren't central to all processing, just certain elements of data processing. There, I said it.

Here's the pandering quote: "But this only works if the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner".

What?

Is he saying that relational databases do not impose the same constraint?

Clearly, the RDBMS has the same "catch". The relational database only works if "...the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner."

Pandering? Really?

Here's why it seems like the article is pandering. Because it worked. It totally appealed to the target audience. I saw this piece because a DBA -- a card-carrying member of the SQL Hegemony cabal -- sent me the link, and highlighted two things. The DBA highlighted the "powerful enough" quote.

As if to say, "See, it won't happen any time soon, Hadoop is too resource intensive to displace the RDBMS."

Which appears to assume that the RDBMS isn't resource intensive.

Further, the DBA had to add the following. "The other catch which is not stated is the skill level required of the people doing the work."

As if to say, "It won't happen any time soon, ordinary programmers can't understand it."

Which appears to assume that ordinary programmers totally understand SQL and the relational model. If they did understand SQL and the relational model perfectly, why would we have DBA's? Why would we have performance tuning? Why would we have DBA's adjusting normalization to correct application design problems?

Weaknesses

So the weaknesses of Hadoop are that it (a) demands resources and (b) requires specialized skills. Okay. But isn't that the exact same weakness as the relational database?

Which causes me to ask why an article like this has to pander to the SQL cabal by suggesting that Hadoop requires a big compute engine? Or is this just my own conspiracy theory?

Tuesday, November 9, 2010

Data Mapping and Conversion Tools -- Sigh

Yes, ETL is interesting and important.

But creating a home-brewed data mapping and conversion tool isn't interesting or important. Indeed, it's just an attractive nuisance. Sure, it's fun, but it isn't valuable work. The world doesn't need another ETL tool.

The core problem is talking management (and other developers) into a change of course. How do we stop development of Yet Another ETL Tool (YAETLT)?

First, there's products like Talend, CloverETL and Pentaho open source data integration. Open Source. ETL. Done.

Then, there's this list of Open Source ETL products on the Manageability blog. This list all Java, but there's nothing wrong with Java. There are a lot of jumping-off points in this list. Most importantly, the world doesn't need another ETL tool.

Here's a piece on Open Source BI, just to drive the point home.

Business Rules

The ETL tools must have rules. Either simple field alignment or more complex transformations. The rules can either be interpreted ("engine-based" ETL) or used to build a stand-alone program ("code-generating" ETL).

The engine-based ETL, when written in Java, is creepy. We have a JVM running a Java app. The Java app is an interpreter for a bunch of ETL rules. Two levels of interpreter. Why?

Code-generating ETL, OTOH, is a huge pain in the neck because you have to produce reasonably portable code. In Java, that's hard. Your rules are used to build Java code; the resulting Java code can be compiled and run. And it's often very efficient. [Commercial products often produce portable C (or COBOL) so that they can be very efficient. That's really hard to do well.]

Code-generating, BTW, has an additional complication. Bad Behavior. Folks often tweak the resulting code. Either because the tool wasn't able to generate all the proper nuances, or because the tool-generated code was inefficient in a way that's so grotesque that it couldn't be fixed by an optimizing compiler. It happens that we can have rules that run afoul of the boilerplate loops.

Old-School Architecture

First, we need to focus on the "TL" part of ETL. Our applications receive files from our customers. We don't do the extract -- they do. This means that each file we receive has a unique and distinctive "feature". We have a clear SoW and examples. That doesn't help. Each file is an experiment in novel data formatting and Semantic Heterogeneity.

A common old-school design pattern for this could be called "The ETL Two-Step". This design breaks the processing into "T" and "L" operations. There are lots of unique, simple, "T" options, one per distinctive file format. The output from "T" is a standardized file. A simple, standardized "L" loads the database from the standardized file.

Indeed, if you follow the ETL Two Step carefully, you don't need to actually write the "L" pass at all. You prepare files which your RDBMS utilities can simply load. So the ETL boils down to "simple" transformation from input file to output file.

Folks working on YAETLT have to focus on just the "T" step. Indeed, they should be writing Yet Another Transformation Tool (YATT) instead of YAETLT.

Enter the Python

If all we're doing is moving data around, what's involved?

import csv
result = {
'column1': None,
'colmnn2': None,
# etc.
}
with open("source","rb") as source:
rdr= csv.DictReader( source )
with open( "target","wb") as target:
wtr= csv.DictWriter( target, result.keys() )
for row in rdr:
result['column1']= row['some_column']
result['column2']= some_func( row['some_column'] )
# etc.
wtr.writerow( result )

That's really about it. There appear to be 6 or 7 lines of overhead. The rest is working code.

But let's not be too dismissive of the overhead. An ETL depends on the file format, summarized in the import statement. With a little care we can produce libraries similar to Python's csv that work with XLS directly, as well as XLSX and other formats. Dealing with COBOL-style fixed layout files can also be boiled down to an importable module. The import isn't overhead; it's a central part of the rules.

The file open functions could be seen as overhead. Do we really need a full line of code when we could -- more easily -- read from stdin and write to stdout? If we're willing to endure the inefficiency of processing one input file multiple times to create several standardized outputs, then we could eliminate the two with statements. If, however, we have to merge several input files to create a standardized output file, the one-in-one-out model breaks down and we need the with statements and the open functions.

The for statement could be seen as needless overhead. It goes without saying that we're processing the entire input file. Unless, of course, we're merging several files. Then, perhaps, it's not a simple loop that can be somehow implied.

It's Just Code

The point of Python-based ETL is that the problem "solved" by YATT isn't that interesting. Python is an excellent transformation engine ETL. Rather than write a fancy rule interpreter, just write Python. Done.

We don't need a higher-level data transformation engine written in Java. Emit simple Python code and use the Python engine. (We could try to emit Java code, but it's not as simple and requires a rather complex supporting library. Python's Duck Typing simplifies the supporting library.)

If we don't write a new transformation engine, but use Python, that leaves a tiny space left over for the YATT: producing the ETL rules in Python notation. Rather than waste time writing another engine, the YATT developers could create a GUI that drags and drops column names to write the assignment statements in the body of the loop.

That's right, the easiest part of the Python loop is what we can automate. Indeed, that's about all we can automate. Everything else requires complex coding that can't be built as "drag-and-drop" functionality.

Transformations

There are several standard transformations.
  • Column order or name changes. Trivial assignment statements handle this.
  • Mapping functions. Some simple (no hysteresis, idempotent) function is applied to one or more columns to produce one or more columns. This can be as simple as a data type conversion, or a complex calculation.
  • Filter. Some simple function is used to include or exclude rows.
  • Reduction. Some summary (sum, count, min, max, etc.) is applied to a collection of input rows to create output rows. This is an ideal spot for Python generator functions. But there's rarely a simple drag-n-drop for these kinds of transformations.
  • Split. One file comes in, two go out. This breaks the stdin-to-stdout assumption.
  • Merge. Two go in, one comes out. This breaks the stdin-to-stdout assumption, also. Further, the matching can be of several forms. There's the multi-file merge when several similarly large files are involved. There's the lookup merge when a large file is merged with smaller files. Merging also applies to doing key lookups required to match natural keys to locate database FK's.
  • Normalization (or Distinct Processing). This is a more subtle form of filter because the function isn't idempotent; it depends on the state of a database or output file. We include the first of many identical items; we exclude the subsequent copies. This is also an ideal place for Python generator functions.
Of these, only the first three are candidates for drag-and-drop. And for mapping and filtering, we either need to write code or have a huge library of pre-built mapping and filtering functions.

Problems and Solutions

The YATT problem has two parts. Creating the rules and executing the rules.

Writing another engine to execute the rules is a bad idea. Just generate Python code. It's a delightfully simple language for describing data transformation. It already works.

Writing a tool to create rules is a bad idea. Just write the Python code and call it the rule set. Easy to maintain. Easy to test. Clear, complete, precise.

Thursday, November 4, 2010

Pythonic vs. "Clean"

This provokes thought: "Pythonic".

Why does Python have a "Pythonic" style? Why not "clean"?

Is it these lines from Tim Peters' "The Zen of Python" (a/k/a import this)
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Perhaps having a PEP 8, a BDFL (and FLUFL) means that there's a certain "pressure" to conform?

Or do we have higher standards than other languages? Or less intellectual diversity?

I think that "pythonic" is just a catchy phrase that rolls off the tongue. I think a similar concept exists in all languages, but there isn't a good phrase for it in most other languages. Although Ned Batchelder has some really good suggestions. (Except for C++, which should be "C-Posh-Posh" for really good coding style.)

History

When I was a COBOL programmer, there were two buzz-phrases used. "Clean" and "Structured". Clean was poorly-defined and really just a kind of cultural norm. In those days, each shop had a different opinion of "clean" and the lack of widespread connectivity meant that each shop had a more-or-less unique style. Indeed, as a traveling consultant, I helped refine and adjust those standards because of the wide variety of code I saw in my travels.

"Structured" is pretty much an absolute. Each GOTO-like thing had to be reworked as properly nested IFs or PERFORMs. No real issue there. Except from folks who argued that "Structured" was slower than non-Structured. A load of malarkey, but one I heard remarkably often.

When I was a Fortran (and Ada) programmer, I worked for the military in which there were simply absolute standards for every feature of the source code. Boring. And no catchy buzz-word. Just "Compliant" or "Wrong".

Since it was the early '90's (and we were sequestered) we didn't have much Internet access. Once in a while we'd have internal discussions on style where the details weren't covered by any standard. Not surprisingly, they amounted to "Code Golf" questions. Ada has to be perfectly clear, which can be verbose, and some folks don't like clarity.

When I become a C programmer, I found a Thomas Plum's Reliable Data Structures in C. That provided a really good set of standards. The buzzword I used was "Reliable".

The problem with C programming is that "Clean" and "Code Golf" get conflated all the time. Folks write the craziest crap, claim it's "clean" and ignore the resulting obscurity. Sigh. I wish folks with stick with "Reliable" or "Maintainable" rather than "Clean".

While doing Perl programming I noticed that some folks didn't seem to realize the golden rule.
No One Wins At Code Golf
I don't know why. Other than to think that some folks felt that Perl programs weren't "real" programs. They were just "scripts" and could be treated with a casual contempt.

When I learned Java, I noted that an objective was to have a syntax that was familiar. It was a stated goal to have the Java style guidelines completely overlap with C and C++ style guidelines. Fair enough. Doesn't solve the "Code Golf" vs. "Clean" problem. But it doesn't confound it with another syntax, either.

Python

From this history, I think that "Pythonic" exists because we have a BDFL with high standards.

Tuesday, November 2, 2010

"Might Be Misleading" is misleading

My books (Building Skills in Programming, Building Skills in Python and Building Skills in OO Design) develop a steady stream of email. [Also, as a side note, I need to move them to the me.com server, Apple is decommissioning the homepage.mac.com domain.]

The mail falls into several buckets.

Thanks. Always a delight. Keep 'em coming.

Suggestions. These are suggestions for new topics. Recently, I've had a few requests for Python 3 coverage. I'm working with a publisher on this, and hope -- before too long -- to have real news.

Corrections. I get a lot of these. A lot. Keep 'em coming. I didn't pay a copy editor; I tried to do it myself. It's hard and I did a poor job. More valuable than spelling corrections are technical corrections. (I'm happy to report that I don't get as many of these.) Technical corrections are the most serious kind of correction and I try to fix those as quickly as possible.

Source Code Requests. No. I don't supply any source. If I send you the source, what skill did you build? Asking for source? Not a skill that has much value, IMHO. If you want to learn to program, you have to create the source yourself. That is the job. Sorry for making you work, but you have to actually do the work. There's no royal road to programming.

The "Other" Bucket

I get some emails that I file under "other" because they're so funny. They have the following boilerplate.

"Code fragment [X] might is misleading because [Y]."

First, it's a complaint, not a question. That's not very helpful. That's just complaining. Without a suggested improvement, it's the worst kind of bare negativity.

The best part is that — without exception — the person sending the email was not mislead. They correctly understood the code examples.

Clearly, the issue isn't that the code is "misleading" in the usual sense of "lying" or "mendacious". If it was actually misleading, then (a) they wouldn't have understood it and (b) there'd be a proper question instead of a complaint.

Since they correctly understood it, what's misleading?

User Interface Reviews

In software development, we used to go through the "misleading" crap-ola in user interface reviews. In non-Agile ("waterfall") development, we have to get every nuance, every aspect, every feature of the UI completely specified before we can move on. Everyone has to hand-wring over every word, every font choice, field order, button placement, blah, blah and blah.

It seems like 80% of the comments are "label [X] might be misleading". The least useful comment, of course, is this sort of comment with no suggestion. The least useful reviewer is the person who (1) provides a negative comment and, when asked for an improvement, (2) calls a meeting of random people to come up with replacement text.

[Hint: If you eventually understood the misleading label, please state your understanding in your own words. Often, hilarity ensues when their stated understanding cycles back to the original label.]

The "label [X] might be misleading" comment is — perhaps — the most expensive nuisance comment ever. Projects wind up spinning through warrens of rat-holes chasing down some verbiage that is acceptable. After all, you can't go over the waterfall until the entire UI is specified, right?

Worse, of course, the best sales people do not interpose themselves into the sales process. They connect prospective customers with products (or services). Really excellent sales people can have trouble making suggestions. Their transparency is what makes them good. It's not sensible demanding suggestions from them.

Underneath a "Might Be Misleading" comment, the person complaining completely understood the label. They were not actually mislead at all. If it was misleading, then (a) they wouldn't have understood it and (b) there'd be a proper question instead of a complaint.

Thank goodness for Agile product owners who can discard the bad kind of negativity. The right thing to do is put a UI in front of more than one user and bypass the negativity with a consensus that the UI actually is usable and isn't really misleading.

Might Be Misleading

The "Might be Misleading" comments are often code-speak for "I don't like it because..." And the reason why is often "because I had to think." I know that thinking is bad.

I understand that Krug's famous Don't Make me Think is the benchmark in usability. And I totally agree that some thinking is bad.

There are two levels of thinking.
  • Thinking about the problem.
  • Thinking about the UI and how the UI models the problem.
Krug's advice is clear. Don't make users think about the UI and how the UI models the problem. Users still have to think about the problem itself.

In the case of UI labels which "Might Be Misleading", we have to figure out if it's the problem or the UI that folks are complaining about. In many cases, parts of the problem are actually hard and no amount of UI fixup can ever make the problem easier.

Not Completely Accurate

One of the most common UI label complaints is that the label isn't "completely" accurate. They seem to object to fact that a UI label can only contain a few words and they have to actually understand the few words. I assume that folks who complain about UI labels also complain about light switches having just "on" and "off" as labels. Those labels aren't "completely" accurate. It should say "power on". Indeed it should say "110V AC power connected". Indeed it should say "110V AC power connected through load". Indeed it should say "110V AC 15 A power connected via circuit labeled #4 through load with ground".

Apparently this is news. Labels are Summaries.

No label can be "completely" accurate. You heard it here first. Now that you've been notified, you can stop complaining about labels which "might be misleading because they're not completely accurate." They can't be "completely" accurate unless the label recapitulates the entire problem domain description and all source code leading to the value.

Apologies

In too many cases of "Might Be Misleading," people are really complaining that they don't like the UI label (or the code example) because the problem itself is hard. I'm sympathetic that the problem domain is hard and requires thinking.

Please, however, don't complain about what "Might Be Misleading". Please try to focus on "Actually Is Misleading."

Before complaining, please clarify your understanding.

Here's the rule. If you eventually understood it, it may be that the problem itself is hard. If the problem is hard, fixing the label isn't going to help, is it?

If the problem is hard, you have to think. Some days are like that. The UI designer and I apologize for making you think. Can we move on now?

If the label (or example) really is wrong, and you can correct it, that's a good thing. Figure out what is actually misleading. Supply the correction. Try to escalate "Might Be Misleading" to "Actually Mislead Someone". Specifics matter.

Also, please remember that labels are summaries. At some point, details must be elided. If you have trouble with the concept of "summary", you can do this. (1) Write down all the details that you understand. Omit nothing. (2) Rank the details in order of importance. (3) Delete words to pare the description down to an appropriate length to fit in the UI. When you're done, you have a suggestion.