Saturday, April 30, 2011

Language, Tools, Chickens, Eggs, Java and Python

Too much of programming is intimately tied up with the tools to support the development of the software.

Example 1. I was told -- with absolute and fierce conviction -- that VB may suck as a language, but Visual Studio more than makes up for the obvious problems. For some people, Tools Trump Language. Sadly, I've also had customers with ancient code they could no longer compile or maintain because the tools were out of support.

On Stack Overflow, you can read questions like this: "What IDE to use for Python?". In spite of this question's immense popularity, it gets re-asked all the time. Search for "Python IDE" to see endless duplicates. One of the most common duplicate forms of this question asks (or demands) code completion. As if there are folks who cannot write code without code completion.

Chickens and Eggs

The issue with sophisticated IDE's (like Eclipse, NetBeans, and even Komodo) is that you have to learn the tools before learning the language. Until you know something about the language, the tools, of course, are useless. Worse, Eclipse is for "enterprise" applications and is so fat with bells (and whistles) that it's hard to determine what to use and what it means.

So the tool is a prerequisite for the language. But the language is a prerequisite for the tool.

How to cut the Gordian Knot?

First Principles

Irrespective of the "Visual Studio makes VB not suck" crowd, language comes first -- and last -- and fills all the spaces in between.

Language is everything. Software is merely encoded knowledge. The language of that encoding is how we determine meaning; how we argue about correctness, adaptability, maintainability and security. Tools don't endure -- they come and go -- but the language remains.

The only thing more important than the language is the data itself. But that's another rant.

Proof, of course, is available everyone except in VB circles. For non-proprietary languages (Java, Python, etc., etc.) there are a large number of competing tools. One language many tools. Take the hint. Language is important.

Yes, some tools are so flexible, they cover several languages. But there's no universal tool any more than there's a universal language. And the bias is clearly very, very many tools for a given language and only a few languages for a given tool.

How To Start

Language comes first.

For Python, that's easy. Run Python, type code at the >>> prompt, and you're learning. Python comes with IDLE which is a minimalist IDE. It will get anyone started. Later, they can try other IDE's.

For Java, however, that's not that easy. It isn't however, impossible to get started. It's just challenging.

Option 1 -- Bare Knuckles. It's possible to edit text and run the javac compiler to learn a great deal of Java without an IDE. It's not a bad idea. It will get complex to manage projects with more than a few files.

Eventually that's what Ant, Maven and SCons are for. But that's not a good place to start. Again, the tools don't make sense until you start writing things big enough that the tools actually help.

Option 2 -- Succession of IDE's. It's probably best to start with a very simple IDE for Java. Something like Komodo Edit, TextMate or BBEdit. There are a lot of choices, but the idea is to find something little more than a text editor with a few tools. I've used these and like their relative simplicity.

The JavaWIDE toolset might be helpful. I haven't used it, but some folks suggest that it simplifies the language learning. Later a "regular" desktop IDE can be used.

Later, one can move to NetBeans or Eclipse.

Classrooms and Autodidacts

In the classroom, it's easy to demonstrate NetBeans and answer questions.

For auto-didacts, however, choosing the wrong tool leads to endless confusion. The chicken and egg issue isn't clarified by wasting time trying to install and use a tool that's too sophisticated for a n00b.

N00b autodidacts really need to start with a simple text-editor. They need to use `javac` to compile and `java` to run the resulting class. For the first week or two, this will do. Once past the fundamentals, however, IDE selection can start to make sense. A BBEdit/TextMate/Komodo thing should be next. This is good for -- perhaps a year or more. Then, when doing "real" programming, a heavier-weight tool makes sense.

Tuesday, April 19, 2011

Test-Driven Reverse Engineering (TDRE)

Another case study on TDRE.

Provided: 2,938 lines of Python code which process a handful of large files to create a number of outputs. [Details can't be disclosed.]

Objective: Refactor to distinguish between the overall sequence of transformational steps and the details of each individual step.

Observations

The code is almost purely procedural. There are 11 class definitions. 6 of these wrap built-in types with type conversion and null-handling. 1 is a new exception. 1 is a generic "table" that essentially duplicates features of SQLite. The remaining 3 are actually part of the problem domain.

One reason for reverse engineering is that the code has reached an intellectual limit. It's small, but "dense" with highly-optimized processing steps. The cohesion type is almost all "Temporal". Processing is grouped into successive processing loops; each loop contains a cluster of processing steps. Consequently, it's quite hard to tease apart the algorithm to get a "big picture" of what's going on. It's just a dense stand of trees. No forest.

Another reason for reverse engineering is to support the endless adaptation and modification of the code base. The program is a kind of "spreadsheet on steroids". This isn't a simplistic collection of cells and formulæ that permits simple what-if analysis. This is a more complex set of formulæ that would be challenging (but not impossible) to implement as a spreadsheet. The use case, however, is the spreasheet use case: think, tweak, create results, repeat.

TDRE Approach

Start with an Initial Survey of the legacy code base and sample files.

Create an Outline or "sketch" of the domain model and main program. This will be a modules (or a package) with comments and some preliminary class definitions. Little more.

Pick a processing Step in the legacy code. This often requires creating processing summaries of the legacy code. Most legacy code is procedural, so the processing tends to be sequential in nature.

Instrument the Legacy Code with print statements to gather data. This can be simple. The output can be challenging to interpret.

with open("tdre_results_1","w") as tdre:
# some legacy processing
print( "Case:", foo, bar, ", Expect:", baz, file=tdre )

From the output, Build Unit Test Cases. Fill in parts of the processing sequence and domain model. Debug code until the tests pass.

Initial Survey

The Initial Survey locates several things.
  1. The usable, working modules. It appears that all reverse engineering involves a code base with dead or unused code. Even a small project (3,000 lines) will have a remarkable amount of dead code.
  2. Priorities for the implemented functionality. Not every "main" module is relevant.
  3. Example inputs and outputs.
If the software cannot be run (as is the case with organically developed systems that depend on large, complex corporate databases), then the example inputs and outputs may not actually match the software. If the software can be run, it should be run and the actuals compared against the samples to confirm that the code base supplied really produced the sample outputs.

Expect that the provided legacy code is slightly different from the code in production use. In some cases, this cannot be resolved; for example, when the executables are older than the source. In other cases, the code matches and no further work is required to establish the legacy baseline.

The sample outputs point in the direction of an acceptance test case. The sample output cannot be taken literally as the one-and-only acceptance test. While it's desirable for reverse engineering to reproduce the sample output, most reverse engineering will involve enhancements or bug fixes. Expect that errors will be found (or may be known to exist) in the sample output.

Create Outline

The outline is -- initially -- just generic MVP. There must be a domain model, some "presenter" that has the application logic, and some "view" for displaying the outputs.

In our case study, above, the "view" is a collection of (mostly text) output files. The model was undefined in the legacy code, which was all "presenter" application logic.

The goal was to extract the underlying model, break the application "presenter" logic into two layers (forest and trees) and build some views for each of the output files.

Pick a Processing Step

This can be challenging, depending on the legacy code base. There are two paths through a procedural code base.
  • Back to Front. Start with the final results and unit test the final steps based on previous steps that will be defined later.
  • Front to Back. Start with the first recognizable intermediate result based on the input files. Unit test the initial steps.
It's more rewarding to work front-to-back because progress can be shown a little more clearly.

A better architecture can be created by working back-to-front since dependencies are easier to understand.

Unit Test Volume, Edges and Corners

There are two unit test design challenges when doing reverse engineering.
  • Volume. The sample data can be large. 100,000 rows of sample data is too many to test. Finding a "representative" subset is difficult. Generally, arbitrary subsets have to be used to get started. Once the application mostly works, more refined unit tests need to be created.
  • Edge and Corner Cases. While the code may be riddled with if-statements, it can still be difficult to locate sample inputs that exercise the various conditions in the code. It's risky to create data -- we have to assume that the legacy code does unexpected things. In many cases, print statements have to be put into complex if statements to locate any actual data that exercises that logic path.
Once the unit tests are built, this is just Test-Driven Development (TDD).

Tuesday, April 5, 2011

Performance Discussions and Software Design

Read this first: "There is something I find interesting about online discussions around performance issues..." It's about Stack Overflow, specifically. Apparently, someone didn't get their question answered and decided it was better to gripe than to rewrite the question.

Let's look at their response in pieces.

"people try to gang up". Since there's almost no social networking capability, this is a bit much to attribute to people responding to a poorly-worded question. But, if you've worked all day on a bad solution to a poorly-conceived problem, it can feel like being ganged up on. When reality leaks in, it can feel unpleasant.

Hint 1. There are no gangs. It's possible that the question really is poorly written.

"cookie-cutter, patronizing, zero-information responses". I'm guessing these are comments suggesting the approach is bad and asking for clarification. I run afoul of this often because I feel compelled to post comments asking for clarification. Some folks just don't like to clarify. More than once I've been told that their question was very clear. Since I'm asking for clarification, it seems odd to insist the question is perfect. Worse, of course, is asking for help on Stack Overflow, but refusing to clarify the help required.

Hint 2. Clarify. Please. Don't insist that the question is perfect.

"they assume, without any basis, that the person has not (a) benchmarked the code," When the question has no bench mark data, this isn't an assumption. It's a response to the lack of benchmark data.

Hint 3. Provide the facts. Don't complain when folks ask for facts.

"(b) is obviously running an inferior algorithm". Again, this isn't an assumption. It's the response to an incomplete question where the algorithm isn't provided. Also, it's a common response to questions where the algorithm really is inferior.

Hint 4. Consider that -- even after spending days banging your head against the wall -- your question might be poorly-written and require both benchmark data and an algorithm.

"advice about premature optimization... is a well assimilated folklore by now and I dont see how repeating that adds value". Without measurements, profiler results and benchmark data, this is our only possible response. After the profiler results are posted, this advice really is useless. Before profiler results are posted, this advice often turns out to be essential.

Hint 5. Whatever you might know is not well-assimilated folklore on Stack Overflow as a whole. We don't know you, sadly. We don't know how much you know. To avoid useless advice, provide evidence -- in the question -- that the advice has already been followed.

"better served if the discussion shifted to ... Pointed out possible bottlenecks ahead of time," Wouldn't that be nice? What's a "possible bottleneck"? It's a badly-design algorithm. So, the responses to performance questions has to be focused on algorithm choice right away. That means details on the code being used, and profiling information.

Hint 6. There is no hint 6. This would simply repeat hints 3 and 5.

"regardless of the fact whether the code construct is actually a bottleneck in the application or not, it is always good to know what the more efficient alternatives are... there is something called intellectual curiosity."

Reducing a question to a hand-waving hypothetical doesn't improve the question. It doesn't rationalize a poor question. The question still needs to be clarified.

Hint 7. If the question raises a lot of comments and useless advice, please rewrite the question.