Showing posts with label unit testing. Show all posts

Tuesday, November 22, 2022

Testing with PySpark

This isn't about details of pySpark. This is about the philosophy of testing when working with a large, complex framework, like pySpark, pandas, numpy, or whatever.

BLUF

Use data subsets.

Write unit tests for the functions that process the data.

Don't test pyspark itself. Test the code you write.

Some History

I've worked with folks -- data scientists specifically -- without a deep background in software engineering.

When we said their model-building applications needed a test case, they supplied the test case they used to validate the model.

Essentially, their test script ran the entire training set. Built the model. Did extensive statistical testing on the resulting decisions made by the model. The test case asserted that the stats were "good." In fact, they recapitulated the entire model review process that had gone on in the data science community to get the model from "someone's idea" to a "central piece of the business."

The test case ran for hours and required a huge server loaded up with GPUs. It cost a fortune to run. And. It tended to timeout the deployment pipeline.

This isn't what we mean by "test." Our mistake.

We had to explain that a unit test demonstrates the code works. That was all. It shouldn't involve the full training set of data and the full training process with all the hyperparameter tuning and hours of compute time. We don't need to revalidate your model. We want to know the code won't crash. We'd like 100% code coverage. But the objective is little more than show it won't crash when we deploy it.

It was difficult to talk them down from full training sets. They couldn't see the value in testing code in isolation. A phrase like "just enough data to prove the thing could plausibly work with real data" seemed to resonate.

A few folks complained that a numpy array with a few rows didn't really show very much. We had to explain (more than once) that we didn't really want to know all the algorithmic and performance nuances. We mostly wanted to know it wouldn't crash when we applied it to production data. We agreed with them the test case didn't show much. We weren't qualified to revalidate the model; we were only qualified to run their training process for them. If they had done enough work to be sure we *could* run it.

(It was a bank. Software deployments have rules. An AI model-building app is still an app. It still goes through the same CI/CD pipeline as demand deposit account software changes. It's a batch job, really, just a bit more internally sophisticated than the thing that clears checks.)

Some Structure

I lean toward the following tiers of testing:

Unit tests of every class and function. 100% code coverage here. I suggest using pytest and pytest-cov packages to tracking testing and make sure every line of code has some test case. For a few particularly tricky things, every logic path is better than simply testing lines of code. In some cases, every line of code will tend to touch every logic path, but seems less burdensome.
Use hypothesis for the more sensitive numeric functions. In “data wrangling” applications there may not be too many of these. In the machine learning and model application software, there may be more sophisticated math that benefits from hypothesis testing.
Write larger integration tests that mimic pyspark processing, using multiple functions or classes to be sure they work together correctly, but without the added complication of actually using pySpark. This means creating mocks for some of the libraries using unittest.mock objects. This is a fair bit of work, but it pays handsome dividends when debugging. For well-understood pyspark APIs, it should be easy to provide mocked results for the app components under test to use. For the less well-understood parts, the time spent building a mock will often provide useful insight into how (and why) it works the way it does. In rare cases, building the mock suggests a better design that's easier to test.
Finally. Write a few overall acceptance tests that use your modules and also start and run a small pyspark instance from the command line. For this, I really like using behave, and writing the acceptance testing cases using the Gherkin language. This enforces a very formal “Given-When-Then” structure on the test scenarios, and allows you to write in English. You can share the Gherkin with users and other stakeholders to be sure they agree on what the application should do.

Why?

Each tier of testing builds up a larger, and more complete picture of the overall application.

More important, we don't emphasize running pySpark and testing it. It already works. It has it's own tests. We need to test the stuff we wrote, not the framework.

We need to test our code in isolation.

We need to test integrated code with mocked pySpark.

Once we're sure our code is likely to work, the next step is confirmation that the important parts do work with pySpark. For life-critical applications, the integration tests will need to touch 100% of the logic paths. For data analytics, extensive integration testing is a lot of cost for relatively little benefit.

Even for data analytics, testing is a lot of work. The alternative is hope and prayer. I suggest starting with small unit tests, and expanding from there.

Tuesday, September 11, 2018

Code Review

I can't actually share all the code. So this feels incomplete. But I can share what I said about the code. Then you can look at your code and decide if you've got similar problems to fix.

My responses were these. I'll expand on them below.

This appears to be a single cell in a Jupyter notebook? Why isn’t it a script?
The code doesn’t look like any effort was made to follow any conventions. Use black. Or pylint. Make the code look conventional.
There don’t appear to be any docstring comments. That’s really a very bad practice.
The design appears untestable. That’s a very bad practice.
If this is an example of “production” code, I would suggest it needs a lot of rework.

Let's review these in a little more detail.

Number 1 was based on the file name being something_p36.ipynb.txt. The Jupyter notebookiness of the name is a problem. The _p36 is extra creepy, and indicates either a severe problem understanding how bash "shebang" comments work, or a blatant refusal to simply use Python3. It's hard to say what's going on, and I didn't even try to ask because... well... too many other things weren't clear.

Don't make up complex, weird naming rules. Use something.py. Simple. Flat. Pythonic.

Number 2 was based on things like this: def PrintParameters(pca): I hate to get super-pure PEP-8, but this kind of thing is simply hard to read. There were a LOT of other troubling aspects to the code. Once this is corrected, some of the other problems will go away, and we could move forward to more substantial issues.

Follow existing code styles. Find Python code. The standard library has a LOT of examples already part of your installation. Read it. Enjoy it. Mimic it.

Use pylint. Always.

Number 3 and Number 4 are consequences of the bulk of the code being a flat script with few class or function definitions. Actually, there were one of each. One class. One function. 240 or so lines of code. There was no separate __name__ == "__main__" section, so I was generally unhappy with the overall design.

Also. There's code like this

if True:

Yes. That's a real line of code. Sigh.

Here's an ancillary problem. If you need to write something like this, you're doing it wrong.

##########################
-- init Stuff
##########################

The code that follows one of these "big billboard comment" sections *must* be part of a function or class. It can't be left floating around with a billboard for demarcation. It should be refactored into a function (or method of a class), documented, and tested.

Did I mention tested?

It's untestable as written. Sigh.

Number 5 may be a misunderstanding on my part. The email had this: "They have produced production code that mathematically optimizes stuff for [redacted]. So, they are heads up type of people."

I'm guessing this is relevant because the team has some "production" code in Python and consider themselves knowledgeable. Otherwise, this is noise, and I should have ignored it.

I'm hopeful they'll use black, make the code minimally readable, and we can move on to substantial issues regarding design for testability and overall possible correctness issues.

It wasn't the worst code I've seen. But. It shows a lot of room for growth and improvement.

Tuesday, August 4, 2015

Mocking and Unit Testing and Test-Driven Development

Mocking is essential to unit testing.

However.

It's also annoyingly difficult to get right.

If we aren't 100% perfectly clear on what we're mocking, we will merely canonize any dumb assumptions into mock objects that don't really work. They work in the sense that they don't crash, but they don't properly test the application objects since they repeat some (bad) assumptions.

When there are doubts, it seems like we have to proceed cautiously. And act like we're breaking some of the test-first test-driven-development rules.

Note. We're not really breaking the rules. Some folks, however, will argue that test-driven development means literally every action you take should be driven by tests. Does this include morning coffee or rotating your monitor into portrait mode? Clearly not. What about technical spikes?

Our position is this.

Set a spike early and often.
Once you have reason to believe that this crazy thing might work, you can formalize the spike with tests. And mock objects.
Now you can write the rest of the app by creating tests and fitting code around those tests.

The import part here is not to create mocks until you really understand what you're doing.

Book Examples

Now comes the tricky part: Writing a book.

Clearly every example must have a unit test of some kind. I use doctest heavily for this. Each example is in a doctest test string.

The code for a chapter might look like this.


test_hello_world = '''
>>> print( 'hello world')
'hello world'
'''

__test__ = { n:v for n,v in vars().items() 
    if n.startswith('test_') }

if __name__ == '__main__':
    import doctest
    doctest.testmod()

We've used the doctest feature that looks for a dictionary assigned to a variable named __test__. The values from this dictionary are tests that get run as if they were docstrings found inside modules, functions, or classes.

This is delightfully simple. Expostulate. Exemplify. Copy and Paste the example into a script for test purposes and Exhibit in the text.

Until we get to external services. And RESTful API requests, and the like. These are right awkward to mock. Mostly because a mocked unittest is singularly uninformative.

Let's say we're writing about making a RESTful API request to http://www.data.gov. The results of the request are very interesting. The mechanics of making the request are an important example of how REST API's work. And how CKAN-powered web sites work in general.

But if we replace urrlib.request with a mock urllib, the unit test amounts to a check that we called urlopen() with the proper parameters. Important for a lot of practical software development, but also uniformative for folks who download the code associated with the book.

It appears that I have four options:

Grin and bear it. Not all examples have to be wonderfully detailed.
Stick with the spike version. Don't mock things. The results may vary and some of the tests might fail on the editor's desktop.
Skip the test.
Write multiple versions of the test: a "with real internet" version and a "with corporate firewall proxy blockers in place" version that uses mocks and works everywhere.

So far, I've leveraged the first three heavily. The fourth is awkward. We wind up with code like this:


class Test_get_whois(unittest.TestCase):
    def test_should_get_subprocess(self):
        subprocess = MagicMock()
        subprocess.check_output.return_value=b'\nwords\n'
        with patch.dict('sys.modules', subprocess=subprocess):
            import subprocess
            from ch_2_ex_4 import get_whois
            result= get_whois('1.2.3.4')
        self.assertEquals( result, ['', 'words'] )
        subprocess.check_output.assert_called_with(['whois', '1.2.3.4'])

This is not a lot of code for enterprise software development purposes. It's a bit weak, in fact, since it only tests the Happy Path.

But for a book example, it seems to be heavy on the mock module and light on the subject of interest.
Indeed, I defy anyone to figure out what the expository value of this is, since it has only 2 lines of relevant code wrapped in 8 lines of boilerplate required to mock a module successfully.

I'm not unhappy with the unitest.mock module in any way. It's great for mocking modules; I think the boilerplate is acceptable considering what kind of a wrenching change we're making to the runtime environment for the unit under test.

This fails at explication.

I'm waffling over how to handle some of these more complex test cases. In the past, I've skipped cases, and used the doctest Ellipsis feature to work through variant outputs. I think I'll continue to do that, since the mocking code seems to be less helpful for the readers, and too focused on purely technical need of proving that all the code is perfectly correct.

Thursday, September 4, 2014

API Testing: Quick, Dirty, and Automated

When writing RESTful API's, the process of testing can be simple or kind of hideous.

The Postman REST Client is pretty popular for testing an API. There are others, I'm sure, but I'm working with folks who like Postman.

Postman 10 has some automation capabilities. Some.

However. (And this is important.)

It doesn't provide much help in framing up a valid complex JSON message.

When dealing with larger and more complex API's with larger and more complex nested and repeating structures, considerably more help is required to frame up a valid request and do some rational evaluation of the response.

Enter Python, httplib and json. While Python3 is universally better, these libraries haven't changed much since Python2, so either version will work.

The idea is simple.

Create templates for the eventual class definitions in Python. This can make it easy to build the JSON structures. It can save a lot of hoping that the JSON content is right. It can save time in "exploratory" testing when the JSON structures are wrong.
Build complex messages using the template class definitions.
Send the message with httplib. Read the response.
Evaluate the responses with a simple script.

Some test scripting is possible in Postman. Some. In Python, you've got a complete programming language. The "some" qualifier evaporates.

When it comes to things like seeding database data, Python (via appropriate database drivers) can seed integration test databases, also.

Further, you can use the Python unittest framework to write elegant automated script libraries and run the entire thing from the command line in a simple, repeatable way.

What's important is that the template class definitions aren't working code. They won't evolve into working code. They're placeholders so that we can work out API concepts quickly and develop relatively complete and accurate pictures of what the RESTful interface will look like.

I had to dig out my copy of https://www.packtpub.com/application-development/mastering-object-oriented-python to work out the metaclass trickery required.

The Model and Meta-Model Classes

The essential ingredient is a model class what we can use to build objects. The objective is not a complete model of anything. The objective is just enough model to build a complex object.
Our use case looks like this.


>>> class P(Model):
...    attr1= String()
...    attr2= Array()
...
>>> class Q(Model):
...    attr3= String()
...
>>> example= P( attr1="this", attr2=[Q(attr3="that")] )

Our goal is to trivially build more complex JSON documents for use in API testing. Clearly, the class definitions are too skinny to have much real meaning. They're handy ways to define a data structure that provides a minimal level of validation and the possibility of providing default values.

Given this goal, we need a model class and descriptor definitions. In addition to the model class, we'll also need a metaclass that will help build the required objects. One feature that we really like is keeping the class-level attributes in order. Something Python doesn't to automatically. But something we can finesse through a metaclass and a class-level sequence number in the descriptors.

Here's the metaclass to cleanup the class __dict__. This is the Python2.7 version because that's what we're using.


class Meta(type):
    """Metaclass to set the ``name`` attribute of each Attr instance and provide
    the ``_attr_order`` sequence that defines the origiunal order.
    """
    def __new__( cls, name, bases, dict ):
        attr_list = sorted( (a_name
            for a_name in dict
            if isinstance(dict[a_name], Attr)), key=lambda x:dict[x].seq )
        for a_name in attr_list:
            setattr( dict[a_name], 'name', a_name )
        dict['_attr_order']= attr_list
        return super(Meta, cls).__new__( cls, name, bases, dict )

class Model(object):
    """Superclass for all model class definitions;
    includes the metaclass to tweak subclass definitions.
    This also provides a ``to_dict()`` method used for
    JSON encoding of the defined attributes.

    The __init__() method validates each keyword argument to
    assure that they match the defined attributes only.
    """
    __metaclass__= Meta
    def __init__( self, **kw ):
        for name, value in kw.items():
            if name not in self._attr_order:
                raise AttributeError( "{0} unknown".format(name) )
            setattr( self, name, value )
    def to_dict( self ):
        od= OrderedDict()
        for name in self._attr_order:
            od[name]= getattr(self, name)
        return od

The __new__() method assures that we have an additional _attr_order attribute added to each class definition. The __init__() method allows us to build an instance of a class with keyword parameters that have a minimal sanity check imposed on them. The to_dict() method is used to convert the object prior to making a JSON representation.

Here is the superclass definition of an Attribute. We'll extend this with other attribute specializations.


class Attr(object):
    """A superclass for Attributes; supports a minimal
    feature set. Attribute ordering is maintained via
    a class-level counter.

    Attribute names are bound later via a metaclass
    process that provides names for each attribute.

    Attributes can have a default value if they are
    omitted.
    """
    attr_seq= 0
    default= None
    def __init__( self, *args ):
        self.seq= Attr.attr_seq
        Attr.attr_seq += 1
        self.name= None # Will be assigned by metaclass ``Meta``
    def __get__( self, instance, something ):
        return instance.__dict__.get(self.name, self.default)
    def __set__( self, instance, value ):
        instance.__dict__[self.name]= value
    def __delete__( self, *args ):
        pass

We've done the minimum to implement a data descriptor. We've also included a class-level sequence number which assures that descriptors can be put into order inside a class definition.

We can then extend this superclass to provide different kinds of attributes. There are a few types which can help us formulate messages properly.


class String(Attr):
    default= ""

class Array(Attr):
    default= []

class Number(Attr):
    default= None

The final ingredient is a JSON encoder that can handle these class definitions. The idea is that we're not asking for much from our encoder. Just a smooth way to transform these classes into the required dict objects.


class ModelEncoder(json.JSONEncoder):
    """Extend the JSON Encoder to support our Model/Attr
    structure.
    """
    def default( self, obj ):
        if isinstance(obj,Model):
            return obj.to_dict()
        return super(NamespaceEncoder,self).default(obj)

encoder= ModelEncoder(indent=2)

The Test Cases

Here is an all-important unit test case. This shows how we can define very simple classes and create an object from those class definitions.


>>> class P(Model):
...    attr1= String()
...    attr2= Array()
...
>>> class Q(Model):
...    attr3= String()
...
>>> example= P( attr1="this", attr2=[Q(attr3="that")] )
>>> print( encoder.encode( example ) )
{
  "attr1": "this", 
  "attr2": [
    {
      "attr3": "that"
    }
  ]
}

Given two simple class structures, we can get a JSON message which we can use for unit testing. We can use httplib to send this to the server and examine the results.

Thursday, June 12, 2014

TDD, API Design and Refactoring

See this short discussion on a Stingray Reader feature:
https://sourceforge.net/p/stingrayreader/discussion/COBOL/thread/d2132851/?limit=25#2a3a

This turned into an exercise in pure TDD.

<rant>
I'm not a fan of applying TDD in a strict, death-march fashion.

I see the comments on Stack Overflow that indicate that some folks feel strongly that strict TDD is somehow helpful. While "test before code" is laudable and often helpful, there's no royal road to good software.

Design involves a great deal of back and forth between code and test. A great deal.

It's logically impossible to write a test without having thought about the code. In order to write the test first, there must be a notional API against which the test is written. Anyone who requires that the test file must be written before the notional class or module is just playing at petty tyranny.

The notional design -- the rough outline of the class or module -- can be written into a file before any tests. It's okay. It is still test-driven because the considerations of testability drove the design process.

In particular, when starting "from scratch" -- with nothing -- writing tests first is senseless. Some module or package structure must exist for the test modules to import.

</rant>
Having ranted, it still arises that the tests do come before any code under some circumstances.

In this case, the requested functionality was quite difficult to visualize. However, it was possible to cobble together a test case that simplified the problem down to something like this this:

01 Some-Record.
     05 Header PIC XXX.
     05 Body PIC X(17).

01 ABC-Segment.
     05 Field-ABC PIC X(17).

01 DEF-Segment.
     05 Field-DEF PIC X(17).

In COBOL, the program would use logic like IF Header EQUALS "ABC" THEN MOVE Body TO ABC-Segment. We need a way to handle something like this in Python so that we can parse the EBCDIC COBOL data.

This summarized example allowed construction of a test case that made use of a API that might have existed. I was pretty sure I had a test case that showed an approach.

What Actually Happened

Since the application already had 178 unit tests, there was plenty of structure that worked.

The single new unit test relied on a notional API that wasn't really in place. The new test bombed grotesquely.

There are two solutions:

Modify the test.
Fix the notional API so that it works properly.

I started out chasing the second option. I tweaked some things. More tests failed. I tweaked some more things. The new test finally passed, but another test was failing.

Some careful study of the failing test revealed that my approach was wrong. Way wrong.

The notional API was a bad idea.

The tweaks to make it work were a worse idea.

Back to the Lab Bench

At this point, I had made enough changes that the only thing to do was copy the new test and use the Git Revoke on the local changes to unwind the awful mistakes.

Staring again, I had a slightly better grip on the relevant code. I had a failing test. I tried a different approach that wasn't quite so inventive. This meant modifying the test.

I actually went through a few iterations of the test, using the test method as a kind of lab bench.

A more Pythonic approach to the lab bench is to work from the >>> prompt. I think that all of the exemplary projects use the >>> prompt examples in their documentation. This is a way to narrow and clarify the API. As projects get big, they can sprawl. New features can wind up with many imports to pick and choose elements from existing modules.

When it becomes difficult to use the >>> prompt as the lab bench, that's a sign that the API is too complex. Refactoring must happen.

Using the unit test framework as the lab bench was a hint that something had drifted out of tolerance.

However. I did get a test which passed. Yay. Sort of.

The test code was hideous.

TDD and API Design

The point of TDD, however, is that we have a working suite of tests. Refactoring won't break anything.

The point was that the hideous API could be rewritten into something that both

Passed all the tests, and
Was usable at the >>> prompt.

It's difficult to express how valuable the Python >>> prompt is to help clarify API design issues.

The rule is this:

If the API doesn't make sense at the >>> prompt, it's incomprehensible

Sadly, Java doesn't have this kind of boundary. Java programming can spin into quite complex API's, limited only by the laziness of the programmer who avoids refactoring.

Or the malice of the programmer's manager in not allowing time to refactor.

Thursday, February 20, 2014

Third Time's the Charm: the version 3.0 phenomenon

Somewhere, I have a vague recollection of reading advice from someone (Bill Gates?) that it takes three versions to get things right. The context may have been a justification of the wild success of Windows 3.0.

Or, I could be just making it up.

But one thing I have noticed is that there's a definite bias toward looking at software three times.

I worked (briefly) with an agile project management group that suggested that everything will be released three times, called the "Good", "Better", "Best" releases.

The good release passed the unit tests.
The better release included any non-functional (performance, auditability, maintainability, etc.) improvements required.
The best implementation possible.

Not everything required three releases. Simpler components can merge better and best. Some components simply start out in really, really good shape.

Teaching Moment

What I've also noticed is that the explanation of the component -- writing documentation, presenting to peers in a walkthrough -- leads to profound rethinking.

May things may appear to be better or best in the sense above. Until we have to explain them. Then they're no longer "best" but merely "better" or perhaps even "good."

A few minutes spent hand-waving through a design often points to things that aren't quite to easy to explain. A walkthrough is very beneficial to the person doing the presentation.

But, not too early.

When I made military software, we had Preliminary Design Reviews that were done before coding begins. The idea was to surround the difficult coding work with yet more process steps and yet more deliverable intermediate results.

The intent was noble: if a walkthrough reveals so much, then do the walkthroughs early and often.

However. I'm beginning to think that early isn't ideal.

I think that the design walkthrough should be delayed until after minimally working code exists. Once there's code -- with automated unit tests -- then refactoring to meet non-functional quality factors (like performance) is easier and more likely to be successful.

Also, refactoring to make the software clear, simple, and elegant should probably wait until it works and has a complete suite of automated unit tests.

Thursday, January 23, 2014

Manual Testing -- Bad Idea

The question of testing came up recently. The description of the process sounded like manually "testing" some complex web application.

When trying to work out manual "testing", I find it necessary to use scare quotes. I'm not sure there's a place for "manual testing" of any software.

I know that some folks use Selenium to created automated test scripts for interactive applications. That may be a helpful technology. I prefer automated test scripts over manual testing. Consequently, I'm not too interested in helping out with testing -- other than perhaps coaching developers to write automated test scripts.

http://docs.seleniumhq.org

To continue this rant.

I've seen the suggestion that having a person do some manual "testing" will permit them to notice things that are "broken" but not a formal requirement in a test script. This seems to require some willing misuse of words. A person who's supposed to be noticing stuff isn't testing: they're exploring or demonstrating or thinking. They're not testing. Tests are -- by definition -- pass-fail. This is a very narrow definition: if there's no failure mode, it's not a test, it's something else. Lots of words are available, use any word except "test." Please.

Reading about exploratory "testing" leads to profound questions about the nature of failing an exploratory "test." When did the failure mode become a requirement? During the exploration? Not prior to the actual development?

When an explorer finds a use case that was never previously part of a user story, then it's really an update to a user story. It's a new requirement; a requirement defined by a test which fails. It's a really high-quality requirement. More importantly, exploratory "testing" is clearly design. It's product ownership.

This kind of exploration/thinking/playing/experiencing is valuable stuff. It needs to be done. But it's not testing.

Developers create the test scripts: unit tests of various kinds. Back-end tests. Front-end tests. Lots of testing. All automated. All.

Other experienced people -- e.g., a product owner -- can also play with the released software and create informed, insightful user stories and user story modifications that may lead to revisions to test cases. They're not testing. They're exploring. They're writing new requirements, updating user stories, and putting work into the backlog.

Putting work into the backlog

An exploratory "test" should not be allowed to gum up a release. To do that breaks the essential work cycle of picking a story with fixed boundaries and getting it to work. Or picking a story with nebulous boundaries and grooming it to have fixed boundaries. Once you think you're doing exploratory "testing" on a release that's in progress, then the user stories no longer have a fixed boundary, and the idea of a fixed release cycle is damaged. It becomes impossible to make predictions, since the stories are no longer fixed.

For a startup development effort, the automated test scripts will grow in complexity very quickly. In many cases, the test scripts will grow considerably faster than the product code. This is good.

It's perfectly normal for a product owner to find behaviors that aren't being tested properly by the initial set of automated test scripts. This is good, too. As the product matures, the test scripts expand. The product owner should have increasing difficulty locating features which are untested.

Management Support

What I've found is that some developers object to writing test scripts. One possible reason is because the test scripts don't seem to be as much "fun" as playing with GUI development tools.

I think the more important reason is that developers in larger organizations are not rewarded for software which is complete, but are rewarded for new features no matter what level of quality they achieve. This seems to happens when software development is mismanaged using a faulty schedules and a faulty idea of the rate of delivery of working software.

If the schedule -- not working features -- dominates management thinking, then time spent writing tests to show precisely how well a feature works is treated as waste. Managers will ask if a developer is just "gold plating" or "polishing a BB" or some other way of discrediting automated test case development.

If the features dominate the discussion, then test development should be the management focus. A new feature without a sufficiently robust suite of automated tests is just a technology spike, not something which can be released.

Manual "testing" and exploratory "testing" seem to allow managers to claim that they're testing without actually automating the tests. It appears that some managers feel that reproducible, automated test take longer and cost more than having someone play with the release to see if it appears to work.

But What About...

The most common complaint about automated GUI testing isn't a proper pass-fail test issue at all.

Folks will insist that somehow font choice, color, position or other net effects of CSS properties must be "tested." Generally, they seem to be conflating two related (but different) things.

1. Design. This is the position/color/font issue. These are design features of a GUI page or JavaScript window or HTML document. Design. The design may need to be reviewed by a person. But no testing makes sense here. The design isn't a "pass-fail" item. Someone may say it's ugly, but that's a far cry from not working. CSS design (especially for people like me who don't really understand it) sure feels like hacking out code. That doesn't mean the design gets tested.

2. Implementation. This is the "does every element use the correct CSS class or id?" question. This is automated testing. And it has nothing to do with looking at a page. It has everything to do with an automated test to be sure HTML tags are generated properly. It has nothing to do with the choice of packing algorithm in a widget, and everything to do with elements simply making the correct API calls to assure that they're properly packed.

For people like me who don't fully get CSS, lots of pages need to be reviewed to make sure the CSS "worked". But that's a design review. It's not a part of automated testing.

Here's the rule: Ugly and Not Working have nothing to do with each other. You can automate tests for "works" -- that's objective. You can't automated the test for "ugly" -- that's subjective.

Here's how some developers get confused. A bug report that amounts to "ugly" is fixed by making a change to a GUI element. This is a valid kind of bug-to-change. But how can the change have an automated test? You must have a person confirm that the GUI is no longer ugly. Right?

Wrong.

The confusion stems from conflating design (change to reduce the ugliness) and implementation (some API change to the offending element.) The design change isn't subject to automated testing. Indeed, it passed the unit tests before and after because it worked.

No design can have automated testing. We don't test algorithm design, either. We test algorithm implementations.

Compare it with class design vs. implementation. We don't check every possible aspect of a class to be sure it follows a design. We check some external-facing features. We don't retest the entire library, compiler, OS and toolset, do we? We presume that design is implemented more-or-less properly, and seek to confirm that the edges and corners work.

Compare it with database design vs. implementation. We don't check every bit on the database. We check that -- across the API -- the application uses the database properly.

There's no reason to test every pixel of an implementation if the design was reviewed and approved and the GUI elements use the design properly.

Tuesday, January 14, 2014

Explaining an Application

Some years ago--never mind how long precisely--having little or no money in my purse... I had a great chance to do some Test-Driven Reverse Engineering on a rather complex C program. I extracted test cases. I worked with the users to gather test cases. And I rewrote their legacy app using Test-Driven Development. The legacy C code was more a hint than anything else.

I thought it went well. We uncovered some issues in the test cases. Uncovered a known issue in the legacy program. And added new features. All very nice. A solid success.

Years later, a developer from the organization had to make some more changes.

The client calls.

"No problem," I assure them, "I'm happy to answer any questions. With one provision. Questions have to be about specific code. I can't do 'overview' questions. Email me the code snippet and the question."

I never heard another word. No question of any kind. Not a general question (that I find difficult to answer,) nor a specific question.

Why the provision?

I find it very hard to talk with someone who hasn't actually read the code yet. I have done far too many presentations to people who are sitting around a conference room table, nodding and looking at power-point slides.

I know the initial phone call focused on "an overview." But what counts as an overview? Use cases? Data model? Architectural layers? Test cases? Rather than waste time explaining something irrelevant, I figured if they asked anything -- anything at all -- I could focus on what they really wanted to know.

I know that I have never been able to understand people hand-waving at a picture of code. I have to actually read the code to see what the modules, classes and functions are and how they seem to work. I'm suspicious of graphics and diagrams. I know that I can't read the code while someone is talking. If they insist on talking, I need to read the code in advance.

Perhaps I'm imposing too much on this customer. But. They're going to maintain the code -- that seems to mean they need to understand it. And they need to understand it their own way, without my babbling randomly about the bits that interested me. Maybe the part I found confusing is obvious to them, and doesn't bear repeating.

Perhaps raising the bar to "specific questions about specific code" forced them to read enough. Perhaps after some reading, they realized they didn't need to pay me to explain things. I certainly can't brag that the code explained itself.

Or. Perhaps they realized how the unit tests worked and realized that TestCases provide a roadmap of the API.

Thursday, June 20, 2013

Automated Code Modernization: Don't Pave the Cowpaths

After talking about some experience with legacy modernization (or migration), I received information from Blue Phoenix about their approach to modernization.

Before talking about modernization, it's important to think about the following issue from two points of view.

Modernization can amount to nothing more than Paving the Cowpaths.

From a user viewpoint, "paving the cowpaths" means that the legacy usability issues have now been modernized without being fixed. The issues remain. A dumb business process is now implemented in a modern programming language. It's still a dumb business process. The modernization was strictly technical with no user-focused "value-add".

From a technical viewpoint, "paving the cowpaths" means that bad legacy design, bad legacy implementation and legacy platform quirks have now been modernized. A poorly-designed application in a legacy language has been modernized into a poorly-designed application in yet another language. Because of language differences, it may go from poorly-designed to really-poorly-designed.

The real underlying issue is how to avoid low-value modernization. How to avoid merely converting bad design and bad UX from one language to another.

Consider that it's possible to actually reduce the value of a legacy application through poorly-planned modernization. Converting quirks and bad design from one language to another will not magically make a legacy application "better". Converting quirky code to Java will merely canonize the quirks, obscuring the essential business value that was also encoded in the quirky legacy code.

Focus on Value

The fundamental modernization question is "Where's the Value?" Or, more specifically, "What part of this legacy is worth preserving?"

In some cases, it's not even completely clear what the legacy software really is. Old COBOL mainframe systems may contain hundreds (or thousands) of application programs, each of which does some very small thing.

While "Focus on Value" is essential, it's not clear how one achieves this. Here's a process I've used.

Step 1. Create a code and data inventory.

This is essential for determine what parts of the legacy system have value. Blue Phoenix has "Legacy Indexing" for determine the current state of the application portfolio. Bravo. This is important.

I've done this analysis with Python. It's not difficult. Many organizations can provide a ZIP file with all of the legacy source and and all of the legacy JCL (Z/OS shell scripts). A few days of scanning can produce inventory summaries showing programs, files, inputs and outputs.

A suite of tools would probably be simpler than writing a JCL parser in Python

A large commercial operation will have all kinds of source checked into the repository. Some will be inexplicable. Some will have never been used. In some cases, there will be executable code that was not actually built from the source in the master source repository.

A recreational project (like HamCalc) reveals the same patterns of confusion as large multi-million dollar efforts. There are mystery programs which are probably never used; the code is available, but they don't appear in shell scripts or interactive menus. There are programs which have clear bugs and (apparently) never worked. There are programs with quirks; programs that work because of an undocumented "feature" of the language or platform.

Step 2. Capture the Data.

In most cases, the data is central: the legacy files or databases need to be preserved. The application code is often secondary. In most cases, the application code is almost worthless, and only the data matters. The application programs serve only as a definition of how to interpret and decode the data.

Blue Phoenix has Transition Bridge Services. Bravo. You'll be moving data from legacy to new (and the reverse, also.) We'll return to this "Build Bridges" below.

Regarding the data vs. application programming distinction, I need to repeat my observation: Legacy Code Is Largely Worthless. Some folks are married to legacy application code. The legacy code does stuff to the legacy files. It must be important, right?

"That's simple logic, you idiot," they say to me. "It's only logical that we need to preserve all the code to process all the data."

That's actually false. It's not simple logic. It's just wishful thinking.

When you actually read legacy code, you find that a significant fraction (something like 30%) is trivial recapitulation of SQL's "set" operations: SQL DML statements have an implied loop that operates on a set of data. Large amounts of legacy code merely recapitulates the implied loop. This is trivially true of legacy SQL applications with embedded SQL; explicit FETCH loops are very wordy. There's no sense in preserving this overhead if it can be avoided.

Programs which work with flat files always have long stretches of code that models SQL loops or Map-Reduce loops. There's no value in the loop management parts of these programs.

Another significant fraction is "utility" code that is not application-specific in any way. It's an application program that merely does a "CREATE TABLE XYZ(...) AS SELECT ....": a single line of SQL. There's no sense in preserving this through an "automated" tool, since it doesn't really do anything of value.

Also. The legacy code has usability issues. It doesn't precisely fit the business use cases. (Indeed, it probably hasn't fit the business use cases for decades.) Some parts of the legacy code base are more liability than asset and should be discarded in order to simplify, streamline or improve operations.

What's left?

The high value processing.

Step 3. Extract the Business Rules.

Once we've disposed of overheads, utility code, quirks, bad design, and wrong use cases, what's left are a the real brass tacks. A few lines of code here and there will decode a one-character flag or indicator and determine the processing. This code is of value.

Note that this code will be disappointingly small compared to the total inventory. It will often be widely scattered. Bad copy-and-paste programming will lead to exact copies as well as near-miss copies. It may be opaque.

IF FLAG-2 IS "B" THEN MOVE "R" TO FLAG-BC.

Seriously. What does this mean? This may turn out to be the secret behind paying bonus commissions to highly-valued sales associates. If this isn't preserved, the good folks will all quit en masse.

This is the "Business Rules" layer of a modern application design. These are the nuggets of high-value coding that we need to preserve.

These are things that must be redesigned when moving from the old database (or flat files) to the new database. These one character flag fields should not simply be preserved as a single character. They need to be understood.

The business rules should never be subject to automated translation. These bits of business-specific processing must always be reviewed by the users (or business owners) to be absolutely sure that it's (a) relevant and (b) has a complete suite of unit test cases.

The unique processing rules need to have modern, formal documentation. Minimally, the documentation must be in the form of unit test cases; English as a backup can be helpful.

Step 4. Build Bridges.

A modernization project is not a once-and-done operation.

I've been told that the IT department goal is to pick a long weekend, preferably a federal Monday holiday weekend (Labor Day is always popular), and do a massive one-time-only conversion on that weekend.

This is a terrible plan. It is doomed to failure.

A better plan is a phased coexistence. If a vendor (like Blue Phoenix) offers bridge services, then it's smarter and less risky to convert back and forth between legacy and new over and over again.

The policy is to convert early and convert often.

A good plan is the following.

Modernize some set of features in the legacy quagmire of code. This should be a simple rewrite from scratch using the legacy code as a specification and the legacy files (or database) as an interface.
Run in parallel to be sure the modern version works. Do frequent data conversions from old to new as part of this parallel test.
At some point, simply stop converting from old to new and start using the new because it passes all the tests. Often, the new will have additional features or remove old bugs, so the users will be clamoring for it.

For particularly large and gnarly systems, all features cannot be modernized at once. There will be features that have not yet been modernized. This means that some portion of new data will be converted back to the legacy for processing.

The feature sets are prioritized by value. What's most important to the users? As each feature set is modernized, the remaining bits become less and less valuable. As some point, you get to the situation where you have a portfolio of unconverted code but no missing features. Since there are no more desirable legacy features to convert, the remaining code is -- by definition -- worthless.

The unconverted code is a net cost savings.

Automated Translation

Note that there is very little emphasis on automated translation of legacy code. The important work is uncovering the data and the processing rules that make the data usable. The important tools are inventory tools and data bridging tools.

Language survey tools will be helpful. Tools to look for file operations. Tools to look for places where a particular field of a record is used.

Automated translation will tend to pave all the cowpaths: good, bad and indifferent. Once the good features are located, a manual rewrite is just as efficient as automated translation.

Automated translation cannot capture meaning, identify use cases or write unit test cases. Thoughtful manual analysis of meaning, usability and unit tests is how the value of legacy code and data is preserved.

Thursday, September 13, 2012

RESTful Web Services Testing, Q&A

Some background:

I was vaguely pointed at one call in an API, via a 2-page "tutorial" that uses CURL examples. Told "Test this some more." by the guy who'd been doing some amount (none?) of hand "success path" testing via CURL. This has since morphed into "regression testing things, all 12 calls", "we have a build API as well", and "there's this hot new feature for a vendor conference in a couple weeks ..."

There was more, but you get the idea. There were so more specific "requirements" for the RESTful unit testing environment.

1) Get "smoke test" coverage vs. all the calls

A sequence of CURL requests to exercise a server can be viewed as "testing". It's piss-poor at best. Indeed, it's often misleading because of the complexity of the technology stack.

In addition to the app, you're also testing Apache (or whatever server they're using) plus the framework, plus the firewall, plus caching and any other components of the server's technology stack.

However, it does get you started ASAP.

2) expand / parameterize that

CURL isn't the best choice. You wind up writing shell scripts. It gets ugly before long.

Python is better for this.

Selenium may also work. Oh wait. Selenium is written in Python.

3) build out to response correctness & error codes

Proper design for testability makes this easy.

However. When you've be tossed a "finished" RESTful web service that you're supposed to be testing, you have to struggle with expected vs. actual.

It's not trivial because the responses may have legitimate variances: date-time stamps, changing security tokens or nonces, sequence numbers that vary.

Essentially, you can't just use the OS DIFF program to compare actual CURL responses with expected CURL responses.

You're going to have to parse the response, pick out appropriate fields to check and write proper unittest assertions around those fields.

4) layer in at least that much testing for the new, new feature breathlessly happening RIGHT NOW.

Without a proper design for testability, this can be painful.

If you're using a good unit test framework, it shouldn't be impossible. Your framework must be able to start the target RESTful web service for a TestCase, exercise the TestCase, and then shutdown the target RESTful web service when the test has completed.

Now, you're just writing unittest TestCase instances for the new feature breathlessly happening RIGHT NOW. That should be manageable.

...tool things I've found so far... [list elided]

All crap, more or less. They're REST client libraries, not testing tools.

You need a proper unit testing framework with TestCase and TestSuite classes and a TestRunner. The tools you identified aren't testing frameworks, they're lower level REST client and client library. CURL, by itself, isn't really very good for robust testing unless you embed CURL in some test framework.

For defining interfaces (2), I have found these... [list elided]

API's in a typical RESTful environment have little or no formal definition, other than Engrish. WSDL is for Java/XML/SOAP. It's not used much for (simpler) REST. For the most part, REST API definitions (i.e., via JSON or whatever) are mostly experimental. Not standardized. Not to be trusted.

The issue is one of parallel maintenance. The idea is that a REST frameworks can operate without too much additional JSON or XML folderol; just the code should be sufficient.

If there's no WSDL (because it's just REST) then there's no formal definition of anything. Sorry.

I (perhaps foolishly) figured there's be some standard way to consume the standard format definition of an API, to generate test cases and stubbing at least. Maybe even a standard set of verifications, like error codes. So I went a-googling for 1) a standard / conventional way to spec these APIs, 2) a standard / conventional tool or maybe tools @one per stack, and 3) a standard / conventional way to generate tests or test scaffolding in these tools, consuming the standard / conventional API spec. So far, not so much.

"So far, not so much" is the state of the art. You have correctly understood what's available.

REST -- in spite of it's trivial simplicity and strict adherence to HTTP -- is a rather open world. It's also pretty simple. Fancy tools don't help much.

Why not?

Because decent programming languages already do REST; tools don't add significant value. In the case of Python, there are relatively few tools (Selenium is the big deal, and it's for browser testing) because there's no real marketplace niche for them. In general, simple Python using httplib (or Python 3 http.client) can test the living shit out of RESTful API better than CURL/DIFF with no ugly shell-script coding. Only polite, civilized Python coding.

Tuesday, September 11, 2012

RESTful Web Service Testing

Unit testing RESTful web services is rather complex. Ideally, the services are tested in isolation before being packaged as a service. However, sometimes people will want to test the "finished" or "integrated" web services technology stack because (I suppose) they don't trust their lower-level unit tests.

Or they don't have effective lower-level unit tests.

Before we look at testing a complete RESTful web service, we need to expose some underlying principles.

Principle #1. Unit does not mean "class". Unit means unit: a discrete unit of code. Class, package, module, framework, application. All are legitimate meanings of unit. We want to use stable, easy-to-live with unit testing tools. We don't want to invent something based on shell scripts running CURL and DIFF.

Principle #2. The code under test cannot have any changes made to it for testing. It has to be the real, unmodified production code. This seems self-evident. But. It gets violated by folks who have badly-designed RESTful services.

This principle means that all the settings required for testability must be part of an external configuration. No exceptions. It also means that your service may need to be refactored so that the guts can be run from the command line outside Apache.

When your RESTful Web Service depends on third-party web service(s), there is an additional principle.

Principle #3. You must have formal proxy classes for all RESTful services your app consumes. These proxy classes are going to be really simple, since they must trivially map resource requests to proper HTTP processing. In Python, it is delightfully simple to create a class where each method simply uses httplib (or http.client in Python 3.2) to make a GET, POST, PUT or DELETE request. In Java you can do this, also, it's just not delightfully simple.

TestCase Overview

Testing a RESTful web service is a matter of starting an instance of the service, running a standard unit testing TestCase, and then shutting that instance down. Generally this will involve setUpModule and tearDownModule (in Python parlance) or a @BeforeClass and @AfterClass (in Java parlance).

The class-level (or module-level) setup must start the application server being tested. The server will start in some known initial condition. This may involve building and populating known database, too. This can be fairly complex.

When working with SQL, In-memory databases are essential for this. SQLite (Python) or http://hsqldb.org (Java) can be life-savers because they're fast and flexible.

What's important is that the client access to the RESTful web service is entirely under control of a unit testing framework.

Mocking The Server

A small, special-purpose server must be built that mocks the full application server without the endless overheads of a full web server.

It can be simpler to mock a server rather than to try to reset the state of a running Apache server. TestCases often execute a sequence of stateful requests assuming a known starting state. Starting a fresh mock server is sometimes an easy way to set this known starting state.

Here's a Python script that will start a server. It writes the PID to a file for the shutdown script.

import http.server
import os
from the_application import some_application_feature
class AppWrapper( http.server.BaseHTTPRequestHandler ):
def do_GET( self ):
# Parse the URL

id= url.split("/")[-1]

# Invoke the real application's method for GET on this URL.
body= some_application_feature( id )
# Respond appropriately
self.send_response( 200, body )
... etc ...

# Database setup before starting the service.
# Filesystem setup before starting the service.
# Other web service proxy processes must be started, too.
with open("someservice.pid","w") as pid_file:
print( os.getpid(), file=pid_file )
httpd = http.server.HTTPServer("localhost:8000", AppWrapper)
try:
httpd.serve_forever()
finally:
# Cleanup other web services.

Here's a shutdown script.

import os, signal
with open("someservice.pid") as pid_file:
pid= int( pid_file.read() )
os.kill( pid, signal.CTRL_C_EVENT )

These two scripts will start and stop a mock server that wraps the underlying application.

When you're working in Java, it isn't so delightfully simple as Python But it should be respectably simple. And you have Jython Java integration so that this Python code can invoke a Java application without too much pain.

Plus, you can always fall further back to a CGI-like unit testing capability where "body= some_application_feature( id )" becomes a subprocess.call(). Yes it's inefficient. We're just testing.

This CGI-like access only works if the application is very well-behaved and can be configured to process one request at a time from a local file or from the command line. This, in turn, may require building a test harness that uses the core application logic in a CGI-like context where STDIN is read and STDOUT is written.

Thursday, June 7, 2012

Stingray Schema-Based File Reader

Just updated the Stingray Reader. There was an egregious error (and a missing test case). I fixed the error, but didn't add a test case to cover the problem.

It's simple laziness. TDD is quite clear on how to tackle this kind of thing. Write the missing test case (which will now fail). Then make the code change.

But the code change was so simple.

Tuesday, February 14, 2012

TDRE - Test Driven Reverse Engineering Case Study

Background

Read up on compass variation or declination. For example, this NOAA site provides some useful information.

Mariners use the magnetic variation to compute the difference between True north (i.e., aligned with the grid on the chart) and Magnetic north (i.e., aligned with the compass.)

The essential use case here is "What's the compass variation at a given point?" The information is printed on paper charts, but it's more useful to simply calculate it.

There are two magnetic models: the US Department of Defense World Magnetic Model (WMM) and the International Association of Geomagnetism and Aeronomy (IAGA) International Geomagnetic Reference Field (IGRF).

A packaged solution is geomag7.0. This includes both the WMM and the IGRF models. This is quite complex. However, it does have "sample output", which amount to unit test cases.

The essential spherical harmonic model is available separately as a small Fortran program, igrf11.f.

Which leads us to reverse engineering this program into Python.

TDRE Approach

The TDRE approach requires having some test cases to drive the reverse engineering process toward some kind of useful results.

The geomag7.0 package includes two "Sample Output" files that have the relevant unit test cases. The file has column headings and 16 test cases. This leads us to the following outline for the unit test application.


    class Test_Geomag( unittest.TestCase ):
        def __init__( self, row ):
            super( Test_Geomag, self ).__init__()
            self.row= row
        def runTest( self ):
            row= self.row
            if details: 
                print( "Source: {0:10s} {1} {2:7s} {3:10s} {4:10s} {5:5s} {6:5s}".format( row['Date'], row['Coord-System'], row['Altitude'], row['Latitude'], row['Longitude'], row['D_deg'], row['D_min'] ),
                file=details )
            
            date= self.parse_date( row['Date'] )
            lat= self.parse_lat_lon( row['Latitude'] )
            lon= self.parse_lat_lon( row['Longitude'] )
            alt= self.parse_altitude(row['Altitude'] )
            
            x, y, z, f = igrf11syn( date, lat*math.pi/180, lon*math.pi/180, alt, coord=row['Coord-System'] )
            D = 180.0/math.pi*math.atan2(y, x) # Declination 

            deg, min = deg2dm( D )
            
            if details: 
                print( "Result: {0:10.5f} {1} K{2:<6.1f} {3:<10.3f} {4:<10.3f} {5:5s} {6:5s}".format( date, row['Coord-System'], alt, lat, lon, str(deg)+"d", str(min)+"m" ), 
                    file=details )
                print( file=details )
            
            self.assertEqual( row['D_deg'], "{0}d".format(deg) )
            self.assertEqual( row['D_min'], "{0}m".format(min) )

    def suite():
        s= unittest.TestSuite()
        with open(sample_output,"r") as expected:
            rdr= csv.DictReader( expected, delimiter=' ', skipinitialspace=True )
            for row in rdr:
                case= Test_Geomag( row )
                s.addTest( case )
        return s

    r = unittest.TextTestRunner(sys.stdout)
    result= r.run( suite() )
    sys.exit(not result.wasSuccessful())

The Test_Geomag class does two things. First, it parses the source values to create a usable test case. We've omitted the parsers to reduce clutter. Second, it produces details to help with debugging. This is reverse engineering, and there's lots of debugging. It depends on a global variable, details, which is either set to sys.stderr or None.

This suite() function builds a suite of test cases from the input file.

The unit under test isn't obvious, but there's a call to the igrf11syn() function where the important work gets done. We can start with this.

def  igrf11syn( date, nlat, elong, alt=0.0, coord='D' ):
    return None, None, None, None

This lets us run the tests and find that we have work to do.

Reverse Engineering

The IGRF11.F fortran code contains this IGRF11SYN "subroutine" that does the work we want. The geomag 7.0 package has a function called shval3 which is essentially the same thing.

Both are implementations of the same underlying "13th order spherical harmonic series" or a "truncated series expansion".

The Fortran code contains numerous Fortran "optimizations". These are irritating hackarounds because of actual (and perceived) limitations of Fortran. They fall into two broad classes.

Hand Optimizations. All repeated expressions were manually hoisted out of their context. This is clever but makes the code particularly obscure. It doesn't help when local variables are named ONE, TWO and THREE. Bad is it is, not much needs to be done about this. Python code looks a bit like Fortran code, so very little needs to be done except add `math.` to the various function calls like sort, cos and sin.
Sparse Array Chicanery. There are actually two spherical harmonic series. The older 10-order and the new 13-order. Each model has two sets of coefficients: g and h. These form two half-matrices plus a vector. The old models have 55 g values in one matrix, 55 h values in second matrix, and a set of 10 more g values that form some kind of vector; 160 values. The new models have 91 g, 91 h and 13 g in the extra vector; 195 values. There are 23 sets of these coefficients (for 1900, 1905, ... 2015). The worst case is 23×195=4,485 values. This appears to be too much memory, so the two matrices and vectors are optimized into a single opaque collection of 3,256 numbers and delightfully complex set of index calculations.

Phase 1. Do the smallest "literal" transformation of Fortran to Python.

This means things like this:

Transforming the subroutine into a Python function with multiple return values.
Reasoning out the overall "steps". There's a bunch of setup followed by the essential series calculation followed by some final calculations.
Locating and populating the global variables.
Reformatting the if statements.
Removing the GOTO's. Either make them separate functions or properly nest the code.
Reformatting the do loop.
Handling the 1-based indexing. In almost all cases, Fortran "arrays" are best handled as Python dictionaries (not lists).

Once this is done, there are some remaining special-case discrepancies. Most of these are tacit assumptions about the problem domain that turn out to be untrue. For example, the Geodetic, Geocentric features seemed needless. However, they're not handled trivially, and need to be left in place. Also, conversion of signed values in radians to degrees and minutes isn't trivial.

This leads to passing all 16 unit tests with the single opaque collection of 3,256 numbers and delightfully complex set of index calculations.

Phase 2. Optimize so that it makes some sense in Python.

This involves unwinding the index calculations to simplify the array. The raw coefficients are available (igrf11coeffs.txt) and they have a sensible structure that separates the two matrices very cleanly. The code uses the combined matrix (called gh) in a very few places. The index calculations aren't obvious at all, but a few calls to print reveal how the matrix is accessed.

Given (1) unit tests that already work and (2) the pattern of access, it's relatively easy to hypothesize a dictionary by year that contains a pair of simple dictionaries, g[n,m] and h[n,m], for the coefficients.

Cleanup and Packaging

Once the tests pass, the package -- as a whole -- needs to be made reasonably Pythonic. In this case, it means a number of additional changes. For example, converting the API from degrees to radians, supplying appropriate default values for parameters, providing convenience functions.

Additionally, there are Python ways to populate the coefficients neatly and eliminate global variables. In this case, it seemed sensible to create a Callable class which could load the coefficients during construction.

Note that there's little point in profiling to apply further optimizations. The legacy Fortran code was already meticulously hand optimized.

Tuesday, December 13, 2011

The need for ping

Years ago, when designing an interface to a vendor's web services, I did the following. This isn't a genius move, but it's worth emphasizing how important it is. And what's most important isn't technical.

I built a simple spike solution to access their service.
I morphed this into a "sanity check" to be sure that their service really was working. Mostly, I cleaned up the code so that it was testable and deliverable without embarrassment.
I morphed this into a "diagnostic tool" to bypass the higher-levels of the application and simply access the vendor (and optionally dump the results) to help determine what wasn't work. This involved adding the dump option to the sanity check and renaming the command-line application.
I morphed this into a "credentials check and diagnostic tool". This was -- ahem -- merely taking the hard-wired credentials out of the application. Yes. The first versions had hard-wired credentials.

That brings us to the version in use today. The "vendor ping" application.

The default behavior is a credentials check.

One optional behavior is to dump the interface details.

Another optional behavior is to allow selection among a small number of simple interactions just to be sure things are working.

Unplanned Work

What's important here isn't that I did all this. What's important is that the deliverables, user stories and project plans didn't include this little nugget of high-value goodness.

It gets run fairly frequently in crunch situations. The actor in the story ("As system admin...") is rarely considered as a first-class user of the application. Yet, the admin is a first-class user, and needs to have proper user stories for confirming that the application is working properly.

Thursday, September 22, 2011

"Strict" Unit Testing -- Everything In Isolation Is Too Much Work

Folks like to claim that unit testing absolutely requires each class be tested in isolation using mocks for all dependencies. This is a noble aspiration, but doesn't work out perfectly well in Python.

First, "unit" is intentionally vague. It could be a class, a function, a module or a package. It's "unit" of code. Anything could be considered a "unit".

Second--and more important--the extensive mocking isn't fully appropriate for Python programming. Mocks are very helpful in statically-typed languages where you must be very fussy about assuring that all of the interface definitions are carefully matched up properly.

In Python, duck typing allows a mock to be defined quite trivially. A mock library isn't terribly helpful, since it doesn't reduce the code volume or complexity in any meaningful way.

Dependencies without Injection

The larger issue with trying to unit test in Python with mock objects is the impact of change.

We have some class with an interface.

class AppFeature( object ):

def app_method( self, anotherObject ):

etc.

class AnotherClass( object ):

def another_method( self ):

etc.

We've properly used dependency injection to make AppFeature depend on an instance of AnotherClass. This means that we're supposed to create a mock of AnotherClass to test the AppFeature.

class MockAnotherClass( object ):

def another_method( self ):

etc.

In Python, this mock isn't a best practice. It can be helpful. But adding a mock can also be confusing and misleading.

Refactoring Scenario

Consider the situation where we're refactoring and change the interface to AnotherClass. We modify another_method to take an additional argument, for example.

How many mocks do we have? How many need to be changed? What happens when we miss one of the mocks and have the mysterious Isolated Test Failure?

While we can use a naming convention and grep to locate the mocks, this can (and does) get murky when we've got a mock that replaces a complex cluster of objects with a simple Facade for testing purposes. Now, we've got a mock that doesn't trivially replace the mocked class.

Alternative: Less Strict Mocking

In Python--and other duck typing languages--a less mock-heavy approach seems more productive. The goal of testing every class in isolation surrounded by mocks needs to be relaxed. A more helpful approach is to work up through the layers.

Test the "low-level" classes--those with few or no dependencies--in isolation. This is easy because they're already isolated by design.
The classes which depend on these low-level classes can simply use the low-level classes without shame or embarrassment. The low-level classes work. Higher-level classes can depend on them. It's okay.
In some cases, mocks are required for particularly complex or difficult classes. Nothing is wrong with mocks. But fussy overuse of mocks does create additional work.

The benefit of this is

The layered architecture is tested the way it's actually used. The low-level classes are tested in isolation as well as being tested in conjunction with the classes that depend on them.
It's easier to refactor. The design changes aren't propagated into mocks.
Layer boundaries can be more strictly enforced. Circularities are exposed in a more useful way through the dependencies and layered testing.

We need to still work out proper dependency injection. If we try to mock every dependency, we are forced to confront every dependency in glorious detail. If we don't mock every single dependency, we can slide by without properly isolating our design.

Tuesday, April 19, 2011

Test-Driven Reverse Engineering (TDRE)

Another case study on TDRE.

Provided: 2,938 lines of Python code which process a handful of large files to create a number of outputs. [Details can't be disclosed.]

Objective: Refactor to distinguish between the overall sequence of transformational steps and the details of each individual step.

Observations

The code is almost purely procedural. There are 11 class definitions. 6 of these wrap built-in types with type conversion and null-handling. 1 is a new exception. 1 is a generic "table" that essentially duplicates features of SQLite. The remaining 3 are actually part of the problem domain.

One reason for reverse engineering is that the code has reached an intellectual limit. It's small, but "dense" with highly-optimized processing steps. The cohesion type is almost all "Temporal". Processing is grouped into successive processing loops; each loop contains a cluster of processing steps. Consequently, it's quite hard to tease apart the algorithm to get a "big picture" of what's going on. It's just a dense stand of trees. No forest.

Another reason for reverse engineering is to support the endless adaptation and modification of the code base. The program is a kind of "spreadsheet on steroids". This isn't a simplistic collection of cells and formulæ that permits simple what-if analysis. This is a more complex set of formulæ that would be challenging (but not impossible) to implement as a spreadsheet. The use case, however, is the spreasheet use case: think, tweak, create results, repeat.

TDRE Approach

Start with an Initial Survey of the legacy code base and sample files.

Create an Outline or "sketch" of the domain model and main program. This will be a modules (or a package) with comments and some preliminary class definitions. Little more.

Pick a processing Step in the legacy code. This often requires creating processing summaries of the legacy code. Most legacy code is procedural, so the processing tends to be sequential in nature.

Instrument the Legacy Code with print statements to gather data. This can be simple. The output can be challenging to interpret.

with open("tdre_results_1","w") as tdre:

# some legacy processing

print( "Case:", foo, bar, ", Expect:", baz, file=tdre )

From the output, Build Unit Test Cases. Fill in parts of the processing sequence and domain model. Debug code until the tests pass.

Initial Survey

The Initial Survey locates several things.

The usable, working modules. It appears that all reverse engineering involves a code base with dead or unused code. Even a small project (3,000 lines) will have a remarkable amount of dead code.
Priorities for the implemented functionality. Not every "main" module is relevant.
Example inputs and outputs.

If the software cannot be run (as is the case with organically developed systems that depend on large, complex corporate databases), then the example inputs and outputs may not actually match the software. If the software can be run, it should be run and the actuals compared against the samples to confirm that the code base supplied really produced the sample outputs.

Expect that the provided legacy code is slightly different from the code in production use. In some cases, this cannot be resolved; for example, when the executables are older than the source. In other cases, the code matches and no further work is required to establish the legacy baseline.

The sample outputs point in the direction of an acceptance test case. The sample output cannot be taken literally as the one-and-only acceptance test. While it's desirable for reverse engineering to reproduce the sample output, most reverse engineering will involve enhancements or bug fixes. Expect that errors will be found (or may be known to exist) in the sample output.

Create Outline

The outline is -- initially -- just generic MVP. There must be a domain model, some "presenter" that has the application logic, and some "view" for displaying the outputs.

In our case study, above, the "view" is a collection of (mostly text) output files. The model was undefined in the legacy code, which was all "presenter" application logic.

The goal was to extract the underlying model, break the application "presenter" logic into two layers (forest and trees) and build some views for each of the output files.

Pick a Processing Step

This can be challenging, depending on the legacy code base. There are two paths through a procedural code base.

Back to Front. Start with the final results and unit test the final steps based on previous steps that will be defined later.
Front to Back. Start with the first recognizable intermediate result based on the input files. Unit test the initial steps.

It's more rewarding to work front-to-back because progress can be shown a little more clearly.

A better architecture can be created by working back-to-front since dependencies are easier to understand.

Unit Test Volume, Edges and Corners

There are two unit test design challenges when doing reverse engineering.

Volume. The sample data can be large. 100,000 rows of sample data is too many to test. Finding a "representative" subset is difficult. Generally, arbitrary subsets have to be used to get started. Once the application mostly works, more refined unit tests need to be created.
Edge and Corner Cases. While the code may be riddled with if-statements, it can still be difficult to locate sample inputs that exercise the various conditions in the code. It's risky to create data -- we have to assume that the legacy code does unexpected things. In many cases, print statements have to be put into complex if statements to locate any actual data that exercises that logic path.

Once the unit tests are built, this is just Test-Driven Development (TDD).

Thursday, February 17, 2011

TDD -- From SME Spreadsheet to TestCase to Code

In "Unit Test Case, Subject Matter Experts and Requirements" I suggested that it's often pretty easy to get a spreadsheet of full-worked out examples from subject-matter experts. Indeed, if your following TDD, that spreadsheet of examples is solid gold.

Let's consider something relatively simple. Let's say we're working on some fancy calculations. Our users explain until they're blue in the face. We take careful notes. We think we understand. To confirm, we ask for a simple spreadsheet with inputs and outputs.

We get something like the following. The latitudes and longitudes are inputs. The ranges and bearings are outputs. [The math can be seen at "Calculate distance, bearing and more between Latitude/Longitude points".]

Latitude 1	Longitude 1	Latitude 2	Longitude 2	range	bearing
50 21 50N	004 09 25W	42 21 04N	071 02 27W	2805 nm	260 07 38

Only it has a a few more rows with different examples. Equator Crossing. Prime Meridian Crossing. All the usual suspects.

TDD Means Making Test Cases

Step one, then, is to parse the spreadsheet full of examples and create some domain-specific examples. Since it's far, far easier to work with .CSV files, we'll presume that we can save the carefully-crafted spreadsheet as a simple .CSV with the columns shown above.

Step two will be to create working Python code from the domain-specific examples.

The creation of test cases is a matter of building some intermediate representation out of the spreadsheet. This is where plenty of parsing and obscure special-case data handling may be necessary.

from __future__ import division
import csv
from collections import namedtuple
import re

latlon_pat= re.compile("(\d+)\s+(\d+)\s+(\d+)([NSWE])")
def latlon( txt ):
  match= latlon_pat.match( txt )
  d, m, s, h = match.groups()
  return float(d)+float(m)/60+float(s)/3600, h
angle_pat= re.compile("(\d+)\s+(\d+)\s+(\d+)")
def angle( txt ):
  match= angle_pat.match( txt )
  d, m, s = match.groups()
  return float(d)+float(m)/60+float(s)/3600
range_pat= re.compile("(\d+)\s*(\D+)")
def range( txt ):
  match= range_pat.match( txt )
  d, units = match.groups()
  return float(d), units

RangeBearing= namedtuple("RangeBearing","lat1,lon1,lat2,lon2,rng,brg")

def test_iter( filename="sample_data.csv" ):
  with open(filename,"r") as source:
      rdr= csv.DictReader( source )
      for row in rdr:
          print row
          tc= RangeBearing(
              latlon(row['Latitude 1']),  latlon(row['Longitude 1']),
              latlon(row['Latitude 2']),  latlon(row['Longitude 2']),
              range(row['range']),
              angle(row['bearing'])
              )
          yield tc
    
for tc in test_iter():
  print tc

This is long, but, it handles a lot of the formatting vagaries that users are prone to.

From Abstract to TestCase

Once we have a generator to build test cases as abstraction examples, generating code for Java or Python or anything else is just a little template-fu.

   
from string import Template
testcase= Template("""
class Test_${name}( unittest.TestCase ):
   def setUp( self ):
       self.p1= LatLon( lat=GlobeAngle(*$lat1), lon=GlobeAngle(*$lon1) )
       self.p2= LatLon( lat=GlobeAngle(*$lat2), lon=GlobeAngle(*$lon2) )
   def test_should_compute( self ):
       d, brg = range_bearing( p1, p2, R=$units )
       self.assertEquals( $dist, int(d) )
       self.assertEquals( $brg, map(int,map(round,brg.deg)))
""")
for name, tc in enumerate( test_iter() ):
   units= tc.rng[1].upper()
   dist= tc.rng[0]
   code= testcase.substitute( name=name, dist=dist, units=units, **tc._asdict()  )
   print code

This shows a simple template with values filled in. Often, we have to generate a hair more than this. A few imports, a "unittest.main()" is usually sufficient to transform a spreadsheet into unit tests that we can confidently use for test-driven development.

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, November 22, 2022

BLUF

Some History

Some Structure

Tuesday, September 11, 2018

Tuesday, August 4, 2015

Book Examples

Thursday, September 4, 2014

Thursday, June 12, 2014

Thursday, February 20, 2014

Thursday, January 23, 2014

Tuesday, January 14, 2014

Thursday, June 20, 2013

Thursday, September 13, 2012

Tuesday, September 11, 2012

Thursday, June 7, 2012

Tuesday, February 14, 2012

Tuesday, December 13, 2011

Thursday, September 22, 2011

Tuesday, April 19, 2011

Thursday, February 17, 2011