Thursday, September 4, 2014

API Testing: Quick, Dirty, and Automated

When writing RESTful API's, the process of testing can be simple or kind of hideous.

The Postman REST Client is pretty popular for testing an API. There are others, I'm sure, but I'm working with folks who like Postman.

Postman 10 has some automation capabilities. Some.

However. (And this is important.)

It doesn't provide much help in framing up a valid complex JSON message.

When dealing with larger and more complex API's with larger and more complex nested and repeating structures, considerably more help is required to frame up a valid request and do some rational evaluation of the response.

Enter Python, httplib and json. While Python3 is universally better, these libraries haven't changed much since Python2, so either version will work.

The idea is simple.
1. Create templates for the eventual class definitions in Python. This can make it easy to build the JSON structures. It can save a lot of hoping that the JSON content is right. It can save time in "exploratory" testing when the JSON structures are wrong.
2. Build complex messages using the template class definitions.
3. Send the message with httplib. Read the response.
4. Evaluate the responses with a simple script.
Some test scripting is possible in Postman. Some. In Python, you've got a complete programming language. The "some" qualifier evaporates.

When it comes to things like seeding database data, Python (via appropriate database drivers) can seed integration test databases, also.

Further, you can use the Python unittest framework to write elegant automated script libraries and run the entire thing from the command line in a simple, repeatable way.

What's important is that the template class definitions aren't working code. They won't evolve into working code. They're placeholders so that we can work out API concepts quickly and develop relatively complete and accurate pictures of what the RESTful interface will look like.

I had to dig out my copy of https://www.packtpub.com/application-development/mastering-object-oriented-python to work out the metaclass trickery required.

The Model and Meta-Model Classes

The essential ingredient is a model class what we can use to build objects. The objective is not a complete model of anything. The objective is just enough model to build a complex object.
Our use case looks like this.


>>> class P(Model):
...    attr1= String()
...    attr2= Array()
...
>>> class Q(Model):
...    attr3= String()
...
>>> example= P( attr1="this", attr2=[Q(attr3="that")] )


Our goal is to trivially build more complex JSON documents for use in API testing.  Clearly, the class definitions are too skinny to have much real meaning. They're handy ways to define a data structure that provides a minimal level of validation and the possibility of providing default values.

Given this goal, we need a model class and descriptor definitions. In addition to the model class, we'll also need a metaclass that will help build the required objects. One feature that we really like is keeping the class-level attributes in order. Something Python doesn't to automatically. But something we can finesse through a metaclass and a class-level sequence number in the descriptors.

Here's the metaclass to cleanup the class __dict__. This is the Python2.7 version because that's what we're using.


class Meta(type):
"""Metaclass to set the name attribute of each Attr instance and provide
the _attr_order sequence that defines the origiunal order.
"""
def __new__( cls, name, bases, dict ):
attr_list = sorted( (a_name
for a_name in dict
if isinstance(dict[a_name], Attr)), key=lambda x:dict[x].seq )
for a_name in attr_list:
setattr( dict[a_name], 'name', a_name )
dict['_attr_order']= attr_list
return super(Meta, cls).__new__( cls, name, bases, dict )

class Model(object):
"""Superclass for all model class definitions;
includes the metaclass to tweak subclass definitions.
This also provides a to_dict() method used for
JSON encoding of the defined attributes.

The __init__() method validates each keyword argument to
assure that they match the defined attributes only.
"""
__metaclass__= Meta
def __init__( self, **kw ):
for name, value in kw.items():
if name not in self._attr_order:
raise AttributeError( "{0} unknown".format(name) )
setattr( self, name, value )
def to_dict( self ):
od= OrderedDict()
for name in self._attr_order:
od[name]= getattr(self, name)
return od


The __new__() method assures that we have an additional _attr_order attribute added to each class definition. The __init__() method allows us to build an instance of a class with keyword parameters that have a minimal sanity check imposed on them. The to_dict() method is used to convert the object prior to making a JSON representation.

Here is the superclass definition of an Attribute. We'll extend this with other attribute specializations.


class Attr(object):
"""A superclass for Attributes; supports a minimal
feature set. Attribute ordering is maintained via
a class-level counter.

Attribute names are bound later via a metaclass
process that provides names for each attribute.

Attributes can have a default value if they are
omitted.
"""
attr_seq= 0
default= None
def __init__( self, *args ):
self.seq= Attr.attr_seq
Attr.attr_seq += 1
self.name= None # Will be assigned by metaclass Meta
def __get__( self, instance, something ):
return instance.__dict__.get(self.name, self.default)
def __set__( self, instance, value ):
instance.__dict__[self.name]= value
def __delete__( self, *args ):
pass


We've done the minimum to implement a data descriptor.  We've also included a class-level sequence number which assures that descriptors can be put into order inside a class definition.

We can then extend this superclass to provide different kinds of attributes. There are a few types which can help us formulate messages properly.


class String(Attr):
default= ""

class Array(Attr):
default= []

class Number(Attr):
default= None


The final ingredient is a JSON encoder that can handle these class definitions.  The idea is that we're not asking for much from our encoder. Just a smooth way to transform these classes into the required dict objects.


class ModelEncoder(json.JSONEncoder):
"""Extend the JSON Encoder to support our Model/Attr
structure.
"""
def default( self, obj ):
if isinstance(obj,Model):
return obj.to_dict()
return super(NamespaceEncoder,self).default(obj)

encoder= ModelEncoder(indent=2)



The Test Cases

Here is an all-important unit test case. This shows how we can define very simple classes and create an object from those class definitions.


>>> class P(Model):
...    attr1= String()
...    attr2= Array()
...
>>> class Q(Model):
...    attr3= String()
...
>>> example= P( attr1="this", attr2=[Q(attr3="that")] )
>>> print( encoder.encode( example ) )
{
"attr1": "this",
"attr2": [
{
"attr3": "that"
}
]
}


Given two simple class structures, we can get a JSON message which we can use for unit testing. We can use httplib to send this to the server and examine the results.

Thursday, August 21, 2014

Permutations, Combinations and Frustrations

The issue of permutations and combinations is sometimes funny.

Not funny weird.  But funny "haha."

I received an email with 100's of words and 10 attachments. (10. Really.) The subject was how best to enumerate 6! permutations of something or other. With a goal of comparing some optimization algorithm with a brute force solution. (I don't know why. I didn't ask.)

Apparently, the programmer was not aware that permutation creation is a pretty standard algorithm with a standard solution. Most "real" programming languages have libraries which already solve this in a tidy, efficient, and well-documented way.

For example

https://docs.python.org/2/library/itertools.html#itertools.permutations

I suspect that this is true for every language in common use.

In Python, this doesn't even really involve programming. It's a first-class expression you enter at the Python >>> prompt.

>>> import itertools
>>> list(itertools.permutations("ABC"))

[('A', 'B', 'C'), ('A', 'C', 'B'), ('B', 'A', 'C'), ('B', 'C', 'A'), ('C', 'A', 'B'), ('C', 'B', 'A')]

What's really important about this question was the obstinate inability of the programmer to realize that their problem had a tidy, well understood solution. And has had a good solution for decades. Instead they did a lot of programming and sent 100's of words and 10 attachments (10. Really.)

The best I could do was provide this link:

Steven Skiena, The Algorithm Design Manual

It appears that too few programmers are aware of how much already exists. They plunge ahead creating a godawful mess when a few minutes of reading would have provided a very nice answer.

Eventually, they sent me this:

http://en.wikipedia.org/wiki/Heap's_algorithm

As a grudging acknowledgement that they had wasted hours failing to reinvent the wheel.

Saturday, August 9, 2014

Some Basic Statistics

I've always been fascinated by the essential statistical algorithms. While there are numerous statistical libraries, the simple measures of central tendency (mean, media, mode, standard deviation) have some interesting features.

Well.  Interesting to me.

First, some basics.


def s0( samples ):
return len(samples) # sum(x**0 for x in samples)

def s1( samples ):
return sum(samples) # sum(x**1 for x in samples)

def s2( samples ):
return sum( x**2 for x in samples )


Why define these three nearly useless functions? It's the cool factor of how they're so elegantly related.

Once we have these, though, the definitions of mean and standard deviation become simple and kind of cool.

def mean( samples ):
return s1(samples)/s0(samples)

def stdev( samples ):
N= s0(samples)
return math.sqrt((s2(samples)/N)-(s1(samples)/N)**2)


It's not much, but it seems quite elegant. Ideally, these functions could work from iterables instead of sequence objects, but that's impractical in Python. We must work with a materialized sequence even if we replace len(X) with sum(1 for _ in X).

The next stage of coolness is the following version of Pearson correlation. It involves a little helper function to normalize samples.

def z( x, μ_x, σ_x ):
return (x-μ_x)/σ_x


Yes, we're using Python 3 and Unicode variable names.

Here's the correlation function.

def corr( sample1, sample2 ):
μ_1, σ_1 = mean(sample1), stdev(sample1)
μ_2, σ_2 = mean(sample2), stdev(sample2)
z_1 = (z(x, μ_1, σ_1) for x in sample1)
z_2 = (z(x, μ_2, σ_2) for x in sample2)
r = sum( zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/len(sample1)
return r


I was looking for something else when I stumbled on this "sum of products of normalized samples" version of correlation. How cool is that? The more text-book versions of this involve lots of sigmas and are pretty bulky-looking. This, on the other hand, is really tidy.

Finally, here's least-squares linear regression.

def linest( x_list, y_list ):
r_xy= corr( x_list, y_list )
μ_x, σ_x= mean(x_list), stdev(x_list)
μ_y, σ_y= mean(y_list), stdev(y_list)
beta= r_xy * σ_y/σ_x
alpha= μ_y - beta*μ_x
return alpha, beta



This, too, was buried at the end of the Wikipedia article. But it was such an elegant formulation for least squares based on correlation. And it leads to a tidy piece of programming. Very tidy.

I haven't taken the time to actually measure the performance of these functions and compare them with more commonly used versions.

But I like the way the Python fits well with the underlying math.

Not shown: The doctest tests for these functions. You can locate sample data and insert your own doctests. It's not difficult.

Thursday, July 24, 2014

Building Probabilistic Graphical Models with Python

A deep dive into probability and scipy: https://www.packtpub.com/building-probabilistic-graphical-models-with-python/book

I have to admit up front that this book is out of my league.

The Python is sensible to me. The subject matter -- graph models, learning and inference -- is above my pay grade.

Let me summarize before diving into details.

Asking someone else if a book is useful is really not going to reveal much. Their background is not my background. They found it helpful/confusing/incomplete/boring isn't really going to indicate anything about how I'll find it.

Asking someone else for a vague, unmeasurable judgement like "useful" or "appropriate" or "helpful" is silly. Someone else's opinions won't apply to you.

Asking if a book is technically correct is more measurable. However. Any competent publisher has a thorough pipeline of editing. It involves at least three steps: Acceptance, Technical Review, and a Final Review. At least three. A good publisher will have multiple technical reviewers. All of this is detailed in the front matter of the book.

Asking someone else if the book was technically correct is like asking if it was reviewed: a silly question. The details of the review process are part of the book. Just check the front matter online before you buy.

It doesn't make sense to ask judgement questions. It doesn't make sense to ask questions answered in the front matter. What can you ask that might be helpful?

I think you might be able to ask completeness questions. "What's omitted from the tutorial?" "What advanced math is assumed?" These are things that can be featured in online reviews.

Irrational Questions

A colleague had some questions about the book named above. Some of which were irrational. I'll try to tackle the rational questions since emphasis my point on ways not to ask questions about books.

2.  Is the Python code good at solidifying the mathematical concepts?

This is a definite maybe situation. The concept of "solidifying" as expressed here bothers me a lot.

Solid mathematics -- to me -- means solid mathematics. Outside any code considerations. I failed a math course in college because I tried to convert everything to algorithms and did not get the math part. A kindly professor explained that "F" very, very clearly. A life lesson. The math exists outside any implementation.

I don't think code can ever "solidify" the mathematics. It goes the other way: the code must properly implement the mathematical concepts. The book depends on scipy, and scipy is a really good implementation of a great deal of advanced math. The implementation of the math sits squarely on the rock-solid foundation of scipy. For me, that's a ringing endorsement of the approach.

If the book reinvented the algorithms available in scipy, that would be reason for concern. The book doesn't reinvent that wheel: it uses scipy to solve problems.

4. Can the code be used to build prototypes?

Um. What? What does the word prototype mean in that question? If we use the usual sense of software prototype, the answer is a trivial "Yes." The examples are prototypes in that sense. That can't be what the question means.

In this context the word might mean "model". Or it might mean "prototype of a model". If we reexamine the question with those other senses of prototype, we might have an answer that's not trivially "yes." Might.

When they ask about prototype, could they mean "model?" The code in the book is a series of models of different kinds of learning. The models are complete, consistent, and work. That can't be what they're asking.

Could they mean "prototype of a model?" It's possible that we're talking about using the book to build a prototype of a model. For example, we might have a large and complex problem with several more degrees of freedom than the text book examples. In this case, perhaps we might want to simplify the complex problem to make it more like one of the text book problems. Then we could use Python to solve that simplified problem as a prototype for building a final model which is appropriate for the larger problem.

In this sense of prototype, the answer remains "What?"  Clearly, the book solves a number of simplified problems and provides code samples that can be expanded and modified to solve larger and more complex problems.

To get past the trivial "yes" for this question, we can try to examine this in a negative sense. What kind of thing is the book unsuitable for? It's unsuitable as a final implementation of anything but the six problems it tackles. It can't be that "prototype" means "final implementation." The book is unsuitable as a tutorial on Python. It's not possible this is what "prototype" means.

Almost any semantics we assign to "prototype" lead to an answer of "yes". The book is suitable for helping someone build a lot of things.

Summary

Those two were the rational questions. The irrational questions made even less sense.

Including the other irrational questions, it appears that the real question might have been this.

Q: "Can I learn Python from this book?"

A: No.

Q: "Can I learn advanced probabilistic modeling with this book?"

A: Above my pay grade. I'm not sure I could learn probabilistic modeling from this book. Maybe I could. But I don't think that I have the depth required.

Q: Can I learn both Python and advanced probabilistic modeling with this book?"

A: Still No.

Gaps In The Book

Here's what I could say about the book.

You won't learn much Python from this book. It assumes Python; it doesn't tutor Python. Indeed, it assumes some working scipy knowledge and a scipy installation. It doesn't include a quick-start tutorial on scipy or any of that other hand-holding.

This is not even a quibble with the presentation. It's just an observation: the examples are all written in Python 2. Small changes are required for Python 3. Scipy will work with Python 3. http://www.scipy.org/scipylib/faq.html#do-numpy-and-scipy-support-python-3-x. Reworking the examples seems to involve only small changes to replace print statements. In that respect, the presentation is excellent.

Thursday, July 17, 2014

New Focus: Data Scientist

Interesting subject areas: Statistics, Machine Learning, Algorithms.

I've had questions about data science from folks who (somehow) felt that calculus and differential equations were important parts of data science. I couldn't figure out how they decided that diffeq's were important. Their weird focus on calculus didn't seem to involve using any data. Odd: wanting to be a data scientist, but being unable to collect actual data.

Folks involved in data science seem to think otherwise. Calculus appears to be a side-issue at best.

I can see that statistics are clearly important for data science. Correlation and regression-based models appear to be really useful. I think, perhaps, that these are the lynch-pins of much data science. Use a sample to develop a model, confirm it over successive samples, then apply it to the population as a whole.

Algorithms become important because doing dumb statistical processing on large data sets can often prove to be intractable. Computing the median of a very large set of data can be essentially impossible if the only algorithm you know is to sort the data and find the middle-most item.

Machine learning and pattern detection may be relevant for deducing a model that offers some predictive power. Personally, I've never worked with this. I've only worked with actuaries and other quants who have a model they want to confirm (or deny or improve.)

Thursday, July 10, 2014

The Permissions Issue

Why?

Why are Enterprise Computers so hard to use? What is it about computers that terrifies corporate IT?

They're paying lots of money to have me sit around and wait for mysterious approver folks to decide if I can be given permission to install development tools. (Of course, the real work is done by off-shore subcontractors who are (a) overworked and (b) simply reviewing a decision matrix.)

And they ask, "Are you getting everything you need?"

The answer is universally "No, I'm not getting what I need." Universally. But I can't say that.

You want me to develop software. And you simultaneously erect massive, institutional roadblocks to prevent me from developing software.

I have yet to work somewhere without roadblocks that effectively prevent development.

And I know that some vague "security considerations" trump any productive approach to doing software development. I know that there's really no point in trying to explain that I'm not making progress because I can't actually do anything. And you're stopping me from doing anything.

My first two weeks at every client:

The client tried to "expedite" my arrival by requesting the PC early, so it would be available on day 1. It wasn't. A temporary PC is -- of course -- useless. But that's the balance of days 1-5: piddling around with the temporary PC. That was ordered two weeks earlier.

Day 6 begins with the real PC. It's actually too small for serious development due to an oversight in bringing me on as a developer, but not ordering a developer's PC for me. I'll deal. Things will be slow. That's okay. Some day, you'll discover that I'm wasting time waiting for each build and unit test suite. Right now, I'm doing nothing, so I have no basis to complain.

Day 7 reveals that I need to fill in a form to have the PC you assigned me "unlocked." Without this, I cannot install any development tools.

In order to fill in the form, I need to run an in-house app. Which is known by several names, none of which appear on the intranet site. Day 8 is lost to searching, making some confused phone calls, and waiting for someone to get back to me with something.

Oh. And the email you sent on Day 9 had a broken link. That's not the in-house app anymore. It may have been in the past. But it's not.

Day 10 is looking good. The development request has been rejected because I -- as an outsider -- can't make the request to unlock a PC directly. It has to be made by someone who's away visiting customers or off-shore developers or something.

Remember. This is the two weeks I'm on site. The whole order started 10 business days earlier with the request for the wrong PC without appropriate developer permissions.

Thursday, July 3, 2014

Project Euler

This is (was?) an epic web site:

Currently, they're struggling with a security problem.

http://forum.projecteuler.net/viewtopic.php?f=5&t=3591

Years ago, I found the site and quickly reached Level 2 by solving a flood of easy problems.

Recently, a recruiter strongly suggested reviewing problems on Project Euler as preparation for a job interview.

It was fun! I restarted my quest for being a higher-level solver.

Then they took the solution checking (and score-keeping) features off-line.

So now I have to content myself with cleaning up my previous solutions to make them neat and readable and improve the performance in some areas.

I -- of course -- cannot share the answers. But, I can (and will) share some advice on how to organize your thinking as you tackle these kinds of algorithmically difficult problems.

My personal preference is to rewrite the entire thing in Django. It would probably take a month or two. Then migrate the data. That way I could use RST markup for the problems and the MathJax add-on that docutils uses to format math. But. That's just me.

I should probably take a weekend and brainstorm the functionality that I can recall and build a prototype. But I'm having too much fun solving the problems instead of solving the problem of presenting the problems.