Saturday, August 9, 2014

Some Basic Statistics

I've always been fascinated by the essential statistical algorithms. While there are numerous statistical libraries, the simple measures of central tendency (mean, media, mode, standard deviation) have some interesting features.

Well.  Interesting to me.

First, some basics.

def s0( samples ):
    return len(samples) # sum(x**0 for x in samples)

def s1( samples ):
    return sum(samples) # sum(x**1 for x in samples)

def s2( samples ):
    return sum( x**2 for x in samples )

Why define these three nearly useless functions? It's the cool factor of how they're so elegantly related.

Once we have these, though, the definitions of mean and standard deviation become simple and kind of cool.

def mean( samples ):
    return s1(samples)/s0(samples)

def stdev( samples ):
    N= s0(samples)
    return math.sqrt((s2(samples)/N)-(s1(samples)/N)**2)

It's not much, but it seems quite elegant. Ideally, these functions could work from iterables instead of sequence objects, but that's impractical in Python. We must work with a materialized sequence even if we replace len(X) with sum(1 for _ in X).

The next stage of coolness is the following version of Pearson correlation. It involves a little helper function to normalize samples.

def z( x, μ_x, σ_x ):
    return (x-μ_x)/σ_x

Yes, we're using Python 3 and Unicode variable names.

Here's the correlation function.

def corr( sample1, sample2 ):
    μ_1, σ_1 = mean(sample1), stdev(sample1)
    μ_2, σ_2 = mean(sample2), stdev(sample2)
    z_1 = (z(x, μ_1, σ_1) for x in sample1)
    z_2 = (z(x, μ_2, σ_2) for x in sample2)
    r = sum( zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/len(sample1)
    return r

I was looking for something else when I stumbled on this "sum of products of normalized samples" version of correlation. How cool is that? The more text-book versions of this involve lots of sigmas and are pretty bulky-looking. This, on the other hand, is really tidy.

Finally, here's least-squares linear regression.

def linest( x_list, y_list ):
    r_xy= corr( x_list, y_list )
    μ_x, σ_x= mean(x_list), stdev(x_list)
    μ_y, σ_y= mean(y_list), stdev(y_list)
    beta= r_xy * σ_y/σ_x
    alpha= μ_y - beta*μ_x
    return alpha, beta

This, too, was buried at the end of the Wikipedia article. But it was such an elegant formulation for least squares based on correlation. And it leads to a tidy piece of programming. Very tidy.

I haven't taken the time to actually measure the performance of these functions and compare them with more commonly used versions.

But I like the way the Python fits well with the underlying math.

Not shown: The doctest tests for these functions. You can locate sample data and insert your own doctests. It's not difficult.

No comments:

Post a Comment