## Thursday, August 21, 2014

### Permutations, Combinations and Frustrations

The issue of permutations and combinations is sometimes funny.

Not funny weird.  But funny "haha."

I received an email with 100's of words and 10 attachments. (10. Really.) The subject was how best to enumerate 6! permutations of something or other. With a goal of comparing some optimization algorithm with a brute force solution. (I don't know why. I didn't ask.)

Apparently, the programmer was not aware that permutation creation is a pretty standard algorithm with a standard solution. Most "real" programming languages have libraries which already solve this in a tidy, efficient, and well-documented way.

For example

https://docs.python.org/2/library/itertools.html#itertools.permutations

I suspect that this is true for every language in common use.

In Python, this doesn't even really involve programming. It's a first-class expression you enter at the Python >>> prompt.

>>> import itertools
>>> list(itertools.permutations("ABC"))

[('A', 'B', 'C'), ('A', 'C', 'B'), ('B', 'A', 'C'), ('B', 'C', 'A'), ('C', 'A', 'B'), ('C', 'B', 'A')]

What's really important about this question was the obstinate inability of the programmer to realize that their problem had a tidy, well understood solution. And has had a good solution for decades. Instead they did a lot of programming and sent 100's of words and 10 attachments (10. Really.)

The best I could do was provide this link:

Steven Skiena, The Algorithm Design Manual

It appears that too few programmers are aware of how much already exists. They plunge ahead creating a godawful mess when a few minutes of reading would have provided a very nice answer.

Eventually, they sent me this:

http://en.wikipedia.org/wiki/Heap's_algorithm

As a grudging acknowledgement that they had wasted hours failing to reinvent the wheel.

## Saturday, August 9, 2014

### Some Basic Statistics

I've always been fascinated by the essential statistical algorithms. While there are numerous statistical libraries, the simple measures of central tendency (mean, media, mode, standard deviation) have some interesting features.

Well.  Interesting to me.

First, some basics.


def s0( samples ):
return len(samples) # sum(x**0 for x in samples)

def s1( samples ):
return sum(samples) # sum(x**1 for x in samples)

def s2( samples ):
return sum( x**2 for x in samples )


Why define these three nearly useless functions? It's the cool factor of how they're so elegantly related.

Once we have these, though, the definitions of mean and standard deviation become simple and kind of cool.

def mean( samples ):
return s1(samples)/s0(samples)

def stdev( samples ):
N= s0(samples)
return math.sqrt((s2(samples)/N)-(s1(samples)/N)**2)


It's not much, but it seems quite elegant. Ideally, these functions could work from iterables instead of sequence objects, but that's impractical in Python. We must work with a materialized sequence even if we replace len(X) with sum(1 for _ in X).

The next stage of coolness is the following version of Pearson correlation. It involves a little helper function to normalize samples.

def z( x, μ_x, σ_x ):
return (x-μ_x)/σ_x


Yes, we're using Python 3 and Unicode variable names.

Here's the correlation function.

def corr( sample1, sample2 ):
μ_1, σ_1 = mean(sample1), stdev(sample1)
μ_2, σ_2 = mean(sample2), stdev(sample2)
z_1 = (z(x, μ_1, σ_1) for x in sample1)
z_2 = (z(x, μ_2, σ_2) for x in sample2)
r = sum( zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/len(sample1)
return r


I was looking for something else when I stumbled on this "sum of products of normalized samples" version of correlation. How cool is that? The more text-book versions of this involve lots of sigmas and are pretty bulky-looking. This, on the other hand, is really tidy.

Finally, here's least-squares linear regression.

def linest( x_list, y_list ):
r_xy= corr( x_list, y_list )
μ_x, σ_x= mean(x_list), stdev(x_list)
μ_y, σ_y= mean(y_list), stdev(y_list)
beta= r_xy * σ_y/σ_x
alpha= μ_y - beta*μ_x
return alpha, beta



This, too, was buried at the end of the Wikipedia article. But it was such an elegant formulation for least squares based on correlation. And it leads to a tidy piece of programming. Very tidy.

I haven't taken the time to actually measure the performance of these functions and compare them with more commonly used versions.

But I like the way the Python fits well with the underlying math.

Not shown: The doctest tests for these functions. You can locate sample data and insert your own doctests. It's not difficult.

## Thursday, July 24, 2014

### Building Probabilistic Graphical Models with Python

A deep dive into probability and scipy: https://www.packtpub.com/building-probabilistic-graphical-models-with-python/book

I have to admit up front that this book is out of my league.

The Python is sensible to me. The subject matter -- graph models, learning and inference -- is above my pay grade.

Asking About a Book

Let me summarize before diving into details.

Asking someone else if a book is useful is really not going to reveal much. Their background is not my background. They found it helpful/confusing/incomplete/boring isn't really going to indicate anything about how I'll find it.

Asking someone else for a vague, unmeasurable judgement like "useful" or "appropriate" or "helpful" is silly. Someone else's opinions won't apply to you.

Asking if a book is technically correct is more measurable. However. Any competent publisher has a thorough pipeline of editing. It involves at least three steps: Acceptance, Technical Review, and a Final Review. At least three. A good publisher will have multiple technical reviewers. All of this is detailed in the front matter of the book.

Asking someone else if the book was technically correct is like asking if it was reviewed: a silly question. The details of the review process are part of the book. Just check the front matter online before you buy.

It doesn't make sense to ask judgement questions. It doesn't make sense to ask questions answered in the front matter. What can you ask that might be helpful?

I think you might be able to ask completeness questions. "What's omitted from the tutorial?" "What advanced math is assumed?" These are things that can be featured in online reviews.

Sadly, these are not questions I get asked.

Irrational Questions

A colleague had some questions about the book named above. Some of which were irrational. I'll try to tackle the rational questions since emphasis my point on ways not to ask questions about books.

2.  Is the Python code good at solidifying the mathematical concepts?

This is a definite maybe situation. The concept of "solidifying" as expressed here bothers me a lot.

Solid mathematics -- to me -- means solid mathematics. Outside any code considerations. I failed a math course in college because I tried to convert everything to algorithms and did not get the math part. A kindly professor explained that "F" very, very clearly. A life lesson. The math exists outside any implementation.

I don't think code can ever "solidify" the mathematics. It goes the other way: the code must properly implement the mathematical concepts. The book depends on scipy, and scipy is a really good implementation of a great deal of advanced math. The implementation of the math sits squarely on the rock-solid foundation of scipy. For me, that's a ringing endorsement of the approach.

If the book reinvented the algorithms available in scipy, that would be reason for concern. The book doesn't reinvent that wheel: it uses scipy to solve problems.

4. Can the code be used to build prototypes?

Um. What? What does the word prototype mean in that question? If we use the usual sense of software prototype, the answer is a trivial "Yes." The examples are prototypes in that sense. That can't be what the question means.

In this context the word might mean "model". Or it might mean "prototype of a model". If we reexamine the question with those other senses of prototype, we might have an answer that's not trivially "yes." Might.

When they ask about prototype, could they mean "model?" The code in the book is a series of models of different kinds of learning. The models are complete, consistent, and work. That can't be what they're asking.

Could they mean "prototype of a model?" It's possible that we're talking about using the book to build a prototype of a model. For example, we might have a large and complex problem with several more degrees of freedom than the text book examples. In this case, perhaps we might want to simplify the complex problem to make it more like one of the text book problems. Then we could use Python to solve that simplified problem as a prototype for building a final model which is appropriate for the larger problem.

In this sense of prototype, the answer remains "What?"  Clearly, the book solves a number of simplified problems and provides code samples that can be expanded and modified to solve larger and more complex problems.

To get past the trivial "yes" for this question, we can try to examine this in a negative sense. What kind of thing is the book unsuitable for? It's unsuitable as a final implementation of anything but the six problems it tackles. It can't be that "prototype" means "final implementation." The book is unsuitable as a tutorial on Python. It's not possible this is what "prototype" means.

Almost any semantics we assign to "prototype" lead to an answer of "yes". The book is suitable for helping someone build a lot of things.

Summary

Those two were the rational questions. The irrational questions made even less sense.

Including the other irrational questions, it appears that the real question might have been this.

Q: "Can I learn Python from this book?"

A: No.

It's possible that the real question was this:

Q: "Can I learn advanced probabilistic modeling with this book?"

A: Above my pay grade. I'm not sure I could learn probabilistic modeling from this book. Maybe I could. But I don't think that I have the depth required.

It's possible that the real questions was this:

Q: Can I learn both Python and advanced probabilistic modeling with this book?"

A: Still No.

Gaps In The Book

Here's what I could say about the book.

You won't learn much Python from this book. It assumes Python; it doesn't tutor Python. Indeed, it assumes some working scipy knowledge and a scipy installation. It doesn't include a quick-start tutorial on scipy or any of that other hand-holding.

This is not even a quibble with the presentation. It's just an observation: the examples are all written in Python 2. Small changes are required for Python 3. Scipy will work with Python 3. http://www.scipy.org/scipylib/faq.html#do-numpy-and-scipy-support-python-3-x. Reworking the examples seems to involve only small changes to replace print statements. In that respect, the presentation is excellent.

## Thursday, July 17, 2014

### New Focus: Data Scientist

Read this: http://www.forbes.com/sites/emc/2014/06/26/the-hottest-jobs-in-it-training-tomorrows-data-scientists/

Interesting subject areas: Statistics, Machine Learning, Algorithms.

I've had questions about data science from folks who (somehow) felt that calculus and differential equations were important parts of data science. I couldn't figure out how they decided that diffeq's were important. Their weird focus on calculus didn't seem to involve using any data. Odd: wanting to be a data scientist, but being unable to collect actual data.

Folks involved in data science seem to think otherwise. Calculus appears to be a side-issue at best.

I can see that statistics are clearly important for data science. Correlation and regression-based models appear to be really useful. I think, perhaps, that these are the lynch-pins of much data science. Use a sample to develop a model, confirm it over successive samples, then apply it to the population as a whole.

Algorithms become important because doing dumb statistical processing on large data sets can often prove to be intractable. Computing the median of a very large set of data can be essentially impossible if the only algorithm you know is to sort the data and find the middle-most item.

Machine learning and pattern detection may be relevant for deducing a model that offers some predictive power. Personally, I've never worked with this. I've only worked with actuaries and other quants who have a model they want to confirm (or deny or improve.)

## Thursday, July 10, 2014

### The Permissions Issue

Why?

Why are Enterprise Computers so hard to use? What is it about computers that terrifies corporate IT?

They're paying lots of money to have me sit around and wait for mysterious approver folks to decide if I can be given permission to install development tools. (Of course, the real work is done by off-shore subcontractors who are (a) overworked and (b) simply reviewing a decision matrix.)

And they ask, "Are you getting everything you need?"

The answer is universally "No, I'm not getting what I need." Universally. But I can't say that.

You want me to develop software. And you simultaneously erect massive, institutional roadblocks to prevent me from developing software.

I have yet to work somewhere without roadblocks that effectively prevent development.

And I know that some vague "security considerations" trump any productive approach to doing software development. I know that there's really no point in trying to explain that I'm not making progress because I can't actually do anything. And you're stopping me from doing anything.

My first two weeks at every client:

The client tried to "expedite" my arrival by requesting the PC early, so it would be available on day 1. It wasn't. A temporary PC is -- of course -- useless. But that's the balance of days 1-5: piddling around with the temporary PC. That was ordered two weeks earlier.

Day 6 begins with the real PC. It's actually too small for serious development due to an oversight in bringing me on as a developer, but not ordering a developer's PC for me. I'll deal. Things will be slow. That's okay. Some day, you'll discover that I'm wasting time waiting for each build and unit test suite. Right now, I'm doing nothing, so I have no basis to complain.

Day 7 reveals that I need to fill in a form to have the PC you assigned me "unlocked." Without this, I cannot install any development tools.

In order to fill in the form, I need to run an in-house app. Which is known by several names, none of which appear on the intranet site. Day 8 is lost to searching, making some confused phone calls, and waiting for someone to get back to me with something.

Oh. And the email you sent on Day 9 had a broken link. That's not the in-house app anymore. It may have been in the past. But it's not.

Day 10 is looking good. The development request has been rejected because I -- as an outsider -- can't make the request to unlock a PC directly. It has to be made by someone who's away visiting customers or off-shore developers or something.

Remember. This is the two weeks I'm on site. The whole order started 10 business days earlier with the request for the wrong PC without appropriate developer permissions.

## Thursday, July 3, 2014

### Project Euler

This is (was?) an epic web site:

http://projecteuler.net/about

Currently, they're struggling with a security problem.

http://forum.projecteuler.net/viewtopic.php?f=5&t=3591

Years ago, I found the site and quickly reached Level 2 by solving a flood of easy problems.

Recently, a recruiter strongly suggested reviewing problems on Project Euler as preparation for a job interview.

It was fun! I restarted my quest for being a higher-level solver.

Then they took the solution checking (and score-keeping) features off-line.

So now I have to content myself with cleaning up my previous solutions to make them neat and readable and improve the performance in some areas.

I -- of course -- cannot share the answers. But, I can (and will) share some advice on how to organize your thinking as you tackle these kinds of algorithmically difficult problems.

My personal preference is to rewrite the entire thing in Django. It would probably take a month or two. Then migrate the data. That way I could use RST markup for the problems and the MathJax add-on that docutils uses to format math. But. That's just me.

I should probably take a weekend and brainstorm the functionality that I can recall and build a prototype. But I'm having too much fun solving the problems instead of solving the problem of presenting the problems.

## Thursday, June 26, 2014

### Package Deal for Learning Python

If you're very new to programming in general, Python's a great place to start.

There are many, many tutorials. I won't even try to summarize them. They're generally good. And the more you read, the more you learn.

Moving past the n00bz needs, there are some more advanced books. Here's a collection for generalists:

My suggestion is to master the general features of the language overall.

Focus on specific things (Django, NLTK, SciPy, Maya, Scrapy, MatPlotLib, etc.) can follow.

I worry that early exposure to some of the details of Python-based packages may obscure the fundamentals of using the language properly. Perhaps that worry is misplaced. I know that the NLTK Book has numerous good examples of Python which are independent of the NLTK focus.