Thursday, July 24, 2014

Building Probabilistic Graphical Models with Python

A deep dive into probability and scipy:

I have to admit up front that this book is out of my league.

The Python is sensible to me. The subject matter -- graph models, learning and inference -- is above my pay grade.

Asking About a Book

Let me summarize before diving into details.

Asking someone else if a book is useful is really not going to reveal much. Their background is not my background. They found it helpful/confusing/incomplete/boring isn't really going to indicate anything about how I'll find it.

Asking someone else for a vague, unmeasurable judgement like "useful" or "appropriate" or "helpful" is silly. Someone else's opinions won't apply to you.

Asking if a book is technically correct is more measurable. However. Any competent publisher has a thorough pipeline of editing. It involves at least three steps: Acceptance, Technical Review, and a Final Review. At least three. A good publisher will have multiple technical reviewers. All of this is detailed in the front matter of the book.

Asking someone else if the book was technically correct is like asking if it was reviewed: a silly question. The details of the review process are part of the book. Just check the front matter online before you buy.

It doesn't make sense to ask judgement questions. It doesn't make sense to ask questions answered in the front matter. What can you ask that might be helpful?

I think you might be able to ask completeness questions. "What's omitted from the tutorial?" "What advanced math is assumed?" These are things that can be featured in online reviews.

Sadly, these are not questions I get asked.

Irrational Questions

A colleague had some questions about the book named above. Some of which were irrational. I'll try to tackle the rational questions since emphasis my point on ways not to ask questions about books.

2.  Is the Python code good at solidifying the mathematical concepts? 

This is a definite maybe situation. The concept of "solidifying" as expressed here bothers me a lot.

Solid mathematics -- to me -- means solid mathematics. Outside any code considerations. I failed a math course in college because I tried to convert everything to algorithms and did not get the math part. A kindly professor explained that "F" very, very clearly. A life lesson. The math exists outside any implementation.

I don't think code can ever "solidify" the mathematics. It goes the other way: the code must properly implement the mathematical concepts. The book depends on scipy, and scipy is a really good implementation of a great deal of advanced math. The implementation of the math sits squarely on the rock-solid foundation of scipy. For me, that's a ringing endorsement of the approach.

If the book reinvented the algorithms available in scipy, that would be reason for concern. The book doesn't reinvent that wheel: it uses scipy to solve problems.

4. Can the code be used to build prototypes? 

Um. What? What does the word prototype mean in that question? If we use the usual sense of software prototype, the answer is a trivial "Yes." The examples are prototypes in that sense. That can't be what the question means.

In this context the word might mean "model". Or it might mean "prototype of a model". If we reexamine the question with those other senses of prototype, we might have an answer that's not trivially "yes." Might.

When they ask about prototype, could they mean "model?" The code in the book is a series of models of different kinds of learning. The models are complete, consistent, and work. That can't be what they're asking.

Could they mean "prototype of a model?" It's possible that we're talking about using the book to build a prototype of a model. For example, we might have a large and complex problem with several more degrees of freedom than the text book examples. In this case, perhaps we might want to simplify the complex problem to make it more like one of the text book problems. Then we could use Python to solve that simplified problem as a prototype for building a final model which is appropriate for the larger problem.

In this sense of prototype, the answer remains "What?"  Clearly, the book solves a number of simplified problems and provides code samples that can be expanded and modified to solve larger and more complex problems.

To get past the trivial "yes" for this question, we can try to examine this in a negative sense. What kind of thing is the book unsuitable for? It's unsuitable as a final implementation of anything but the six problems it tackles. It can't be that "prototype" means "final implementation." The book is unsuitable as a tutorial on Python. It's not possible this is what "prototype" means.

Almost any semantics we assign to "prototype" lead to an answer of "yes". The book is suitable for helping someone build a lot of things.


Those two were the rational questions. The irrational questions made even less sense.

Including the other irrational questions, it appears that the real question might have been this.

Q: "Can I learn Python from this book?"

A: No.

It's possible that the real question was this:

Q: "Can I learn advanced probabilistic modeling with this book?"

A: Above my pay grade. I'm not sure I could learn probabilistic modeling from this book. Maybe I could. But I don't think that I have the depth required.

It's possible that the real questions was this:

Q: Can I learn both Python and advanced probabilistic modeling with this book?"

A: Still No.

Gaps In The Book

Here's what I could say about the book.

You won't learn much Python from this book. It assumes Python; it doesn't tutor Python. Indeed, it assumes some working scipy knowledge and a scipy installation. It doesn't include a quick-start tutorial on scipy or any of that other hand-holding.

This is not even a quibble with the presentation. It's just an observation: the examples are all written in Python 2. Small changes are required for Python 3. Scipy will work with Python 3. Reworking the examples seems to involve only small changes to replace print statements. In that respect, the presentation is excellent.

Thursday, July 17, 2014

New Focus: Data Scientist

Read this:

Interesting subject areas: Statistics, Machine Learning, Algorithms.

I've had questions about data science from folks who (somehow) felt that calculus and differential equations were important parts of data science. I couldn't figure out how they decided that diffeq's were important. Their weird focus on calculus didn't seem to involve using any data. Odd: wanting to be a data scientist, but being unable to collect actual data.

Folks involved in data science seem to think otherwise. Calculus appears to be a side-issue at best.

I can see that statistics are clearly important for data science. Correlation and regression-based models appear to be really useful. I think, perhaps, that these are the lynch-pins of much data science. Use a sample to develop a model, confirm it over successive samples, then apply it to the population as a whole.

Algorithms become important because doing dumb statistical processing on large data sets can often prove to be intractable. Computing the median of a very large set of data can be essentially impossible if the only algorithm you know is to sort the data and find the middle-most item.

Machine learning and pattern detection may be relevant for deducing a model that offers some predictive power. Personally, I've never worked with this. I've only worked with actuaries and other quants who have a model they want to confirm (or deny or improve.)

Thursday, July 10, 2014

The Permissions Issue


Why are Enterprise Computers so hard to use? What is it about computers that terrifies corporate IT?

They're paying lots of money to have me sit around and wait for mysterious approver folks to decide if I can be given permission to install development tools. (Of course, the real work is done by off-shore subcontractors who are (a) overworked and (b) simply reviewing a decision matrix.)

And they ask, "Are you getting everything you need?"

The answer is universally "No, I'm not getting what I need." Universally. But I can't say that.

You want me to develop software. And you simultaneously erect massive, institutional roadblocks to prevent me from developing software.

I have yet to work somewhere without roadblocks that effectively prevent development.

And I know that some vague "security considerations" trump any productive approach to doing software development. I know that there's really no point in trying to explain that I'm not making progress because I can't actually do anything. And you're stopping me from doing anything.

My first two weeks at every client:

The client tried to "expedite" my arrival by requesting the PC early, so it would be available on day 1. It wasn't. A temporary PC is -- of course -- useless. But that's the balance of days 1-5: piddling around with the temporary PC. That was ordered two weeks earlier.

Day 6 begins with the real PC. It's actually too small for serious development due to an oversight in bringing me on as a developer, but not ordering a developer's PC for me. I'll deal. Things will be slow. That's okay. Some day, you'll discover that I'm wasting time waiting for each build and unit test suite. Right now, I'm doing nothing, so I have no basis to complain.

Day 7 reveals that I need to fill in a form to have the PC you assigned me "unlocked." Without this, I cannot install any development tools.

In order to fill in the form, I need to run an in-house app. Which is known by several names, none of which appear on the intranet site. Day 8 is lost to searching, making some confused phone calls, and waiting for someone to get back to me with something.

Oh. And the email you sent on Day 9 had a broken link. That's not the in-house app anymore. It may have been in the past. But it's not.

Day 10 is looking good. The development request has been rejected because I -- as an outsider -- can't make the request to unlock a PC directly. It has to be made by someone who's away visiting customers or off-shore developers or something.

Remember. This is the two weeks I'm on site. The whole order started 10 business days earlier with the request for the wrong PC without appropriate developer permissions.

Thursday, July 3, 2014

Project Euler

This is (was?) an epic web site:

Currently, they're struggling with a security problem.

Years ago, I found the site and quickly reached Level 2 by solving a flood of easy problems.

Recently, a recruiter strongly suggested reviewing problems on Project Euler as preparation for a job interview.

It was fun! I restarted my quest for being a higher-level solver.

Then they took the solution checking (and score-keeping) features off-line.

So now I have to content myself with cleaning up my previous solutions to make them neat and readable and improve the performance in some areas.

I -- of course -- cannot share the answers. But, I can (and will) share some advice on how to organize your thinking as you tackle these kinds of algorithmically difficult problems.

My personal preference is to rewrite the entire thing in Django. It would probably take a month or two. Then migrate the data. That way I could use RST markup for the problems and the MathJax add-on that docutils uses to format math. But. That's just me.

I should probably take a weekend and brainstorm the functionality that I can recall and build a prototype. But I'm having too much fun solving the problems instead of solving the problem of presenting the problems.

Thursday, June 26, 2014

Package Deal for Learning Python

If you're very new to programming in general, Python's a great place to start.

There are many, many tutorials. I won't even try to summarize them. They're generally good. And the more you read, the more you learn.

Moving past the n00bz needs, there are some more advanced books. Here's a collection for generalists:

My suggestion is to master the general features of the language overall.

Focus on specific things (Django, NLTK, SciPy, Maya, Scrapy, MatPlotLib, etc.) can follow.

I worry that early exposure to some of the details of Python-based packages may obscure the fundamentals of using the language properly. Perhaps that worry is misplaced. I know that the NLTK Book has numerous good examples of Python which are independent of the NLTK focus.

Thursday, June 19, 2014

The Swift Programming Language

This lowers the bar for entry to the iOS market.

Does it also lower the bar for Mac OS X?

Can it be used to write command-line command-line applications ("scripts")? It has a REPL, which means it can do a kind of "just-in-time" compile and run. This is how Python works, so perhaps this is a viable mode for using Swift.

Via the Objective-C and C compatibility, it has full access to the POSIX libraries, as well as Cocoa, so it can clearly be used to build command-line apps. It might lack the flexibility of Python, since it's compiled. But C (C++, Objective-C) with automated memory management is still a gigantic victory for writing fast and reliable programs.

Can it be plugged into Apache to write backend applications? It's compiled, and compatible with C and Objective-C. So, one can imagine that a mod-swift component in Apache might be possible. It might be better to work through existing FCGI interfaces and write stand-alone Swift back-ends. This would require a bunch of libraries for database API's, template rendering, request and response processing, and the various bits and pieces that make up a rich web development environment. But this is largely available for C and C++, making it available to Swift-based backends.

Is one language even a desirable goal?

The idea of having one official version of the class definitions seems very helpful for capturing knowledge and managing the intellectual property that is embodied in application logic.

Thursday, June 12, 2014

TDD, API Design and Refactoring

See this short discussion on a Stingray Reader feature:

This turned into an exercise in pure TDD.

I'm not a fan of applying TDD in a strict, death-march fashion.

I see the comments on Stack Overflow that indicate that some folks feel strongly that strict TDD is somehow helpful. While "test before code" is laudable and often helpful, there's no royal road to good software.

Design involves a great deal of back and forth between code and test. A great deal.

It's logically impossible to write a test without having thought about the code. In order to write the test first, there must be a notional API against which the test is written. Anyone who requires that the test file must be written before the notional class or module is just playing at petty tyranny.

The notional design -- the rough outline of the class or module -- can be written into a file before any tests. It's okay. It is still test-driven because the considerations of testability drove the design process.

In particular, when starting "from scratch" -- with nothing -- writing tests first is senseless. Some module or package structure must exist for the test modules to import.

Having ranted, it still arises that the tests do come before any code under some circumstances.

In this case, the requested functionality was quite difficult to visualize. However, it was possible to cobble together a test case that simplified the problem down to something like this this:

01 Some-Record.
     05 Header PIC XXX.
     05 Body PIC X(17).

01 ABC-Segment.
     05 Field-ABC PIC X(17).

01 DEF-Segment.
     05 Field-DEF PIC X(17).

In COBOL, the program would use logic like IF Header EQUALS "ABC" THEN MOVE Body TO ABC-Segment. We need a way to handle something like this in Python so that we can parse the EBCDIC COBOL data.

This summarized example allowed construction of a test case that made use of a API that might have existed. I was pretty sure I had a test case that showed an approach.

What Actually Happened

Since the application already had 178 unit tests, there was plenty of structure that worked.

The single new unit test relied on a notional API that wasn't really in place. The new test bombed grotesquely.

There are two solutions:

  • Modify the test.
  • Fix the notional API so that it works properly.

I started out chasing the second option. I tweaked some things. More tests failed. I tweaked some more things. The new test finally passed, but another test was failing.

Some careful study of the failing test revealed that my approach was wrong. Way wrong.

The notional API was a bad idea.

The tweaks to make it work were a worse idea.

Back to the Lab Bench

At this point, I had made enough changes that the only thing to do was copy the new test and use the Git Revoke on the local changes to unwind the awful mistakes.

Staring again, I had a slightly better grip on the relevant code. I had a failing test. I tried a different approach that wasn't quite so inventive. This meant modifying the test.

I actually went through a few iterations of the test, using the test method as a kind of lab bench.

A more Pythonic approach to the lab bench is to work from the >>> prompt. I think that all of the exemplary projects use the >>> prompt examples in their documentation. This is a way to narrow and clarify the API. As projects get big, they can sprawl. New features can wind up with many imports to pick and choose elements from existing modules.

When it becomes difficult to use the >>> prompt as the lab bench, that's a sign that the API is too complex. Refactoring must happen.

Using the unit test framework as the lab bench was a hint that something had drifted out of tolerance.

However. I did get a test which passed. Yay. Sort of.

The test code was hideous.

TDD and API Design

The point of TDD, however, is that we have a working suite of tests. Refactoring won't break anything.

The point was that the hideous API could be rewritten into something that both

  • Passed all the tests, and
  • Was usable at the >>> prompt.
It's difficult to express how valuable the Python >>> prompt is to help clarify API design issues.

The rule is this:

If the API doesn't make sense at the >>> prompt, it's incomprehensible

Sadly, Java doesn't have this kind of boundary. Java programming can spin into quite complex API's, limited only by the laziness of the programmer who avoids refactoring.

Or the malice of the programmer's manager in not allowing time to refactor.