Thursday, November 20, 2014

MongoDB and Schema Validation

One part of the MongoDB value proposition is being freed from the constraints of a database schema.

There's a "baby and bathwater" issue here. While a schema can become a low-value constraint, we have to be careful about throwing out the baby when we throw out the bathwater. A schema isn't inherently evil. A schema that's hard to modify can become more cost than benefit.

When working with document databases like MongoDB or CouchDB, we're freed from the constraints of a schema.


Do we really want the kind of freedom that can devolve to anarchy?


Do we want some kind of constraint checking capability to provide some additional run-time assurance that the applications are using the database properly?

Read this and this

My thesis is that some schema validation may have some value.

My plan is this.

1. Define the essential collections for the various documents using ordinary document design practices.

2. For each document class, we'll have two closely associated collections:

  • The primary collection, call it it "class" because it matches one of the application classes.
  • An additional "class.schema" collection. This collection will contain JSON-schema documents. See for more information.
  • For audit, and sequential key generation, we may have some additional associated collections.
Because JSON schema documents have a "$schema" field, we can replace the "$" with "\uFF04" the "FULLWIDTH DOLLAR SIGN" character when saving the JSON-schema document into a MongoDB database. We can do the inverse operation when finding the schema documents in the database.

3. Use a tool like to validate the schema. The document-level validation could be embedded in the application for each transaction. However, it seems better trust the code and the unit testing of the code to enforce schema rules. We'd use this validation periodically to check the schema. Significant events should include a validation pass. For example, before and after any schema changes. This way we can be sure that things are continuing to go properly.

It would be strictly an additional layer of checking.

Thursday, November 13, 2014

Declarative Programming

I know that some folks swear by declarative programming. They like the ideas behind ant (and make) and SCons and related examples.

You can google for "ant v. maven v. gradle" where people gripe about which is more declarative. The point of the whining being that more declarative == good and any traces of procedural or imperative programming == bad.

All, of course, without any really good justification of why declarative is better. It's assumed that declarative simply has innumerable advantages. And yes, I've started with The issue isn't simply moot; the justification is weak.

Perhaps there's a awful bias toward imperative and functional programming. After all, the big thinkers in computer science tend to favor the imperative and functional schools of thought. Maybe declarative suffers from some bias.

Or maybe declarative has limited utility.

There. I said it. Limited utility.

I think a functional approach might be better, faster and simpler.

Side-bar Ranting

The code is below. You can skip down to the "The Functional Build System" section and not miss much.

Declarative programming seems applicable to the cases where the ordering of operations can be easily deduced. It seems like the significant value of declarative programming is to rely on an optimizing compiler rearrange the declarations into properly-ordered imperative steps. From this viewpoint, it seems like ant/maven/gradle are optimizers that look at the dependencies among transformation functions and then apply the functions in the proper order.

It seems like we're writing expressions like these:

x.class = java(
xyz.jar = jar(x.class, y.class, z.class, ... )
app.war = war(xyz.jar, abc.jar, ... )

and then turning them over to a clever compiler (like Haskell) to work out a total order among the expressions that will build the right thing for us.

There's a potential difference between manually structuring a script to get all of the steps in order and allowing the compiler to arrange things properly based on some formal semantics behind each expression.

It's a potential difference because most folks that deal with ant/maven/gradle tend to put things in more-or-less the right order so that others can figure out what the hell is going on. In the trivial cases where we're building simple web sites, the default rules have evolved to the point where they work in almost all cases, so we don't even look at the configuration of the tools. We hit Ctrl+B knowing that it's all setup properly

Some Requirements

A number of applications have ant-like (or make-like) aspects but don't really cry out for ant with customized actions. We might be doing data warehouse loads which involve an ant-like sequence of processing steps to do transformations, loads, and produce final summaries and confirmations. We can, of course, write this all in first-class Java code. The hard way.

It's not terribly complex. A class to define a dependency. A suite of plug-in strategies. Some static definitions of the actual rules. Been there. Done that.

Pragmatically, the declarative style suffers from a limitation of being rather rigid in applying a fixed set of rules. A more script-like implementation can be more helpful to support reruns, debugging, problem-solving and the inevitable special cases and exceptions. After a storage failure -- and the reruns required to get the warehouse back up-to-date -- one sees more need for script-like flexibility and less need for overly simplistic rigidity.

Another end of the spectrum is individual steps all manually coordinated with a tool like BMC's Control-M. This requires endless manual intervention to make sure all the various tasks are defined properly in Control-M.

Somewhere near the middle is a configurable application with some processing rules to give it flexibility. But some defined structure to remove the need for carefully planned manual intervention and deep expertise.

The Functional Build System

We can image an ant-like build system defined functionally.

The core is a function that implements build-if-needed rules:

def build_if_needed( builder, target_file, *source ):
    if target_ok( target_file, *source ):
        return "ok({0},...)".format(target_file)
    builder( target_file, *source )
    return "{0}({1},...)".format(builder.__class__.__name__,target_file)

We can use this function to define the essential dependency: use a builder function to create some target if it's out-of-date with respect to the sources. The return value forms a kind of audit log.

This relies on some helper functions: target_ok() checks the modification times of files. The various builders do the various kinds of operations required to make one from the sources.

Here's the target_ok() function

def target_ok( target_file, *source_list, logger=logging ):
        mtime_target= datetime.datetime.fromtimestamp(
            os.path.getmtime( target_file ) )
    except Exception:
        return False
    # If a source doesn't exist, we throw an exception.
    times = (datetime.datetime.fromtimestamp(
            os.path.getmtime( source ) ) for source in source_list)
    return all(mtime_target > mtime_source for mtime_source in times)

I think this function is what started me thinking about a functional approach. It could be a method of a class. But. It's seems like a very functional design. It could be reduced to a single (long) expression.

The builders are composite functions. They need to combine the subprocess.check_call() with a function that builds the command. We can do functional composition several ways in Python: we can combine functions via decorators. We can also combine functions via Callables. We could write a higher-order function that combines the check_call() with a function to create the command.

We'll opt for the higher-order function and create partially evaluated forms using functools.partial().

Here's a typical case:

def subprocess_builder( make_command, target_file, *source_list ):
    command= make_command( target_file, *source_list )
    subprocess.check_call( command )

This is a generic function: it requires a function (or lambda) to build the actual command. We might do something like this to create a specific builder.

def command_rst2html( output, *input ):
        return ["", "--syntax-highlight=long", "--input-encoding=utf-8", input[0], output]

rst2html= partial( subprocess_builder, command_rst2html )

This rst2html() function can be used to define a dependency rule. We might have something like this:

    files_txt = glob.glob( "*.txt" )
    for f in files_txt:
        build_if_needed( rst2html, ext_to(f,'.html'), f )

This rule specifies that *.html files depend on *.txt files; when needed, use the rst2html() function to build the required html file when the txt file is newer.

The ext_to() function is a two-liner that changes the extension on a filename. This helps us write "template" build rules rather than exhaustively enumerating a large number of similar files.

def ext_to( filename, new_ext ):
    name, ext = os.path.splitext( filename )
    return name + new_ext

What we've done here is define a few generic functions that form the basis for a functional build system that can compete against ant, make or scons. The system is not even close to declarative. However, we only need to assure that our final build_if_needed() functions have a sensible ordering, something that's rarely a towering intellectual burden.

The individual customizations are the build commands like rst2html() where we created the command-line list of strings for subprocess.check_call(). We can just as easily build functions which run entirely in the process or functions which farm the work out to separate processes via queues or RESTful web services.

Bottom Lines

It appears that declarative programming isn't terribly helpful. There may be a niche, but it seems to be a small niche to me.

I'm sure that an object-oriented approach to this problem isn't any better. I've written a shabby-make version of this, and it's bigger. There's just more code and it's not significantly more clear what's going on. Inheritance can be difficult to suss out.

Python seems to be a good functional programming language. It did this very nicely.

Thursday, November 6, 2014

Hard Copy Books

I've now got my actual souvenir hard-copies of my two Packt books

So far, so good. I've got one more title in the works. After that, I think I'll have to take a small break and do some development work and learn more new stuff.

I've been advised to square away my author's page.

I think this will work to help folks post questions, comments, and suggestions.

Thursday, October 30, 2014

My First Webcast

I'm a pretty good public speaker. But I've avoided webcasting and podcasting because it's kind of daunting. In a smaller venue, the audience members are right there, and you can tell if you're not making sense. In a webcast, the feedback will be indirect. In a podcast it's seems like it would be nonexistent.

Also, I find that programming is an intensely literate experience. It's about reading and writing. A podcast -- listening and watching -- seems very un-programmerly to me. Perhaps I'm just being an old "get-off-my-lawn-you-kids" fart.

But I'll see how the webcast thing goes in January, and perhaps I'll try to do some podcasts.

Thursday, October 23, 2014

Currying and Partial Function Evaluation

Old. But still interesting.

Partial Function Application is not Currying

It seems like hair-splitting. However, the distinction between bound variables and curried functions does have some practical implications.

I'm looking closely at PyMonad and the built-in functools library.

I'm finding some benefits in understanding functional programming and how to apply functional design patterns in Python. I'm also seeing the important differences between compiled -- and optimized languages -- and Python's approach. I'm slowly coming to understand how a (simple) recursive design is flattened into a for loop as part of manual tail-recursion optimization.

The functional programming goodness is giving me first-class headaches when trying to apply the lessons learned to Java, however. I suppose I should look closely at and There are claims that it's dangerously inefficient. Also, the customer who insists on Java has a (very) limited set of allowed libraries; if this isn't on the list, then the whole concept is a non-starter.

Thursday, October 16, 2014

Using Bottle as a miniature demo server

Let's talk small.

When writing API's, it sometimes helps to have a small demo web site to show the API in a context that's easy to visualize. API's are sometimes abstract, and without an application to provide some context, it can be unclear why the path looks like that or why the JSON document has those fields.

I want to emphasize the "small" part of the small demo. A small page or two with some HTML forms and a submit button. Really small.

The actual customer-facing apps (mobile, mobile web, and full web site) are being built by lots of other people. Not us. They're big. We build the API's (there are a lot) that support the data structures that support the processing that supports the user experience.

Building fake mobile apps is right out. We're not going to lard on Android SDK or Xcode development environments to our already overburdened laptops. We build backend API's.

Building a fake mobile web or full web site is appealing. What makes it complex is the UX folks are building everything in Angular.js. If we want to properly implement a form, we would have to master Angular just to do a demo for the product owner.

No thanks. Still too far afield for API developers. We're focused on mongo and JSON and performance and scalability. Not Angular.js and the UX.

What we want to do is build a small web server which runs just a few pages plucked out of the UX demo code so that we can show how interactions with a web page put stuff in a database. And vice-versa: stuff in the database will show up on a web page.

"Really?" we get asked. Some folks look askance at us for wanting to put a small demo site together.

"Yes," we answer. "Our product owner has a big vision and we're breaking that into a bunch of little API's. It's not perfectly clear how we're building up to that vision."

It's not perfectly clear how some of this work. Folks outside the scrum team have distracting questions. We want to have a page or two where we can fill in a form and click submit and stuff happens. This is far easier to explain than showing them Postman or SoapUI and claiming that this will support some user stories.

And as we grow toward the epic, the workflow aspects of this will grow. The stuff that admin "A" does after user "U" has made an initial request. Or the stuff that internal user "I" does after external user "X" has done something. But really, it's just a few small web pages. Small.

Imagine the demo. On laptop #1, we'll show user "X". On laptop #2, we're running a Mongo shell to query what's in the db. On laptop #3 we're showing user "I". The focus is really the API's. And how the API's add up to an epic collection of stories.

Serving some HTML pages

Just to make it painful, we can't simply grab the demo web pages out of the UX team's SVN repository. Why not? First, it's an Angular app. We can't just grab some HTML and go. The demo pages are served via node.js with Bower, so it's not even clear (to us) what the complete deployment looks like.

So. We cheated. We took a screen shot. We trimmed the edges of the page as .PNG files. We wrote our own form and cobbled in enough CSS to make it look close. We're not here to fake out the UX. We just want to enter some data and have it tickle our API. (Indeed, we have a "Not The Real Experience" on some pages.)

Initially, some of the team members tried serving these small pages with WebLogic. Then Jetty. It's not bad. But it's Java. It takes forever to build and deploy something after a trivial change. There are a lot of moving parts even with Jetty, and not all of them are obvious.

Since we're building "enterprise" API's, we're deeply enmeshed with every feature of the Spring Framework. Our STS/Eclipse environments are fat with add-ons and features.

While the Spring Framework ideal is to allow a developer to focus on relevant details and have the irrelevant details handled automagically, the magic almost gets in the way. These are small applications that are little more than a few static pages with forms and a submit button. Spring can do it, of course. But we're often testing our the actual API's in a Jetty server (or two). If the demo site requires yet another instance of Jetty with yet another configuration, our ability to cope diminishes.

How can we get back to small?

Python and Bottle

Python has several web servers built-in. We can use http.server. We can use wsgiref. Both of these are almost OK for what we want to do.

We can do better with two small downloads: Bottle and Jinja2. With these, we can build simple HTML pages that show some data. We can build simple servers that collect form data, use http.client to make API requests, and write copious logging details. We can write little bottle apps that handle just GETs and POSTs simply.

This is suitably small.

We can share the module with the Bottle object and the HTML mock-up pages. We can fire up the app in an instant on anyone's laptop, no matter what else they're running. We can tweak the server to adjust the logging or the API request or the form.

We actually run the server from within Idle. Make a change and hit F5 to redeploy after a change. It's small. It's fast. And it doesn't involve the huge complexities associated with Java.

Bottle doesn't do much. But what little it does do is a pretty tidy fit with tiny little demonstrations of super-simple HTML interactions.

Thursday, October 9, 2014

Scipy.optimization.anneal Problems

Well, not really "problems" per se. More of a strange kind of whining than a solvable problem.

Here's the bottom line. Two real quotes. Unedited.

Me: "> There's a way to avoid the religious nature of the argument. "
Them: "Please suggest away."

Really. Confronted with choices between anneal and basin hopping, they could only resort to hand-waving and random utterances.

The tl;dr summary is this:
  • "scipy.optimize.anneal only has three hard-wired schedule variants: ‘fast’, ‘cauchy’ or ‘boltzmann’."
  • My initial response was "And..."? 
  • "Not being able to specify my own cooling schedule severely limits the usability of the code"
A complaint that causes me deep pain: "severely limits" with no actual evidence. And no plan to get evidence beyond a religious wars style argument.

There may have been a technical question on the class definitions inside scipy. But that question was overshadowed by the essential problems with what they were doing. Or, more properly, what they were whining about.

Did they really have a problem with a state of the art solution to optimization problems? More specifically:

1. Did they read the "Deprecated" part of the scipy documentation? This is a hint that there are better solutions available. Perhaps they could start there instead of whining.
2. Did they actually read the details of the three schedules in the "Notes" section? Do they seriously think they've got a new approach that does not fit any of the various parameters of the three installed algorithms? I don't mean to be too rude, but... Do they really think they're that scale of genius?
3. Do they have any evidence that their problem is so unlike the typical case handled by basin hopping?
4. Do they have any evidence that their solution totally crushes the already-built code?

I think the answers to all four question were "no". 

I'm not even certain that I could help them with some of the Python technology required to extend scipy. But, I'm sure I cannot actually do anything of value under the circumstances that (a) they have not really tried the established algorithms and (b) they're already sure that the established algorithms can't work based on religious-wars arguments.

It was clear that they never read the "Notes" section on this SciPy page:

One of the emails in the exchange had a kind of hand-waving justification for the problem domain being somehow unique. Lacking any actual evidence, I'm inclined to believe they were just hoping that their problem domain was unique, allowing them to dismiss the available Python solution and do something uniquely bad. 

(Optimization is not my area of expertise. Perhaps I'm way off base; perhaps the existing solutions are so problem-domain specific that everyone has to invent new technology. Maybe established solutions really don't work.)

More importantly: there was no actual evidence that the existing optimization (either annealing or basin hopping) failed to solve their problem.

But the worst part was this:

"From, a business perspective, I need to know about SA because our competitor stole our biggest client using it."

They don't actually want to innovate. They only want to try and catch up by making religious war arguments over the deprecated simulated annealing vs. basin hopping.