Tuesday, March 24, 2015

Configuration Files, Environment Variables, and Command-Line Options

We have three major tiers of configuration for applications. Within each tier, we have sub-tiers, larding on yet more complexity. The organization of the layers is a bit fungible, too. Making good choices can be rather complex because there are so many variations on the theme of "configuration". The desktop GUI app with a preferences file has very different requirements from larger, more complex applications.

The most dynamic configuration options are the command-line arguments. Within this tier of configuration, we have two sub-tiers of default values and user-provided overrides to those defaults. Where do the defaults come from? They might be wired in, but more often they come from environment variables or parameter files or both.

There's some difference of opinion on which tier is next in the tiers of dynamism. The two choices are configuration files and environment variables. We can consider environment variables as easier to edit than configuration files. In some cases, though, configuration files are easier to change than environment variables. Environment variables are typically bound to the process just once (like command-line arguments), where configuration files can be read and re-read as needed.

The environment variables have three sub-tiers. System-level environment variables tend to be fixed. The variables set by a .profile or .bashrc tend to be specific to a logged-in user, and are somewhat more flexible that system variables. The current set of environment variables associated with the logged-in session can be modified on the command line, and are as flexible as command-line arguments.

Note that we can do this in Linux:

PYTHONPATH=/path/to/project python3 -m some_app -opts

This will set an environment variable as part of running a command.

The configuration files may also have tiers. We might have a global configuration file in /etc/our-app. We might look for a ~/.our-app-rc as a user's generic configuration. We can also look for our-app.config in the current working directory as the final set of overrides to be used for the current invocation.

Some applications can be restarted, leading to re-reading the configuration files. We can change the configuration more easily than we can bind in new command-line arguments or environment variables.

Representation Issues

When we think about configuration files, we also have to consider the syntax we want to use to represent configurable parameters. We have five common choices.

Some folks are hopelessly in love with Windows-style .ini files. The configparser module will parse these. I call it hopelessly in love because the syntax is rather quite limited. Look at the logging.config module to see how complex the .ini file format is for non-trivial cases.

Some folks like Java-style properties files. These have the benefit of being really easy to parse in Python. Indeed, scanning a properties file is great exercise in functional-style Python programming.
I'm not completely sold on these, either, because they don't really handle the non-trivial cases well.

Using JSON or YAML for properties has some real advantages. There's a lot of sophistication available in these two notations. While JSON has first-class support, YAML requires an add-on module.

We can also use Python as the language for configuration. For good examples of this, look at the Django project settings file. Using Python has numerous advantages. The only possible disadvantage is the time wasted arguing with folks who call it a "security vulnerability."

Using Python as the configuration language is only considered a vulnerability by people who fail to realize that the Python source itself can be hacked. Why waste time injecting a bug into a configuration file? Why not just hack the source?

My Current Fave 

My current favorite way to handle configuration is by defining some kind of configuration class and using the class object throughout the application. Because of Python's import processing, a single instance of the class definition is easy to guarantee.

We might have a module that defines a hierarchy of configuration classes, each of which layers in additional details.

class Defaults:
    mongo_uri = "mongodb://localhost:27017" 
    some_param = "xyz" 

class Dev(Defaults):
    mongo_uri = "mongodb://sandbox:27017"

class QA(Defaults):
    mongo_uri = "mongodb://username:password@qa02:27017/?authMechanism=PLAIN&authSource=$external"

Yes. The password is visible. If we want to mess around with higher levels of secrecy in the configuration files, we can use PyCrypto and a key generator to use an encrypted password that's injected into the URI. That's a subject for another post. The folks to can edit the configuration files often know the passwords. Who are we trying to hide things from?

How do we choose the active configuration to use from among the available choices in this file? We have several ways.
  • Add a line to the configuration module. For example, Config=QA will name the selected environment. We have to change the configuration file as our code marches through environments from development to production. We can use from configuration import Config to get the proper configuration in all other modules of the application.
  • Rely on the environment variable to specify which configuration use. In enterprise contexts, an environment variable is often available.We can import os, and use Config=globals()[os.environ['OURAPP_ENVIRONMENT']] to pick a configuration based on an environment variable. 
  • In some places, we can rely on the host name itself to pick a configuration. We can use os.uname()[1] to get the name of the server. We can add a mapping from server name to configuration, and use this: Config=host_map(os.uname()[1],Defaults).
  • Use a command-line options like "--env=QA". This can a little more complex than the above techniques, but it seems to work out nicely in the long run.
Command-line args to select a specific configuration

To select a configuration using command-line arguments, we must decompose configuration into two parts. The configuration alternatives shown above are placed in a config_params.py module. The config.py module that's used directly by the application will import the config_params.py module, parse the command-line options, and finally pick a configuration. This module can create the required module global, Config. Since it will only execute once, we can import it freely.

The config module will use argparse to create an object named options with the command-line options. We can then do this little dance:

import argparse
import sys
import config_params

parser= argparse.ArgumentParser()
parser.add_argument("--env", default="DEV")
options= parser.parse_args()

Config = getattr(config_params, options.env)
Config.options= options

This seems to work out reasonably well. We can tweak the config_params.py flexibly. We can pick the configuration with a simple command-line option.

If we want to elegantly dump the configuration, we have a bit of a struggle. Each class in the hierarchy introduces names: it's a bit of work to walk down the __class__.__mro__ lattice to discover all of the available names and values that are inherited and overridden from the parents.

We could do something like this to flatten out the resulting values:

Base= getattr(config_params, options.env)
class Config(Base):
    def __repr__(self):
        names= {}
        for cls in reversed(self.__class__.__mro__):
            cls_names= dict((nm, (cls.__name__, val)) 
                for nm,val in cls.__dict__.items() 
                    if nm[0] != "_")
            names.update( cls_names )
        return ", ".join( "{0}.{1}={2}".format(class_val[0], nm, class_val[1]) 
            for nm,class_val in names.items() )

It's not clear this is required. But it's kind of cool for debugging.

Tuesday, March 17, 2015

Building Skills in Object-Oriented Design

New Kindle Edition of Building Skills in Object-Oriented Design.

It seems to work okay in my Kindle Readers.

I'm not sure it's really formatted completely appropriately. I'm not a book designer. But before I fuss around with font sizes, I think I need to spend some time on several more valuable aspects of a rewrite:

  1. Updating the text and revising for Python 3.
  2. Removing the (complex) parallels between Python and Java. The Java edition can be forked as a separate text. 
  3. Reducing some of the up-front sermonizing and other non-coding nonsense.
  4. Moving the unit testing and other "fit-and-finish" considerations forward.
  5. Looking more closely at the Sphinx epub features and how those work (or don't work) with the KindleGen application which transforms .html to .mobi. This is the last step of technical production, once the content is right.
I have two other titles to finish for Packt publishing.

Maybe I should pitch this to Packt, since there seems to be interest in the topic? A skilled technical editor from Packt and some reviewers are likely to improve the quality.

The question, though, will be how to fit this approach to programming into their product offerings? Since I have two other titles to finish for them, perhaps I'll just set this aside for now.

Tuesday, March 10, 2015

It appears that DevOps may be more symptom than solution

It appears that DevOps may be a symptom of a bigger problem. The bigger problem? Java.

Java development -- with a giant framework like Spring -- seems to accrete layers and layers of stuff. And more stuff.  And bonus stuff on top the the excess stuff.

The in-house framework that's used on top of the Spring Framework that's used to isolate us from WebLogic that used to isolate us from REST seems -- well -- heavy. Very heavy.

And the code cannot be built without a heavy investment in learning Maven. It can't be practically deployed without Hudson, Nexus, and Subversion. And Sonar and Hamcrest and mountains of stuff that's implicitly brought in by the mountain of stuff already listed in the Maven pom.xml files.

The deployment from Nexus artifacts to a WebLogic server also involves uDeploy. Because the whole thing is AWS-based, this kind of overhead seems unavoidable.

Bunches of web-based tools to manage the bunch of server-based tools to build and deploy.

Let me emphasize this: bunches of tools.

Architecturally, we're focused on building "micro services". Consequently, an API takes about a sprint to build. Sometimes a single developer can do the whole CRUD spectrum in a sprint for something relatively simple. That's five API's by our count: GET one, GET many, POST, PUT and DELETE: each operation counts as a separate API.

Then we're into CI/CD overheads. It's a full sprint of flailing around with deployment onto a dev servers to get something testable and get back test results so we can fix problems. A great deal of time spent making sure that all the right versions of the right artifacts are properly linked. Doesn't work? Oh. Stuff was updated: fix your pom's.

It's another sprint after that flailing around with the CI/CD folks to get onto official QA servers. Not building in Husdon? Wrong Nexus setup in Maven: go back to square one. Not deployable to WebLogic? Spring Beans that aren't relevant when doing unit testing are not being built by WebLogic on the QA server because the .XML configuration or the annotations or something is wrong.

What's important here is that ⅔ of the duration is related to the simple complexity of Java.

The DevOps folks are trying really hard to mitigate that complexity. And to an extent, they're sort of successful.

But. Let's take a step back.

  1. We have hellish complexity in our gigantic, layered software toolset.
  2. We've thrown complicated tools at the hellish complexity, creating -- well -- more complexity.

This doesn't seem right. More complexity to solve the problems of complexity just don't seem right.

My summary is this: the fact that DevOps even exists seems like an indictment of the awful complexity of the toolset. It feels like DevOps is a symptom and Java is the problem.

Tuesday, March 3, 2015

Let's all Build a Hat Rack

Wound up here: "A Place to Hang Your Hat" and the #LABHR hash tag.

H/T to this post: "Building a Hat Rack."

This is a huge idea. I follow some folks from the Code For America group. The +Mark Headd  Twitter feed (@mheadd) is particularly helpful for understanding this movement. Also, follow +Kevin Curry (@kmcurry) for more insights.

Open Source contributions are often anonymous and the rewards are intangible. A little bit of tangibility is a huge thing.

My (old) open source Python books have a "donate" button. Once in a while I'll collect the comments that come in on the donate button. They're amazingly positive and encouraging. But also private. Since I have a paying gig writing about Python, I don't need any more encouragement than I already have. (Indeed, I probably need less encouragement.)

However.

There are unsung heroes at every hackathon and tech meetup who could benefit from some recognition. Perhaps they're looking for a new job or a transition in their existing job. Perhaps they're looking to break through one of the obscure social barriers that seem to appear in a community where everyone looks alike.

And. There's a tiny social plus to being the Recognizer in Chief. There's a lot to be said in praise of the talent spotters and friction eliminators.

Tuesday, February 24, 2015

Functional Python Programming

New from Packt Publishing: Functional Python Programming.

Also here on Amazon.

The fun part is covering generator functions, iterators, and higher-order functions in some real depth. There's a lot of powerful programming techniques available.

What's challenging is reconciling Python's approach to FP with languages that are purely functional like Haskell and OCaml and others. Years ago, I saw some discussion in Stack Overflow that Python simply wasn't a proper functional programming language because it lacked some features. I'm vague on specifics (perhaps there weren't any) but the gaps between Python and FP are narrow.

As far as I can tell, the single biggest features missing are non-strict evaluation coupled with an optimizer that can rearrange expressions to optimize performance. This feature pair also tends to also produce nice tail-call optimization of recursions.

Languages which are totally non-strict (or fully lazy) need to introduce monads so that some ordering can be enforced in the cases where ordering really does matter.

Since Python is strict (with only minor exceptions) monads aren't needed. But we also sacrifice some optimization capability because we can't reorder Python's strict expressions. I'm not sure this is a gap which is so huge that we can indict Python as being non-functional or not suitable for a functional approach. I think the lack of an optimizing compiler is a little more than an interesting factoid.

An interesting problem that compiled functional languages have is resolving data types properly. It's a problem that all statically-typed languages share. In order to write a really generic algorithm, we either have to rely on a huge type hierarchy or sophisticated type pattern-matching rules. Python eschews this problem by making all code generic with respect to type. If we've applied a function to an inappropriate object, we find out through unit testing that we have TypeError exceptions.

I think we can (and should) borrow functional programming design patterns and reimplement them in Python. This is quite easy and doesn't involve too much work or overhead. For example, the yield from statement allows us to do manual tail-call optimization rather than trusting that the compiler will recognize the code pattern.

Tuesday, February 17, 2015

Yet Another Complaint about Python in General, SciPy in Particular

The context is an ongoing question about optimization -- not my strong suit -- and the SciPy algorithms for this. See Scipy.optimization.anneal Problems for some additional confusion over simple things.

The new quote is this:
However, firing up Python, NumPy, SciPy and figuring out which solver to use is not convoluted? Keep on writing code and over engineering as opposed to using the minimum tech in order to get the job. After all, we are professionals.
It appears that using a packaged, proven optimizer is somehow "convoluted." Apparently, the Anaconda product never surfaced in a Google search. This seems to indicate that perhaps (a) Google was never used or (b) the author didn't get to page 4 of the search results, or (c) the author never tried another search beyond the single-word "scipy".

I'm guessing they did not google "Python simulated annealing" -- the actual subject -- because there are a fairly large number of existing solutions to this. Lots and lots of lecture notes and tutorials. It seems to be a rich area full of tutorials on both optimization and Python. Reading a few of these would probably have addressed all of the concerns.

Anaconda, BTW, appears to be an amazing product. It seems to be the gold standard for data science. (I know of organizations that have in-house variations on this theme They bundle Python plus numerous extra packages and a variety of installers for Mac OS X, Windows and Linux.)

The "Keep on writing code" complaint is peculiar. The optimization examples in SciPy seem to involve less than a half-dozen lines of code. Reading a CSV file can be digested down to four lines of code.

import cvs
with open("constrains.csv", newline="") as source;
    rdr= DictReader(source)
    data = list(rdr)

I can only guess that the threshold for "over engineering" is a dozen lines of code. Fewer lines are acceptable; more are bad.

I don't know what "using the minimum tech in order to get the job" means, but the context included an example spreadsheet that was somehow a solution to an instance of a problem. I'm guessing from this that "minimum tech" means "spreadsheet."

Read this: When spreadsheets go bad. There are a lot of war stories like this. (For information on the  original quote, read 'What is meant by "Now you have two problems"?')

I regret not asking follow-up questions.

The more complete story is this: rather than actually leverage SciPy, the author of the quote appears to be fixated on rewriting a classic Simulated Annealing textbook example into a spreadsheet because reasons. One of which is that more modern algorithms in SciPy aren't actually classic simulated annealing. The newer algorithms may be better, but since they're not literally from the textbook, this is a problem.

And my suggestion -- just use SciPy -- was dismissed as "convoluted", "over-engineering", and -- I guess -- unprofessional.

Tuesday, February 10, 2015