Tuesday, June 23, 2015

Literate Programming and GitHub

I remain captivated by the ideals of Literate Programming. My fork of PyLit (https://github.com/slott56/PyLit-3) coupled with Sphinx seems to handle LP programming in a very elegant way.

It works like this.
  1. Write RST files describing the problem and the solution. This includes the actual implementation code. And everything else that's relevant. 
  2. Run PyLit3 to build final Python code from the RST documentation. This should include the setup.py so that it can be installed properly. 
  3. Run Sphinx to build pretty HTML pages (and LaTeX) from the RST documentation.
I often include the unit tests along with the sphinx build so that I'm sure that things are working.

The challenge is final presentation of the whole package.

The HTML can be easy to publish, but it can't (trivially) be used to recover the code. We have to upload two separate and distinct things. (We could use BeautifulSoup to recover RST from HTML and then PyLit to rebuild the code. But that sounds crazy.)

The RST is easy to publish, but hard to read and it requires a pass with PyLit to emit the code and then another pass with Sphinx to produce the HTML. A single upload doesn't work well.

If we publish only the Python code we've defeated the point of literate programming. Even if we focus on the Python, we need to do a separate upload of HTML to providing the supporting documentation.

After working with this for a while, I've found that it's simplest to have one source and several targets. I use RST ⇒ (.py, .html, .tex). This encourages me to write documentation first. I often fail, and have blocks of code with tiny summaries and non-existent explanations.

PyLit allows one to use .py ⇒ .rst ⇒ .html, .tex. I've messed with this a bit and don't like it as much. Code first leaves the documentation as a kind of afterthought.

How can we publish simply and cleanly: without separate uploads?

Enter GitHub and gh-pages.

See the "sphinxdoc-test" project for an example. Also this https://github.com/daler/sphinxdoc-test. The bulk of this is useful advice on a way to create the gh-pages branch from your RST source via Sphinx and some GitHub commands.

Following this line of thinking, we almost have the case for three branches in a LP project.
  1. The "master" branch with the RST source. And nothing more.
  2. The "code" branch with the generated Python code created by PyLit.
  3. The "gh-pages" branch with the generated HTML created by Sphinx.
I think I like this.

We need three top-level directories. One has RST source. A build script would run PyLit to populate the (separate) directory for the code branch. The build script would also run Sphinx to populate a third top-level directory for the gh-pages branch.

The downside of this shows up when you need to create a branch for a separate effort. You have a "some-major-change" branch to master. Where's the code? Where's the doco? You don't want to commit either of those derived work products until you merge the "some-major-change" back into master.

GitHub Literate Programming

There are many LP projects on GitHub. There are perhaps a dozen which focus on publishing with the Github-flavored Markdown as the source language. Because Markdown is about as easy to parse as RST, the tooling is simple. Because Markdown lacks semantic richness, I'm not switching.

I've found that semantically rich markup is essential. This is a key feature of RST. It's carried forward by Sphinx to create very sophisticated markup. Think :code:`sample` vs. :py:func:`sample` vs. :py:mod:`sample` vs. :py:exc:`sample`. The final typesetting may be similar, but they are clearly semantically distinct and create separate index entries.

A focus on Markdown seems to be a limitation. It's encouraging to see folks experiment with literate programming using Markdown and GitHub. Perhaps other folks will look at more sophisticated markup languages like RST.

Previous Exercises

See https://sourceforge.net/projects/stingrayreader/ for a seriously large literate programming effort. The HTML is also hosted at SourceForge: http://stingrayreader.sourceforge.net/index.html.

This project is awkward because -- well -- I have to do a separate FTP upload of the finished pages after a change. It's done with a script, not a simple "git push." SourceForge has a GitHub repository. https://sourceforge.net/p/stingrayreader/code/ci/master/tree/. But. SourceForge doesn't use  GitHub.com's UI, so it's not clear if it supports the gh-pages feature. I assume it doesn't, but, maybe it does. (I can't even login to SourceForge with Safari... I should really stop using SourceForge and switch to GitHub.)

See https://github.com/slott56/HamCalc-2.1 for another complex, LP effort. This predates my dim understanding of the gh-pages branch, so it's got HTML (in doc/build/html), but it doesn't show it elegantly.

I'm still not sure this three-branch Literate Programming approach is sensible. My first step should probably be to rearrange the PyLit3 project into this three-branch structure.

Tuesday, June 16, 2015

A plea to avoid sys.exit() [Updated]

Let me gripe about this for a moment.

sys.exit()

The use case for this function is limited. Very, very limited.

Every place that this appears (except for one) is going to lead to reusability issues.

Consider some obscure little function, deep within the app.

def deep_within_the_app(x, y, zed):
    try:
        something -- doesn't matter what
    except SomeException:
        logging.exception( "deep_within_the_app")
        sys.exit(2)

What's so bad about that?

The function seizes control of every app that uses it by raising an unexpected exception.

We can (partially) undo this mischief by wrapping the function in a try/except which catches SystemExit.

def reusing_a_feature():
    for i in range(a_bunch):
        try:
            print(deep_within_the_app(x,y,i))
        except SystemExit as e:
            print("error on {0}".format(i))

This will defeat the sys.exit(). But the cost is one of clarity. Why SystemExit? Why not some meaningful exception?

This is important: raise the meaningful exception instead of exit.

Bottom Line.



The right place for sys.exit() is inside the if __name__ == "__main__": section.
It might look something like this:

if __name__ == "__main__":
    try:
        main()
    except (KnownException, AnotherException) as ex:
        logging.exception(ex)
        sys.exit(2)

Use meaningful exceptions instead of sys.exit().

This permits reuse of everything without a mysterious SystemExit causing confusion.

On "Taste" in Software Design

Read this: http://www.paulgraham.com/taste.html.

I was originally focused on "beauty". Clearly, good design is beautiful. Isn't that obvious? Why so many words to explain the obvious?

The post seemed useless. Why write it in the first place? Why share it? Why share it now, 12 years after it was written?

Because beauty can be elusive to some people. A more complete definition of some attributes of beauty are helpful.

This is not a throw-away concept. These are fourteen essential elements that need to be used as part of every software architectural design review. Indeed, it should be part of every code review. Although code perhaps shouldn't be "daring."

When we adopt an architecture, it should fit these criteria.

This doesn't replace more pragmatic software quality assurance considerations.  See http://www.sei.cmu.edu/reports/95tr021.pdf.

I'm currently delighted with "Good design is redesign."

Tuesday, June 9, 2015

On Waiting to Write "Serious Code"

Someone told me they weren't yet ready to write "serious code." They needed to spend more time doing something that's not coding.

I'm unclear on what they were doing. It appears they have some barriers that I can't see.

They had sample data. They had a problem statement. They had an existing solution that was not very good. I couldn't see any reason for waiting. Indeed, I can't figure out what "serious" code is. Does that mean there's frivolous code?

Because there was a previous solution, they had a minimum viable product already defined: it has to do what the previous version did, only be better in some way. One could trivially transform the previous product into unit test cases and an acceptance test case. Few things could be more amenable to coding than having test cases.

Since everything necessary seemed to be in place, I had a complete brain cramp when they mentioned they weren't yet ready to write "serious" code. "Serious?" Seriously?

It appears that this developer suffers from a bad case of Fear of Code™. I know some common sources of this fear.
  1. Waterfall Project Experience (WPE™.) Old people (like me,) who started in Waterfall World, were told that we had to produce mountains of design before we produced any code. No one knew why in any precise way. Indeed, there's ample evidence that too much design is simply a way to introduce noise into the process. In spite of real questions, some folks think that you can write a design so detailed that a coder can just type in the code from the design. (This level of design is isomorphic to code; to avoid ambiguity it must be written as code.)
  2. Relational Database Hegemony (RDH™.) Folks (like me,) who were DBA's, know that databases require a lot of design and a lot of review before they can be created. Writing stored procedures requires even more design and review time. You don't just slap an SP out there. It might be "bad" or "create problems." Also, when you insist on DBA's writing application code, it takes super-detailed, code-level design details. In effect, you must write the code for the DBA to write your code back to you. 
  3. One and Done (OAD™.) Some people like to feel that they can write code once and it can be a thing of beauty and a joy forever. The idea of a rewrite is anathema to these people. While this is obviously silly, people still like the conceit that they can produce some prototype code that will be a proper part of every future release forever and always. It's not possible to make all of the decisions the first time regarding adoption and scaling and user preferences. Your prototype code will get replaced eventually: get over it. Write the prototype, get funding, move forward. Don't dither trying to make a bunch of future-oriented decisions based on a future you cannot actually foresee. You can't "future-proof" your code.
  4. Learnings are Expensive (LAE™.) You can find people that think that the sequence of (spike, POC, version 0, version 1) is too expensive. They are sure that learning is a project drag, since no "tangible" results are created by learning. This means that they don't value intellectual property or knowledge work, either; an attitude is actually destructive to the organization. Knowledge is everything: software captures knowledge: a spike followed by a POC followed by version zero will arrive on the scene more quickly than any alternative strategy. Don't waste time trying to write version 1 from a position of ignorance.
  5. Tools are Expensive (TAE™.) Some people feel that -- since tools are expensive -- they should be used rarely. Back in the olden days, when a compiler took many minutes to produce an error report, you had to be sure the code was good. (I'm old enough that I remember when compiles took hours. Really.) Those days are gone. Most compilers today work at the "speed of light" -- if they were any faster, you couldn't tell, because you can't click any faster. For dynamic languages, like Python, the speed with which code can be emitted makes all tool considerations quaint and silly.
  6. Diagram it to Death (DTD™.) Rather than write code, some folks would rather talk about writing code. To them, email, powerpoint, and whiteboard are cheaper than coding. This is a false economy. Nothing is saved by avoiding code. Time is wasted drawing diagrams of things at a level of detail that mirrors the code. Pictures aren't bad in general. Detailed pictures are simply a stalling tactic.
I find it frustrating when people search for excuses to avoid simply creating code. While I see a number of sources, there are many counter-arguments available. 
  1. Waterfall is dead. Make something minimal that works for this sprint. Call it a "spike" if that makes you happier. Clean it up in the next sprint. Create value early. Expand on the features later.
  2. Databases are free now. SQLite and similar products mean that we can prototype a database without waiting around for DBA's to give us permission to make progress. Build the database now, get something that works. Rework the database as your understanding of the problem matures. Rework the database as the problem itself matures and morphs. Nothing is static; the universe is expanding; do something now.
  3. No code lasts forever. Waiting around to create some kind of perfect value one time only is perfect silliness. Create value early and often. Discarding code means you're making progress. If you think it's important, write "draft" on every electronic document which might get changed. (Hint: version numbers are smarter than putting "draft" everywhere.)
  4. A spike and code happens more quickly than code. It's a matter of technical risk: unfinished work is an "exposure" -- an unrealized investment. Failing soon is better than researching extensively in an effort prevent a failure that could have been found quickly.
  5. Use a dynamic language and avoid all overheads.
  6. Keep the diagrams high-level. Code is the only way to meaningfully capture details. Code endures better than some out-of-date Visio file that's in Sharepoint completely disconnected from GitHub.
It's imperative to break down the roadblocks. All "pre-coding" activities are little more than emotional props: knock them down and start coding.

Tuesday, June 2, 2015

On Pre-built Binaries for Python Packages

Or.

Why I Hate Windows.

For Mac OS X, you download XCode (for free) and you can build anything. For Linux, you use some kind of yum or rpm installer for the developer tools, and you can build anything.

For Windows...

Pre-built binaries. ��

And a hope that the version numbers all match up properly. ��

In many cases, you can use http://www.mingw.org or http://www.cygwin.com. Many projects can work well with one or both of these compilers.

In some cases, however, you have for fork over $$$ for Microsoft Visual Studio to download and build a Python module with a C extension.

The problem is a show-stopper for many n00bz. They are lead to believe that pip does everything. And it does -- for Mac OS X and Linux; for Windows, however, it does almost everything. And it's not obvious to the n00bz what the problem is when pip barfs because there's no suitable C compiler.

"Replace that junk Windows PC" is not an appropriate response. Although I often suggest it as the first solution when things won't install. ��

Often Anaconda is the solution. It includes MinGW and you can (for a fee) buy their bundle of database drivers. The install for Anaconda is breathtakingly simple, removing a great deal of the potential complexity of assembling a tech stack for Python.

In other cases, we have to do some hand-holding to show how to find a pre-built binary for Windows.

Tuesday, May 26, 2015

Regular Expression "Hell"

Actual quote: "they spend a lot of time maintaining regular expressions. So, what are the alternatives to regular expression hell?"

Regular Expression Hell? It's a thing?

I have several thoughts:
  1. Do you have metrics to support "a lot"?  I doubt it. It's very difficult to tease RE maintenance away from code maintenance. Unless you have RE specialists. Maybe there's an RE organization that parallels the DBA org. DBA's write SQL. RE specialists write RE's. If that was true, I could see that you would have metrics, and could justify "a lot." Otherwise, I suspect this is hyperbole. There's frustration, true.
  2. REs are essential to programming.  It's hard to express how fundamental they are. I would suggest that programmers who have serious trouble with RE's have serious trouble with other aspects of the craft, and might need remedial training in RE's (and other things.) There's no shame in getting some training. There are a lot of books that can help. Claiming that there's no time for training (or no budget) is what created RE Hell to begin with. It's a trivial problem to solve. You can spend 16 hours fumbling around, or stop fumbling, spend 16 hours learning, and then press forward with a new skill. The choice is yours.
  3. REs are simply a variant on conventional set theory. They're not hard at all. Set theory is essential to programming, so are RE's. It's as fundamental as boolean algebra. It's as fundamental as getting a loop to terminate properly. It's as fundamental as copy-and-paste from the terminal window. 
  4. REs are universal because they solve a number of problems better than any other technology. Emphasis on better than ANY alternative. RE's are baked into the syntax of languages like awk and perl. They're universal because no one has ever built a sensible alternative. If you want to see even more baked-in regular expression goodness, learn SNOBOL4
REs are essential. Failure to master REs suggests failure to learn the fundamentals.

RE Hell is like Boolean Algebra Hell. It's like Set Theory Hell. It's like Math Library Hell. It's like Uninitialized Variables Hell. These are things you create through a kind of intentional ignorance.

I'm sorry to sound harsh. But I'm unsympathetic.

The initial regex in question? r"[\( | \$ | \/ |]". This indicates a certain lack of familiarity with the basics. It looks like it started as r"\(|\$|/" and someone put in spaces (perhaps they intended to use the verbose option when compiling it) and/or wrapped the whole in []'s. After trying the []'s, it appeared to work and they called it done.

The email asked (sort of trivially) if it was true that the last pipe was extraneous. Um. Yes. But.

Follow-up

The hard parts are (1) trying to figure out what the question really is. Why did they remove just the last pipe character? What were they trying to do? What's the goal? Then (2) trying to figure out how much tutorial background is required to successfully answer whatever question is really being asked. A response of r"[\(\$/]" seems like it might not actually be helpful. Acting a magic oracle that emits mysterious answers would only perpetuate the reigning state of confusion.

The follow-up requests for clarification resulted in (1) an exhaustive list of every book that seems to mention regex, (2) a user story that was far higher level than the context of regex questions. It's difficult to help when there's no focus. Every Book. Generalized "matching" of "data."

The Python connection? Can't completely parse that out, either, It appears that this is part of an ETL pipeline. I can't be sure because the initial user story made so little sense.

Attempts to discuss the supplied user story about "matching" and "data" -- predictably -- lead nowhere. It was stopped at "Some of the problems ... aren’t just typos and misspellings." Wait. What? What are they then? If they're not misspellings, what are they? Fraud? Hacking attempts? Denial of Service attacks by tying up time in some matching algorithm?

It's a misspelling. It can't be anything else. Ending the conversation by claiming otherwise is a strange and self-defeating approach to redesigning the software.

More Follow-up

At this point, we seem to be narrowing the domain of discussion to "As time goes on, we have accumulated a lot of the 'standard mistakes'. The question that need help w/ [sic] is how to manage all the code for these 'common mistakes'?" This question was provided in lieu of an actual user story. Lack of a story might mean that we're not interested in actually solving the data matching problem. Instead we're narrowly focused on sprinkling Faerie Dust all over the regexes to make them behave better.

They don't want an alternative to regexes because the problems "aren't just typos and misspellings." They want the regex without the regex hell. 

Tuesday, May 19, 2015

More Thoughts on the friction of DevOps

Read this: How 'DevOps' is Killing the Developer

My pull-out quote:
This is why we see so many developers that can't pass FizzBuzz: they never really had to write any code.
I agree: It appears that DevOps may be more symptom than solution.

I have one tiny objection to any otherwise excellent series of points: I don't like the totem pole analogy.

I prefer a supply-chain:
  • Release Engineers respond to user needs.
  • Quality Engineers respond to the Release Engineers' needs for assurance that something is fit for use.
  • Developers respond to Release Engineers by providing software.
  • Similarly, procurement folks may purchase or lease or download and pay royalties for software. 
I think of it like this:

Developer ⇒ QE ⇒ RE ⇒ Users

No top-to-bottom. More a sequence of more-or-less peers.

I still agree with the central tenet: a developer is able to march the software from concept to user. We don't really expect QE or RE to create software. We might expect some skill sharing between QE and RE.

Many years ago, I posted this: IT’s Drive to Self-Destruction, which is random and whiny but related to this point about DevOps. The idea is that key developers create competitive advantage. Release Engineers put it in the hands of users. Both are important. Without creation there's no deployment. Without deployment, creators can be diverted to deployment, so deployment can still go forward, but it will be slower.

The key point is this:
If a developer is spending time with DevOps (and TechOps) trying to get stuff deployed, who's developing the Next Big Thing?