Bio and Publications

Tuesday, May 30, 2017

Ancient Software and A Small Value of "Works"

I'm not sold about this argument at all.

The "constant replacement" issue has two sides to it. If you follow the thread, there's a certain amount of (appropriate) bashing of the acquisitiveness that -- when exploited ruthlessly -- is a damning indictment of capitalism. There's a lot of value in recognizing the core capitalistic "buy more stuff" part of this.

But there's more.

One problem with this tweet is the threshold for "things that work."

Consumer software tends to be rather complex. It's often fat with features. If the feature you use aren't obviously broken, then you can say the software "works." But it's heavily qualified. Your interesting feature set may be rather small when compared to the whole.

Also, the technology stack tends to be quite tall. Your consumer software sits on top of a consumer OS and consumer-friendly libraries. All of which have to "work" to claim that your software "works."

As with the app, you're only using a subset of OS and runtime library features. This wraps the threshold for "works" in more and more layers of qualification. Smaller and smaller percentages of the code are involved in "works."

I'd suggest there's no other context where things are quite so complex and interdependent as software. The narrow feature set that appears to "work" may be adjacent to numerous security flaws and bugs and unintended consequences. The constant replacement may be necessary bug fixes.

There's more.

Part of the "constant replacement" situation stems from the cost and complexity of innovation. I think that some companies have a rentier mind-set. Once the software is written, they hope to derive ongoing revenue from the software. This doesn't happen because other companies innovate, the product becomes obsolete. So they scramble around trying to make money without doing too much real work.

There are three scenarios where ancient software gets replaced:
  • A new version include bug fixes. See above; the previous version didn't really "work" for large values of work. Blame for using ancient software is deserved. Keeping the old security flaws is not a virtue.
  • A new version is incremental feature creep. This is (potentially) rampant capitalism. Add a little something and sell the product as "new and improved." Keeping the old software because the new features aren't helpful makes sense.
  • Some businesses have a rentier mind-set. They want an ongoing revenue stream. There aren't any material improvements. Keep the old software. Make these people do real work.
Traditional manufacturing business models rely on things wearing out. In the world of atoms, things need to be replaced. A good product has a long future because of parts. And Service. Possibly even customization. I lived on a sailboat made in 1982. I've been carefully rebuilding and replacing pieces all over the boat.

A software "platform" (e.g. OS, database, etc.) can have good long-term value. A software product, however, lives in a hyper-competitive marketplace where improvements appear constantly. The lazy route of non-innovative upgrades is tempting.

I think that there is a place to blame people for having too low a threshold for "works."

Tuesday, May 23, 2017

Python under XCode?!? Cool. Sort of.

Dan Bader (@dbader_org)
"Running Python in Xcode: Step by Step" ericasadun.com/2016/12/04/run…


Thanks for the link. There's a fair amount of "this doesn't seem to be what Xcode was designed to do." But it does seem to work.

I'll stick with Komodo Edit.

Tuesday, May 16, 2017

Needless Complexity -- Creating Havoc Leads to Mistakes

I received the worst code example ever. The. Worst.

Here's the email.
I have hurriedly created a blog post titled [omitted] at the url below
[also omitted]
...
Unfortunately, I am neither an algorithm expert or a Python expert. However, I am willing to jump in. 
Please review the Python code snippets. I know that they work because I ran them using Python 2.7.6. It was the environment available on my work PC. Speed to get something to the group so that it does not disband outweighs spending time on environments. The goal is not to be Pythonic but to have anyone that has written any code follow the logic.
Also, please review the logic. Somehow, I managed to get the wrong answer. The entire blog post is a build to provide a solution to CLRS exercise 2.3-7. My analysis gave me the answer of O( {n [log n]}**2 ) and the CLRS answer is O(n [log n] ). Where did I screw up my logic?

The referenced blog post is shocking. The "neither an algorithm expert or a Python expert" is an understatement. The "willing to jump in" is perhaps a bad thing. I sent several comments. They were all ignored. I asked for changes a second time. That was also ignored. Eventually, changes were made reluctantly and only after a distressing amount of back-and-forth.

Havoc was created through a process of casually misstating just about everything that can possibly be  misstated. It transcended mere "wrong" and enters that space where the whole thing can't even be falsified. It was beyond simply wrong.

My point here is (partially) to heap ridicule on the author. More importantly, want to isolate a number of issues to show how simple things can become needlessly complex and create havoc.

The Problem

The definition of the problem 2.3-7 seems so straightforward.
"Describe a $\Theta(n \log n)$-time algorithm that, given a set $S$ of $n$ integers and another integer $x$, determines whether or not there exist two elements in $S$ whose sum is exactly $x$."
This brings us to problem #1. The blog post is unable to actually repeat the problem. Here's the quote:
x + y = x0
where
x ∈ N and y ∈ N for some finite set of integer values N
x0: the integer to which x and y sum
x != y
It's not at all clear what's going on here. What's x0? What's N? Why is x!=y even introduced?

This is how havoc starts. The requirements have been restated in a way that makes them more confusing. The original terminology was dropped in favor of random new terminology. There's no reason for restating the problem. The consequence of the bad restatement is to introduce needless features and create confusion.

The Python Nonsense

A quote:
Concepts are demonstrated via code snippets. They code snippets were executed using Python 2.7.6. They were written in such a way that anyone with basic coding skills could read the code. In other words, the goal was not to be Pythonic.
Python 2.7.6 has been obsolete since May of 2014. At the very least, use a current release.

The goal of using Python without being Pythonic seems to be -- well -- schizophrenic. Or it's intentional troll-bait. Hard to say. 

Another paragraph says this.
The Python community will be annoyed because I am using Python 2 and not 3. Their annoyance is appropriate. Unfortunately, I only have Windows machines and can't afford to screw them up at this point in time.
What? That makes no sense at all. It's trivial to install Python 3.6 side-by-side with Python 2. Everyone should. Right now. I'll wait. See https://conda.io/docs/. Start here: https://conda.io/miniconda.html.

If you're going to insist on using the quirky and slow Python 2, you absolutely must use this in all of your code:

from __future__ import print_function, division, absolute_import, unicode_literals

Python 2 code without this is wrong. If you're still using Python 2, add this to all your code, right now. Please. You'll have to fix stuff that breaks; but we'll all thank you for it. pylint --py3k will help you locate and fix this.

The code with a -2/10 pylint score

I'm trying to reproduce this faithfully. It's hard, because the original blog post has issues with layout.

SomeIntegerList = [1, 2, 3, 4, 5, 6]
DesiredSumOfIntegers = 11
for SomeIntegerA in SomeIntegerList:
 for SomeIntegerB in SomeIntegerList:
  if SomeIntegerA == SomeIntegerB: continue
        SumOfIntegers = SomeIntegerA + SomeIntegerB
        print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
  if DesiredSumOfIntegers == SumOfIntegers:
      print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"

(The original really could not be copied and pasted to create code that could even be parsed. I may have accidentally fixed that. I hope not.)

Almost every line of code has a problem. It gets worse, of course.

There's output in the original blog post that provides a hint as to what's supposed to be happening here.

Addition is Commutative

Yes. There is an entire paragraph plus a spreadsheet which proves that addition is commutative. An. Entire. Paragraph. Plus. A. Spreadsheet.

Meanwhile, factorial, multiplication, and division aren't mentioned. Why do we need a spreadsheet to show that addition is commutative, yet, all other operators are ignored? No clue. Moving on.

Permutations

A quote:
Now, let's talk about the number of computations involved in using nested for loops to examine all the possible addition permutations. Here I am using the term permutation as it is strictly defined in mathematics.
First. The algorithm uses all combinations. $\textbf{O}(n^2)$.

Second. "as it is strictly defined in mathematics" should go without saying. If you feel the need to say this, it calls the entire blog post into question.

It's like "honestly." Anyone who has to establish their honesty with "can I be honest with you?" is still lying.

If we're being strict here, are we not being strict elsewhere? If we're not being strict, why not?

The algorithm enumerates all combinations of n things taken 2 at a time without replacement. For reasons that aren't clear. The original problem statement permits replacement. The restatement of the problem doesn't permit replacement.

The n things taken r or 2 at a time problem

There's a table with values for $\frac{n!}{(n-r)!}$

No hint is given as to what this table is or why it's here.  I think it's supposed to be because of this:

$\frac{n!}{r!(n-r)!} \text{ with } r=2 \equiv \frac{n!}{2(n-2)!} \equiv \frac{n\times(n-1)}{2}$

It's hard to say why commutativity of addition gets a paragraph, but this gets no explanation at all. To me, it shows a disregard for the reader: the reader doesn't understand addition, but they totally get factorial.

Another Perspective

A quote
Another perspective is to note that the nested for loops result in O(n^2). Clearly, the above approach is not scalable.
That's not "another perspective." That's. The. Point. The entire point of the exercise is that the brute force algorithm isn't optimal.

The Worst Code Snippet Ever

This is truly and deeply shocking.

SomeIntegerList = [1, 2, 3, 4, 5, 6]
DesiredSumOfIntegers = 11
i = 0
for SomeIntegerA in SomeIntegerList:
    i = i + 1
    j = 0
 for SomeIntegerB in SomeIntegerList:
        j = j + 1
        if j > i:
            print "i = ", i, ", j = ", j
            SumOfIntegers = SomeIntegerA + SomeIntegerB
            print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
      if DesiredSumOfIntegers == SumOfIntegers:
          print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"


This is what drove me over the edge. This is unconscionably evil programming. It transcends mere "non-Pythonic" and reaches a realm of hellish havoc that can barely be understood as rational. Seriously. This is evil incarnate.

This is the most baffling complex version of a half-matrix iteration that I think I've ever seen. I can only guess that this is written by someone uncomfortable with thinking. They copied and pasted a block of assembler code changing the syntax to Python. I can't discern any way to arrive at this code.

The Big-O Problem

This quote:
Even though the number of computations is cut in half
The rules for Big-O are in the cited CLRS book.  $\textbf{O}(\frac{n^2}{2}) = \textbf{O}(n^2)$.

The "cut in half" doesn't count when describing the overall worst-case complexity. It needs to be emphasized that "cut in half" doesn't matter. Over and over again.

This code doesn't solve the problem. It doesn't advance toward solving the problem. And it's unreadable. Maybe it's a counter-example? An elaborate "don't do this"?

The idea of for i in range(len(S)): for j in range(i): ... seems to be an inescapable approach to processing the upper half of a matrix, and it seems to be obviously $\textbf{O}(n^2)$.

The Binary Search

This quote is perhaps the only thing in the entire blog post that's not utterly wrong.
we can compute the integer value that we need to find. We can than do a search over an ordered list for the integer that we need to find.
Finally. Something sensible. Followed by more really bad code.

The code starts with this

def binarySearch(alist, item):

instead of this

from bisect import bisect

Why does anyone try to write code when Python already provides it?

There's more code, but it's just badly formatted and has a net pylint score that's below zero. We've seen enough.

There's some further analysis that doesn't make any sense at all:
Since the integers that sum must be distinct, the diagnol on the matrix have values of N/A
And this:
Secondly, we should remove the integer that we are on from the binary search
This is a consequence of the initial confusion that decided that $x \neq y$ was somehow part of the problem. When it wasn't. These two sentences indicate a level of profound confusion about the essential requirements. Which leads to havoc.

Added Complication

The whole story is pretty badly confused. Then this arrives.
Complicate Problem by Having Integer List Not Sorted
It's not clear what this is or why it's here. But there it is. 

It leads eventually to this, which also happens to be true. 
The total computation complexity is O(2 * n [log n] ) = O(n [log n] )
That's not bad. However. The email that asked for help claimed O( {n [log n]}**2 ). I have no idea what the email is talking about. Nor could I find out what any of this meant. 

The Kicker

The kicker is some code that solves the problem in $\textbf{O}(n)$ time. Without using a set, which is interesting.

This was not part of the CLRS exercise 2.3-7. I suppose it's just there to point out something about something. Maybe it's a "other people are smarter than CLRS"? Or maybe it's a "just google for the right answer without too much thinking"? Hard to say.

A sentence or two of introduction might be all that's required to see why the other result is there.

Lessons Learned

Some people like to add complexity to the problem. The $x \neq y$ business is fabricated from thin air. It adds to the code complexity, but is clearly not part of the problem space.

This creates havoc. Simple havoc.

Some people appear to act like they're asking for help. But they're not. They may only want affirmation. A nice pat on the head. "Yes, you've written a blog post." Actual criticism isn't expected or desired. This is easy to detect by the volume and vehemence of the replies.

Given a list of numbers, S, and a target, x, determine of two values exist in the set that sum to x.

>>> S = [1,2,3,4,5,6]
>>> x=11
>>> [(n, x-n) for n in S if (x-n) in S]
[(5, 6), (6, 5)]
>>> bool([(n, x-n) for n in S if (x-n) in S])
True

This follows directly from the analysis. It doesn't add anything new or different. It just uses Python code rather than indented assembler.

This first example is $\textbf{O}(n^2)$ because the in operator is applied to a list. We can, however, use bisect() instead of the in operator.

>>> [(n, x-n) for n in S if S[bisect(S, (x-n))-1] == x-n]
[(5, 6), (6, 5)]
>>> x=13
>>> [(n, x-n) for n in S if S[bisect(S, (x-n))-1] == x-n]
[]

This achieves the goal -- following the parts of the analysis that aren't riddled with errors -- without so much nonsensical code.

This does require some explanation for what bisect(S, i) does. It's important to note that the bisect() function returns the position at which we should insert a new value to maintain order. It doesn't return the location of a found item. Indeed, if the item isn't found, it will still return a position into which a new item should be inserted.

If we want this to be $\textbf{O}(n)$, we can use this:

>>> S = [1,2,3,4,5,6]
>>> S_set = set(S)
>>> x=11
>>> bool([(n, x-n) for n in S_set if (x-n) in S_set])
True

This replaces the linear list with a set, S_set. The (x-n) in S_set operation is $\textbf{O}(1)$, leading to the overall operation being $\textbf{O}(n)$.

If you want to shave a little time, you can use any() instead of bool([]). If you're not returning the pairs, you can reduce it to any(x-n in S_set for n in S_set). Try it with timeit to see what the impact is. It's surprisingly small.

Tuesday, May 9, 2017

Fizz Buzz Overthought, Project Euler #1, and Unit Tests

This. http://www.tomdalling.com/blog/software-design/fizzbuzz-in-too-much-detail/

And many other thoughts on overthinking fizz buzz. I'm going to overthink it, also. Why not?

This is a problem where the obvious unit test may not cover the cases properly. See https://projecteuler.net/problem=1 for a unit test case with a subtle misdirection built in. I love this problem statement deeply because the happy path is incomplete.

Part of my overthinking is overthinking this as a "classification" exercise, where we have simple classifiers that we can apply as functions of an input value. A higher-level function (i.e., map or a generator expression) applies all of these functions to the input. Some match. Some don't.

It shakes out like this.

The core classifier is a function that requires flexible parameter binding.

query = lambda n, t: lambda x: t if x%n == 0 else None

This is a two-step function definition. The outer function binds in two parameters, n, and t. The result of this binding is the inner function. If an argument value is a multiplier of n, return the text, t. We can think of the result of the outer lambda as a partial function, also, with some parameters defined, but not all.

We have two concrete results of this query() function:

fizz = query(3, 'fizz') 
buzz = query(5, 'buzz')

We've bound in a number and some text. Here's how the resulting functions work.


>>> fizz(3) 
'fizz' 
>>> fizz(2)

The "trick" in the fizz buzz problem space is recognizing that we're working with the power set of these two rules. There are actually four separate conditions. This is remarkably easy to get wrong even though the sample code may pass a unit test like the Project Euler #1 sample data.

Here's the power set that contains the complete set of all subsets of the rules.

rule_groups = set(powerset([fizz, buzz]))

"Um," you say, "Is that necessary? And powerset() isn't built-in."

Try to add a third or fourth rule and the $\textbf{O}(2^n)$ growth in complexity of checking all combinations of the rules will become readily apparent. For two rules, $4 = \lvert\mathcal{P}(\{q(3), q(5)\})\rvert$. For a general set of rules, $r$, it's $2^{\lvert r \rvert} = \lvert \mathcal{P}(r)\rvert$. Four rules? sixteen outcomes. It sure seems like the power set is absolutely necessary; it describes the domain of possible outcomes. How could it not be necessary?

Also. There's a nice definition of powerset in the itertools recipes section of the standard library. It's *almost* built-in. 

The domain of possible responses form a power set. However, it's also clear that we aren't obligated to actually enumerate that set for each value we're testing. We do need to be aware that the complexity of the classification output is $\textbf{O}(2^n)$ where $n$ is the number of rules.

The processing to build each set of classifications, however, is $\textbf{O}(n)$. Here's how it looks.

for n in range(20):
    m = set(filter(None, (r(n) for r in [fizz, buzz])))
    print(m if m else n)

This locates the set of all matches, m. We apply the rules, fizz() and buzz(), to the given value. The result of applying the rules is filtered to remove falsy values. The resulting set, m, has the values from all rules which matched. This will be one of the sets from the power set of applying the rules to each value. The match, $m$, is an element of $\mathcal{P} (\{q(3), q(5)\})$.

I'm delighted that Python has some support for creating partial functions in a variety of ways. When things are complex, we can use the def statement. We can use functools partial(). When things are simple, we can even use lambdas.

Tuesday, May 2, 2017

Functional Python, Literate Programming & Trello Board Analysis

The general advice to people using Kanban/Agile Project boards of various kinds is this:

Stop Starting -- Start Finishing


etc.

There's a lot of this advice. Some of it is helpful.

Many tools have various dashboards and metrics computations.

However.

The basic velocity calculations -- starts v. finishes -- is pretty straight-forward. The rules to classify a Trello action as "start" or "finish" are actually nice examples of simple functions or lambdas. Which also means that the basic pipeline required to gather the data can be written as a lazy, functional process.

Which leads to writing a Literate Programming version of a small program that gathers data from a Trello board.

https://github.com/slott56/Trello-Action-Counts

It's a kind of deep-dive into some aspects of Python functional-style programming. It's also a dive into Literate Programming via a longish example. And it has a fair number of type hints. It's not perfectly clean from MyPy-'s analysis. So there's some more to do on that front.