Bio and Publications

Tuesday, March 26, 2019

Python and pathlib and Windows -- this problem has been solved -- and yet...

The Passive-Aggressive Programmer strikes again. A sad story of sadness.

I tell everyone to stop using os.path and use pathlib. Everyone. Here's the link: https://docs.python.org/3/library/pathlib.html

It's essential to realize the semantic richness of OS filesystem paths. They're not simply strings. They have a string representation, but there's quite a bit going on there that is not captured trivially by strings and string parsing.  "path/basename.extension" is more than just slashes and dots.

Windows users, of course, have a nightmarish problem. Actually many nightmarish problems, one of which is pathnames.

I tell Windows developers to use pathlib, it will make their life somewhat more bearable.

And Yet. The Passive-Aggressive Programmer insists on using Windows as if it doesn't have a problem with \ in path strings.

Line 110 has a literal r"C:\windows\is\xtreme\evil" Note the \x in the path. Without the r"" string, this literal raises a SyntaxError.

Line 50 had subprocess.run(r"C:\path\to\xectuable -option" + " " + options + " " + filename). Note the r"" string. Note the Linux-style -option, too. They're wrapping an open source app in a Python shell.

You're with me so far? They're Windows devs. They've managed to use raw strings in two places. Right?

But. They're Passive-Aggressive. They don't like PR comments of any kind. They'll "agree" to a change, but then... This...

Line 50 should change. It's needs to use a list.

The Passive Aggressive Programmer can't make a list work.

list_of_arguments = ["C:\path\to\xectuable -option"] + options_list + ["C:\windows\is\xtreme\evil"]

See what they did there? They didn't want to change from string to list. (We had to go over it more than once.) They wanted to leave it alone. Grudgingly, they agreed to change from string to list.

But.

SyntaxError. See? The list just doesn't work. Python is weird. It's an undocumented WAT.

[Yes, some of us know the r's vanished. They author couldn't figure that out, though.]

And the pathlib suggestion?

Since the strings are now a SyntaxError, they need me to fix that for them. I made them change to a list, therefore, I caused the SyntaxError. It would be a distraction to spend tome researching pathlib. "I need to Google and think about how to handle the Unicode error" was the response.

Using Path("C:/path/to/xecutable") to avoid any Window-ism of any kind is an impossible burden. Impossible. It requires them to Google. Instead, the SyntxError is all my fault.

The previous examples of the use of raw strings?  Don't know why they're not helpful, but I'm not the one who's struggling to implement a change.

Tuesday, March 19, 2019

Don't Solve My Problem.

Two and a half examples of "Don't solve the problem I described. Provide the implementation I dream about."

Can't Use Enums for Constants

I was asked to see this because sometimes there's just too much abstraction https://stackoverflow.com/questions/2668355/how-much-abstraction-is-too-much

The accepted answer links to some useful design principles. Read the answer. It's useful.

The question objects to abstract superclasses without much (or any) actual implementation.  I've seen folks toy around with frameworks where there are classes that introduce a name, but little else. So I understand the complaint. I once tried to use Python classes as surrogates for Java interfaces. It was a bad idea. And irrelevant to solving the underlying problem.

The problem that lead to the "yet another abstraction" complaint, however, was not related to a design with empty layers of framework abstractions. It was not related to classes used to define an interface-like feature in Python. It wasn't related to *anything* in the Stack Overflow question or answer.

The "yet another abstraction" complaint was based on Python not having constants. Seriously. How did we get here? Right. They don't want a solution. They want to complain.

I lift this situation up to folks who are trapped in conversations where things devolve into bizarro-world like "Yet Another Abstraction is bad" when we're not talking about abstractions. The solution is simple, but, it's not what they wanted so, it's labeled as bad in some way.

The solution is bad because it's unexpected. Consequently peripheral, tangential, weird-ass nonsense will show up in trying to avoid an unexpected solution.

Can't Assign Numbers

There's an API to load some data.  They have 100's of clients happily loading data. In some cases, the clients must assign numbers in addition to names; it's a disambiguation thing. Most of the time, the name is good. In a few cases, (name, number) is a two-part key because they have multiple instances with the same name.

We're good here. The data structure's key can be (name, number) and the default number is zero. Works for almost everyone.

Almost everyone. Exception they have one client who cannot count or enumerate their data.

Really.

The client can't even pre-process the data to add numbers because reasons.

The stated reason is "the data originates off-line and the numbers might be inconsistent." The key needs a number. It doesn't need to consistent. The point is asking the client to own the identity -- a name and a number.

The solution seemed easy. Assign a number. If your data comes from a spread-sheet, use the row number. The =row() function works. Use that.  If your data doesn't come from a spread-sheet write a tiny utility to laminate a number into the data. This doesn't seem hard. And then the client owns the object identity.

Nope.

Can't do it. The web service will have to assign the number for them.

It's not a difficult feature to add. It's a complicated, stateful default value. This will turn into trouble tickets in the future when the numbers are unacceptable because they change with each load or something even more obscure than that.

Can't Fork a Repo

This isn't recent, and I may not have the details right. But.

The team had evolved an approach where they had several different pieces of software spread among multiple branches in a single Git repository.

This was weird. And they were -- of course -- about to start having CI/CD problems as they moved away from manual builds into a world of git commit hooks and relatively fixed CI/CD pipelines.

And they were really unhappy. They liked having multiple branches in a single repo. The idea of forking this into separate repos was unacceptable. Unworkable. Breaks everything. (Breaks everything they had. Everything they needed to replace. Or so they claimed.)

They had some vision of having the CI/CD jobs all reworked to move beyond the common dev/master world into their uniquely odd world of lots of parallel branches, each it's own private "master". But all in one repo.

They seemed to have locked into a strange world view, and weren't happy discarding it. The circular discussions of how multiple repos would break something something in their something was more examples of tangential, irrelevant discussion to cloak empty whining.

tl;dr

I think there are people who don't really want a "solution." They want something else.

There are people who have a vision: How Things Should Be (HTSB™,) They seem to be utterly unwilling to consider something that is not literally their (narrow) vision of HTSB.

It's very much as "Don't confuse me with facts, my mind is made up" situation. It's exasperating to me because of the irrelevant side-channel discussions they use to avoid confronting (or even stating) the actual problem.

Tuesday, March 12, 2019

Python's multi-threading and the GIL

Got this in an email.
"Python's multi-threading module seems not efficient because of the global interpreter lock?" 
Yep.
Is the trick is to use "Thread-Local Data"?
Nope.

It Gets Worse

Interestingly, there was no further ask. The questioner had decided on thread-local data because the questioner had decided to focus on threads. And they were done making choices at that point.

Sigh.

No question on "What was recommended?" or "What's a common solution?" or "What is Dask?" Nothing other than "confirm my assumptions."

This is swirling around a bunch of emails on trying to determine the maximum number of concurrent threads or processes based on the number of cores or CPU's or something.

Maximum.

I'll repeat that for those who skim.

They think there's a maximum number of concurrent threads or processes.

If you have some computation which (1) makes zero OS requests and (2) is never interrupted, I can imagine you'd like to have all of the cores fully committed to executing that theoretical stream of instructions. You might even be able to split that theoretical workload up based on the number of cores.

Practically, however, that stream of uninterrupted computing rarely exists.

Maybe. Maybe you've got some basin-hopping or gradient-following or random forest ML algorithm which is going to do a burst of computation on an in-memory data structure. In that (rare) case, Dask is still ideal for exploiting all of the cores on your processor.

The upper-bound idea bugs me a lot.

  • Any OS request leads to a context switch. Any context switch leads to waiting. Any waiting means you can have more threads than you have cores. 
  • AFAIK, any memory write outside the local cache will lead to a stall in the pipeline. Another thread can (and should) leap in to the core's processing stream. The only way you can create the "all-computing" sequence of instructions bounded by the number of cores is to *also* be sure the entire thing fits in cache. Hahahaha.

What's the maximum number of threads or processes? It depends on the wait times. It depends on memory writes. It depends on the size of the data structure, the size of cache, and the size of the instruction stream.

Because it depends on a lot of things, it's rather difficult to predict. And that makes it rather difficult to determine a maximum.

Replying about the uselessness of trying to establish a maximum, of course, does nothing. AFAIK, they're still assiduously trying to use os.cpu_count() and os.sched_getaffinity() to put an upper bound on the size of a thread pool.

Acting as if Dask doesn't exist.

Solution

Use Dask.

Or

Use a multiprocessing pool.

These are simple things. They don't require a lot of hand-wringing over the GIL and Thread Local Data. They're built. They work. They're simple and effective solutions.

Side-bar Nonsense

From "a really smart guy. He got his PhD in quantum mechanics and he got major money to actually go build … . He initially worked for ... and now he is working for .... So, when he says something or asks a question, I listen very carefully."
The laudatory blah-blah-blah doesn't really change the argument. It can be omitted. It is an "Appeal to Authority" fallacy, and the Highest Paid Person's Opinion (HIPPO) organizational pattern. Spare me.

Indeed. Asking for my confirmation of using Thread-Local Data to avoid the GIL is -- effectively -- yet another Appeal to Authority. Don't ask me if you have a good idea. An appeal to me as an authority is exactly as bad as appeal to some other authority to convince me you've found a corner case that no one has ever seen before.

Worse is to ask me and then blah-blah-blah Steve Lott says blah-blah-blah. Please don't.

I can be (and often am) wrong.

Write your code. Measure your code's performance. Tell me your results. Explore *all* the alternatives while you're at it.

Tuesday, March 5, 2019

Python exceptions considered an anti-pattern

https://sobolevn.me/2019/02/python-exceptions-considered-an-antipattern

While eloquent and thorough, I remain unconvinced that this is a significant improvement over try/except.

It's common enough in some functional languages to have strong support and a long, successful history.

I think it replaces one problem with another. It's not a "solution". It's an alternative. Instead of exceptions being raised, they're returned. Cool.