Tuesday, October 26, 2021

Python is a Bad Programming Language. Wait, wut?

It may help to read Python is a Bad Programming Language, but it's not very useful.

I shouldn't be tempted by click-bait headlines. But. I am drawn in by bad articles on Python.

In particular, any post claiming Python is deficient causes me to look for the concrete PEP's that fix the problems.

Interestingly, there never seem to be any PEP's in any article that bashes Python. This post is yet another example of complaining without offering any practical solutions.

BLUF

The article has a complaining tone, but, I can't figure out some of the complaints. It lifts up a confusing collection of features from other languages as if these features are somehow universally desirable. No justification is provided. The author seems to rely exclusively on Stack Overflow answers for information about Python. There are no PEP's proposed to fix Python. There aren't even any examples.

Point-by-Point

I will try to address each point. It's difficult, because some of the points are hard to discern. There's a lot of "Who thought that was a good idea?" which isn't really a specific point that can be refuted. It's a kind of rhetorical flourish that seems to work best with folks that already agree.

Let's start.

A Fragmented Language

This is the result of profound confusion. It's hard to find anyone recommending Python 2 anywhere. The supplied link is 9 years old, making this comment extremely misleading. (I'm being charitable. A nine-year old link on Stack Overflow requires some curation. This is not a Python problem.)

Ugly Object-Orientation

The inconsistent use of this in C++ and Java is lifted up as somehow good. The consistent use of the self instance variable in Python is somehow less good; perhaps because it's consistent.

"See how I have to both declare and initialize them in the constructor? Another example of Python stupidity." Um. No, I don't actually see you declare them anywhere. I guess you're unaware of what declare means in languages like C++ and why declare isn't a thing in Python.

Somehow using the private keyword is better than __ name mangling. I'm unclear on why it's better, it's simply stated in a way that makes it sound like a long keyword used once is better because it's better. No additional reason or justification is offered. The idea of using __ to emphasize the privacy is somehow inconceivable.

The private and protected keywords are in C++, C#, and Java to optimize recompilation. To an extent, this also permits distribution of libraries in the form of "headers" and obfuscated binaries. None of this makes sense in a Python context. A single example of how the private keyword would be helpful in Python is missing from the original post. There are huge complications of the protected keyword, also; these make the keywords more trouble than they are worth, and any example needs to cover these issues, also.

"In general, when you point out any flaws in their language, Python developers will act hostile and condescending." Sorry, this complaint in the original post sounds hostile and condescending. I'll try to ignore the tone and stick to what content I can find.

Whitespace

"...how is using whitespace any better than curly braces?" has an answer. But. Somehow it can't be chased down and included in the original post. Whitespace (like name mangling) is described as wrong because it's wrong, with no further justification provided.

An example where braces seem to be essential for sorting out syntax would be nice. The entire Python community is waiting for any example where braces were necessary and the indentation wasn't already clear.

"And only in Python will the difference between tabs and spaces cause the interpreter to have a heart attack." Um. A syntax error is a heart attack? I wish I was able to type code without syntax errors. I am humbled thinking about the idea of seeing syntax errors so rarely. I have my editor set up to use spaces instead of tabs, and haven't had a problem in 20 years of using Python.

Dynamic Typing

The opening quote, "Dynamic typing is bad," is stated as if it's axiomatic. The rest of the paragraph seems like vitriol rather than justification. "Some Python programmers have realized that dynamic typing is bad" requires some justification; a link to some documentation to support the claim would be helpful. An example would be good.

I can only assume that code like this is important and needs to be flagged by the compiler or something.

for data in some_list:
    if data == 42:
        print("data is int")
for data in some_other_list:
    if data == "wait":
        print("see the type of data changed.")

This seems like poor programming to begin with. Expecting the compiler to reject this seems weak. It seems better to not reuse variable names in the first place.

Constants

Not sure what the point is here. There's no justification for demanding the inconsistent behavior of a one-time-only assignment statement. There's no reference how how folks can use enums to define constant-like names and values.

The concluding paragraph "The Emperor Has Not Clothes" is some kind of summary. It says "Python will only grow in popularity as more and more software is written in it," which does seem to be true. I think that might be the single most useful sentence.

What Have We Learned?

First, reading a few Stack Overflow posts can be misleading. Python now is not Python from nine years ago.

Everyone says to use Python3. Really. If you have found a Python2 tutorial, stop now. Don't follow it.
The consistent use of the self variable seems simpler than trying to understand the rules for the this variable.
Variables aren't declared, they're assigned values. It's as simple as it can be and avoids the clutter of variable declarations.
We can read the source; the complexities of private (or protected) instance variables doesn't really help.
Python's use of whitespace is very simple; most people can indent their code correctly. Anyone who's tried to debug C++ code that's correctly indented but missing a (nearly invisible) } will agree that the indentation is easier to get right.
AFAICT, the reason dynamic typing might be bad is when a function or class reuses the same variable name for multiple different types of data. This seems wrong to reuse a variable name for multiple types. A small effort at inspecting the code can prevent this.
Constants are easily implemented via enum. But. They appear to be useless in a dynamic language where the source is trivially available to be changed. I'm not sure why they seem important to people. And this article provides no help there.

Bottom line: Without concrete PEPs to fix things, or examples of what better might look like, this is click-bait whining.

Starting from C# or Java to locate deficiencies is just as wrong as starting from Dartmouth Basic or FORTH as the standard against which Python is measured.

Tuesday, October 19, 2021

Code so bad it causes me physical pain

Here's the code.

def get_categories(file):
    """
    Get categories.
    """
    verify_file(file)

    categories = set()

    with open(file, "r") as cat_file:
        while line := cat_file.readline().rstrip():
            categories.add(line)

    return categories

To me this was terrible. truly and deeply horrifying. Let me count the ways.

The docstring repeats the name of the function providing no additional information.
The verify_file() function checks are pure, useless LBYL code. It seemed designed to map a lot of detailed exceptions to a RuntimeError. Which is misleading.
The use while and readline() to iterate through the lines of a file is -- I guess -- reasonable if we're working Pascal or Modula-2. But we're not. Use of the walrus operator isn't really getting any bonus points because -- well -- this is terrible.
While pathlib is used elsewhere in this module, it's not used here. This function works with a filename string, assigned to the file parameter.

Actually, taking a step back, it's not that the author is being malicious. They just missed all the features of files and sets. And -- somehow -- were able to learn about the walrus operator while never figuring out how files work.

This is something like:

source = Path("some_file.txt")
with source.open() as source_file:
    categories = set(source_file)

And that's it.

It Gets Worse

This was part of some category mapping application.

They've got a CSV file with some string values. And they want to map those string values to summary category values.

Most folks think of a dictionary for a mapping from one string to another string.

The code I was sent -- I kid you not -- used a list of two-tuples. I'll repeat that for those who are skimming. It use A LIST OF TWO-TUPLES INSTEAD OF A DICTIONARY. It used a colossally bad search through an unsorted list of tuples to find matches. (The only search that would have been worse was random probes instead of iteration.)

It really did.

It can't even show you that code, it's such a horrifyingly bad design.

They had a question. Was the looping over a list of two-tuples ineffective? That's why they asked for help.

It was like they had never heard of a dictionary. Nor seen a tutorial with a dictionary. Nor read a book that mentioned dictionaries. They had managed to learn enough Python to see the walrus operator without hearing of dictionaries.

A list of two-tuples, when provided to the dict() function, will make a dictionary. They were ignorant of this.

A dictionary that does O(1) lookups and avoids looping over a list of two-tuples. This was a mystery to them..

When someone doesn't know the Python dictionary exists, what is the appropriate response?

How can you politely say "Find another tutorial and do the ENTIRE thing all of it!"

That's Not All

There's this nugget of "You can't be serious."

category_counts = {element: 0 for element in categories}

And

category_counts[category] += 1

Yes. They used a dictionary to count instances of the categories. They did not understand collections.defaultdict or collections.Counter. But they understood a dictionary well enough to use it here. But not use it elsewhere for the central functionality of the app.

So. They couldn't use a dictionary, but could use a dictionary.

They couldn't use the csv module, so they wrote their own (bad) CSV parser.

It's almost impossible to write a polite code review.

Tuesday, October 12, 2021

Legacy Software is a Sticky Mess

I'll get to legacy software. First, however, some backstory on observability.

Sailors will sometimes create "Float Plans". Like aircraft flight plans, they have an itinerary to make it slightly easier to find us when something goes wrong. Unlike airspace, which is tightly controlled by the FAA, the seas are more-or-less chaos.

The practice then, is to create float plan and give it to a trusted shore-side party, go out sailing, check in periodically, and cancel the whole thing when you're done sailing. If you miss a check-in, they can call an appropriate Search-And-Rescue agency like the US Coast Guard or BASRA or local cops with jurisdiction over a lake or river.

How much detail should be in this plan? For a long or complex trip, it doesn't seem sensible to say "Going to the Bahamas" as your float plan. That's a little thin on details. The bare minimum is to provide an Estimated Time of Arrival (ETA). But. When you summarize 36 hours of sailing to a single ETA, you invite observability problems. It's a sailboat, and you could be becalmed. Things are fine, you're just going to be late.

Late, of course is relative. Simply late means you missed your ETA. If you're becalmed to the point where you're running low on supplies, then, this can become a bit of a problem.

The general policy followed by SAR is to allow several hours past the ETA before activating SAR resources. (The US Coast announces overdue mariners on the VHF radio so others can keep a lookout for them and render assistance.)

If you have a one-checkin-plan that summarizes 36 hours of sailing with a single ETA, you're going to be waiting for many hours after the ETA for help. So. Total systems failure after the first hour means 35 hours of drifting before someone will even alert SAR folks. And then the SAR folks will wait several hours after the ETA in case you're only slow.

What seems better is to have a sequence of waypoints with ETA's at each waypoint. That way you have incremental evidence of success or failure, and you're not waiting a LOOOONG time for your one-and-only ETA to pass without a check-in.

This leads us to software. And legacy software.

Creating the Plan

To create a sensible plan, you have waypoints as Latitude, Longitude pairs. These are angles on a sphere, not distances on a plane, so computing the length of a leg isn't a simple hypotenuse.

It is a lot like a hypotenuse. For short distances, we can assume the earth is more-or-less flat. We can then use a relatively simple conversion (cosine of the latitude) to compress the longitudes toward the poles. We can convert lat and lon to distances and use a hypotenuse and get a really close answer.

def range_bearing(p1: LatLon, p2: LatLon, R: float = NM) -> tuple[float, Angle]:
    """Rhumb-line course from :py:data:`p1` to :py:data:`p2`.

    See :ref:`calc.range_bearing`.
    This is the equirectangular approximation.
    Without even the minimal corrections for non-spherical Earth.

    :param p1: a :py:class:`LatLon` starting point
    :param p2: a :py:class:`LatLon` ending point
    :param R: radius of the earth in appropriate units;
        default is nautical miles.
        Values include :py:data:`KM` for kilometers,
        :py:data:`MI` for statute miles and :py:data:`NM` for nautical miles.
    :returns: 2-tuple of range and bearing from p1 to p2.

    """
    d_NS = R * (p2.lat.radians - p1.lat.radians)
    d_EW = (
        R
        * math.cos((p2.lat.radians + p1.lat.radians) / 2)
        * (p2.lon.radians - p1.lon.radians)
    )
    d = math.hypot(d_NS, d_EW)
    tc = math.atan2(d_EW, d_NS) % (2 * math.pi)
    theta = Angle(tc)
    return d, theta

This means we can't trivially write down a list of waypoints. We need to do some fancy math to compute distances.

For years and years. (Since our first "big" trip in 2007.) I've used spreadsheets in various forms to work out the waypoints, distances, estimated time enroute (ETE) and ETA.

The math isn't too far beyond what a spreadsheet can do. But. There's a complication.

Complications

File formats are a complication.

There are KML files, GPX files, and CSV files that are used by various pieces of software. This is only the tip of the iceberg, because some of Navionics devices have an even more interesting USR file that contains everything in your chartplotter. It's cool. But complicated.

The file formats are -- clearly -- way outside the box for a spreadsheet.

Python to the rescue.

Since I'm a Python hack (and have been since well before 2007) I've got all kinds of file conversion tools. See https://github.com/slott56/navtools.

But.

And here's where legacy enters the picture. (Music Cue.)

Fear that rattles in men's ears
And rears its hideous head
Dread ... Death ... in the wind ...

Spreadsheets.

Up until yesterday, the final planning tool was a spreadsheet with waypoints and times. Mac OS X Numbers is GREAT for this. I can pile in boat information, crew information, safety information, the itinerary, and SAR contact details in one spreadsheet, save it as a PDF, and email it to my shore-side contacts.

The BEST part of this was tinkering with the departure time while we waited for weather. We could plug in the day we're leaving, get revised ETA's for the waypoints, push the document, and take off.

(We use an old Spot Navigator to provide notifications at midnight to show progress. We're going to upgrade to a SpotX so we can send messages a little more flexibly.)

The Legacy Spreadsheet

The legacy spreadsheet has a lot of good UX features. It's really adequate for some user stories. Save as PDF rocks.

However.

For the more advanced route planning, it isn't ideal. Specifically, spreadsheets can be weak on multiple "what-if" scenarios.

The genesis of spreadsheets (I'm old, I was there, I remember VisiCalc) was "what-if" analysis. Change an assumption and follow the consequences through the lattice of dependent cells. These are hard to save. You can "Save As" to make a copy of the spreadsheet. You can save pages within a single spreadsheet. These are terrible because you can't really make a more fundamental change very easily. You have to make the same change to all the copies in your pile of "what-if" alternatives.

To be very specific. I often need to plan for different boat speeds. We have a sailboat; wind and water matter a lot. Slow is about 5 knots. Fast is about 6 knots. Our theoretical top speed is 8 knots, but we've rarely seen that without a river flowing along with us. Sailing at that speed means a lot of sail wrestling, something we'd rather not do.

Fine. That's 3 scenarios, one for each speed: 5, 5.5, and 6. No big deal.

Until we add a waypoint. Or move a waypoint. Now we have to reset all three spreadsheets with a different itinerary. Since it's a different number of rows, we have the usual copy-and-paste problems in spreadsheets.

What's Better?

Jupyter notebooks crush the life out of spreadsheets.

Here's the revised workflow.

Create the route. Use tools like OpenCPN so the route can be exported as a GPX or CSV file.
Use a notebook to parse the route file, creating an internal Route object.
Manipulate the Route object, providing different ETA's and speed assumptions. These assumptions lead to multiple cells in the notebook. They can all share details so that one fundamental change leads to lots and lots of recomputation of itineraries. We can include all kinds of headings and markdown notes and thoughts and considerations.
Finalize a route that's part of the plan. Still working in the confines of a longish notebook.
Emit a Markdown file with Vessel Identification, Itinerary, Notes, and SAR Contact sections. Run pandoc to make a PDF. (This is the foundation for the nbconvert utility.)

This workflow creates two categories results:

One result is a Notebook with all of the planning details and thoughts and contingencies and considerations.

The other result(s) are float plan documents as PDF's that can be shared widely.

Why did this take so long?

I used spreadsheets from 2007 to 2021. Why switch now? Some reasons.

Legacy solutions are sticky. This has a lot of consequences. I built up "expertise" in making the legacy work. I had become an "expert" in working around the hinky little problems with multiple what-if scenarios and propagating changes from the route into the what-ifs. For example, I limited the number of what-if scenarios I would consider because more than two got confusing.

New solutions are sometimes invisible. I only learned about Jupyter Notebooks about three years ago. I did not realize how powerful they were. I've since rearranged my thinking.

I've known about RST and Markdown and Pandoc for years. But. Getting from spreadsheet-like flexibility to a Markdown document was never a clear step. Without something like Jupyter Lab.

Pulling it all together

Does it require some kind of catalyst to force change?

Is it a slow accretion of evidence that the legacy software isn't working?

I'm pretty sure I had a long, slow Aha! moment as I realized that the Numbers spreadsheet was a large pain in the ass and a notebook would be simpler. It took a few days of fiddling to become really, really sure Numbers was not working out.

I think one of the biggest issues was a third "what-if" scenario. It was helpful to visualize arrival times. But. It was a huge pain in the neck to fiddle with the spreadsheets to get the right waypoints in there and summarize the alternatives.

I think the lesson here is to avoid automating anything unless you actually are the user.

If an organization wants software, a developer needs to do the job manually to *really* understand what the pain points are. Users develop expertise in the wrong things. And they want automation where the benefits are minor. Automating the spreadsheet-to-PDF is wrong. Replacing the spreadsheet is right.

Tuesday, October 5, 2021

New to Python -- How to manage architecture choices

This is a problem folks new to Python have, and sometimes can't articulate that they have it.

They don't know which package is the "right" one to use. Tox vs. Nox. Click vs. Argparse vs. getopts? What's the "best" choice?

Response 1. The whole Python ecosystem is chaos and the language is just a "toy". You don't have this many choices with (pick your language: e.g., Go or Rust or Scala).
Response 2. We need a way to make architectural choices that the team understands and can use.

Response 1 is remarkably common. It's hard to argue against. If someone thinks innovation is chaos, they -- perhaps -- shouldn't be in technology to begin with. Innovation IS chaos. That's the essential definition!

However, they may be a project owner (or the manager of an old-school waterfall-style project) or -- worse -- someone responsible for architecture, and complain about chaos. If so, they're not really cut out for managing rapid technological change, and they need to be bypassed.

Yes. Bypassed. Ignore them. Go to their meetings. Nod politely when they rant about chaos. Then build working software. Eventually, they'll grow to understand that a large ecosystem is NOT chaos. Rapid innovation is not chaos. They may come to understand that filters are required to reject some of the noise that comes from innovation.

Response 2 -- How do we make choices?

Glad you asked.

I have seen four common approaches.

HiPPO: The Highest-Paid Person's Opinion.
Tech Oracle.
HashTAG: Hyperconnected And Socially Helpful Tech Advisory Group.
Peer Pressure.

Let's look at each of these.

HiPPO

The Highest-Paid Person's Opinion isn't easy to dismiss. They're an executive or the product owner and they think their position in the company gives them a magical ability to somehow predict the technical shortcomings of a component or a framework or a language.

Once upon a time, when all components were licensed, someone negotiated contracts for support and training. The contracts (and negotiations) were a Big Sweaty Deal (BSD™). The HiPPO needed to justify all the time and money spent with the vendor. Okay. Sure. Then their opinion on continuing to invest in a losing proposition makes a lot of sense. Since they've already spent money with the vendor, they'd like us to continue to spend money with the vendor, even if the vendor's product isn't really very good.

Those times are past. Most everything is open source nowadays, and we pay for support reluctantly. We often have POC's to chose among alternatives. We can fire a vendor quickly. We don't invest heavily in inking a deal. (In the olden days, I got lots of plane rides and hotels from vendors. Getting to a deal was fun back then.)

The HiPPO needs to be informed that their opinion isn't helpful unless they can back it up with a POC. If they can't supply the POC, then the technical folks will keep arguing until they have competing POC's to help make a technical choice.

With good languages (like Python) and large ecosystems, POC's are cheap insurance to back up an opinion.

Tech Oracle

The Tech Oracle is expected to provide an opinion on everything. In many cases, this can work.

If the architecture is reasonably well known to the Oracle, then picking open-source projects to help build a solution isn't too difficult. Filters like "date of last update" and "volume of changes on GitHub" can be useful ways for the Oracle to locate better components and frameworks.

The Oracle should be producing POC's. This makes it hard for them also to produce production code. Not impossible, but hard. Their role isn't quite the same as other devs, since they have to provide up-front justification before too much real work is invested.

If the Oracle can't provide POC's, that's a bit of a problem. I've met architects who don't code. I couldn't find a way to trust them. Yes, they may know a lot. They're wonderfully articulate. Great slide decks. Good choices of lunch places where they try to influence you. But... I don't trust architects who don't code. Sorry. Personal weakness.

Architecture diagrams are an essential work product in addition to POC's. Usually, they're focused on a specific project, rather than providing general-purpose guidance. Generally, the overall ecosystem moves so quickly that the idea of a general-purpose, one-size-fits-all architecture isn't a good idea.

HashTAG

The Hyperconnected And Socially Helpful Tech Advisory Group is often a really good thing. It's best when there are multiple teams who need to coordinate. It can be slow-ish, however, and time needs to be invested in this. TAG meetings deserve stories on the storyboard. TAG time needs to be prioritized above individual team needs.

A TAG needs to look at choices, and publish recommendations. That often means reviewing POC's. And that means folks have to take POC's to the TAG for them to weigh in on the difficult-to-quantify "better solution".

These are interesting demos. The TAG should be looking at the same (or similar) functionality from competing POC's to render a final, binding judgement. There needs to be an agenda, strict time-lines for the presentation, and a final -- almost objective -- score-card to show the elements of the final decision.

Decisions are an essential work product. Published. Socialized. Well-known. Easy-to-find. A whole GitHub repo with decisions is essential.

Architecture diagrams are also an essential work product. These should provide general-purpose guidance. A team should be able to start with one of these, eliminate the parts they don't need, plug in their product name, and move forward quickly.

Peer Pressure

This is the HashTAG reduced to a single team. Given a choice, the members of the team need to look at filters the way the Tech Oracle should. They need to weigh things like More stars in GitHub? Fewer bug reports? Documentation? And they need to capture the decision in something more than a conversation.

If it's hard to reach consensus, this means the team has to commit to dueling POC's. This needs to be time-boxed work. It's enough of a POC to show how competing libraries or frameworks *could* be used in the implementation. It's important not to run down the road to a candidate implementation. The POC should point the implementers in the right direction.

(A candidate implementation becomes a kind of fait accompli: "I've already built it, we might as well use it. This dilutes consensus in favor of fast coding.)

Ideally, the POC shows what code could look like. It might include benchmarks. Test cases. Concrete things that can be compared -- line by line if necessary -- to show some measurable aspect of "better."

The decision and the diagram are part of the team's legacy. It has to live with the code. The number of decisions that get redebated after a few sprints needs to be minimal. It's never zero, but the team needs to put stories on the board for finalizing tech documentation with architectural decisions, reasons, and links to the POC that backs up the decision.

Wait. What about Python?

This, clearly, has nothing to do with Python.

The vastness and rapidity of change in the Python ecosystem surfaces a need for some kind of formal decision-making process.

But Python isn't the cause of the problem. All open source software moves quickly. A popular language like Python has more potential sources of confusion than a more specialized language/framework like R.

Embrace the community nature of decision-making. Python is about community building and collective solutions to difficult problems.

But. All those Proofs of Concept...

Yes, there will be POC's. In the case of a HashTAG or TechOracle, these need to be preserved and maintained and upgraded all the time. It's real work. It's a lot of real work.

Remember, the Python ecosystem moves rapidly. There's a lot of innovation, and it needs to be actively tracked. (Unlike the olden days where a C compiler update was an annual affair buried in an annual OS upgrade.)

This leads to defining projects via project templates. See https://cookiecutter.readthedocs.io/en/1.7.2/ for a good approach to this. You want to create cookie cutters that include enough skeleton code that you can run a complete 100% code coverage unit test.

You can then use tox (or nox) to define your component and framework versions as variant virtual environments. As components evolve, you update the versions and rerun your test suite. You can publish internal update trackers for project teams to make sure they're testing with the latest-and-greatest environments.

You'll also have to watch Python version changes. These can creep up on organizations. The PEP's and the schedules need to be central to folks using Python. See https://endoflife.date/python for a handy visualization.

The Billboard

Enterprise developers all discover that there's no way to share code easily within an enterprise. Everyone is isolated in their teams, and each team winds up reinventing some wheel or other. It's been an ongoing problem since IT organizations grew beyond a single team.

Python is no different. Teams solving related problems don't talk enough. If you have lots of meetings to share things, no real work gets done.

Python uses a Package Index to track popular useful packages. Visit https://pypi.org if you haven't seen it yet. You have two paths forward in an enterprise.

- Your own PyPI. This is easy and fun. You can have the internal PyPI shadow the global PyPI.

- Use JFrog Artifactory. https://jfrog.com/artifact-management/ This involves spending money to track in-house artifacts as well as global PyPI artifacts.

- A GitHub Billboard organization. This is an organization that serves as a place to post links to other repos. It needs a little bit of curation. As an implementation, the organization's repositories are a lot of small project advertisements. The degenerate case is a README.md. A better case is the POC repo showing how to use the real project. In the middle is a cookie cutter. This is your in-house advertising. It's relatively easy to search because you're looking at one organization's list of repositories. Each is a pithy, focused summary of another project. Choose names that reflect why someone wants to look more deeply at the project.

The point here is to embrace the chaos that stems from innovation and make it visible.

S.Lott-Software Architect

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.