Tuesday, May 26, 2015

Regular Expression "Hell"

Actual quote: "they spend a lot of time maintaining regular expressions. So, what are the alternatives to regular expression hell?"

Regular Expression Hell? It's a thing?

I have several thoughts:
  1. Do you have metrics to support "a lot"?  I doubt it. It's very difficult to tease RE maintenance away from code maintenance. Unless you have RE specialists. Maybe there's an RE organization that parallels the DBA org. DBA's write SQL. RE specialists write RE's. If that was true, I could see that you would have metrics, and could justify "a lot." Otherwise, I suspect this is hyperbole. There's frustration, true.
  2. REs are essential to programming.  It's hard to express how fundamental they are. I would suggest that programmers who have serious trouble with RE's have serious trouble with other aspects of the craft, and might need remedial training in RE's (and other things.) There's no shame in getting some training. There are a lot of books that can help. Claiming that there's no time for training (or no budget) is what created RE Hell to begin with. It's a trivial problem to solve. You can spend 16 hours fumbling around, or stop fumbling, spend 16 hours learning, and then press forward with a new skill. The choice is yours.
  3. REs are simply a variant on conventional set theory. They're not hard at all. Set theory is essential to programming, so are RE's. It's as fundamental as boolean algebra. It's as fundamental as getting a loop to terminate properly. It's as fundamental as copy-and-paste from the terminal window. 
  4. REs are universal because they solve a number of problems better than any other technology. Emphasis on better than ANY alternative. RE's are baked into the syntax of languages like awk and perl. They're universal because no one has ever built a sensible alternative. If you want to see even more baked-in regular expression goodness, learn SNOBOL4
REs are essential. Failure to master REs suggests failure to learn the fundamentals.

RE Hell is like Boolean Algebra Hell. It's like Set Theory Hell. It's like Math Library Hell. It's like Uninitialized Variables Hell. These are things you create through a kind of intentional ignorance.

I'm sorry to sound harsh. But I'm unsympathetic.

The initial regex in question? r"[\( | \$ | \/ |]". This indicates a certain lack of familiarity with the basics. It looks like it started as r"\(|\$|/" and someone put in spaces (perhaps they intended to use the verbose option when compiling it) and/or wrapped the whole in []'s. After trying the []'s, it appeared to work and they called it done.

The email asked (sort of trivially) if it was true that the last pipe was extraneous. Um. Yes. But.

Follow-up

The hard parts are (1) trying to figure out what the question really is. Why did they remove just the last pipe character? What were they trying to do? What's the goal? Then (2) trying to figure out how much tutorial background is required to successfully answer whatever question is really being asked. A response of r"[\(\$/]" seems like it might not actually be helpful. Acting a magic oracle that emits mysterious answers would only perpetuate the reigning state of confusion.

The follow-up requests for clarification resulted in (1) an exhaustive list of every book that seems to mention regex, (2) a user story that was far higher level than the context of regex questions. It's difficult to help when there's no focus. Every Book. Generalized "matching" of "data."

The Python connection? Can't completely parse that out, either, It appears that this is part of an ETL pipeline. I can't be sure because the initial user story made so little sense.

Attempts to discuss the supplied user story about "matching" and "data" -- predictably -- lead nowhere. It was stopped at "Some of the problems ... aren’t just typos and misspellings." Wait. What? What are they then? If they're not misspellings, what are they? Fraud? Hacking attempts? Denial of Service attacks by tying up time in some matching algorithm?

It's a misspelling. It can't be anything else. Ending the conversation by claiming otherwise is a strange and self-defeating approach to redesigning the software.

More Follow-up

At this point, we seem to be narrowing the domain of discussion to "As time goes on, we have accumulated a lot of the 'standard mistakes'. The question that need help w/ [sic] is how to manage all the code for these 'common mistakes'?" This question was provided in lieu of an actual user story. Lack of a story might mean that we're not interested in actually solving the data matching problem. Instead we're narrowly focused on sprinkling Faerie Dust all over the regexes to make them behave better.

They don't want an alternative to regexes because the problems "aren't just typos and misspellings." They want the regex without the regex hell. 

Tuesday, May 19, 2015

More Thoughts on the friction of DevOps

Read this: How 'DevOps' is Killing the Developer

My pull-out quote:
This is why we see so many developers that can't pass FizzBuzz: they never really had to write any code.
I agree: It appears that DevOps may be more symptom than solution.

I have one tiny objection to any otherwise excellent series of points: I don't like the totem pole analogy.

I prefer a supply-chain:
  • Release Engineers respond to user needs.
  • Quality Engineers respond to the Release Engineers' needs for assurance that something is fit for use.
  • Developers respond to Release Engineers by providing software.
  • Similarly, procurement folks may purchase or lease or download and pay royalties for software. 
I think of it like this:

Developer ⇒ QE ⇒ RE ⇒ Users

No top-to-bottom. More a sequence of more-or-less peers.

I still agree with the central tenet: a developer is able to march the software from concept to user. We don't really expect QE or RE to create software. We might expect some skill sharing between QE and RE.

Many years ago, I posted this: IT’s Drive to Self-Destruction, which is random and whiny but related to this point about DevOps. The idea is that key developers create competitive advantage. Release Engineers put it in the hands of users. Both are important. Without creation there's no deployment. Without deployment, creators can be diverted to deployment, so deployment can still go forward, but it will be slower.

The key point is this:
If a developer is spending time with DevOps (and TechOps) trying to get stuff deployed, who's developing the Next Big Thing? 

Tuesday, May 12, 2015

Class Design Strategies -- analysis vs. synthesis

The conventional wisdom on class design is to model real-world things. This is the Domain-Driven Design Approach. It's what we used to teach as Rumbaugh's OMT (prior to the Three Amigos creating UML.)

The idea is simple: Look at the real world objects. Describe them.

Classes will have attributes and behaviors. They will have relationships. Rumbaugh was very careful about keeping object and class separate. A class had associations, and object had links. The association was the abstraction, the link was a concrete implementation. A class offered an operation, an object provided a method as an implementation.

As powerful as this is, I'm not sure it's the final word.

The only problem it has is that people often get confused by "real world objects." I've seen a number of places where folks completely fail to distinguish Enterprise IT implementation choices from actual things that reflect actual objects in the actual world.

Users and Credentials, for example. Users are real human beings. You find them in the hallways, standing around. They take up space in conference rooms. Credentials are a peculiar security-focused way to summarize a person as something public (username) and something private (password.) You don't find a stack of credentials tying up a room at the end of the hallway. Indeed, you can't physically stack credentials. While something a user knows is important, it isn't the entirety of a User. The attributes and behaviors of credentials aren't a good model for a User. But you still have this argument periodically when developing a class model or a noSQL database document model.

I'd like to emphasize that this is -- as far as I can tell -- the only problem with domain driven modeling. Some people don't see the domain very clearly because they tend to stick to a technology-driven world view.

However. That doesn't mean that drawing on the white board is the only way to discover the domain.

Building Classes from Functions

As a heretical alternative, allow me to suggest an alternative to the whiteboard.

Once upon a time, the whiteboard was the only way to do object modeling. The successor to the whiteboard (I use Argo UML as well as Pocketworks yuml) is a diagramming tool that -- ideally -- helps you understand the domain before committing to the cost and complexity of writing code.

Wait a second, though.

The "cost and complexity of writing code"? Java programmers know what I mean. If you don't have your classes understood, you should not start slapping code together.

Python programmers have no idea what "cost and complexity of writing code" means. They slap classes together faster than I can draw the damn pictures on Argo.

Indeed, the pictures can become a kind of burden. The picture shows "x.X", therefore, the module must include "x.X". Even though there might be a better way using classes in separate modules "a.Y" and "b.Z". But changing the cluster of pictures that comprises a fairly complete UML diagram isn't easy.

[Clearly, this depends on how much you tried to show. If your diagrams are really spare, refactoring is no problem. If you include parts of the object model in the component diagram or activity diagram, you're in trouble.]

This leads to an alternative to the whiteboard. And the diagramming tool.

Code. [Cue Orchestral Hit: Ta-daaa!]

Yes. Code. [Cue Orchestral Hit: Ta-ta-ta-daaa!]

When you can slap together a spike solution in Python you have a sensible alternative to the whiteboard.

You can build some classes, write some demonstration code to show how they work together. Don't like it? Start again from another base of classes. You can do this as a Mob Programming exercise. It fits somewhere between grooming a story and finishing an MVP release. Indeed, it may be a good way to do specific, concrete grooming.

In some cases, though, you can't build classes. You don't really know (or can't agree) on what the real world things are.

Rather than debate, shift the focus. Just write functions.

In Python, this is easy, since functions are first-class inhabitants of the programming model. In Java, this isn't easy at all. Functions aren't proper things; they must be part of a class; and you can't agree on what the classes are; the Java stalemate. [Yes, Java 8 introduces free-standing functions.]

How This Works In Practice

In many cases, it makes sense to punt on the "big picture." You're not really sure what you even have.  Yes, you know you have eight individual CSV files that reflect events that happened somewhere in cyberspace. (Let's just say they were the output from stored procedure triggers; the only record of changes made to crucial data.)

You can wring your hands over the eight-table join required to reconstruct the pre-change and post-change views of the objects. You can wring your hands over the way it's really three (or is it four?) different navigation paths from I to II to IV to (V to VI to VII) union I to II to IV to (V to VI to VII) union I to III to IV to oh my god I'm so confused.

Or.

You can get the sample data.  You can read it using the CSV module.

DictReader can awkward. It can be fixed, however. If your column titles are legal Python variables, you can use this to create a namespace reader from a DictReader. This allows you to say row.ATTRIBUTE instead of row['ATTRIBUTE'].

def nsreader(dictreader):
    return (SimpleNamespace(**row) for row in dictreader)

We can then turn to working out the various join algorithms on real data. Each step builds objects based on types.SimpleNamespace.

You start with simplest possible join algorithm: load everything into dictionaries based on their keys.

I_map = { row.KEY: row for row in nsreader(table_I_dict_reader) }
II_map = { row.SOMETHING: row for row in nsreader(table_II_dict_reader) }

Once you have sample data in memory, you can figure out what the actual, working relationships are. You can tinker with navigation paths through the tangled mess of tables. You can explore that data. You can do data profiling to find out how many misses there are.

If the tables are smallish (10,000 rows each) it all fits in memory nicely. No need for fancy database connections and no need to reason out join algorithms that don't tie up too much memory. You're not writing a database server. You're writing an application.

Look For Common Features

The design issue for classes is locating common features: common attributes or common methods. We often start down the road of common attributes. Because. Well... it seems logical.

Focus on attributes is a bias.

Classification of objects isn't based mostly on attributes. It's not 50-50 objects vs. attributes.

We tend to focus on attributes -- I think -- out of habit. Data structures mean "common data", right? Databases include tables of commonly-structured data.

But this isn't a requirement -- nor is it even important. It's just a habit.

We can conceive of a class hierarchy based around common behavior, too. This may require a very flexible collection of attributes. On the other hand, there's no a priori reason not to define classes based on their behavior.

That's why the idea of building functions first doesn't seem too far-fetched.

First, we can build working functions.  We can have test cases and everything.

Then we can look for commonality. We can refactor into classes. We can start with a Flyweight design pattern. As common attribute emerge, we can refactor to store more state in the class, and less state somewhere else. The API changes while we do this.

Then we examine it for the "is this a thing" criteria. Last, not first. We may need to make a few more tweaks to reflect the thing we discovered scattered around the functions. The thing may be a checklist or a recipe or a procedure: something active instead of simply stateful.

This tends to make RESTful web services a bit of a head scratcher. If we have an active thing, what is the state that we can represent and transfer? The state may be very small; the active agency may be quite sophisticated. This shouldn't be too baffling, but it can be confusing when the GET request response is either 200 or 403/409: OK or Forbidden/Conflict. Or there are multiple shades of 200: 200 OK with a body that indicates success, vs. 200 OK with a body that indicates something more needs to be done, vs. warnings, vs. exceptions, vs. other nuanced results.

Summary -- tl;dr

I think there's a place for code-first design. Build something to explore the space and learn about the problem domain. Refactor. Or Delete and Start Again. In modern languages (i.e., Python) code is cheap. Whiteboard design may not actually save effort.

I think there's a place for building functions and refactoring them into classes. I think the Java pre-8 "All Classes or Burn In Hell" approach is misleading. Functional programming and languages like Python show that functions should be a first-class part of programming.

I think there's too much emphasis on stateful objects. The DDD warnings about "anemic" classes seems to come from a habitual over-emphasis on state and an under-emphsis on operations. I think that active classes (as much as they push the REST envelope) might be a good thing.


Tuesday, May 5, 2015

Scrum, Agile, and Modern Tools

Required Reading: https://www.pandastrike.com/posts/20150304-agile

My takeaway quote? "Scrum lags behind the modern toolchain enough that there can be a Potemkin village vibe to the whole thing."

I was clued into this from another takeaway quote someone seconded on Twitter: "Waterfall used too much written communication, but Agile doesn't use enough."

Also read this: http://caines.ca/blog/2014/12/02/i-dont-miss-the-sprint/

Is "sprint" misleading? What about "sprint commitment?"

I'm not sure I object to "sprint" per se.

But I have seen "sprint commitment" turned into an organizational problem, removing what could have been a helpful tool. Folks who start harping on sprint commitments in the sense of "we committed to this, will we meet the deadline?" tend to create a toxic environment. I think the people who hype commitment the most really liked the non-Agile environments: they try bend Agile to meet their Waterfall concepts.

The problem is the word. A "sprint commitment" shouldn't be used like a legally binding "do it or pay penalties" commitment. It should be a metric used to gauge progress. More like a "sprint outcome".

The commitment hype can lead to stories, epics and detailed technical tasks getting muddied up terribly. The story becomes an epic. Little tiny technical tasks get inflated into big important stories. A proper user story gets replaced with nonsense about prepping a database for production rollout, or resolving defects found in QA, or things that -- obviously -- aren't user stories, but are taking up a lot of time.

When it appears that a story is going nowhere, the scrum master breaks it down into things that have status which changes frequently. The sense of end-user meaning behind the actual story gets lost in a haze of technical considerations and tasks that show activity more than accomplishment more than value.

"As an actuary, I want to know that the developers have written syntactically correct DML for my database, so that the product owner don't have to wait as long for the DBA's to build the database."

Really?