Tuesday, February 11, 2020

Interesting Data Restructuring Problem

This seemed like an interesting problem. I hope this isn't someone's take-home homework or an interview question. It seemed organic enough when I found out about it.

Given a document like this...

doc = {
    "key": "the key",
    "tag1": ["list", "of", "values"],
    "tag2": ["another", "list", "here"],
    "tag3": ["lorem", "ipsum", "dolor"],
}


We want a document like this...

doc = {
    "key": "the key",
    "values": [
        {"tag1": "list", "tag2": "another", "tag3": "lorem"},
        {"tag1": "of", "tag2": "list", "tag3": "ipsum"},
        {"tag1": "values", "tag2": "here", "tag3": "dolor"},
    ]
}


In effect, rotating the structure from Dict[str, List[Any]] to List[Dict[str, Any]].
Bonus, we need to limiting the rotation to those keys with a value of List[Any], ignoring keys with atomic values (int, str, etc.).

Step 1. Key Partitioning

We need to distinguish the keys to be rotated from the other keys in the dict.
We start with Dict[str, Union[List[Any], Any]]. We need to distinguish the two subtypes in the union.

from itertools import filterfalse
list_of_values = lambda x: isinstance(doc[x], list)
lov_keys = list(filter(list_of_values, doc.keys()))
non_lov_keys = list(filterfalse(list_of_values, doc.keys()))

This gets two disjoint subsets of keys: those which have a list and all the others. The others, presumably, are strings or integers or something irrelevant.

List lengths

There's no requirement for the lists to be the same lengths. We have three choices here:
  • insist on uniformity,
  • truncate the long ones,
  • pad the short ones.

We'll opt for uniformity in this example. Truncating is what zip() normally does. Padding is what itertools.ziplongest() does.

lengths = (len(doc[k]) for k in lov_keys)
sample = next(lengths)
assert all(l == sample for l in lengths), "Inconsistent lengths"

Some folks don't like using assert for this. This can be a more elaborate if-raise ValueError() if that's necessary.

Use zip() to merge data values

We have several List[Any] instances in the document. The intermediate goal is a List[Tuple[Any, ...]] structure where the items from each tuple are chosen from the source lists. This gets us a sequence of tuples that have parallel selections of items from each of the source lists.

The zip(list, list) function produces pairs from each of the two lists. In our case, we have n lists in the original document. A zip(*lists) will produce a sequence of items selected from each list.

Here's what it looks like:

list(zip(*(doc[k] for k in lov_keys)))

We can also use zip(key-list, value-list) to make a list of key-value pairs from a tuple of the keys and a tuple of values. zip(Tuple[Any, ...], Typle[Any, ...]]) gives us a List[Tuple[Any, Any]] structure. These objects can be turned into dictionaries with the dict() function.

It looks like this:

list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))

Assemble the parts

The final document, then, is built from untouched keys and touched keys.

d1 = {
    k: doc[k] for k in non_lov_keys
}
d2 = {
    "values": list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))
}
d1.update(d2)

It might be slightly easier to "somehow" build this as s single dictionary, but the two subsets of keys make it seem more sensible to build the resulting document in two parts.

The code I was asked to comment on was quite complex. It built a large number of intermediate structures rather than building a List[Dict] using a list comprehension.

What's important about this problem is the complexity of the list comprehension. In particular, the keys are used twice in the comprehension. One use extracts the source lists from the original document. The second use attaches the key to each value from the original list.

It almost seems like the Python 3.8 "Walrus" operator might be a handy way to shrink this code down from about 14 lines. I'm not sure it's helpful to make this any shorter. Indeed, I'm not 100% sure this compact form is really optimal. The fact that I had to expand things as part of an explanation suggests that separate lines of code are as important as separate subsections of this blog post.

Tuesday, February 4, 2020

Dictionary clear() as a code smell

Using the clear() method of a dict isn't *wrong*. But. The reasons have to be investigated. I got a question about this code not working "properly." ("Properly"? Seems too vague to be useful.)

Here's a summary of the example.

final_list = []
temp_dict = {}
for obj in some_source:
    cool_function(obj, temp_dict)
    final_list.append(temp_dict)
    temp_dict.clear()  # Ready for reuse, right?


This can't work.

(Bonus points if you suspect that list.append() is a smell, too. There may be a list comprehension solution that's tidier than this.)

It's not always easy to get to a succinct statement of what doesn't work "properly," or what's confusing about the Python list structure. Getting useful information can be hard. Why?
  • Some programmers are "Assumptions First" kind of people, and their complaint is often "doesn't match my assumption" not "doesn't actually work."
  • Some people live in "All Details Matter" world. Rather than create the smallest example of code that's confusing, they send the *entire* project. The problem is buried in a log, wrapped with "Why is the list of dictionaries not being properly updated?" In an email that provides background details. For a Trello story that links to background details. Details. None of which point to the problem. 

"Properly?" What does that even mean?

 Confronted with hundreds of lines of impenetrable code, I asked for a definition of "properly" and got these exact seven words: "Properly is defined as correctly or satisfactorily."

So... 

They have no idea what's wrong, can't summarize the code that's broken, and it's my fault because I'm the Python guru.

Why Won't My Code Work?

The short answer is "Because You're Making an Assumption."

Of course, anyone who puts their assumptions first is as blind to their assumptions as we are to the air that surrounds us. Assumptions are just there. All around them. They breathe their assumptions in and out without seeing them.

The long answer is Python uses references.

If you apply the id() function to the items in the resulting final_list, you'll see that it's reference after reference of one object, temp_dict.  Not copies of individually populated dictionaries, but multiple references to the same dictionary. The same dictionary which was cleared and reloaded over and over again.

The very first log, crammed with useless details, had output from print() functions. It showed multiple copies of the same dict. 

Because they assumed Python is making copies, there was no explanation for why the list of dictionaries was broken. Clearly, it couldn't be in their code. They assume their code is correct. The only choice has to be an undocumented mystery in Python. And I'm the Python guru, so it's my problem.

The presence of duplicates in the output meant "something" to them. They could point it out as somehow wrong. But the idea that their assumptions might be wrong? That was a nope.

They wanted it to be the list object, final_list, which didn't append dictionaries the way they assumed it would. They needed it to be a Python internals problem. They needed it to be a bad documentation problem. (Seriously. These convos have spun out of control in the past.)

tl;dr

Using the clear() method of a dict may indicate the developer is hoping Python shares copies, not references. Either add an explicit copy() (or deepcopy.copy()) or fix things to create new, fresh dictionaries each time. Objects are cheap. Why reuse them?

(Indeed, an interesting side-bar question I did not ask is "In what god-forsaken programming language does this 'clear-and-reuse' a data structure even make sense? FORTRAN?)

The list comprehension solution to this problem will have to wait. Stay tuned. I want to disentangle the algorithmic design problem from the "why aren't my assumptions correct?" problem..