Wednesday, December 4, 2019

Creating Palindromes -- if possible -- from a string of letters.

This can be an interesting exercise. I think it is something that can help people learn to code well. I found this in the LinkedIn Python community:  https://www.linkedin.com/groups/25827/

The Palindrome Problem:
Make a function that makes a palindrome out of the letters in a string and
returns -1 if this is not possible.
Convert a list of strings with the function.
Some test cases:

>>> palify('eedd')
'edde' (or 'deed')
>>> palify('wgerar')
>>> palify('uiuiqii')
'uiiqiiu' or several similar variants.



Let's not get too carried away. I like *some* of this problem.

I don't like the idea of Union[str, int] as a return type from this function. Yes, it's valid Python, but it seems like a code smell. Since the intent is to build lists, a None would be more sensible than a number; we'd have Optional[str] which seems better overall.

The solution that was posted was interesting. It did way too much work, but it was acceptable-looking Python. (It started with a big block comment with "#" on each line instead of a docstring, so... there were minor style problems, but otherwise, it was not bad.)

Here's what popped into my head, to act as a concrete response to the request for comments.

"""
Make a function that makes a palindrome out of the letters in a string and
returns -1 if this is not possible.
Convert a list of strings with the function.
Some test cases:

>>> palify('eedd')
'edde'
>>> palify('wgerar')
>>> palify('uiuiqii')
'uiiqiiu'
"""
from typing import Optional, Set


def palify(source: str) -> Optional[str]:
    """Core palindromic conversion."""
    singletons: Set[str] = set()
    pairs = list()
    for c in source:
        if c in singletons:
            pairs.append(c)
            singletons.remove(c)
        else:
            singletons.add(c)

    if pairs and len(singletons) <= 1:
        # presuming a single letter can't be palindromic.
        return ''.join(pairs+list(singletons)+pairs[::-1])
    return None

if __name__ == "__main__":
    s =  ['eedd', 'wgerar', 'uiuiqii']
    p = list(map(palify, s))
    print(f"from {s=}, we get {p=}")


The core problem statement is interesting. And the ancillary requirement is almost as interesting as the problem.

The simple-seeming "Make a palindrome out of the letters of the string" has two parts. First, there's the question of "can it even become a palindrome"? Which implies validating the source data against some set of rules. After that, we have to emit one of the many possible palindromes from the source material.

The original post had a complicated survey of the data. This was followed by an elegant way of creating a palindrome from the survey data. Since we're looking for a bunch of pairs and a singleton, I elided the more complex survey and opted to collect pairs and singletons into two separate collections.

When we've consumed the input, we will have partitioned the characters into their two pools and we can decide if the pools have the right sizes to proceed. The emission of the palindrome is a lazy assembly of the resulting data, first as a list, and then transformed to a single string.

The ancillary requirement is interesting in its own right. When a bundle of letters can't form a palindrome, that seems like a ValueError exception to me. Doing bulk transformations in the presence of ValueErrors seems wrong-ish. I already griefed about the -1 response above: it seems very bad. A None is less bad than -1. An Exception, however, seems like a more right thing to do.

Code Review Response

I think my response to the original code should be follow-up questions on why a defaultdict(int) was used to survey the data in the first place. A Counter() is a better idea, and requires less code.

The survey involved trying to locate singletons -- a laudable goal. There may have been a better approach to looking for the presence of a singleton letter in the Counter values.

More fundamentally, there are few states for each letter. There are two stark algorithmic choices: a structure keyed by letter or collections of letters. I've shown the collections, and hinted at the collection. The student response used a collection.

I think this problem serves as a good discussion for algorithmic alternatives. The core problem of detecting the possibility of palindromicity for a bunch of letters is cool. There are two choices. The handling of the exceptional case (-1, None or ValueError) is another bundle of choices.

Tuesday, December 3, 2019

Functional programming design pattern: Nested Iterators == Flattening

Here's a functional programming design pattern I uncovered. This may not be news to you, but it was a surprise to me. It cropped up when looking at something that needs parallelization to reduced the elapsed run time.

Consider this data collection process.

for h in some_high_level_collection(arg1):
    for l in h.some_low_level_collection(arg2):
        if some_filter(l):
            logger.info("Processing %s %s", h, l)
            some_function(h, l)

This is pretty common in devops world. You might be looking at all repositories of in all github organizations. You might be looking at all keys in all AWS S3 buckets under a specific account. You might be looking at all tables owned by all schemas in a database.

It's helpful -- for the moment -- to stay away from taller tree structures like the file system. Traversing the file system involves recursion, and the pattern is slightly different there. We'll get to it, but what made this clear to me was a "simpler" walk through a two-layer hierarchy. 

The nested for-statements aren't really ideal. We can't apply any itertools techniques here. We can't trivially change this to a multiprocessing.map()

In fact, the more we look at this, the worse it is.

Here's something that's a little easier to work with:

def h_l_iter(arg1, arg2):
    for h in some_high_level_collection(arg1):
        for l in h.some_low_level_collection(arg2):
            if some_filter(l):
                logger.info("Processing %s %s", h, l)
                yield h, l

itertools.starmap(some_function, h_l_iter(arg1, arg2))

The data gathering has expanded to a few more lines of code. It gained a lot of flexibility. Once we have something that can be used with starmap, it can also be used with other itertools functions to do additional processing steps without breaking the loops into horrible pieces.

I think the pattern here is a kind of "Flattened Map" transformation. The initial design, with nested loops wrapping a process wasn't a good plan. A better plan is to think of the nested loops as a way to flatten the two tiers of the hierarchy into a single iterator. Then a mapping can be applied to process each item from that flat iterator.

Extracting the Filter

We can now tease apart the nested loops to expose the filter. In the version above, the body of the h_l_iter() function binds log-writing with the yield. If we take those two apart, we gain the flexibility of being able to change the filter (or the logging) without an awfully complex rewrite.

T = TypeVar('T')
def logging_iter(source: Iterable[T]) -> Iterator[T]:
    for item in source:
        logger.info("Processing %s", item)
        yield item

def h_l_iter(arg1, arg2):
    for h in some_high_level_collection(arg1):
        for l in h.some_low_level_collection(arg2):
            yield h, l

raw_data = h_l_iter(arg1, arg2)
filtered_subset = logging_iter(filter(some_filter, raw_data))
itertools.starmap(some_function, filtered_subset)

Yes, this is still longer, but all of the details are now exposed in a way that lets me change filters without further breakage.

Now, I can introduce various forms of multiprocessing to improve concurrency.

This transformed a hard-wired set of nest loops, if, and function evaluation into a "Flattener" that can be combined with off-the shelf filtering and mapping functions.

I've snuck in a kind of "tee" operation that writes an iterable sequence to a log. This can be injected at any point in the processing.

Logging the entire "item" value isn't really a great idea. Another mapping is required to create sensible log messages from each item. I've left that out to keep this exposition more focused.

I'm sure others have seen this pattern, but it was eye-opening to me.

Full Flattening

The h_l_iter() function is actually a generator expression. A function isn't needed.

h_l_iter = (
    (h, l) 
    for h in some_high_level_collection(arg1) 
        for l in h.some_low_level_collection(arg2)
)

This simplification doesn't add much value, but it seems to be general truth. In Python, it's a small change in syntax and therefore, an easy optimization to make.

What About The File System?

When we're working with some a more deeply-nested structure, like the File System, we'll make a small change. We'll replace the h_l_iter() function with a recursive_walk() function.

def recursive_walk(path: Path) -> Iterator[Path]:
    for item in path.glob():
        if item.is_file():
            yield item
        elif item.is_dir():
            yield from recursive_walk(item)

This function has, effectively the same signature as h_l_iter(). It walks a complex structure yielding a flat sequence of items. The other functions used for filtering, logging, and processing don't change, allowing us to build new features from various combinations of these functions.

tl;dr

The too-long version of this is:

Replace for item in iter: process(item) with map(process, iter).

This pattern works for simple, flat items, nested structures, and even recursively-defined trees. It introduces flexibility with no real cost.

The other pattern in play is:

Any for item in iter: for sub-item in item:  processing is "flattening" a hierarchy into a sequence. Replace it with (sub-item for item in iter for sub-item in item).

These felt like blinding revelations to me.

Tuesday, November 26, 2019

Refactoring

Follow my Patreon: Become a Patron!

I'll try to focus on my Building Skills in OO Design book there. I'm thinking of adding some more code examples. Is that a good idea?

Maybe that should be the higher-level Patreon benefit?

Tuesday, November 19, 2019

Python 3.8 features

Real Python (@realpython)
πŸ“ΊπŸ Cool New Features in Python 3.8
realpython.com/courses/cool-n…

The Cool New Features in Python 3.8 lists some of the things that could become part of The Modern Python Cookbook 2e. I'm working with Packt on a plan for creating a new edition focused on Python 3.8 features, and using mypy to validate the type hints.

I'm learning a lot about my previously sketchy designs and potential problems with some of them. There are a number of things in Python which "work" in a vague hand-wavey way. But they don't work in a "I can convince mypy this will work" way. 

The additional "convince mypy" rigor can separate potentially sketchy design from an unassailable design.


Tuesday, October 29, 2019

Building Skills in OO Design

See https://www.patreon.com/posts/30995708

I've (finally) gotten the book content upgraded to Python 3.7.

I've also deleted all the previous versions of the book. I had been keeping them on my web server because -- well -- because I don't know why. They go back to at least 2011, some of the content may be even older than that.

I've also deleted some previous self-published content.

(I started writing about Python almost 20 years ago. Some of the content could have been that old. It deserves to be deleted.)


Tuesday, October 22, 2019

State Change and NoSQL Databases

Let's take another look at F. L. Stevens spreadsheet with agencies and agents. It's -- of course -- an unholy mess. Why? It's difficult to handle state change and deduplication.

Let's look at state changes.

The author needs URL's and names and a list of genres the agent is interested in. This is more-or-less static data. It changes rarely. What changes more often is an agent being closed or open to queries.

Another state change is the query itself. Once the email has been sent, the agent (and their agency) should not be bothered again for at least sixty days. After an explicit rejection, there's little point in making any contact with the agent; they're effectively out of the market for a given manuscript.

There are some other stateful rules, we don't need all the details to see the potential complexities here.

A spreadsheet presents a particularly odious non-solution to the problem of state and state change. There's a good and a bad. Mostly bad.
  • On the good side, you can edit a single cell, changing the state. You can define a drop-down list of states, or radio buttons with alternative states. 
  • The be bad side, you're often limited to editing a single cell when you want to change the state. You want to have dates filled in automatically on state change. You want history of state changes. Excel hackers try to write macros to automate filling in the date. History, however... History is a problem.
We can try to spread history across the row. This rapidly becomes horrifying -- the rows are uneven in length, breaking a First Normal Form rule for spreadsheets.

We can try to spread history down the rows of a column. Wow this is bad. We can try to use the hierarchy features to make history a bunch of folded-up details underneath a heading row. This is microscopically better, but still difficult to manage with all the unfolding and folding required to change state after a rejection.

We can blow up a single cell to have non-atomic data -- all of the history with events and dates in a long, ";" delimited list.

There's no good way to represent this in a spreadsheet.

What to do?

The relational database people love the master-detail relationship. Agency has Agent. Agent has History. The history is a bunch of rows in the history table, with a foreign key relationship with the agent.

The rigidity of the SQL schema is a barrier here. We're dealing with some sloppy data handling practices in the legacy spreadsheet. We don't want to have to tweak the SQL each time we find some new subtlety that's poorly represented in the spreadsheet data.

We're also handling a number of data sources, each with a unique schema. We need a way to unify these flexibly, so we can fold in additional data sources, once the broken spreadsheet is behind us.

(There are a yet more problems with the relational model in general, those are material for a separate blog post. For now, the rigidity and complexity are a big enough pair of problems.)

SQL is Out. What Else?

A document store is pretty nice for this.  The rest of this section is an indictment of SQL. Feel free to skip it. It's widely known, and well supported elsewhere.

We have an Agency as the primary document., Within an Agency, there are a number of individual Agents. Within each agent is a series of Events. Some Agents aren't even interested in the genre F. L. Stevens writes, so they're closed. Some Agents are temporarily closed. The rest are open.

The author can get a list of open agents, following a number of rules, including waiting after the last contact, and avoiding working with multiple agents within a single agency. After sending query letters, the event history gets an entry, and those agents are in another state, query pending.

One common complaint I hear about a document store is the "cost" of updating a large-ish document. The implicit assumption seems to be that an update operation can't locate the relevant sub-document, and can't make incremental changes. Having worked with both SQL and NoSQL, this "cost of document update" seems to be unmeasurably small.

Another cluster command question hovers around locking and concurrency. Most of them nonsensical because they come from the world of fragmented data in a SQL database. When the relevant object (i.e. Agency) is spread over a lot of rows of several tables, locking is essential. When the relevant object is a single document, locks aren't as important. If two people are updating the same document at the same time, that's a document design issue, or a control issue in the application.

Finally, there are questions about "update anomalies." This is a sensible question. In the relational world, we often have shared "lookup" data. A single change to a lookup row will have a ripple effect to all rows using the lookup row's foreign key.

Think of changing zip code 12345 from Schenectady, NY to Scotia, NY. Everyone sharing the foreign key reference via the zip code has been moved with a single update. Except, of course, nothing is visible until a query reconstructs the desired document from the fragmented pieces.

We've traded a rare sweeping updated across many documents for a sweeping, complex join operating to build the relevant document from the normalized pieces. Queries are expensive, complex, and often wrong. They're so painful, we use ORM's to mask the queries and give us the documents we wanted all along.

What's It Look Like?

This:

@dataclass
class Agency:
    """A collection of individual agents."""
    name : str
    url : Optional[str] = field(default=None)
    agents : Dict[str, 'Agent'] = field(init=False, default_factory=dict)

@dataclass
class Agent:
    """An Agent with a sequence of events: actions and state changes."""
    name : str
    url : str
    email : str
    fiction_genres : List[str]
    query_details : str = field(default_factory=str)
    events : List['Event'] = field(init=False, default_factory=list)

@dataclass
class Event:
    """An action or state change.
    status = 'open', 'closed', 'query sent', 'query outcome', 'closed until', etc.

    Depending on the status, there may be additional details.
    For 'query sent', there's 'date'.
    For 'query outcome', there's 'outcome' and an optional 'date'.
    for 'closed until', there's 'reason' and an optional 'date'.
    """
    status : str
    date : Optional[datetime.date] = field(default=None)
    outcome : Optional[str] = field(default=None)
    reason : Optional[str] = field(default=None)

    def __repr__(self):
        return f"{self.status} {self.date} {self.outcome} {self.reason}"


We have three classes here. Agency is the parent document. Each Agency contains one or more Agent instances. Each Agent contains one or more Events.

When we fetch an agent's data, we fetch the entire agency, since the "business" rules preclude querying more than one agent in an agency. The queries involve a nuanced state change: a rejection by one agent, opens another in the same agency.  Rather than do some additional SQL queries to locate the parent and other children of the parent, just read the whole thing at once.

In later posts, we'll look at deduplication and some other processing. But this seems to be all the schema we'll ever need.  The type hints provided mypy some evidence of what we intend to do with these documents.