See https://www.patreon.com/posts/30995708
I've (finally) gotten the book content upgraded to Python 3.7.
I've also deleted all the previous versions of the book. I had been keeping them on my web server because -- well -- because I don't know why. They go back to at least 2011, some of the content may be even older than that.
I've also deleted some previous self-published content.
(I started writing about Python almost 20 years ago. Some of the content could have been that old. It deserves to be deleted.)
Rants on the daily grind of building software. This has been moved to https://slott56.github.io. Fix your bookmarks.
Bio and Publications
▼
Tuesday, October 29, 2019
Tuesday, October 22, 2019
State Change and NoSQL Databases
Let's take another look at F. L. Stevens spreadsheet with agencies and agents. It's -- of course -- an unholy mess. Why? It's difficult to handle state change and deduplication.
Let's look at state changes.
The author needs URL's and names and a list of genres the agent is interested in. This is more-or-less static data. It changes rarely. What changes more often is an agent being closed or open to queries.
Another state change is the query itself. Once the email has been sent, the agent (and their agency) should not be bothered again for at least sixty days. After an explicit rejection, there's little point in making any contact with the agent; they're effectively out of the market for a given manuscript.
There are some other stateful rules, we don't need all the details to see the potential complexities here.
A spreadsheet presents a particularly odious non-solution to the problem of state and state change. There's a good and a bad. Mostly bad.
We can try to spread history down the rows of a column. Wow this is bad. We can try to use the hierarchy features to make history a bunch of folded-up details underneath a heading row. This is microscopically better, but still difficult to manage with all the unfolding and folding required to change state after a rejection.
We can blow up a single cell to have non-atomic data -- all of the history with events and dates in a long, ";" delimited list.
There's no good way to represent this in a spreadsheet.
The rigidity of the SQL schema is a barrier here. We're dealing with some sloppy data handling practices in the legacy spreadsheet. We don't want to have to tweak the SQL each time we find some new subtlety that's poorly represented in the spreadsheet data.
We're also handling a number of data sources, each with a unique schema. We need a way to unify these flexibly, so we can fold in additional data sources, once the broken spreadsheet is behind us.
(There are a yet more problems with the relational model in general, those are material for a separate blog post. For now, the rigidity and complexity are a big enough pair of problems.)
We have an Agency as the primary document., Within an Agency, there are a number of individual Agents. Within each agent is a series of Events. Some Agents aren't even interested in the genre F. L. Stevens writes, so they're closed. Some Agents are temporarily closed. The rest are open.
The author can get a list of open agents, following a number of rules, including waiting after the last contact, and avoiding working with multiple agents within a single agency. After sending query letters, the event history gets an entry, and those agents are in another state, query pending.
One common complaint I hear about a document store is the "cost" of updating a large-ish document. The implicit assumption seems to be that an update operation can't locate the relevant sub-document, and can't make incremental changes. Having worked with both SQL and NoSQL, this "cost of document update" seems to be unmeasurably small.
Another cluster command question hovers around locking and concurrency. Most of them nonsensical because they come from the world of fragmented data in a SQL database. When the relevant object (i.e. Agency) is spread over a lot of rows of several tables, locking is essential. When the relevant object is a single document, locks aren't as important. If two people are updating the same document at the same time, that's a document design issue, or a control issue in the application.
Finally, there are questions about "update anomalies." This is a sensible question. In the relational world, we often have shared "lookup" data. A single change to a lookup row will have a ripple effect to all rows using the lookup row's foreign key.
Think of changing zip code 12345 from Schenectady, NY to Scotia, NY. Everyone sharing the foreign key reference via the zip code has been moved with a single update. Except, of course, nothing is visible until a query reconstructs the desired document from the fragmented pieces.
We've traded a rare sweeping updated across many documents for a sweeping, complex join operating to build the relevant document from the normalized pieces. Queries are expensive, complex, and often wrong. They're so painful, we use ORM's to mask the queries and give us the documents we wanted all along.
We have three classes here. Agency is the parent document. Each Agency contains one or more Agent instances. Each Agent contains one or more Events.
When we fetch an agent's data, we fetch the entire agency, since the "business" rules preclude querying more than one agent in an agency. The queries involve a nuanced state change: a rejection by one agent, opens another in the same agency. Rather than do some additional SQL queries to locate the parent and other children of the parent, just read the whole thing at once.
In later posts, we'll look at deduplication and some other processing. But this seems to be all the schema we'll ever need. The type hints provided mypy some evidence of what we intend to do with these documents.
Let's look at state changes.
The author needs URL's and names and a list of genres the agent is interested in. This is more-or-less static data. It changes rarely. What changes more often is an agent being closed or open to queries.
Another state change is the query itself. Once the email has been sent, the agent (and their agency) should not be bothered again for at least sixty days. After an explicit rejection, there's little point in making any contact with the agent; they're effectively out of the market for a given manuscript.
There are some other stateful rules, we don't need all the details to see the potential complexities here.
A spreadsheet presents a particularly odious non-solution to the problem of state and state change. There's a good and a bad. Mostly bad.
- On the good side, you can edit a single cell, changing the state. You can define a drop-down list of states, or radio buttons with alternative states.
- The be bad side, you're often limited to editing a single cell when you want to change the state. You want to have dates filled in automatically on state change. You want history of state changes. Excel hackers try to write macros to automate filling in the date. History, however... History is a problem.
We can try to spread history down the rows of a column. Wow this is bad. We can try to use the hierarchy features to make history a bunch of folded-up details underneath a heading row. This is microscopically better, but still difficult to manage with all the unfolding and folding required to change state after a rejection.
We can blow up a single cell to have non-atomic data -- all of the history with events and dates in a long, ";" delimited list.
There's no good way to represent this in a spreadsheet.
What to do?
The relational database people love the master-detail relationship. Agency has Agent. Agent has History. The history is a bunch of rows in the history table, with a foreign key relationship with the agent.The rigidity of the SQL schema is a barrier here. We're dealing with some sloppy data handling practices in the legacy spreadsheet. We don't want to have to tweak the SQL each time we find some new subtlety that's poorly represented in the spreadsheet data.
We're also handling a number of data sources, each with a unique schema. We need a way to unify these flexibly, so we can fold in additional data sources, once the broken spreadsheet is behind us.
(There are a yet more problems with the relational model in general, those are material for a separate blog post. For now, the rigidity and complexity are a big enough pair of problems.)
SQL is Out. What Else?
A document store is pretty nice for this. The rest of this section is an indictment of SQL. Feel free to skip it. It's widely known, and well supported elsewhere.We have an Agency as the primary document., Within an Agency, there are a number of individual Agents. Within each agent is a series of Events. Some Agents aren't even interested in the genre F. L. Stevens writes, so they're closed. Some Agents are temporarily closed. The rest are open.
The author can get a list of open agents, following a number of rules, including waiting after the last contact, and avoiding working with multiple agents within a single agency. After sending query letters, the event history gets an entry, and those agents are in another state, query pending.
One common complaint I hear about a document store is the "cost" of updating a large-ish document. The implicit assumption seems to be that an update operation can't locate the relevant sub-document, and can't make incremental changes. Having worked with both SQL and NoSQL, this "cost of document update" seems to be unmeasurably small.
Another cluster command question hovers around locking and concurrency. Most of them nonsensical because they come from the world of fragmented data in a SQL database. When the relevant object (i.e. Agency) is spread over a lot of rows of several tables, locking is essential. When the relevant object is a single document, locks aren't as important. If two people are updating the same document at the same time, that's a document design issue, or a control issue in the application.
Finally, there are questions about "update anomalies." This is a sensible question. In the relational world, we often have shared "lookup" data. A single change to a lookup row will have a ripple effect to all rows using the lookup row's foreign key.
Think of changing zip code 12345 from Schenectady, NY to Scotia, NY. Everyone sharing the foreign key reference via the zip code has been moved with a single update. Except, of course, nothing is visible until a query reconstructs the desired document from the fragmented pieces.
We've traded a rare sweeping updated across many documents for a sweeping, complex join operating to build the relevant document from the normalized pieces. Queries are expensive, complex, and often wrong. They're so painful, we use ORM's to mask the queries and give us the documents we wanted all along.
What's It Look Like?
This:@dataclass class Agency: """A collection of individual agents.""" name : str url : Optional[str] = field(default=None) agents : Dict[str, 'Agent'] = field(init=False, default_factory=dict) @dataclass class Agent: """An Agent with a sequence of events: actions and state changes.""" name : str url : str email : str fiction_genres : List[str] query_details : str = field(default_factory=str) events : List['Event'] = field(init=False, default_factory=list) @dataclass class Event: """An action or state change. status = 'open', 'closed', 'query sent', 'query outcome', 'closed until', etc. Depending on the status, there may be additional details. For 'query sent', there's 'date'. For 'query outcome', there's 'outcome' and an optional 'date'. for 'closed until', there's 'reason' and an optional 'date'. """ status : str date : Optional[datetime.date] = field(default=None) outcome : Optional[str] = field(default=None) reason : Optional[str] = field(default=None) def __repr__(self): return f"{self.status} {self.date} {self.outcome} {self.reason}"
We have three classes here. Agency is the parent document. Each Agency contains one or more Agent instances. Each Agent contains one or more Events.
When we fetch an agent's data, we fetch the entire agency, since the "business" rules preclude querying more than one agent in an agency. The queries involve a nuanced state change: a rejection by one agent, opens another in the same agency. Rather than do some additional SQL queries to locate the parent and other children of the parent, just read the whole thing at once.
In later posts, we'll look at deduplication and some other processing. But this seems to be all the schema we'll ever need. The type hints provided mypy some evidence of what we intend to do with these documents.
Tuesday, October 15, 2019
Apple's Numbers and the All-in-One CSV export
Author F. L. Stevens has a hellishly complex (and irregular) spreadsheet with agents, agencies, and query status. (This is how fiction gets marketed: querying agents.) The spreadsheet has become unmanageably complex, with multiple pages. Each page has multiple tables. Buried in this are three "interesting" tables with agent query information.
Can we talk about drama? There is the dark night of the soul for anyone interested in regular, normalized data.
We have some fundamental choices for working with this mess:
The prefix is a line with "Sheet: Table" Yes. There's a ": " (colon space) separator. The suffix is a simple blank line, essentially indistinguishable from a blank line within a table.
If the table was originally in strict first normal form (1NF) each row would have the same number of commas. If cells are merged, however, the number of commas can be fewer. This makes it potentially difficult to distinguish blank rows in a table from blank lines between tables.
It's generally easiest to ignore the blank lines entirely. We can distinguish table headers because they're a single cell with a sheet: table format. We are left hoping there aren't any tables that have values that have this format.
We have two ways to walk through the values:
Clearly, this is craziness.
Flattening is much nicer.
This provides a relatively simple way to find the relevant tables and sheets. We can use something as simple as the following to locate the relevant data.
This lets us write pleasant functions that handle exactly one row from the source table. We'll have one of these for each target table. In the above example, we've only shown two, you get the idea. Each new source table, with its unique headers can be accommodated.
Can we talk about drama? There is the dark night of the soul for anyone interested in regular, normalized data.
We have some fundamental choices for working with this mess:
- Export each relevant table to separate files. Lots of manual pointy-clicky and opportunities for making mistakes.
- Export the whole thing to separate files. Less pointy-clicky.
- Export the whole thing to one file. About the same pointy-clicky and error vulnerability as #2. But. Simpler still because there's one file to take care of. Something a fiction author should be able to handle.
The prefix is a line with "Sheet: Table" Yes. There's a ": " (colon space) separator. The suffix is a simple blank line, essentially indistinguishable from a blank line within a table.
If the table was originally in strict first normal form (1NF) each row would have the same number of commas. If cells are merged, however, the number of commas can be fewer. This makes it potentially difficult to distinguish blank rows in a table from blank lines between tables.
It's generally easiest to ignore the blank lines entirely. We can distinguish table headers because they're a single cell with a sheet: table format. We are left hoping there aren't any tables that have values that have this format.
We have two ways to walk through the values:
- Preserving the Sheet, Table, Row hierarchy. We can think of this as the for s in sheet: for t in table: for r in rows structure. The sheet iterator is Iterator[Tuple[str, Table_Iterator]]. The Table_Iterator is similar: Iterator[Tuple[str, Row_Iterator]]. The Row_Iterator, is the most granular Iterator[Dict[str, Any]].
- Flattening this into a sequence of "(Sheet name, Table Name, Row)" triples. Since a sheet and table have no other attributes beyond a name, this seems advantageous to me.
The hierarchical form requires a number of generator functions for Sheet-from-CSV, Table-from-CSV, and Row-from-CSV. Each of these works with a single underlying iterator over the source file and a fairly complex hand-off of state. If we only use the sheet iterator, the tables and rows are skipped. If we use the table within a sheet, the first table name comes from the header that started a sheet; the table names come from distinct headers until the sheet name changes.
The table-within-sheet iteration is very tricky. The first table is a simple yield of information gathered by the sheet iterator. Any subsequent tables, however, may be based one one of two conditions: either no rows have been consumed, in which case the table iterator consumes (and ignores) rows; or, all the rows of the table have been consumed and the current row is another "sheet: table" header.
The code sample below involves a fair amount of repetition. It's not appealing to refactor this because it's ungainly in its complexity, and doesn't create any tangible value. (I haven't even tried to get the type hints right.)
class SheetTable: def __init__(self, source_path: Path) -> None: self.path: Path = source_path self.csv_source = None self.rdr = None self.header = None self.row = None def __enter__(self) -> None: self.csv_source = self.path.open() self.rdr = csv.reader(self.csv_source) self.header = None self.row = next(self.rdr) return self def __exit__(self, *args) -> None: self.csv_source.close() def _sheet_header(self) -> bool: return len(self.row) == 1 and ': ' in self.row[0] def sheet_iter(self): while True: while not (self._sheet_header()): try: self.row = next(self.rdr) except StopIteration: return self.sheet, _, self.table = self.row[0].partition(": ") self.header = next(self.rdr) self.row = next(self.rdr) yield self.sheet, self.table_iter() def table_iter(self): yield self.table, self.row_iter() while not (self._sheet_header()): try: self.row = next(self.rdr) except StopIteration: return next_sheet, _, next_table = self.row[0].partition(": ") while next_sheet == self.sheet: self.table = next_table self.header = next(self.rdr) self.row = next(self.rdr) yield self.table, self.row_iter() while not (self._sheet_header()): try: self.row = next(self.rdr) except StopIteration: return next_sheet, _, next_table = self.row[0].partition(": ") def row_iter(self): while not self._sheet_header(): yield dict(zip(self.header, self.row)) try: self.row = next(self.rdr) except StopIteration: return
Flattening is much nicer.
def sheet_table_iter(source_path: Path) -> Iterator[Tuple[str, str, Dict[str, Any]]]: with source_path.open() as csv_source: rdr = csv.reader(csv_source) header = None for row in rdr: if len(row) == 0: continue elif len(row) == 1 and ": " in row[0]: sheet, table = row[0].split(": ", maxsplit=1) header = next(rdr) continue else: # Inject headers to create dict from row yield sheet, table, dict(zip(header, row))
This provides a relatively simple way to find the relevant tables and sheets. We can use something as simple as the following to locate the relevant data.
for sheet, table, row in sheet_table_iter(source_path): if sheet == 'AgentQuery' and table == 'agent_query': agent = agent_query_row(database, row) elif sheet == 'AAR-2019-03' and table == 'Table 1': agent = aar_2019_row(database, row)
This lets us write pleasant functions that handle exactly one row from the source table. We'll have one of these for each target table. In the above example, we've only shown two, you get the idea. Each new source table, with its unique headers can be accommodated.
Tuesday, October 8, 2019
Spreadsheet Regrets
I can't emphasize this enough.
Some people, when confronted with a problem, think
“I know, I'll use a spreadsheet.” Now they have two problems.
(This was originally about regular expressions. And AWK. See http://regex.info/blog/2006-09-15/247)
Fiction writer F. L. Stevens got a list of literary agents from AAR Online. This became a spreadsheet driving queries for representation. After a bunch of rejections, another query against AAR Online provided a second list of agents.
Apple's Numbers product will readily translate the AAR Online HTML table into a usable spreadsheet table. But after initial success the spreadsheet as tool of choice collapses into a pile of rubble. The spreadsheet data model is hopelessly ineffective for the problem domain.
What is the problem domain?
There are two user stories:
Some people, when confronted with a problem, think
“I know, I'll use a spreadsheet.” Now they have two problems.
(This was originally about regular expressions. And AWK. See http://regex.info/blog/2006-09-15/247)
Fiction writer F. L. Stevens got a list of literary agents from AAR Online. This became a spreadsheet driving queries for representation. After a bunch of rejections, another query against AAR Online provided a second list of agents.
Apple's Numbers product will readily translate the AAR Online HTML table into a usable spreadsheet table. But after initial success the spreadsheet as tool of choice collapses into a pile of rubble. The spreadsheet data model is hopelessly ineffective for the problem domain.
What is the problem domain?
There are two user stories:
- Author needs to deduplicate agents and agencies. It's considered poor form to badger agents with repeated queries for the same title. It's also bad form to query two agents at the same agency. You have to get rejected by one before contacting the other.
- Author needs to track activities at the Agent and Agency level to optimize querying. This mostly involves sending queries and tracking rejections. Ideally, an agent acceptance should lead to notification to other agents that the manuscript is being withdrawn. This is so rare as to not require much automation.
Agents come and go. Periodically, an agent will be closed to queries for some period of time, and then reopen. Their interests vary with the whims of the marketplace they're trying to serve. Traditional fiction publishing is quite complex; agents are the gatekeepers.
To an extent, we can decompose the processing like this.
1. Sourcing. There are several sources: AAR Online and Agent Query are two big sources. These sites have usable query engines and the HTML can be scraped to get a list of currently active agents with a uniform representation. This is elegant Python and Beautiful Soup.
2. Deduplication. Agency and Agent deduplication is central. Query results may involve state changes to an agent (open to queries, interested in new genres.) Query results may involve simple duplicates, which have to be discarded to avoid repeated queries. It's a huge pain when attempted with a spreadsheet. The simplistic string equality test for name matching is defeated by whitespace variations, for example. This is elegant Python, however.
3. Agent web site checks. These have to be done manually. Agency web pages are often art projects, larded up with javascript that produces elegant rolling animations of books, authors, agents, background art, and text. These sites aren't really set up to help authors. It's impossible to automate a check to confirm the source query results. This has to be done manually: F. L. is required to click and update status.
4. State Changes. Queries and Rejections are the important state changes. Open and Closed to queries is also part of the state that needs to be tracked. Additionally, there's a multiple agent per agency check that makes this more complex. The state changes are painful to track in a simple spreadsheet-like data structure: a rejection by one agent can free up another agent at the same agency. This multi-row state change is simply horrible to deal with.
Bonus confusion! Time-to-Live rules: a query over 60 days old is more-or-less a de facto rejection. This means that periodic scans of the data are required to close a query to one agent in an agency, freeing up subsequent agents in the same agency.
Bonus confusion! Time-to-Live rules: a query over 60 days old is more-or-less a de facto rejection. This means that periodic scans of the data are required to close a query to one agent in an agency, freeing up subsequent agents in the same agency.
Manuscript Wish Lists (MSWLs) are a source for agents actively searching for manuscripts. This is more-or-less a Twitter query. Using the various aggregating web sites seems slightly easier than using Twitter directly. However, additional Twitter lookups are required to locate agent details, so this is interesting web-scraping.
Of course F. L. Stevens has a legacy spreadsheet with at least four "similar" (but not really identical) tabs filled with agencies, agents, and query status.
Of course F. L. Stevens has a legacy spreadsheet with at least four "similar" (but not really identical) tabs filled with agencies, agents, and query status.
I don't have an implementation to share -- yet. I'm working on it slowly.
I think it will be an interesting tutorial in cleaning up semi-structured data.
I think it will be an interesting tutorial in cleaning up semi-structured data.