Tuesday, October 15, 2019

Apple's Numbers and the All-in-One CSV export

Author F. L. Stevens has a hellishly complex (and irregular) spreadsheet with agents, agencies, and query status. (This is how fiction gets marketed: querying agents.) The spreadsheet has become unmanageably complex, with multiple pages. Each page has multiple tables. Buried in this are three "interesting" tables with agent query information.

Can we talk about drama? There is the dark night of the soul for anyone interested in regular, normalized data.

We have some fundamental choices for working with this mess:
  1. Export each relevant table to separate files. Lots of manual pointy-clicky and opportunities for making mistakes.
  2. Export the whole thing to separate files. Less pointy-clicky.
  3. Export the whole thing to one file. About the same pointy-clicky and error vulnerability as #2. But. Simpler still because there's one file to take care of. Something a fiction author should be able to handle.
The all-in-one CSV export is (initially) exasperating. Each table is wrapped in a prefix and suffix.
The prefix is a line with "Sheet: Table" Yes. There's a ": " (colon space) separator. The suffix is a simple blank line, essentially indistinguishable from a blank line within a table.

If the table was originally in strict first normal form (1NF) each row would have the same number of commas. If cells are merged, however, the number of commas can be fewer. This makes it potentially difficult to distinguish blank rows in a table from blank lines between tables.

It's generally easiest to ignore the blank lines entirely. We can distinguish table headers because they're a single cell with a sheet: table format. We are left hoping there aren't any tables that have values that have this format.

We have two ways to walk through the values:
  • Preserving the Sheet, Table, Row hierarchy. We can think of this as the for s in sheet: for t in table: for r in rows structure. The sheet iterator is Iterator[Tuple[str, Table_Iterator]]. The Table_Iterator is similar: Iterator[Tuple[str, Row_Iterator]]. The Row_Iterator, is the most granular Iterator[Dict[str, Any]].
  • Flattening this into a sequence of "(Sheet name, Table Name, Row)" triples. Since a sheet and table have no other attributes beyond a name, this seems advantageous to me.
The hierarchical form requires a number of generator functions for Sheet-from-CSV, Table-from-CSV, and Row-from-CSV. Each of these works with a single underlying iterator over the source file and a fairly complex hand-off of state. If we only use the sheet iterator, the tables and rows are skipped. If we use the table within a sheet, the first table name comes from the header that started a sheet; the table names come from distinct headers until the sheet name changes. 

The table-within-sheet iteration is very tricky. The first table is a simple yield of information gathered by the sheet iterator. Any subsequent tables, however, may be based one one of two conditions: either no rows have been consumed, in which case the table iterator consumes (and ignores) rows; or, all the rows of the table have been consumed and the current row is another "sheet: table" header. 

The code sample below involves a fair amount of repetition. It's not appealing to refactor this because it's ungainly in its complexity, and doesn't create any tangible value. (I haven't even tried to get the type hints right.)

class SheetTable:
    def __init__(self, source_path: Path) -> None:
        self.path: Path = source_path
        self.csv_source = None
        self.rdr = None
        self.header = None
        self.row = None

    def __enter__(self) -> None:
        self.csv_source = self.path.open()
        self.rdr = csv.reader(self.csv_source)
        self.header = None
        self.row = next(self.rdr)
        return self

    def __exit__(self, *args) -> None:

    def _sheet_header(self) -> bool:
        return len(self.row) == 1 and ': ' in self.row[0]

    def sheet_iter(self):
        while True:
            while not (self._sheet_header()):
                    self.row = next(self.rdr)
                except StopIteration:
            self.sheet, _, self.table = self.row[0].partition(": ")
            self.header = next(self.rdr)
            self.row = next(self.rdr)
            yield self.sheet, self.table_iter()

    def table_iter(self):
        yield self.table, self.row_iter()
        while not (self._sheet_header()):
                self.row = next(self.rdr)
            except StopIteration:
        next_sheet, _, next_table = self.row[0].partition(": ")
        while next_sheet == self.sheet:
            self.table = next_table
            self.header = next(self.rdr)
            self.row = next(self.rdr)
            yield self.table, self.row_iter()
            while not (self._sheet_header()):
                    self.row = next(self.rdr)
                except StopIteration:
            next_sheet, _, next_table = self.row[0].partition(": ")

    def row_iter(self):
        while not self._sheet_header():
            yield dict(zip(self.header, self.row))
                self.row = next(self.rdr)
            except StopIteration:

Clearly, this is craziness.

Flattening is much nicer.

def sheet_table_iter(source_path: Path) -> Iterator[Tuple[str, str, Dict[str, Any]]]:
    with source_path.open() as csv_source:
        rdr = csv.reader(csv_source)
        header = None
        for row in rdr:
            if len(row) == 0:
            elif len(row) == 1 and ": " in row[0]:
                sheet, table = row[0].split(": ", maxsplit=1)
                header = next(rdr)
                # Inject headers to create dict from row
                yield sheet, table, dict(zip(header, row))

This provides a relatively simple way to find the relevant tables and sheets. We can use something as simple as the following to locate the relevant data.

    for sheet, table, row in sheet_table_iter(source_path):
        if sheet == 'AgentQuery' and table == 'agent_query':
            agent = agent_query_row(database, row)
        elif sheet == 'AAR-2019-03' and table == 'Table 1':
            agent = aar_2019_row(database, row)

This lets us write pleasant functions that handle exactly one row from the source table. We'll have one of these for each target table. In the above example, we've only shown two, you get the idea. Each new source table, with its unique headers can be accommodated.

Tuesday, October 8, 2019

Spreadsheet Regrets

I can't emphasize this enough.

Some people, when confronted with a problem, think
“I know, I'll use a spreadsheet.”   Now they have two problems.

(This was originally about regular expressions. And AWK. See http://regex.info/blog/2006-09-15/247)

Fiction writer F. L. Stevens got a list of literary agents from AAR Online. This became a spreadsheet driving queries for representation. After a bunch of rejections, another query against AAR Online provided a second list of agents.

Apple's Numbers product will readily translate the AAR Online HTML table into a usable spreadsheet table. But after initial success the spreadsheet as tool of choice collapses into a pile of rubble. The spreadsheet data model is hopelessly ineffective for the problem domain.

What is the problem domain?

There are two user stories:
  1. Author needs to deduplicate agents and agencies. It's considered poor form to badger agents with repeated queries for the same title. It's also bad form to query two agents at the same agency. You have to get rejected by one before contacting the other. 
  2. Author needs to track activities at the Agent and Agency level to optimize querying. This mostly involves sending queries and tracking rejections. Ideally, an agent acceptance should lead to notification to other agents that the manuscript is being withdrawn. This is so rare as to not require much automation.
Agents come and go. Periodically, an agent will be closed to queries for some period of time, and then reopen. Their interests vary with the whims of the marketplace they're trying to serve. Traditional fiction publishing is quite complex; agents are the gatekeepers.

To an extent, we can decompose the processing like this. 

1. Sourcing. There are several sources: AAR Online and Agent Query are two big sources. These sites have usable query engines and the HTML can be scraped to get a list of currently active agents with a uniform representation. This is elegant Python and Beautiful Soup. 

2. Deduplication. Agency and Agent deduplication is central. Query results may involve state changes to an agent (open to queries, interested in new genres.) Query results may involve simple duplicates, which have to be discarded to avoid repeated queries. It's a huge pain when attempted with a spreadsheet. The simplistic string equality test for name matching is defeated by whitespace variations, for example. This is elegant Python, however. 

3. Agent web site checks. These have to be done manually. Agency web pages are often art projects, larded up with javascript that produces elegant rolling animations of books, authors, agents, background art, and text. These sites aren't really set up to help authors. It's impossible to automate a check to confirm the source query results. This has to be done manually: F. L. is required to click and update status. 

4. State Changes. Queries and Rejections are the important state changes. Open and Closed to queries is also part of the state that needs to be tracked. Additionally, there's a multiple agent per agency check that makes this more complex. The state changes are painful to track in a simple spreadsheet-like data structure: a rejection by one agent can free up another agent at the same agency. This multi-row state change is simply horrible to deal with.

Bonus confusion! Time-to-Live rules: a query over 60 days old is more-or-less a de facto rejection. This means that periodic scans of the data are required to close a query to one agent in an agency, freeing up subsequent agents in the same agency.

Manuscript Wish Lists (MSWLs) are a source for agents actively searching for manuscripts. This is more-or-less a Twitter query. Using the various aggregating web sites seems slightly easier than using Twitter directly. However, additional Twitter lookups are required to locate agent details, so this is interesting web-scraping.

Of course F. L. Stevens has a legacy spreadsheet with at least four "similar" (but not really identical) tabs filled with agencies, agents, and query status.

I don't have an implementation to share -- yet. I'm working on it slowly.

I think it will be an interesting tutorial in cleaning up semi-structured data.

Tuesday, September 10, 2019

Tuesday, September 3, 2019

Finally Planning the Rewrite of Building Skills in Object-Oriented Design

See Building Skills in Object-Oriented Design for the old content, which has a number of features that hold up well over time.

  1. A graduated series of exercises to build up large, complete applications is important. 
  2. It covers a lot of skills essential to building real applications -- unit testing, integration, code reuse. I want to expand on this to include more testing strategies, and final documentation.
  3. It's so popular, I've got enough donations to move forward on a rewrite.
Previously, it was hosted out of my ancient web site as HTML and PDF download. That hasn't aged well.

Also, it was originally Python 2, and that ship sailed years ago.

I'm leaning toward hosting the content on GitHub.

One idea is to have a complex project with the following top-level folders:
  • A docs folder that has the HTML as well as PDF (and maybe an ebook format, too.)
  • A src folder with seed files for the various packages and modules.
  • A tests folder with seed tests.
Someone could fork and then build on the framework.

It's possible to put the exposition into the wiki pages associated with the repo. This has the advantage of keeping the meta-level documentation and individual project requirements separate from the project itself.

Before I go too far, I'll need to experiment a bit to see what the editing process is like. The Github wiki pages are their own git branch, and are easy to edit off-line and push to the repo. Some of the fancy Sphinx markup features vanish, replaced with basic RST. This may not be all bad, since the baseline content is not *very* complex.

Stand by for more.

Tuesday, August 13, 2019

Coping with Windows via AWS

For a training class, I needed to address The Windows Problem™. TWP is the my summary of all the non-standard features of Windows in its various inconsistent incarnations.

Any training class that involves "install Python" inevitably involves at least one Windows user who can't get their PATH set correctly. It's an eternal mystery to me, since the installers all seem to take care of this, but, some people are able to click the wrong thing somewhere in a simple installation.

The uninstall and start again sometimes helps. Having a Windows expert in the room sometimes helps.

(The hapless flailing I sometimes observe is my personal problem to deal with. I should not let myself get short-tempered with people who try things exactly once and then stop, unable to discern a single branch in the path they chose. It's as if the sequence of dialog boxes with different choices never existed for them, and I need to respect the fact that they did not read the messages and did not think of themselves as actively making choices.)

I have to say that being able to get a Windows machine in free tier of AWS is a wonderful resource.

I can screen capture the installation on Windows. 

I can narrate the sequence of choices I made. I'm hoping this prevents TWP from side-tracking a person who's struggling with Windows.

I haven't used the movies yet. But it's so handy to be able to spin up a Windows machine in the cloud and run the Conda install. It's a whole lot nicer than buying a throw-away Windows machine to do screen shots on.

Friday, July 5, 2019

Mastering Object-Oriented Python 2nd ed

The book https://www.packtpub.com/programming/mastering-object-oriented-python-second-edition

Some new chapters. 

Type hints almost everywhere.

If you want to write a review, DM me on twitter @s_lott I can add you to the list for freebies in exchange for a review.

Thursday, June 20, 2019

HumbleBundle -- Functional Python Programming -- Through July 1

See this https://www.humblebundle.com

This is amazing to me. 
Humble Bundle sells games, ebooks, software, and other digital content. Our mission is to support charity while providing awesome content to customers at great prices. We launched in 2010 with a single two-week Humble Indie Bundle, but we have humbly grown into a store full of games and bundles, a subscription service, a game publisher, and more.
Currently, Packt is offering some of my books.

Want a *ton* of technical books and donate to charity?

Click Now. Thank me later.