Tuesday, September 24, 2013

Introduction to Programming: iBook Edition for Python 3.2

That was challenging.

I rewrote almost all of my Introduction to Programming book into an iBook. Trimmed it down. Refocused it. Changed from Python 2.6 to 3.2. A complete refactoring from which almost nothing of the original book survives except the goals.

Look for it October 1st in the iTunes bookstore. Programming for Absolute Beginners: Building Skills in Programming.

[My intent is to have several Building Skills titles. We'll see how far I get.]

The rewrite involved three substantial changes.

  1. I removed all of the duplicative reference material. The Python library reference is (now)  utstandingly good. When I started using Python over ten years ago, it was not very good, and I started writing a Python reference of my own merely to organize the documentation. The books grew from there; the reference aspect is now useless.
  2. I dropped almost all Python 2 references. There's a little bit of Python 2 history, but that's it. It's time to move forward, now that most of the major packages seem to have made the switch. 
  3. I changed the focus from processing to data.
Processing vs. Data

When looking at a multi-faceted language like Python, it's difficult to know what's the most gentle introduction to software development.

Historically, the procedural, imperative style of programming appears the most appealing. The roots of Python come from procedural programming. It reaches back to Pascal (and even Algol 60) by elegantly restating the core principles of those languages with an easier-to-read syntax.

Indeed, if you read classic foundational CACM articles where essential algorithms were first formally described, they used a neatly typeset variant on Algol that (for the early years of my career) was the gold standard in how code should look. Python follows this tradition nicely.

But.

That doesn't mean that procedural programming is really the absolutely best way to introduce the language.

Data First

I think that it may be possible to introduce the language with a focus on data objects first and the procedural/imperative statements as a secondary consideration.

When it comes to anything beyond trivial Rate-Time-Distance calculations, the data structure matters more than almost any other aspect of the software. The objects, their relationships, their operations and their attributes are core to the problem. The presentation, user actions and persistent representation are secondary considerations after the structure of the data.

It seems like the data structures should "drive" the presentation. The outline of the book should be introductions of each of the important and visible builtin data structures. Additionally, the library extensions that are most often used can be introduced, also.

Definitional features (def, return, yield, class, and the ideas of module and package) are central, but a step behind the builtin data structures.

Procedural features (if, for, while, break, continue, with, etc.) are clearly second-class; they exist only to support access to the data structures. A for statement, makes a "for all" assertion about a data structure. A for with a break (or a while) makes a "there exists" assertion about a data structure. The data is central. The imperative statements are secondary.

Other features (global, nonlocal, del, raise, try, etc.) are tertiary, and exist to create more elegant programs that don't annoy the other developers as much. 

This also means that generator expressions and comprehensions are first-class, front-and-center features of the language. This parallels the approach in the NLTK Book, which puts the focus on generator expressions as a way to clearly state the processing.

Other Forms

Currently, I only have the iBook available.

The iBook Author application can (and does) produce a PDF. I think I may offer that separately through www.lulu.com

Tuesday, September 17, 2013

iWeb File Extract and XML Iterators

Once upon a time, Apple offered iBlog. Then they switched to iWeb. Then they abandoned that market entirely.

That leaves some of us with content in iBlog as well as iWeb. Content we'd like to work with without doing extensive cutting and pasting. Or downloading from a web server. After all, the files are on our computer.

The iWeb files are essentially XML, making them relatively easy to work with. We can reduce the huge, and hugely complex iWeb XML to a simple iterator and use a simple for statement to extract the content.

[Historical note. I wrote a Python script to convert iBlog to RST. It worked reasonably well, all things considered. This is not the first time I've tried to preserve old content from obsolete tools. Sigh.]

Some tools (like SandVox) have a "extract iWeb content" mode. But that's not what we want. We don't want to convert from iWeb to another blog. We want to convert from iWeb to CVS or some other more useful format so we can do some interesting processing, not simple web presentation.

This is a note on how to read iWeb files to get at the content. And further, how to get at XML content in the form of a simple iterator.

Opening The Package

Here's how to overview the package.

    path="~/Documents/iWeb/Domain"
    path_full= os.path.expanduser(path+".sites2")
    for filename in os.listdir(path_full):
        name, ext = os.path.splitext( filename )
        if ext.lower() in ( ".jpg", ".png", ".mov", ".m4v", ".tiff", ".gif", ".m4a", ".mpg", ".pdf" ): continue
        print( filename )

This will reveal the files; we only really care about the "index.xml.gz" file since that has the bulk of the content.

    with closing( gzip.GzipFile( os.path.join(path_full,"index.xml.gz") ) ) as index:
        index_doc= xml.parse( index )
        index_root= index_doc.getroot()

This gets us the XML version of the blog.

Finding the Pages

We can use the following to thread through the XML. We're looking for a particular "Domain", a "Site" and a particular blog page within that site. The rest of the blog is mostly text. This portion of the blog is more structured.

For some reason, the domain is "Untitled". The site is "Cruising", and the blog page is "Travel 2012-2013". We insert these target names into XPath search strings to locate the relevant content.


search= '{{http://developer.apple.com/namespaces/bl}}domain[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(domain_name)
domain= index_root.find( search )
mdu_uuid_tag= domain.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
domain_filename= "domain-{0}".format( mdu_uuid_value )

search= './/{{http://developer.apple.com/namespaces/bl}}site[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(site_name)
cruising= domain.find(search)
mdu_uuid_tag= cruising.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
site_filename= "site-{0}".format(mdu_uuid_value)

search= '{{http://developer.apple.com/namespaces/bl}}site-blog[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(site_blog_name)
site_nodes= cruising.find('{http://developer.apple.com/namespaces/bl}site-nodes')
travel= site_nodes.find(search)
mdu_uuid_tag= travel.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
site_blog_filename= "site-blog-{0}".format(mdu_uuid_value)


This will allow us to iterate through the blog entries, called "pages". Each page, it turns out, is stored in a separate XML file with the page details and styles. A lot of styles. We have to assemble the path from the base path, the domain, site,  site-blog and site-page names. We'll find an ".xml.gz" file that has the individual blog post.

    for site_page in travel.findall('{http://developer.apple.com/namespaces/bl}series/{http://developer.apple.com/namespaces/bl}site-page'):
        mdu_uuid_tag= site_page.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
        mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
        site_page_filename= "site-page-{0}".format(mdu_uuid_value)

        blog_path= os.path.join(path_full, domain_filename, site_filename, site_blog_filename, site_page_filename )
        with closing( gzip.GzipFile( os.path.join(blog_path,site_page_filename+".xml.gz") ) ) as child:
            child_doc= xml.parse( child )
            child_root= child_doc.getroot()
        main_layer= child_root.find( '{http://developer.apple.com/namespaces/bl}site-page/{http://developer.apple.com/namespaces/bl}drawables/{http://developer.apple.com/namespaces/bl}main-layer' )

Once we have access to the page XML document, we can extract the content. At this point, we could define a function which simply yielded the individual site_page tags.

Summary Iterable

The most useful form for the pages is an iterable that yields the date, title and content text. In this case, we're not going to preserve the internal markup, we're just going to extract the text in bulk.


        content_map = {}
        for ds in main_layer.findall( '{http://developer.apple.com/namespaces/sf}drawable-shape' ):
            style_name= ds.get('{http://developer.apple.com/namespaces/sf}name')
            if style_name is None:
                #xml.dump( ds ) # Never has any content.
                continue
            for tb in ds.findall('{http://developer.apple.com/namespaces/sf}text/{http://developer.apple.com/namespaces/sf}text-storage/{http://developer.apple.com/namespaces/sf}text-body' ):
                # Simply extract text. Markup is lost.
                content_map[style_name] = tb.itertext()
        yield content_map


This works because the text has no useful semantic markup. It's essentially HTML formatting full of span and div tags.

Note that this could be a separate generator function, or it could be merged into the loop that finds the site-page tags. It's unlikely we'd ever have another source of site-page tags. But, it's very like that we'd have another function for extracting the text, date and title from a site-page tag. Therefore, we should package this as a separate generator function.  We didn't, however. It's just a big old function named postings_iter().

There are three relevant style names. We're not sure why these are used, but they're completely consistent indicators of the content.
  • "generic-datefield-attributes (from archive)"
  • "generic-title-attributes (from archive)"
  • "generic-body-attributes (from archive)"
These becomes keys of the content_map mapping. The values are iterators over the text.

Processing The Text

Here's an iterator that makes use of the postings_iter() function shown above.

def flatten_posting_iter( postings=postings_iter(path="~/Documents/iWeb/Domain") ):
    """Minor cleanup to the postings to parse the date and flatten out the title."""
    for content_map in postings:
        date_text= " ".join( content_map['generic-datefield-attributes (from archive)'] )
        date= datetime.datetime.strptime( date_text, "%A, %B %d, %Y" ).date()
        title= " ".join( content_map['generic-title-attributes (from archive)'] )
        body= content_map['generic-body-attributes (from archive)']
        yield date, title, body

This will parse the dates, compress the title to remove internal markup, but otherwise leave the content untouched. 

Now we can use the following kind of loop to examine each posting.

    flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain"))
    for date, title, text_iter in sorted(flat_postings):
        for text in text_iter:
           # examine the text for important content.

We've sorted the posting into date order. Now we can process the text elements to look for the relevant content.

In this case, we're looking for Lat/Lon coordinates, which have rather complex (but easy to spot) regular expressions. So the "examine" part is a series of RE matches to collect the data points we're looking for.

We'll leave off those application-specific details. We'll leave it at the following outline of the processing.

def fact_iter( flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain")) ):
    for date, title, text_iter in sorted(flat_postings):
        fact= Fact()
        for text in text_iter:
           # examine the text for important content, set attributes of fact
           if fact.complete(): 
               yield fact
               fact= Fact()

This iterates through focused data structures that include the requested lat/lon points.

Final Application

The final application function that uses all of these iterators has the following kind of structure.


source= flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain"))
with open('target.csv', 'w', newlines='') as target:
    wtr= csv.DictWriter( target, Fact.heading )
    wtr.writeheader()
    for fact in fact_iter( source ):
        wtr.writerow( fact.as_dict() )


We're simply iterating through the facts and writing them to a CSV file.

We can also simplify the last bit to this.

wtr.writerows( f.as_dict() for f in fact_iter( source ) )

The iWeb XML structure, while bulky and complex, can easily be reduced to a simple iterator. That's why I love Python.

Thursday, September 12, 2013

Omni Outliner, XML Processing, and Recursive Generators

First, and most important, Omni Outliner is a super-flexible tool. Crazy levels of flexibility. It's very much a generic-all-singing-all-dancing information management tool.

It has a broad spectrum of file export alternative formats. Most of which are fine for import into some kind of word processor.

But what if the data is more suitable for a spreadsheet or some more structured environment? What if it was a detailed log or a project outline decorated with a column of budget numbers?

We have two approaches, one is workable, but not great, the other has numerous advantages.

In the previous post, "Omni Outliner and Content Conversion", we read an export in tab-delimited format. It was workable but icky.

Here's the alternative. This uses a recursive generator function to flatten out the hierarchy. There's a trick to recursion with generator functions.

Answer 2: Look Under the Hood

At the Mac OS X level, an Omni Outline is a "package". A directory that appears to be a single file icon to the user. If we open that directory, however, we can see that there's an XML file inside the package that has the information we want.

Here's how we can process that file.

import xml.etree.ElementTree as xml
import os
import gzip

packagename= "{0}.oo3".format(filename)
assert 'contents.xml' in os.listdir(packagename)
with gzip.GzipFile(packagename+"/contents.xml", 'rb' ) as source:
   self.doc= xml.parse(source)

This assumes it's compressed on disk. The outlines don't have to be compressed. It's an easy try/except block to attempt the parsing without unzipping. We'll leave that as an exercise for the reader.

And here's how we can get the column headings: they're easy to find in the XML structure.

self.heading = []
for c in self.doc.findall(
        "{http://www.omnigroup.com/namespace/OmniOutliner/v3}columns"
        "/{http://www.omnigroup.com/namespace/OmniOutliner/v3}column"):
    # print( c.tag, c.attrib, c.text )
    if c.attrib.get('is-note-column','no') == "yes":
        pass
    else:
        # is-outline-column == "yes"? May be named "Topic".
        # other columns don't have a special role
        title= c.find("{http://www.omnigroup.com/namespace/OmniOutliner/v3}title")
        name= "".join( title.itertext() )
        self.heading.append( name )

Now that we have the columns titles, we're able to walk the outline hierarchy, emitting normalized data. The indentation depth number is provided to distinguish the meaning of the data.

This involves a recursive tree-walk. Here's the top-level method function.

def __iter__( self ):
    """Find  for outline itself. Each item has values and children.
    Recursive walk from root of outline down through the structure.
    """
    root= self.doc.find("{http://www.omnigroup.com/namespace/OmniOutliner/v3}root")
    for item in root.findall("{http://www.omnigroup.com/namespace/OmniOutliner/v3}item"):
        for row in self._tree_walk(item):
            yield row

Here's the internal method function that does the real work.

    def _tree_walk( self, node, depth=0 ):
        """Iterator through item structure; descends recursively.
        """
        note= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}note' )
        if note is not None:
            note_text= "".join( note.itertext() )
        else:
            note_text= None
        data= []
        values= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}values' )
        if values is not None:
            for c in values:
                if c.tag == "{http://www.omnigroup.com/namespace/OmniOutliner/v3}text":
                    text= "".join( c.itertext() )
                    data.append( text )
                elif c.tag == "{http://www.omnigroup.com/namespace/OmniOutliner/v3}null":
                    data.append( None )
                else:
                    raise Exception( c.tag )
        yield depth, note_text, data
        children= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}children' )
        if children is not None:
            for child in children.findall( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}item' ):
                for row in self._tree_walk( child, depth+1 ):
                    yield row

This gets us the data in a form that doesn't require a lot of external schema information.

Each row has the indentation depth number, the note text, and the various columns of data. The only thing we need to know is which of the data columns has the indented outline.

The Trick

Here's the tricky bit.

When we recurse using a generator function, we have to explicitly iterate through the recursive result set. This is different from recursion in simple (non-generator) functions. In a simple function, we it looks like this.

def function( args ):
    if base case: return value
    else:
        return calculation on function( other args )
     
When there's a generator involved, we have to do this instead.

def function_iter( args ):
    if base case: yield value
    else:
        for x in function_iter( other args )
            yield x

Columnizing a Hierarchy

The depth number makes our data look like this.

0, "2009"
1, "November"
2, "Item In Outline"
3, "Subitem in Outline"
1, "December"
2, "Another Item"
3, "More Details"

We can normalize this into columns. We can take the depth number as a column number. When the depth numbers are increasing, we're building a row. When the depth number decreases, we've finished a row and are starting the next row.

"2009", "November", "Item in Outline", "Subitem in Outline"
"2009", "December", "Another Item", "More Details"

The algorithm works like this.

row, depth_prev = [], -1
for depth, text in source:
    while len(row) <= depth+1: row.append(None)
    if depth <= depth_prev: yield row
    row[depth:]= [text]+(len(row)-depth-1)*[None]
    depth_prev= depth
yield row


The yield will have to also handle the non-outline columns that may also be part of the Omni Outliner extract.

Tuesday, September 10, 2013

Omni Outliner and Content Conversion

First, and most important, Omni Outliner is a super-flexible tool. Crazy levels of flexibility. It's very much a generic-all-singing-all-dancing information management tool.

It has a broad spectrum of file export alternative formats. Most of which are fine for import into some kind of word processor.

But what if the data is more suitable for a spreadsheet or some more structured environment? What if it was a detailed log or a project outline decorated with a column of budget numbers?

We have two approaches, one is workable, but not great, the other has numerous advantages.

Answer 1: Workable

Sure, you say, that's easy. Export into a Plain Text with Tabs (or HTML or OPML) and then parse the resulting tab-delimited file.

In Python. Piece of cake.

import csv

class Tab_Delim(csv.Dialect):
    delimiter='\t'
    doublequote=False
    escapechar='\\'
    lineterminator='\n'
    quotechar=''
    quoting=csv.QUOTE_NONE
    skipinitialspace=True
    
rdr= csv.reader( source, dialect=Tab_Delim )
column_names= next(rdr)
for row in rdr:
   # Boom. There it is.    


That gets us started. But.

Each row is variable length. The number of columns varies with the level of indentation. The good news is that the level of indentation is consistent. Very consistent. Year, Month, Topic, Details in this case.

[When an outline is super consistent, one wonders why a spreadsheet wasn't used.]

Each outline node in the export is prefaced with "- ".

It looks pretty when printed. But it doesn't parse cleanly, since the data moves around.

Further, it turns out that "notes" (blocks of text attached to an outline node, but not part of the outline hierarchy) show up in the last column along with the data items that properly belong in the last column.

Sigh.

The good news is that notes seem to appear on a line by themselves, where the data elements seem to be properly attached to outline nodes. It's still possible to have a "blank" outline node with data in the columns, but that's unlikely.

We have to do some cleanup

Answer 1A: Cleanup In Column 1 

We want to transform indented data into proper first-normal form schema with a consistent number of fixed columns. Step 1 is to know the deepest indent. Step 2 is to then fill each row with enough empty columns to normalize the rows.

Each specific outline has a kind of schema that defines the layout of the export file. One of the tab-delimimted columns will be the "outline" column: it will have tabs and leading "-" to show the outline hierarchy. The other columns will be non-outline columns. There may be a notes column and there will be the interesting data columns which are non-notes and non-outline.

In our tab-delimited export, the outline ("Topic") is first. Followed by two data columns. The minimal row size, then will be three columns. As the topics are indented more and more, then the number of columns will appear to grow. To normalize, then, we need to pad, pushing the last two columns of data to the right.

That leads to a multi-part cleanup pipeline. First, figure out how many columns there are.

    rows= list( rdr )
    width_max= max( len(r) for r in rows )+1

This allows us the following two generator functions to fill each row and strip "-".

def filled( source, width, data_count ):
    """Iterable with each row filled to given width.
    Rightmost {data_count} columns are pushed right to preserve
    their position.
    """
    for r_in in source:
        yield r_in[:-data_count] + ['']*(width-len(r_in)) + r_in[-data_count:]

def cleaned( source ):
    """Iterable with each column cleaned of leading "- "
    """
    def strip_dash( c ):
        return c[2:] if c.startswith('- ') else c

    for row in source:
        yield list( strip_dash(c) for c in row )

That gets us to the following main loop in a conversion function.

    for row in cleaned( filled( rows, width_max, len(columns) ) ):
        # Last column may have either a note or column data.
        # If all previous columns empty, it's probably a note, not numeric value.
        if all( len(c)==0 for c in row[:-1] ):
            row[4]= row[-1]
            row[-1]= ''
        yield row

Now we can do some real work with properly normalized data. With overheads, we have an 80-line module that lets us process the outline extract in a simple, civilized CSV-style loop.

The Ick Factor

What's unpleasant about this is that it requires a fair amount of configuration.

The conversion from tab-delim outline to normalized data requires some schema information that's difficult to parameterize.

1. Which column has the outline.
2. Are there going to be notes on lines by themselves.

We can deduce how many columns of ancillary data are present, but the order of the columns is a separate piece of logical schema that we can't deduce from the export itself.