Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label xml. Show all posts
Showing posts with label xml. Show all posts

Tuesday, September 18, 2018

Data Modeling Nightmare -- XML, HTML, and Markdown

Here's a particularly tangled and difficult problem. It arises because I have another blog. Specifically this: Team Red Cruising. And it's an unholy mess.

There are two important features of the Team Red Cruising blog.
  1. It's managed with off-line editor(s) so I can write posts from the boat and then upload them when I get connectivity. Welcome to being a technomad -- I don't always have a web-based blog editor available.
  2. It was actually created with two different off-line editors over a period of years: iWeb and Sandvox. iWeb is long dead. Sandvox hasn't seen many updates recently, and I think I'd like to move on to something newer and "better". 
(In this case, "better" means iOS-friendly. e.g., Blogo or BlogPad ProAlso. Blogo's support site seems to be a right mess. Not a good look. They're working on it.)

The blog isn't the unholy mess. We'll get to the mess below. First, however some background on the overall strategy. I want to move my content. What's involved? There are several things in play: the hosting, the target, and the source. So. Essentially. Everything.

Changing the Hosting Platform

Both of my legacy tools would export and upload the changes to my hosting service directly, avoiding the overheads of having any complex hosting software. The site was static and served simply from the filesystem via Apache httpd. Publishing was an SFTP transfer to the server. Nothing more. The "platform" was almost nothing.

(I could switch to using an Amazon S3 bucket and a DNS entry and it would work nicely.)

Both of these offline editing tools have a tiny bias toward working with hosting services like WordPress. Blogo claims it can also work with Medium, and Blogger, as well.

This means running Wordpress on top of my default SFTP/Apache configuration. I use A2 Hosting, so this is really easy to do.

So. The hosting is more-or-less settled. I'll do very little. (Dealing with breaking links is a separate hand-wringing exercise.)

In order to move from iWeb and Sandvox to another tool, and start using WordPress, I have two strategies for converting the content.
  1. Ignore my legacy content. Leave it where it is, more-or-less uneditable. The tool(s) are gone, all that's left is the static HTML output from the tool. 
  2. Gather the legacy content and migrate it to WordPress and then pick an offline tool that works with WordPress. 
I've already done strategy #1, when I converted from iWeb to Sandvox. I left the old iWeb stuff out there, and moved to a new URL path with new content. While a clever menu structure can make it look like it's all one multi-year blog, the pages themselves are vastly different in the way they look. There's no comprehensive search. And, of course, I can't easily maintain the old iWeb stuff.

Having one #1, I'm now sure that's a bad idea.

An advantage of moving to WordPress is the ability to have all of the content in one, uniform database. WordPress has export functionality, so the next tool is a distinct possibility.

Note that SandVox seems to have a distinct problem trying to import iWeb's published content. They have a cool HTML scraper, but iWeb relies on JavaScript, and scraper doesn't do well.

Getting to WordPress

What we're looking at is a fairly complex data structure. While I'd like to look at this from a vast and reserved distance (i.e., in the abstract) I have a very concrete problem. So, we're forced to consider this from the WordPress POV.

We have a WordPress "Site" with a long series of posts and some pages.


The essence here is that the content can -- to an extent -- be converted to Markdown. The titles and dates are easy to preserve. The body? Not so much.

We can, as an alternative to Markdown, use some kind of skinny HTML that WordPress supports. I think WP can handle a structure free of class names, and using a most of the available HTML tags.

Most of the blog content is relatively flat. The block structure is generally limited to images, block quotes, paragraphs, ordered and unordered lists. The inline tags in use seem to be a, img, strong, em, and a few span tags for font changes.

The complexity, then, is building a useful content model from the source. There are a few AST's for Markdown. commonmark.py might have a useful AST.  It's not complex, so it may be simpler to define my own.

It's hard to understand the inline blocks in mistletoe. The python-markdown project uses ElementTree objects to build the AST. I'm not a fan of this because I'm not parsing Markdown.

Starting From -- Well, it's Complicated

There are -- as noted above -- two sources:
  • Sandvox.
  • iWeb.
The Sandvox desktop "database" structure is opaque. The media is easy to find. The content is some kind of binary-encoded data with headers that tell me a little about the XCode environment, but nothing else.

To read this, I have to scrape the HTML using Beautiful Soup. It involves processing like this:

    content = soup.html.body.find("div", id="main-content")
    article = content.find(class_="article-content").find(class_="RichTextElement").div

Find a nested <div> with a target ID. Inside that <div> is where the article can be found.

This seems to work out pretty well. Almost everything I want to preserve can be -- sort of -- mushed into Markdown.

The iWeb desktop "database" is XML. The published HTML depends on Javascript and is hard to work with. The XML is -- of course -- densely wordy and convoluted as can be. But the words and markup are there.  I can use ElementTree to walk down through XML to locate the right tags.

There's a lot of code like this

    main_layer = child_root.find('ns0:site-page/ns0:drawables/ns0:main-layer', ns)

This example digs into site pages, and nested drawables, and main layers of content.  Eventually, we wind up looking at <p>, <span>, <attachment-ref>, and <link> tags in the XML to build the relevant content.

The nuance is style. They're not part of the inline markup. They're stored separately, and included by reference. Each of the four tags that seem to be in use have a style attribute that references styles defined within the posting. Once these references are resolved, I think Mardown can be generated.

The Unholy Mess

The hateful part of this is the disconnect between HTML (and XML) and Markdown. The source data permits indefinite nesting of tags. Semantically meaningless <p><p>words</p></p> are legal. The "flattening" from HTML/XML to Markdown is worrisome: what if I trash an entry by missing something important?

Ideally, it's this:



Pragmatically, HTML/XML can be more complex. This diagram assumes we won't have paragraphs inside list items. HTML permits it. It's redundant in Markdown.

Worse, of course, are the inline tags. HTML has a kabillion of them. The software I've been using seems to limit me to <img>, <strong>, <em>, and <a>. HTML/XML allows nesting. Markdown doesn't.

Ideally, I can reframe the inline tags to create a flat sequence of styled-text objects within any of the tags.

Right now. Headaches.

Working on the code. It's not a general solution to anyone else's problem. But. I'm hoping -- as I beat the problem into submission -- to find a way to make some useful tutorial materials on mapping between complex, and different, data structures.

Tuesday, August 1, 2017

JSON vs. XML: The battle for format supremacy may be wasted energy - SD Times

http://sdtimes.com/json-vs-xml-battle-format-supremacy-may-wasted-energy/

This article seems silly. Perhaps I missed something important.

I'm not sure who's still litigating the JSON vs. XML, but it seems like it's more-or-less done.

XHTML/XML for HTML things.

JSON for everything else.

Maybe there are people still wringing their hands over this. AFAIK, the last folks using SOAP/XML services are commercial and governmental agencies where change tends to happen very slowly.

I remember when Sun Microsystems was a company and had the Java Composite Applications Suite. Very XML. That was -- perhaps -- ten years ago. Since then, I think the problem has been solved. I'm not sure who's battling for supremacy or why.

Tuesday, March 1, 2016

Dexy and word-processing toolchains

See http://www.dexy.it

Wow. That seems cool.

I write. A lot.

I've tried a lot of tool chains. A lot. And I mean non-trivial "try". Whole books.  Hundreds of pages.

LEO + my own HTML Templates. A lot of fun at first.  An outliner that generates RST is a very, very handy thing for technical writing.

An XML editor (I forget which one. Maybe http://www.xmlmind.com/xmleditor/?) with the DocBook XML and XSLT tool-chain. This produced HTML from the XML. I think it could also produce LaTeX. Again, the outlining and structuring were kind of handy. What was particularly cool was the diverse semantic markup tags available in DocBook. Getting the tag containment right was a large pain even with a handy GUI editor.

Plain Text using RST "the hard way;" i.e., without LEO. This isn't too bad, it turns out. The outlining features of LEO -- while fun -- aren't essential. A simple RST toolchain is easy to concoct. I used SCONS to rebuild HTML and LaTeX from the RST.

LaTeX. Once I had the base LaTeX from RST, I could then edit that to produce an even richer document using lots of LaTeX add-on packages. I use MacTex and it is truly great. The downside of LaTex -- for me -- was no trivial way to go back to HTML from the LaTeX.  There are some complex back-and-forth toolchains, but it's easier to just produce a PDF. PDF looks great, but wasn't really my goal.

RST with Sphinx. Wow. This is elegant. I often produce chapter drafts here, and then copy and paste the HTML into the word processors preferred by the publishing industry.

[They insist on applying their goofy markup templates to the text. It's a subset of meaningful semantic markup used by Sphinx, but somehow their toolchain must start with .DOCX files, and nothing else will do.]

Dexy was cool right up until I read this: "Dexy is a Python package (Python 2.6-2.7 only)".

Ouch. Web.py Utilities include DBUtils which won't install under Python 3.5. So that put the kibosh on Dexy. Sadface.

Tuesday, September 17, 2013

iWeb File Extract and XML Iterators

Once upon a time, Apple offered iBlog. Then they switched to iWeb. Then they abandoned that market entirely.

That leaves some of us with content in iBlog as well as iWeb. Content we'd like to work with without doing extensive cutting and pasting. Or downloading from a web server. After all, the files are on our computer.

The iWeb files are essentially XML, making them relatively easy to work with. We can reduce the huge, and hugely complex iWeb XML to a simple iterator and use a simple for statement to extract the content.

[Historical note. I wrote a Python script to convert iBlog to RST. It worked reasonably well, all things considered. This is not the first time I've tried to preserve old content from obsolete tools. Sigh.]

Some tools (like SandVox) have a "extract iWeb content" mode. But that's not what we want. We don't want to convert from iWeb to another blog. We want to convert from iWeb to CVS or some other more useful format so we can do some interesting processing, not simple web presentation.

This is a note on how to read iWeb files to get at the content. And further, how to get at XML content in the form of a simple iterator.

Opening The Package

Here's how to overview the package.

    path="~/Documents/iWeb/Domain"
    path_full= os.path.expanduser(path+".sites2")
    for filename in os.listdir(path_full):
        name, ext = os.path.splitext( filename )
        if ext.lower() in ( ".jpg", ".png", ".mov", ".m4v", ".tiff", ".gif", ".m4a", ".mpg", ".pdf" ): continue
        print( filename )

This will reveal the files; we only really care about the "index.xml.gz" file since that has the bulk of the content.

    with closing( gzip.GzipFile( os.path.join(path_full,"index.xml.gz") ) ) as index:
        index_doc= xml.parse( index )
        index_root= index_doc.getroot()

This gets us the XML version of the blog.

Finding the Pages

We can use the following to thread through the XML. We're looking for a particular "Domain", a "Site" and a particular blog page within that site. The rest of the blog is mostly text. This portion of the blog is more structured.

For some reason, the domain is "Untitled". The site is "Cruising", and the blog page is "Travel 2012-2013". We insert these target names into XPath search strings to locate the relevant content.


search= '{{http://developer.apple.com/namespaces/bl}}domain[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(domain_name)
domain= index_root.find( search )
mdu_uuid_tag= domain.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
domain_filename= "domain-{0}".format( mdu_uuid_value )

search= './/{{http://developer.apple.com/namespaces/bl}}site[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(site_name)
cruising= domain.find(search)
mdu_uuid_tag= cruising.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
site_filename= "site-{0}".format(mdu_uuid_value)

search= '{{http://developer.apple.com/namespaces/bl}}site-blog[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(site_blog_name)
site_nodes= cruising.find('{http://developer.apple.com/namespaces/bl}site-nodes')
travel= site_nodes.find(search)
mdu_uuid_tag= travel.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
site_blog_filename= "site-blog-{0}".format(mdu_uuid_value)


This will allow us to iterate through the blog entries, called "pages". Each page, it turns out, is stored in a separate XML file with the page details and styles. A lot of styles. We have to assemble the path from the base path, the domain, site,  site-blog and site-page names. We'll find an ".xml.gz" file that has the individual blog post.

    for site_page in travel.findall('{http://developer.apple.com/namespaces/bl}series/{http://developer.apple.com/namespaces/bl}site-page'):
        mdu_uuid_tag= site_page.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
        mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
        site_page_filename= "site-page-{0}".format(mdu_uuid_value)

        blog_path= os.path.join(path_full, domain_filename, site_filename, site_blog_filename, site_page_filename )
        with closing( gzip.GzipFile( os.path.join(blog_path,site_page_filename+".xml.gz") ) ) as child:
            child_doc= xml.parse( child )
            child_root= child_doc.getroot()
        main_layer= child_root.find( '{http://developer.apple.com/namespaces/bl}site-page/{http://developer.apple.com/namespaces/bl}drawables/{http://developer.apple.com/namespaces/bl}main-layer' )

Once we have access to the page XML document, we can extract the content. At this point, we could define a function which simply yielded the individual site_page tags.

Summary Iterable

The most useful form for the pages is an iterable that yields the date, title and content text. In this case, we're not going to preserve the internal markup, we're just going to extract the text in bulk.


        content_map = {}
        for ds in main_layer.findall( '{http://developer.apple.com/namespaces/sf}drawable-shape' ):
            style_name= ds.get('{http://developer.apple.com/namespaces/sf}name')
            if style_name is None:
                #xml.dump( ds ) # Never has any content.
                continue
            for tb in ds.findall('{http://developer.apple.com/namespaces/sf}text/{http://developer.apple.com/namespaces/sf}text-storage/{http://developer.apple.com/namespaces/sf}text-body' ):
                # Simply extract text. Markup is lost.
                content_map[style_name] = tb.itertext()
        yield content_map


This works because the text has no useful semantic markup. It's essentially HTML formatting full of span and div tags.

Note that this could be a separate generator function, or it could be merged into the loop that finds the site-page tags. It's unlikely we'd ever have another source of site-page tags. But, it's very like that we'd have another function for extracting the text, date and title from a site-page tag. Therefore, we should package this as a separate generator function.  We didn't, however. It's just a big old function named postings_iter().

There are three relevant style names. We're not sure why these are used, but they're completely consistent indicators of the content.
  • "generic-datefield-attributes (from archive)"
  • "generic-title-attributes (from archive)"
  • "generic-body-attributes (from archive)"
These becomes keys of the content_map mapping. The values are iterators over the text.

Processing The Text

Here's an iterator that makes use of the postings_iter() function shown above.

def flatten_posting_iter( postings=postings_iter(path="~/Documents/iWeb/Domain") ):
    """Minor cleanup to the postings to parse the date and flatten out the title."""
    for content_map in postings:
        date_text= " ".join( content_map['generic-datefield-attributes (from archive)'] )
        date= datetime.datetime.strptime( date_text, "%A, %B %d, %Y" ).date()
        title= " ".join( content_map['generic-title-attributes (from archive)'] )
        body= content_map['generic-body-attributes (from archive)']
        yield date, title, body

This will parse the dates, compress the title to remove internal markup, but otherwise leave the content untouched. 

Now we can use the following kind of loop to examine each posting.

    flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain"))
    for date, title, text_iter in sorted(flat_postings):
        for text in text_iter:
           # examine the text for important content.

We've sorted the posting into date order. Now we can process the text elements to look for the relevant content.

In this case, we're looking for Lat/Lon coordinates, which have rather complex (but easy to spot) regular expressions. So the "examine" part is a series of RE matches to collect the data points we're looking for.

We'll leave off those application-specific details. We'll leave it at the following outline of the processing.

def fact_iter( flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain")) ):
    for date, title, text_iter in sorted(flat_postings):
        fact= Fact()
        for text in text_iter:
           # examine the text for important content, set attributes of fact
           if fact.complete(): 
               yield fact
               fact= Fact()

This iterates through focused data structures that include the requested lat/lon points.

Final Application

The final application function that uses all of these iterators has the following kind of structure.


source= flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain"))
with open('target.csv', 'w', newlines='') as target:
    wtr= csv.DictWriter( target, Fact.heading )
    wtr.writeheader()
    for fact in fact_iter( source ):
        wtr.writerow( fact.as_dict() )


We're simply iterating through the facts and writing them to a CSV file.

We can also simplify the last bit to this.

wtr.writerows( f.as_dict() for f in fact_iter( source ) )

The iWeb XML structure, while bulky and complex, can easily be reduced to a simple iterator. That's why I love Python.

Thursday, September 12, 2013

Omni Outliner, XML Processing, and Recursive Generators

First, and most important, Omni Outliner is a super-flexible tool. Crazy levels of flexibility. It's very much a generic-all-singing-all-dancing information management tool.

It has a broad spectrum of file export alternative formats. Most of which are fine for import into some kind of word processor.

But what if the data is more suitable for a spreadsheet or some more structured environment? What if it was a detailed log or a project outline decorated with a column of budget numbers?

We have two approaches, one is workable, but not great, the other has numerous advantages.

In the previous post, "Omni Outliner and Content Conversion", we read an export in tab-delimited format. It was workable but icky.

Here's the alternative. This uses a recursive generator function to flatten out the hierarchy. There's a trick to recursion with generator functions.

Answer 2: Look Under the Hood

At the Mac OS X level, an Omni Outline is a "package". A directory that appears to be a single file icon to the user. If we open that directory, however, we can see that there's an XML file inside the package that has the information we want.

Here's how we can process that file.

import xml.etree.ElementTree as xml
import os
import gzip

packagename= "{0}.oo3".format(filename)
assert 'contents.xml' in os.listdir(packagename)
with gzip.GzipFile(packagename+"/contents.xml", 'rb' ) as source:
   self.doc= xml.parse(source)

This assumes it's compressed on disk. The outlines don't have to be compressed. It's an easy try/except block to attempt the parsing without unzipping. We'll leave that as an exercise for the reader.

And here's how we can get the column headings: they're easy to find in the XML structure.

self.heading = []
for c in self.doc.findall(
        "{http://www.omnigroup.com/namespace/OmniOutliner/v3}columns"
        "/{http://www.omnigroup.com/namespace/OmniOutliner/v3}column"):
    # print( c.tag, c.attrib, c.text )
    if c.attrib.get('is-note-column','no') == "yes":
        pass
    else:
        # is-outline-column == "yes"? May be named "Topic".
        # other columns don't have a special role
        title= c.find("{http://www.omnigroup.com/namespace/OmniOutliner/v3}title")
        name= "".join( title.itertext() )
        self.heading.append( name )

Now that we have the columns titles, we're able to walk the outline hierarchy, emitting normalized data. The indentation depth number is provided to distinguish the meaning of the data.

This involves a recursive tree-walk. Here's the top-level method function.

def __iter__( self ):
    """Find  for outline itself. Each item has values and children.
    Recursive walk from root of outline down through the structure.
    """
    root= self.doc.find("{http://www.omnigroup.com/namespace/OmniOutliner/v3}root")
    for item in root.findall("{http://www.omnigroup.com/namespace/OmniOutliner/v3}item"):
        for row in self._tree_walk(item):
            yield row

Here's the internal method function that does the real work.

    def _tree_walk( self, node, depth=0 ):
        """Iterator through item structure; descends recursively.
        """
        note= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}note' )
        if note is not None:
            note_text= "".join( note.itertext() )
        else:
            note_text= None
        data= []
        values= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}values' )
        if values is not None:
            for c in values:
                if c.tag == "{http://www.omnigroup.com/namespace/OmniOutliner/v3}text":
                    text= "".join( c.itertext() )
                    data.append( text )
                elif c.tag == "{http://www.omnigroup.com/namespace/OmniOutliner/v3}null":
                    data.append( None )
                else:
                    raise Exception( c.tag )
        yield depth, note_text, data
        children= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}children' )
        if children is not None:
            for child in children.findall( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}item' ):
                for row in self._tree_walk( child, depth+1 ):
                    yield row

This gets us the data in a form that doesn't require a lot of external schema information.

Each row has the indentation depth number, the note text, and the various columns of data. The only thing we need to know is which of the data columns has the indented outline.

The Trick

Here's the tricky bit.

When we recurse using a generator function, we have to explicitly iterate through the recursive result set. This is different from recursion in simple (non-generator) functions. In a simple function, we it looks like this.

def function( args ):
    if base case: return value
    else:
        return calculation on function( other args )
     
When there's a generator involved, we have to do this instead.

def function_iter( args ):
    if base case: yield value
    else:
        for x in function_iter( other args )
            yield x

Columnizing a Hierarchy

The depth number makes our data look like this.

0, "2009"
1, "November"
2, "Item In Outline"
3, "Subitem in Outline"
1, "December"
2, "Another Item"
3, "More Details"

We can normalize this into columns. We can take the depth number as a column number. When the depth numbers are increasing, we're building a row. When the depth number decreases, we've finished a row and are starting the next row.

"2009", "November", "Item in Outline", "Subitem in Outline"
"2009", "December", "Another Item", "More Details"

The algorithm works like this.

row, depth_prev = [], -1
for depth, text in source:
    while len(row) <= depth+1: row.append(None)
    if depth <= depth_prev: yield row
    row[depth:]= [text]+(len(row)-depth-1)*[None]
    depth_prev= depth
yield row


The yield will have to also handle the non-outline columns that may also be part of the Omni Outliner extract.

Monday, October 4, 2010

.xlsm and .xlsx Files -- Finally Reaching Broad Use

For years, I've been using Apache POI in Java and XLRD in Python to read spreadsheets. Finally, now that .XLSX and .XLSM files are in more widespread use, we can move away from those packages and their reliance on successful reverse engineering of undocumented features.

Spreadsheets are -- BTW -- the universal user interface. Everyone likes them, they're almost inescapable. And they work. There's no reason to attempt to replace the spreadsheet with a web page or a form or a desktop application. It's easier to cope with spreadsheet vagaries than to replace them.

The downside is, of course, that users often tweak their spreadsheets, meaning that you never have a truly "stable" interface. However, transforming each row of data into a Python dictionary (or Java mapping) often works out reasonably well to make your application mostly immune to the common spreadsheet tweaks.

Most of the .XLSX and .XLSM spreadsheets we process can be trivially converted to CSV files. It's manual, yes, but a quick audit can check the counts and totals.

Yesterday we got an .XLSM with over 80,000 plus rows. It couldn't be trivially converted to CSV by my installation of Excel.

What to do?

Python to the Rescue

Step 1. Read the standards. Start with the Wikipedia article: "Open Office XML". Move to the ECMA 376 standard.

Step 2. It's a zip archive. So, to process the file, we need to locate the various bits inside the archive. In many cases, the zip members can be processed "in memory". In the case of our 80,000+ row spreadsheet, the archive is 34M. The sheet in question expands to a 215M beast. The shared strings are 3M. This doesn't easily fit into memory.

Further, a simple DOM parser, like Python's excellent ElementTree, won't work on files this huge.

Expanding an XLSX or XLSM file

Here's step 2. Expanding the zip archive to locate the shared strings and sheets.
import zipfile
def get_worksheets(name):
arc= zipfile.ZipFile( name, "r" )
member= arc.getinfo("xl/sharedStrings.xml")
arc.extract( member )
for member in arc.infolist():
if member.filename.startswith("xl/worksheets") and member.filename.endswith('.xml'):
arc.extract(member)
yield member.filename

This does two things. First, it locates the shared strings and the various sheets within the zip archive. Second, it expands the sheets and shared strings into the local working directory.

There are many other parts to the workbook archive. The good news is that we're not interesting in complex workbooks with lots of cool Excel features. We're interested in workbooks that are basically file-transfer containers. Usually a few sheets with a consistent format.

Once we have the raw files, we have to parse the shared strings first. Then we can parse the data. Both of these files are simple XML. However, they don't fit in memory. We're forced to use SAX.

Step 3 -- Parse the Strings

Here's a SAX ContentHandler that finds the shared strings.
import xml.sax
import xml.sax.handler
class GetStrings( xml.sax.handler.ContentHandler ):
"""Locate Shared Strings."""
def __init__( self ):
xml.sax.handler.ContentHandler.__init__(self)
self.context= []
self.count= 0
self.string_dict= {}
def path( self ):
return [ n[1] for n in self.context ]
def startElement( self, name, attrs ):
print( "***Non-Namespace Element", name )
def startElementNS( self, name, qname, attrs ):
self.context.append( name )
self.buffer= ""
def endElementNS( self, name, qname ):
if self.path() == [u'sst', u'si', u't']:
self.string_dict[self.count]= self.buffer
self.buffer= ""
self.count += 1
while self.context[-1] != name:
self.context.pop(-1)
self.context.pop(-1)
def characters( self, content ):
if self.path() == [u'sst', u'si', u't']:
self.buffer += content
This handler collects the strings into a simple dictionary, keyed by their relative position in the XML file.

This handler is used as follows.
string_handler= GetStrings()
rdr= xml.sax.make_parser()
rdr.setContentHandler( string_handler )
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
rdr.parse( "xl/sharedStrings.xml" )
We create the handler, create a parser, and process the shared strings portion of the workbook. When this is done, the handler has a dictionary of all strings. This is string_handler.string_dict. Note that a shelve database could be used if the string dictionary was so epic that it wouldn't fit in memory.

The Final Countdown

Once we have the shared strings, we can then parse each worksheet, using the share string data to reconstruct a simple CSV file (or JSON document or something more usable).

The Content Handler for the worksheet isn't too complex. We only want cell values, so there's little real subtlety. The biggest issue is coping with the fact that sometimes the content of a tag is reported in multiple parts.

class GetSheetData( xml.sax.handler.ContentHandler ):
"""Locate column values."""
def __init__( self, string_dict, writer ):
xml.sax.handler.ContentHandler.__init__(self)
self.id_pat = re.compile( r"(\D+)(\d+)" )
self.string_dict= string_dict
self.context= []
self.row= {}
self.writer= writer
def path( self ):
return [ n[1] for n in self.context ]
def startElement( self, name, attrs ):
print( "***Non-Namespace Element", name )
def startElementNS( self, name, qname, attrs ):
self.context.append( name )
if name[1] == "row":
self.row_num = attrs.getValueByQName(u'r')
elif name[1] == "c":
if u't' in attrs.getQNames():
self.cell_type = attrs.getValueByQName(u't')
else:
self.cell_type = None # defult, not a string
self.cell_id = attrs.getValueByQName(u'r')
id_match = self.id_pat.match( self.cell_id )
self.row_col = self.make_row_col( id_match.groups() )
elif name[1] == "v":
self.buffer= "" # Value of a cell
else:
pass # might do some debugging here.
@staticmethod
def make_row_col( col_row_pair ):
col = 0
for c in col_row_pair[0]:
col = col*26 + (ord(c)-ord("A")+1)
return int(col_row_pair[1]), col-1
def endElementNS( self, name, qname ):
if name[1] == "row":
# write the row to the CSV result file.
self.writer.writerow( [ self.row.get(i) for i in xrange(max(self.row.keys())) ] )
self.row= {}
elif name[1] == "v":
if self.cell_type is None:
try:
self.value= float( self.buffer )
except ValueError:
print( self.row_num, self.cell_id, self.cell_type, self.buffer )
self.value= None
elif self.cell_type == "s":
try:
self.value= self.string_dict[int(self.buffer)]
except ValueError:
print( self.row_num, self.cell_id, self.cell_type, self.buffer )
self.value= None
elif self.cell_type == "b":
self.value= bool(self.buffer)
else:
print( self.row_num, self.cell_id, self.cell_type, self.buffer, self.string_dict.get(int(self.buffer)) )
self.value= None
self.row[self.row_col[1]] = self.value
while self.context[-1] != name:
self.context.pop(-1)
self.context.pop(-1)
def characters( self, content ):
self.buffer += content
This class and the shared string handler could be refactored to eliminate a tiny bit of redundancy.

This class does two things. At the end of a tag, it determines what data was found. It could be a number, a boolean value or a shared string. At the end of a tag, it writes the row to a CSV writer.

This handler is used as follows.
    rdr= xml.sax.make_parser()
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
for s in sheets:
with open(s+".csv","wb") as result:
handler= GetSheetData(string_handler.string_dict,csv.writer(result))
rdr.setContentHandler( handler )
rdr.parse( s )
This iterates through each sheet, transforming it into a simple .CSV file. Once we have the file in CSV format, it's smaller and simpler. It can easily be processed by follow-on applications.

The overall loop actually looks like this.

sheets= list( get_worksheets(name) )

string_handler= GetStrings()
rdr= xml.sax.make_parser()
rdr.setContentHandler( string_handler )
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
rdr.parse( "xl/sharedStrings.xml" )

rdr= xml.sax.make_parser()
rdr.setFeature( xml.sax.handler.feature_namespaces, True )
for s in sheets:
with open(s+".csv","wb") as result:
handler= GetSheetData(string_handler.string_dict,csv.writer(result))
rdr.setContentHandler( handler )
rdr.parse( s )
This expands the shared strings and individual sheets. It iterates through the sheets, using the shared strings, to create a bunch of .CSV files from the .XLSM data.

The resulting .CSV -- stripped of the XML overheads -- is 80,000+ rows and only 39M. Also, it can be processed with the Python csv library.

CSV Processing

This, after all, was the goal. Read the CSV file and do some useful work.
def csv_rows(source):
rdr= csv.reader( source )
headings = []
for n, cols in enumerate( rdr ):
if n < 4:
if headings:
headings = [ (top+' '+nxt).strip() for top, nxt in zip( headings, cols ) ]
else:
headings = cols
continue
yield dict(zip(headings,cols))
We locate the four header rows and build labels from the the four rows of data. Given these big, complex headers, we can then build a dictionary from each data row. The resulting structure is exactly like the results of a csv.DictReader, and can be used to do the "real work" of the application.

Thursday, November 26, 2009

Python Book -- Version 2.6

Completely revised the Building Skills in Python book.

It now covers Python 2.6 and is much, must easier to maintain in ReStructured Text markup, formatted with Sphinx and LaTeX (via TeXLive) than it was in XML.

XML -- while modern and clean and uniform -- isn't as convenient as LaTeX and RST.

Monday, August 17, 2009

Building Skills Books Toolset (Update)

I wrote the first Building Skills books using Appleworks. It wasn't too bad to organize the styles around basic semantics of the subject area. It's an easy, productive writing environment. Except, of course, for internal cross-references, indexes, and tables of contents.

I converted to DocBook XML markup. The conversion was arduous, but well worth it. I got better semantic markup. I used the DocBook XSL tools to convert to HTML both as a single document, and a chunked presentation. It worked out pretty well.

Two things don't work out well. First, the FOP processing is shaky. The books are big, and rather complex, with a fair number of embedded fonts. I have been unable to get the embedded fonts to work correctly with FOP.

The second thing that doesn't work out well is the language-specific markup. DocBook is biased toward C. There aren't enough tags for Python markup (module and library tags are missing, for example) and the syntax-oriented statement, class and function markup is all over the map in DocBook.

Objectives

My goal is to have the books in four formats: XML, single HTML file, chunked HTML and PDF. Of these, the single HTML is the least appealing. The chunked HTML is a great carrier for Adsense ads. The PDF is what I should be selling.

I don't mind writing in XML. Using XMLMind XMLEditor is generally pretty nice. Running the XSL-based tool chain to convert to HTML, and chunked HTML is easy.

Currently, I'm using FireFox to create the PDF. It's quick, but dirty. I'm not sure how many of the print-formatting CSS options FireFox can handle, so I haven't really customized the CSS for printing. However, the FireFox PDF has properly embedded fonts and cross-reference links.

Choices

Apple's Pages does a lot. It's a very nice product. But I'm not sure that the PDF and Chunked HTML will work out all that well.

The DocBook tool chain has problems identified above. The PDF output doesn't work because it overwhelms FOP.
  • Currently, I use FireFox to create PDF's. I could dress up the CSS to make it look a little better.
  • An alternative is to use Pisa to transform the XHTML into PDF. I started using Flying Saucer on another project and the XHTML to PDF idea has some appeal. This requires debugging the print-media CSS, which doesn't seem too bad.
On the other hand, RST can have almost all the semantic richness of XML. I've decided to redo Building Skills in Programming entirely in Sphinx, using RST. This has the advantage of being Python-specific, making heavy use of pygments for syntax coloring.

Also, I could stick with XML and use a different tool-chain to go from DocBook XML LaTeX. The dblatex package may do this nicely.

Tradeoffs

If I switch to Sphinx, editing is much easier. The source is plain text.

The chunked HTML created by Sphinx is outstanding. It's far better than the DocBook HTML. It's much easier to customize than the DocBook XSL, allowing use of Adsense ads with relatively little work.

On the other hand, to produce PDF, I have to go through LaTeX. This means that I have to find a nice LaTeX to PDF tool.

Currently, Sphinx doesn't easily produce a single HTML file. There may be ways around this; perhaps by using an alternate `.. toctree::` directive. But this is also a low-priority requirement, so this may have to be dropped in favor of a better-looking PDF page.

LaTeX to PDF

A Google search for "mac os x latex to pdf" turns up some interesting results.



This list of references makes it look appealing to start with TeXShop and seeing if the LaTeX output from Sphinx can be used to produce PDF.

The TeXLive distribution includes a basic pdfTeX utility that might emit a nice PDF from the Sphinx LaTeX output.

Additionally, there is iTeXMac, which may also convert my Sphinx LaTeX to PDF.

These, however, seem to be largely WYSIWYG editing. While editing LaTeX isn't too bad, I want to work from a single RST source.

Python Solutions

The "python latex to pdf" Google search turns up the following projects for doing LaTeX processing in Python. These look very nice. In particular, they get away from manual editing of LaTeX.



Better Still

Finally, I located the following: http://jimmyg.org/blog/2009/sphinx-pdf-generation-with-latex.html. This makes it clear that Sphinx expects TeXLive. This leads me to MacTeX, which -- it appears -- is what Sphinx expects.

Sphinx generates a makefile to create PDF from the LaTeX. Hopefully, this is not highly Linux-specific and will use the TeXlive distribution on Mac OS X.

Bonus Feature

Switching to LaTeX may also give me a better way to handle the formulas in the exercise sections. Currently, I have to write them and save the images. I don't know how many different equation editors I've used.

Alternative RST to PDF

There's an rst2pdf tool that may make it possible to go from Sphinx RST directly to PDF. Hopefully, this honors all the Sphinx extensions.

Wednesday, July 22, 2009

Software Overdesign -- An Update

Saw a horrifying design document recently. One that was at the "gouge out my eyes" level of badness. That's one step below "drink until I forget what I saw", but one level above "beat the author with a tire iron."

They were -- I'm guessing here -- trying to develop their own Document Object Model. Distinct from any established DOM. The Wikipedia entry on DOM provides several examples of existing DOM's. Why reinvent?

The application is -- ultimately -- going to be in Python. There are two candidate DOM's that could have been used: the XML DOM and the RST DOM as implemented in Docutils nodes module. Instead, they were reinventing: they appear to have spent a great deal of time writing use cases for "editor". I expect there was a use case for "wheel" and "fire" in there also.

What scared me was the "flatness" of the model. Every buzzword had it's own class. There was no inheritance or reuse anywhere in the diagram. Parts of the model where influenced by the DocBook schemas. The actual DTD could have been turned into the model, but wasn't.

Further, undefinable terms like "sentence" showed up as class definitions. XML's DOM treats all text as -- well -- text. Any language structure is outside the DOM. RST, similarly, treats text as a container "... children are all `Text` or `Inline` subclass nodes."

All I could suggest was "locate common superclasses" and "until you can define 'sentence', omit it". And then run outside and gouge out my eyes.

It's hard to criticize something like that in a truly helpful manner. Fixing the model is merely putting lipstick on a pig.

As far as I can tell, the application is -- somehow -- an editor that imposes a severe set of structural constraints on what the author can do. It's as if RST, docutils and Sphinx don't exist. The real solution isn't "fix your object model" it's "fix your problem statement and learn RST."



Post Script

Check out this reply:
The advantage of having an "outliner data model" and a "document data model" like DocBook XML is that your outliner functionality is not limited by the DocBook XML. The downside is that have to create and support a second model as well as provide a mapping between the two.
In other words, rather than simplify, we'll (1) insist the eye-gougingly bad model is "better", (2) justify the complexity ("not limited by DocBook [DOM]") by and (3) plan to add some more complexity to map an overly complex (and atypical) DOM to a standard DOM.

Not a very parsimonious approach to design.

Wednesday, June 24, 2009

Semantic Markup -- RST vs. XML

I have very mixed feelings about XML's usability.

An avowed goal of the inventors of XML was "XML documents should be human-legible and reasonably clear." While I like to think that "legible" means usable, I'm feeling that legibility is really a minimal standard; I think it's a polite way of saying "viewable with any text editor."

I've got some content (my Building Skills books) that I've edited with a number of tools. As I've changed tools, I've come to really understand what semantic markup means.

Once Upon A Time

When I started -- back in '00 or '01 -- I was taking notes on Python using BBEdit and other text-editor tools. That doesn't really count.

The first drafts of the Python book were written using AppleWorks; the predecessor to Apple's iWork Pages product. Any Mac text editor is a joy to use. Except, of course, that AppleWorks semantic markup wasn't the easiest thing to use. It was little more than the visual styles with meaningful names.

Then I converted the whole thing to XML.

DocBook Semantic Markup

The DocBook XML-based markup seemed to be the best choice for what I was doing. It was reasonably technically focused, and provided a degree of structure and formality.

To convert from AppleWorks, I exported the entire thing as text and then used the LEO Outlining Editor to painstakingly -- manually -- rework it into XML.

At this point, the XML tags were a visible part of the document, and editing the document means touching the tags. Not the easiest thing to do.

I switched to XMLmind's XXE. This was nice -- in a way. I didn't have to see the XML tags, but I was heavily constrained by the clunky way they handle the XML document structure. Double-clicking a word can lead to ambiguity on which level of tag you wanted to talk about.

The XML was "invisble" but the many-layered hierarchical structure was very much in my face.

RST Semantic Markup

After becoming a heavy user of Sphinx, I realized that I might be able to simplify my life by switching from XML to RST.

There are a number of gains when moving to RST.
  1. The document is simpler. It's approximately plain text, with a number of simple constraints.
  2. Editing is easier because the markup is both explicit and simple.
  3. The tooling is simpler. Sphinx pretty much does what I want with respect to publication.
There is just one big loss: semantic markup. DocBook documents are full of <acronym>TLA</acronym> to provide some meaningful classification behind the various words. It's relatively easy to replace these with RST's Interpreted Text Roles. The revised markup is :acronym:`TLA`.

The smaller, less relevant loss, is the inability to nest inline markup. I used nested markup to provide detailed <function><parameter>a</parameter></function> kind of descriptions. I think :code:`function(x)` is just as meaningful when it comes to analyzing and manipulating the XML with automated tools.

The Complete Set of Roles

I haven't finished the XML -> Sphinx transformation. However, I do have a list of roles that I'm working with.

Here's the list of literal conversions. Some of these have obvious Sphinx/RST replacements. Some don't. I haven't defined CSS markup styles for all of these -- but I could. Instead, I used the existing roles for presentation.

.. role:: parameter(literal)
.. role:: replaceable(literal)
.. role:: function(literal)
.. role:: exceptionname(literal)
.. role:: classname(literal)
.. role:: methodname(literal)
.. role:: varname(literal)
.. role:: envar(literal)
.. role:: filename(literal)
.. role:: code(literal)

.. role:: prompt(literal)
.. role:: userinput(literal)
.. role:: computeroutput(literal)

.. role:: guimenu(strong)
.. role:: guisubmenu(strong)
.. role:: guimenuitem(strong)
.. role:: guibutton(strong)
.. role:: guilabel(strong)
.. role:: keycap(strong)

.. role:: application(strong)
.. role:: command(strong)
.. role:: productname(strong)

.. role:: firstterm(emphasis)
.. role:: foreignphrase(emphasis)
.. role:: attribution
.. role:: abbrev

The next big step is to handle roles that are more than a simple style difference. My benchmark is the :trademark: role.

Adding A Role

Here's what you do to add semantic markup role to your document processing tool stack.

First, write a small module to define the role.

Second, update Sphinx's conf.py to name your module. It goes in the extensions list.

Here's my module to define the trademark role.

import docutils.nodes
from docutils.parsers.rst import roles

def trademark_role(role, rawtext, text, lineno, inliner,
options={}, content=[]):
"""Build text followed by inline substitution '|trade|'
"""
roles.set_classes(options)
word= docutils.nodes.Text( text, rawtext )
symbol= docutils.nodes.substitution_reference( '|trade|', 'trade', refname='trade' )
return [word,symbol], []

def setup( app ):
app.add_role( "trademark", trademark_role )

Here's the tweak I made to my conf.py

import sys, os
project=os.path.join( "")
sys.path.append("/Users/slott/Documents/Writing/NonProg2.5/source")
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.ifconfig', 'docbook_roles' ]

That's it. Now I have semantic markup that produces additional text (in this case the TM symbol). I don't think there are too many more examples like this. I'm still weeks away from finishing the conversion (and validating all the code samples again.)

But I think I've preserved the semantic content of my document in a simpler, easier to use set of tools.

Wednesday, May 27, 2009

Semantic Markup with Docutils Interpreted Text Roles

A resume is a slippery thing -- a package of semi-structured data.

It has a kind of database-like feel to it, but there are so many exceptions and special cases that the database never works out quite the way you wanted.

For example, I've got -- essentially -- one employer over the past 30+ years.  But I've been on hundreds of projects for almost 100 different clients.  Since projects overlap, there's no tidy timeline.  The database has a token "Employer" table, a "Client" table, a "Project", which is an association between "Client" and "Employer".  For each "Project" I can have a number of roles or positions.  Most importantly, each project has a large number of hardware, software, skill, language and other "features" to it.

Relax

A more relaxed model is some kind of markup so that keywords can be identified semantically and culled out to create tag clouds or indices.

The usual culprit for mixed-content models like this is XML.  We would define a DTD or XSD with our tags in a new namespace.  Sadly, this also means that I have to rewrite my resume into XML.  Not that bad, but still...

Can we do similarly detailed semantic markup in RST?

What Role Does These Words Play?

RST offers a flexible mechanism they called Interpreted Text Roles.  There are two parts to getting started with this.

1.  Name the role in a .. role:: name directive.
2.  Markup your content with :name:`words`.  

By default, the role name is the class name that will be put into the HTML <span> tag when the document is written in HTML.  If you want, you can supply special formatting in addition to marking the words with a role.

You can do considerably more with interpreted roles, but we'll look at creating a tag cloud.

Gathering Data

The gathering part is easy.  You can snarf out the interpreted text roles with a simple visitor-based design.

import sys
from collections import defaultdict
from docutils.core import publish_doctree
from docutils.nodes import SparseNodeVisitor

class RoleVisitor( SparseNodeVisitor ):
    def __init__( self, role="skill", *args, **kw ):
        SparseNodeVisitor.__init__( self, *args, **kw )
        self.role= role
        self.cloud = defaultdict(int)
    def visit_inline( self, aNode ):
        if self.role in aNode['classes']:
            self.cloud[ aNode.astext() ] += 1


This visitor will accumulate a map with tag and frequency for a given role.

We can parse the RST resume file and accumulate the tag cloud statistics as follows.

def tagFreq( aFile ):
    source= aFile.read()
    structure= publish_doctree( source )

    skills= RoleVisitor( "skill", structure)

    structure.walkabout(skills)
    return skills.cloud

Once we have the data we can emit a tag cloud.

Frequency to Font Size

Converting frequencies to font sizes is a little alignment exercise.   A clever page designer might have clever style names based on the tag frequency.  I decided to name the styles after the font-sizes, since that seems simple.

def sizeMap( cloud ):
    """Many common tags piled into xx-large."""
    size_name = [ 'xx-small', 'x-small', 'small', 'medium', 'large',
         'x-large', 'xx-large' ]
    freq=list(set(cloud.values()))
    offset = max( 0, (len(size_name)-len(freq))//2 )
    size_map= {}
    for sz, f in enumerate(sorted(freq)):
        size_map[f]= size_name[sz+offset] if sz <>
    #print size
    return size_name, size_map

This assigns all the words that occur just once to the smallest font.  There are usually a large number of tags that occur just once.  A few tags will have a large number of occurrences; these will all wind up with 'xx-large' as their class.

Emitting The Cloud

Writing the tag cloud (in RST) looks this this.

def rst( names, sizes, cloud, destination ):
    sys.stdout= destination
    for s in names:
        print "..  role::", s # The formatting roles that match our CSS.
    print "\n----------\n"
    for k in sorted(cloud):
        print ':%s:`%s`' % ( sizes[cloud[k]], k, )

We can then tack this cloud onto the end of the resume to get a summary of skills, frameworks, OS's, languages and the like.

Style Points

The docutils section on overriding the style sheet suggests we include something like the following in the working directory.

resume.css

@import url(html4css1.css);

span.xx-small { font-size:0.65em; font-family:sans-serif }
span.x-small { font-size:0.7em; font-family:sans-serif }
span.small { font-size:0.85em; font-family:sans-serif }
span.medium { font-size:1em; font-family:sans-serif }
span.large { font-size:1.3em; font-family:sans-serif }
span.x-large { font-size:1.6em; font-family:sans-serif }
span.xx-large { font-size:1.9em; font-family:sans-serif }

We include this with the following command: rst2html.py --stylesheet-path=resume.css

Workflow

This makes it much more pleasant to edit my resume.  

1.  Make the changes.
2.  Run the tag-cloud script.
3.  Run rst2html. 

Now I just have to remember to do it more often than once every five years.

Monday, May 25, 2009

ReStructured Text markup and Content Management

I can't say enough good things about ReStructuredText (RST).  I've used all of the available markup languages (SGML, HTML and XML).  They have their place, but they all fall short of being truly usable.

In This sounds complicated, because it is I reviewed some of my history of cheap content management.   

In looking at content of all kinds, I'm finding that RST is much, much easier to work with than SGML, HTML or XML.  In short, I think that RST makes the file system into a really good content management system (CMS).  Unstructured content is a big win.  Structured content is a "don't care".  But there's a middle ground of semi-structured content that requires sophisticated semantic markup.

SGML At The Dawn Of Time

When the web started it's ascent (back in the 90's), I was lucky.  I had already been working with folks that did military contracting, and folks there had introduced me to SGML.   When I moved from SGML to HTML, I saw it as a pleasant simplification because it had a more-or-less fixed DTD.  

My first personal web pages were lovingly hand-crafted HTML masterpieces.  (Okay, they were lovingly hand-crafted.)   There was  a lot of work involved in markup, cross-references, and presentation. 

HTML via a Class Hierarchy

My first templating was via proper Python classes.  I created class hierarchies that embodied the page template and filled in required data.  The heart of each class was an emit method that wrote the final HTML.

Variant page layouts and special cases were easily handled by Python simple inheritance.  

Of course, the big problem is that HTML is just representation.  There's often some bleed-through between the problem domain model and the HTML representation of that underlying model.  You don't want your problem domain objects to encode any HTML.  You can have a generic Tag class, but the Page class is specific to your problem domain.

The Python class structure is nice, but it's only suitable for structured content management.  When you have semi-structured and unstructured data -- the strong suit of HTML -- you find the class hierarchy to be too rigid.

Some time in the early 00's, I discovered Cheetah.

HTML via Templates

Cheetah (and template engines like Mako, Jinja, and numerous others) did what I wanted.  A base template was -- effectively -- a superclass.  Each block in that template could be overridden by a subclass.

The content, then, becomes a relatively simple template file that extends a page layout.  You can handle unstructured and semi-structured content very nicely.  I changed my ways of working with HTML to leverage this elegant, extensible view of the world.  I redid my personal web site: the content become a collection of Cheetah templates that contained all the content.

Note that I've *added* a markup language.  In addition to HTML, I also have some Cheetah markup on each page.  While this got me consistency and flexibility (and a reduction in the volume of stuff on each page) it did make things slightly more complex.

Look at http://cadesignquilts.com/ for another example of an all-Cheetah static site.  I did several sites like this.  The workflow involved (1) design the overall page, (2) getting the data into a usable form, (3) generating the page-level template files, and (4) running Cheetah to emit HTML from the templates.  All static content.  Runs like lightning.  

The JSP Distraction

Eventually, I started doing development with Struts, which depends heavily on JSP.  You have HTML commingled with Java code.  Plus, you've got custom actions via a tag library to extend JSP processing.  You can create page-level templates with a reasonably smart JSP tag library.

This template solution doesn't work well for unstructured or semi-structured data.  It's a pure programming solution.

DocBook XML and Semantic Markup

I wrote Building Skills in Python entirely in Appleworks.  That was pretty well unmaintainable and unpublishable in that form.

I converted the text to DocBook XML.  I used the Leo outliner to manage the document as a whole.  I wrote my own publishing workflow to transform the XML to HTML and PDF.   It worked reasonably well.

More important, using DocBook reinforced the importance of semantic markup.  It took me back to my SGML days.  It also showed why and other HTML presentation things have to be moved out of the document and into the stylesheet.

This was a very nice way to handle the semi-structured and unstructured content in a book.  Direct use of XML is a pain in the neck.  XML has a lot of syntax.  It's much nicer to do your thinking with something lighter weight.  

ReStructured Text (RST) for Unstructured Content

Somewhere in the late 00's, I found Python's docutils and RST.  I can't figure out when I started -- precisely -- but using RST as part of content management didn't fully click at first.

After reworking my personal site, which includes a lot of really unstructured ("random" might be a better word) content, I'm seeing the value in RST + Filesystem as a CMS.  I think the Sphinx folks are right.  If you have a simple markup system and all the filesystem tools that have evolved over the past few decades, you're covered.

Further, on larger projects, I've found that I can pop out a nice template documentation tree with a simple .. toctree:: directive on the index.rst page and generate a tidy, complete documentation package without much pain.

Structured Content

For structured data, you have ordinary classes and programs.  You have SQL databases, ORM to map to classes; all of that technology.  It's easy to write applications that emit RST which you can then publish.  

Most structured content can be boiled down to tables and charts.  The .. csv-table:: directive makes it easy to have an application emit data that you fold into a more elegant-looking report.

The Nuance -- Semi-Structured Data

My worst-case scenarios are my résumés: sailing, programming and writing.  The data has deep semantic meaning:  it isn't just words.  On the other hand, the data has lots of special-cases and exceptions: it isn't totally amenable to a database.

The absolute best part of docutils is that the parser's output is available for processing.  You can -- easily -- add directives and text roles to create semantic meaning.

I experimented with XML and YAML for my résumés.  The XML is cumbersome.  The YAML requires a fairly sophisticated class model to make use of the information.  

RST with a few text roles, however, rocks.  The .. role:: directive makes it easy to throw roles into a document for later use by applications.