Wednesday, May 27, 2009

Semantic Markup with Docutils Interpreted Text Roles

A resume is a slippery thing -- a package of semi-structured data.

It has a kind of database-like feel to it, but there are so many exceptions and special cases that the database never works out quite the way you wanted.

For example, I've got -- essentially -- one employer over the past 30+ years.  But I've been on hundreds of projects for almost 100 different clients.  Since projects overlap, there's no tidy timeline.  The database has a token "Employer" table, a "Client" table, a "Project", which is an association between "Client" and "Employer".  For each "Project" I can have a number of roles or positions.  Most importantly, each project has a large number of hardware, software, skill, language and other "features" to it.


A more relaxed model is some kind of markup so that keywords can be identified semantically and culled out to create tag clouds or indices.

The usual culprit for mixed-content models like this is XML.  We would define a DTD or XSD with our tags in a new namespace.  Sadly, this also means that I have to rewrite my resume into XML.  Not that bad, but still...

Can we do similarly detailed semantic markup in RST?

What Role Does These Words Play?

RST offers a flexible mechanism they called Interpreted Text Roles.  There are two parts to getting started with this.

1.  Name the role in a .. role:: name directive.
2.  Markup your content with :name:`words`.  

By default, the role name is the class name that will be put into the HTML <span> tag when the document is written in HTML.  If you want, you can supply special formatting in addition to marking the words with a role.

You can do considerably more with interpreted roles, but we'll look at creating a tag cloud.

Gathering Data

The gathering part is easy.  You can snarf out the interpreted text roles with a simple visitor-based design.

import sys
from collections import defaultdict
from docutils.core import publish_doctree
from docutils.nodes import SparseNodeVisitor

class RoleVisitor( SparseNodeVisitor ):
    def __init__( self, role="skill", *args, **kw ):
        SparseNodeVisitor.__init__( self, *args, **kw )
        self.role= role = defaultdict(int)
    def visit_inline( self, aNode ):
        if self.role in aNode['classes']:
  [ aNode.astext() ] += 1

This visitor will accumulate a map with tag and frequency for a given role.

We can parse the RST resume file and accumulate the tag cloud statistics as follows.

def tagFreq( aFile ):
    structure= publish_doctree( source )

    skills= RoleVisitor( "skill", structure)


Once we have the data we can emit a tag cloud.

Frequency to Font Size

Converting frequencies to font sizes is a little alignment exercise.   A clever page designer might have clever style names based on the tag frequency.  I decided to name the styles after the font-sizes, since that seems simple.

def sizeMap( cloud ):
    """Many common tags piled into xx-large."""
    size_name = [ 'xx-small', 'x-small', 'small', 'medium', 'large',
         'x-large', 'xx-large' ]
    offset = max( 0, (len(size_name)-len(freq))//2 )
    size_map= {}
    for sz, f in enumerate(sorted(freq)):
        size_map[f]= size_name[sz+offset] if sz <>
    #print size
    return size_name, size_map

This assigns all the words that occur just once to the smallest font.  There are usually a large number of tags that occur just once.  A few tags will have a large number of occurrences; these will all wind up with 'xx-large' as their class.

Emitting The Cloud

Writing the tag cloud (in RST) looks this this.

def rst( names, sizes, cloud, destination ):
    sys.stdout= destination
    for s in names:
        print "..  role::", s # The formatting roles that match our CSS.
    print "\n----------\n"
    for k in sorted(cloud):
        print ':%s:`%s`' % ( sizes[cloud[k]], k, )

We can then tack this cloud onto the end of the resume to get a summary of skills, frameworks, OS's, languages and the like.

Style Points

The docutils section on overriding the style sheet suggests we include something like the following in the working directory.


@import url(html4css1.css);

span.xx-small { font-size:0.65em; font-family:sans-serif }
span.x-small { font-size:0.7em; font-family:sans-serif }
span.small { font-size:0.85em; font-family:sans-serif }
span.medium { font-size:1em; font-family:sans-serif }
span.large { font-size:1.3em; font-family:sans-serif }
span.x-large { font-size:1.6em; font-family:sans-serif }
span.xx-large { font-size:1.9em; font-family:sans-serif }

We include this with the following command: --stylesheet-path=resume.css


This makes it much more pleasant to edit my resume.  

1.  Make the changes.
2.  Run the tag-cloud script.
3.  Run rst2html. 

Now I just have to remember to do it more often than once every five years.