Tuesday, January 31, 2012

Enriched Details of Apache Common Log Format

See Apache Log Parsing for the background.

Here's a generator function which expands a simple Access to be a more detailed named tuple.


Access_Details= namedtuple( 'Access_Details',
    ['access', 'time', 'method', 'url', 'protocol'] )
            
def details_iter( iterable ):
    def parse_time_linux( ts ):
        return datetime.datetime.strptime( ts, "%d/%b/%Y:%H:%M:%S %z" )

    def parse_time_macos( ts ):
        dt= datetime.datetime.strptime( ts[:-6], "%d/%b/%Y:%H:%M:%S" )
        tz_text= ts[-6:]
        sign, hh, mm = tz_text[:1], int(tz_text[1:3]), int(tz_text[3:])
        minutes= (hh*60+mm) * (-1 if sign == '-' else +1)
        offset = datetime.timedelta(minutes = minutes)
        tz= datetime.timezone( offset, tz_text )
        return dt.replace(tzinfo=tz)
        return dt

    first, last = None, None
    for access in iterable:
        meth, uri, protocol = access.request.split()
        dt= parse_time_macos( access.time )
        first= min(dt,first) if first else dt
        last= max(dt,last) if last else dt
        yield Access_Details(
            access= access,
            time= dt,
            method= meth,
            url= urllib.parse.urlparse(uri),
            protocol= protocol )

    print( "Log Data from", first, "to", last, 'duration', last-first )


This "wraps" the original Access object with an Access_Details that includes information that isn't trivially parsed from the access row.
  • The datetime object with the real timestamp.  Not the Mac OS subtlety.  Due to platform issues, the %z strptime format doesn't seem to work in Python 3.2
  • The three fields from the request: method, URL and protocol.  
  • The URL is parsed into its individual fields.
Note that the Access_Details object is pickle-able.  While seemingly irrelevant, it turns out that having something which can be pickled means that we can use multiprocessing to create a multi-staged concurrent pipeline of log analysis.

What's important here is that we're adding functionality without redefining the underlying Access class. Indeed, the underlying Access object is immutable.  The idea of stateless values comes from the functional programming crowd.  It seems to work out really well because the functionality seems to accrete in relatively simple layers.

Thursday, January 26, 2012

Apache Log Parsing

How much do I love Python?  Consider this little snippet that parses Apache logs.


import re
from collections import defaultdict, named tuple

format_pat= re.compile( 
    r"(?P<host>[\d\.]+)\s" 
    r"(?P<identity>\S*)\s" 
    r"(?P<user>\S*)\s"
    r"\[(?P<time>.*?)\]\s"
    r'"(?P<request>.*?)"\s'
    r"(?P<status>\d+)\s"
    r"(?P<bytes>\S*)\s"
    r'"(?P<referer>.*?)"\s' # [SIC]
    r'"(?P<user_agent>.*?)"\s*' 
)

Access = namedtuple('Access',
    ['host', 'identity', 'user', 'time', 'request',
    'status', 'bytes', 'referer', 'user_agent'] )

def access_iter( source_iter ):
    for log in source_iter:
        for line in (l.rstrip() for l in log):
            match= format_pat.match(line)
            if match:
                yield Access( **match.groupdict() )


That's about it.  The access log rows are now first-class Access-class objects that can be processed pleasantly by high-level Python applications.

Cool things.
  1. The adjacent string concatenation means that the regular expression can be broken up into bits to make it readable.
  2. When the named tuple attributes match the regular expression names, we can trivially turn the match.groupdict() into a named tuple. 
  3. By using a generator, the other parts of the application can simply loop through the results without tying up memory to create vast intermediate structures.
A couple of years back, a sysadmin was trying to justify spending money on a log analyzer product.  I suggested they (at the very least) get an open source log analyzer.

I also suggested that they learn Python and save themselves the pain of working with a (potentially) complex tool.  Given this as a common library module, log analysis applications are remarkably easy to write.


Tuesday, January 24, 2012

Building Skills in Programming

I've revised (and streamlined) my Building Skills in Programming book.

The 2.6.2. edition will simply replace the 2.6.1. edition, leading to the possibility of broken bookmarks because of the changes.

Currently, the non-programmer book accounts for under 10% hits on the http://www.itmaybeaback.com/book site.  Consequently, I'm not very worried about the breakage.  I know someone will hate me for messing with the content just as they were starting to understand it.

I'm indebted to all my readers for the numerous suggestions, corrections and complements that I've received.  In order to simplify the correction process, I've put the source onto SourceForge.  See the Programming Book-Python 2.6 project.

The next step will be to add a PayPal donations button.

And... If I can get the PDF into really good shape, I may post it on Lulu for folks who really want a hardcopy.

Traffic

For a random 2-day period, the usage looks like this:

246 distinct "users" (really IP addresses).


{'html only': 161,
 ('both', 'oodesign-java-2.1'): 9,
 ('both', 'oodesign-python-2.1'): 5,
 ('both', 'programming-2.6'): 3,
 ('both', 'python-2.6'): 7,
 ('pdf only', 'oodesign-java-2.1'): 21,
 ('pdf only', 'oodesign-python-2.1'): 14,
 ('pdf only', 'programming-2.6'): 11,
 ('pdf only', 'python-2.6'): 37}


The "both" is a count of users reading HTML as well as the identified PDF editions.
For the "html only" and "both" users, there's a detailed list of particular books and sections.  Too large and boring to repeat here.

One interesting part is this detail:

82.128.23.63 {'programming-2.6': 37, 'oodesign-java-2.1': 28, 'python-2.6': 37, 'oodesign-python-2.1': 24} 

Apparently, Nigeria needs a lot of copies of the PDF.  I think I might want to block them, because this can't be anything sensible except endless polling by some botnet script.

Since we don't drop off cookies, we can't really identify user sessions.  Maybe in the future, I'll wrap the static content download with a simple WSGI application to drop off and collect cookies to track users instead of IP addresses.

Thursday, January 19, 2012

Python 2.7 CSV files with Unicode Characters

The csv module in Python 2.7 is more-or-less hard-wired to work with ASCII and only ASCII.

Sadly, we're often confronted with CSV files that include Unicode characters.  There are numerous Stack Overflow questions on this topic.  http://stackoverflow.com/search?q=python+csv+unicode

What to do?  Since csv is married to seeing ASCII/bytes, we must explicitly decode the column values.

One solution is to wrap csv.DictReader, something like the following.  We need to decode each individual column before attempting to do anything with value.

class UnicodeDictReader( object ):
    def __init__( self, *args, **kw ):
        self.encoding= kw.pop('encoding', 'mac_roman')
        self.reader= csv.DictReader( *args, **kw )
    def __iter__( self ):
        decode= codecs.getdecoder( self.encoding )
        for row in self.reader:
            t= dict( (k,decode(row[k])[0]) for k in row )
            yield t

This new object is an iterable which contains a DictReader. We could subclass DictReader, also.

The use case, then, becomes something simple like this.
with open("some.csv","rU") as source:
    rdr= UnicodeDictReader( source )
    for row in rdr:
        # process the row

We can now get Unicode characters from a CSV file.

Tuesday, January 17, 2012

Python 3.2 CSV Module -- Very, very nice

A common (and small) task is reformatting a file that's in some variant of CSV.  It could be a SQL database extract, or an export from an application that works well with CSV files.

In Python 2.x, a CSV file with Unicode was a bit of a problem.  The CSV module isn't happy with Unicode.  The documentation is quite clear that many files need to be opened with a mode of 'rb' to correctly handle Windows line-endings.

Because of this, a CSV file with Unicode required using an explicit decoder on the individual columns (not the line as a whole!)

But with Python 3.2, that's all behind us.

Here's something I did recently.  The file has six columns that are relevant.  One of them (the "NOTE") column has a big block of text with details buried inside using a kind of RST markup.  The data might be three lines with a value like this "words words\n:budget: 1500\nwords words".

The file is UTF-8, and the words have non-ASCII unicode characters randomly through it.

 
def details( source ):
    relevant = ( "TASK", "FOLDER", "CONTEXT", "PRIORITY", "STAR", )
    parse= "NOTE"
    data_pat= re.compile( r"^:(\w+):\s*(.*)\s*$" )
    rdr= csv.DictReader( source )
    for row in rdr:
        txt= row[parse]
        lines= ( data_pat.match(l) for l in txt.splitlines() )
        matches= ( m.groups() for m in lines if m )
        result= dict( (k, row[k]) for k in relevant) 
        result.update( dict(matches) )
        yield result

How much do I love Python? Let me count the ways.
  1. The assignment of lines on line 8 was fun.  The "NOTE" column, in row[parse], contains the extra fields.  They'll be on a separate line with the :word:value format as shown in the data_pat pattern.  We create a generator which will split the text field into lines and apply the pattern to each line.
  2. The assignment to  matches on line 9 was equally fun.  If the matches generator produced a match object, the lines generator will gather the two groups form the line.
  3. The assignment to result creates a dictionary from the relevant columns.  
  4. The second assignment to result updates this dictionary with data parsed out of the "NOTE" column.
That makes it quite pleasant (and fast) to process an extract file, reformatting a "big blob of text" into individual columns.

The rest of the app boils down to this.


def rewrite( input, target=sys.stdout ):
    with io.open(input, 'r', encoding='UTF-8') as source:
        data= list( details( source ) )
    headers= set( k for row in data for k in row  )
    wtr= csv.DictWriter( target, sorted(headers) )
    wtr.writeheader( )
    wtr.writerows( data )

This gathers the raw data into a big old sequence in memory, and then writes that big old sequence back out to a file.  If we knew the headers buried in the "NOTE" field, we could do the entire thing in a single pass just using generators.

We have to explicitly provide the encoding because the file was created via a download and the encoding isn't properly set on the client machine.  The important thing is that we can do this when it's necessary.  And we no longer have to explicitly decode fields.

Since we don't know the headers in the "NOTE" field, we're forced to create the headers set by examining each row dictionary for it's keys.

Thursday, January 12, 2012

Multiprocessing and Shared Objects [Revised]

Read this: Shared Counter with Python Multiprocessing.

Brilliant.  Thank you for this.

Too many of the questions on StackOverflow that include multi-threading are better approached as multi-processing.  In Linux, there are times when all threads of a single process are stopped while the process (as a whole) waits for system services to complete.  It's a consequence of the way select and poll work.  An example of the kind of sophisticated design required to avoid this can be found here.  Most I/O-intensive applications should be done via multi processing, not multi threading.

And.  The kind of shared objects that multi threading allows are often rare and require locks.

So, simplify your life.  When you hear about "threads", replace the word with "processes" and move on.  The implementation will be much nicer.

The standard gripe is that process creation is so expensive, and thread creation is relatively cheap.  All true.  That's why folks use process pools: to amortize the creation cost over a long period of operation.

Tuesday, January 10, 2012

Innovation is the punctuation at the end of a string of failures

Read this in Forbes: "Innovation's Return on Failure: ROF".

Also, this: "The Necessity of Failure in Innovation (+ more on CDOs)".

This, too: "Why innovation efforts fail".

While we're at it: "Accepting Failure is Key to Good Overall Returns on High-Risk Development Programs".

I can't say enough about the value of "failure".  The big issue here is the label.

A project with a grand scope and strategic vision gets changed to make it smaller and more focused.  Did it "fail" to deliver on the original requirements?  Or did someone learn that the original grand scope was wrong?

A project that changes isn't failure.  It's just lessons learned.  Canceling, re-scoping, de-scoping, and otherwise modifying a project is what innovation looks like.  It should not be counted as a "failure".

A project "Death March" occurs because failure is not an option, and change will be labeled as failure.

Thursday, January 5, 2012

Self-Direction, Mastery, Purpose

Watch this: http://www.youtube.com/watch?v=u6XAPnuFjJc

Brilliant summary of what really motivates people.

The most important advice: provide a sense of purpose and get out of people's way so that they can do the right thing.

Micromanagement, incentives, annual performance reviews and the like aren't as useful as providing the sense of purpose, the opportunity for mastery and the freedom of self-direction.

Tuesday, January 3, 2012

Epic indictment of Waterfall Methods

Saw this recently.

I think this aptly summarizes the results of a waterfall methodology.
  1. You wrote a lot of requirements, not fully understanding the actors or their use cases.
  2. Your vendor implemented those requirements because they were contractual obligations not because they created value for the actors.
The government's not the only offender.  They're just more visible and more bound up in a legally-mandated purchasing cycle that makes the waterfall desirable and more Agile methods undesirable.