Rants on the daily grind of building software. This has been moved to https://slott56.github.io. Fix your bookmarks.
Bio and Publications
Tuesday, December 17, 2013
Apple's Feckless Download Protocol
When there's any disruption, it simply discards the data it has and starts again.
How stupid. How blatantly and infuriatingly stupid.
If I pause a download, it will resume. If it breaks, it will not resume. WTF?
For some things, I can use BitTorrent, which tolerates noisy links. But for proper AppStore Apps, their protocol is the pits.
Anyone know anyone at Apple who's able to work on a solution to this?
Thursday, December 12, 2013
Secure Salted Password Hashing
https://crackstation.net/hashing-security.htm
This was really quite nice. It didn't have a Python version, but the clarity of the exposition makes the Python easy to write.
A few months back, I had this mystery conversation: http://slott-softwarearchitect.blogspot.com/2012/08/password-encryption-short-answer-dont.html.
While this is not going to produce identical results to the code shown in the blog post, it seems to fit the requirements.
from hashlib import sha256 import os class Authentication: iterations= 1000 def __init__( self, username, password ): """Works with bytes. Not Unicode strings.""" self.username= username self.salt= os.urandom(24) self.hash= self._iter_hash( self.iterations, self.salt, username, password ) @staticmethod def _iter_hash( iterations, salt, username, password ): seed= salt+b":"+username+b":"+password for i in range(iterations): seed= sha256( seed ).digest() return seed def __eq__( self, other ): return self.username == other.username and self.hash == other.hash def __hash__( self, other ): return hash(self.hash) def __repr__( self ): salt_x= "".join( "{0:x}".format(b) for b in self.salt ) hash_x= "".join( "{0:x}".format(b) for b in self.hash ) return "{username} {iterations:d}:{salt}:{hash}".format( username=self.username, iterations=self.iterations, salt=salt_x, hash=hash_x) def match( self, password ): test= self._iter_hash( self.iterations, self.salt, self.username, password ) return self.hash == test # Constant Time is Best
It may be helpful to use __slots__ with this to reduce the storage and make the object less mutable.
Perhaps I didn't google well enough to find a clear explanation that also included Python code samples.
Tuesday, December 3, 2013
Python vs. R for Data Science
Recently, I've had a former colleague asking questions about Data Science. See Obstinate Idiocy.
They -- weirdly -- insisted that the only language that made sense to them was Excel.
My response was a blunt "What?"
The Python vs. R post cited above clarifies that reasons why a programming language is a better choice than a "tool" or "platform".
Tuesday, November 26, 2013
Mac OS X 10.9 and Python 3.3
The previously documented patch (http://slott-softwarearchitect.blogspot.com/2013/10/mac-os-x-109-mavericks-crashes-python.html) is no longer required.
Time to start incrementally installing all the various add-on components: docutils, PyYaml, Django, Jinja2, SQLAlchemy, etc.
Also, time to put more focus into rewriting various projects to finally cut the cord with Python2. At this point, there's no longer a reason to be looking backwards.
Tuesday, October 29, 2013
When to choose Python over Java and vice versa ??: A Very Silly Question
In spite of this.
(A) The question gets asked.
And worse.
(B) It gets answered. And people take their answers seriously. As if there are Profound Differences among programming languages.
Among Turing Complete programming languages there are few Profound Differences.
The pragmatic differences are the relative ease (or pain) of expressing specific algorithms or data structures.
This means that there's no easy, blanket, one-size-fits-all answer to such a silly question.
You can have some code (or data) which is painful in Java and less painful in Python.
But.
You can also find an extension library that makes it much, much less painful.
This, alone, makes the question largely moot.
When it comes to a specific project, the question of the team's skills, the existing infrastructure, and any integration requirements are the driving considerations.
Because of this, an incumbent language has huge advantages.
If you've already got a dozen web sites in Java, there's no good reason to flip-flop between Java and Python.
If you're going to switch from some Java Framework to Django, however, you'd do this is part of a strategic commitment to drop Java and convert to Python.
To read the discussion, see LinkedIn Python Community. http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=5796513808700682243&qid=8612bee7-76e1-4a35-9c80-c163494191b0&trk=groups_most_popular-0-b-ttl&goback=%2Egmp_25827
Friday, October 25, 2013
Mac OS X 10.9 "Mavericks" Crashes Python -- Patch Available
Thursday, October 24, 2013
Required Reading for everyone who said "If it ain't broke, don't fix it."
Here an important lesson.
Code Rot is Real. It Leads to Subtle and Expensive Bugs.
Claiming that code cleanup is just pointless "gold plating" is the kind of thing that can drive a company out of business.
Tuesday, October 15, 2013
Literate Programming: PyLit3
See https://github.com/slott56/PyLit-3
The code seems to pass all the unit tests.
The changes include Python3 revisions, plus a small change to handle trailing spaces in a sightly cleaner fashion. This was necessary because I have most of my editors set to remove trailing spaces from the files I create, and PyLit tended to create trailing spaces. This made the expected output from the unit tests not precisely match the actual output.
Thursday, October 3, 2013
Literate Programming and PyLit
Mostly, I followed the Web/Weave world view and cribbed their markup syntax. It's not bad, but, the PyWeb markup is based on some presumptions about literate programming that were, perhaps, true with some languages, but are not true at all when working with Python.
- The source presentation order incomprehensible. To fix this, we create a literate programming document, and from that tangle the source into an order that's acceptable to the compiler, but perhaps hard to understand for people. We weave a document that's easy for people to understand.
- The source syntax may be incomprehensible. To fix this, we have fine grained substitution. The target source can be built at any level of syntax (token, line, or higher-level language construct.) We can assure that the woven document for people is written using elegant symbols even if the tangled source code uses technical gibberish.
- The woven documentation needs a lot of additional output markup. The original web/weave toolset create extensive TeX markup. Later tools reduced the markup to allow HTML or XML, minimizing the added markup in a woven document.
- Use six.py to make a single version that covers both Python2 and Python3.
- Rewrite PyLit it for Python3 and move forward.
- Remove print statement and exec statements.
- Replace string formatting % with .format().
- Replace raise statements and except statements with Python3 (and Python2.7) syntax.
- Upgrade for dict method changes in Python3.
- Replace DefaultDict with collections.defaultdict.
- Replace optparse with argparse.
Also, it doesn't address the files with names that differ only in case. There are two graphics files in the /trunk/rstdocs/logo/ path that differ only in case of letters. Bad, but acceptable for Linux. Fatal for Mac OS X with the default filesystem.
- Fork the PyLit 0.7.5 to create PyLit3 release 1.0? A whole, new project.
- Try to use six.py to create a 2-3 compatible source file and call this PyLit 0.8?
Tuesday, September 24, 2013
Introduction to Programming: iBook Edition for Python 3.2
I rewrote almost all of my Introduction to Programming book into an iBook. Trimmed it down. Refocused it. Changed from Python 2.6 to 3.2. A complete refactoring from which almost nothing of the original book survives except the goals.
Look for it October 1st in the iTunes bookstore. Programming for Absolute Beginners: Building Skills in Programming.
[My intent is to have several Building Skills titles. We'll see how far I get.]
The rewrite involved three substantial changes.
- I removed all of the duplicative reference material. The Python library reference is (now) utstandingly good. When I started using Python over ten years ago, it was not very good, and I started writing a Python reference of my own merely to organize the documentation. The books grew from there; the reference aspect is now useless.
- I dropped almost all Python 2 references. There's a little bit of Python 2 history, but that's it. It's time to move forward, now that most of the major packages seem to have made the switch.
- I changed the focus from processing to data.
Tuesday, September 17, 2013
iWeb File Extract and XML Iterators
That leaves some of us with content in iBlog as well as iWeb. Content we'd like to work with without doing extensive cutting and pasting. Or downloading from a web server. After all, the files are on our computer.
The iWeb files are essentially XML, making them relatively easy to work with. We can reduce the huge, and hugely complex iWeb XML to a simple iterator and use a simple for statement to extract the content.
[Historical note. I wrote a Python script to convert iBlog to RST. It worked reasonably well, all things considered. This is not the first time I've tried to preserve old content from obsolete tools. Sigh.]
Some tools (like SandVox) have a "extract iWeb content" mode. But that's not what we want. We don't want to convert from iWeb to another blog. We want to convert from iWeb to CVS or some other more useful format so we can do some interesting processing, not simple web presentation.
This is a note on how to read iWeb files to get at the content. And further, how to get at XML content in the form of a simple iterator.
Opening The Package
Here's how to overview the package.
path="~/Documents/iWeb/Domain"
path_full= os.path.expanduser(path+".sites2")
for filename in os.listdir(path_full):
name, ext = os.path.splitext( filename )
if ext.lower() in ( ".jpg", ".png", ".mov", ".m4v", ".tiff", ".gif", ".m4a", ".mpg", ".pdf" ): continue
print( filename )
This will reveal the files; we only really care about the "index.xml.gz" file since that has the bulk of the content.
with closing( gzip.GzipFile( os.path.join(path_full,"index.xml.gz") ) ) as index:
index_doc= xml.parse( index )
index_root= index_doc.getroot()
This gets us the XML version of the blog.
Finding the Pages
We can use the following to thread through the XML. We're looking for a particular "Domain", a "Site" and a particular blog page within that site. The rest of the blog is mostly text. This portion of the blog is more structured.
For some reason, the domain is "Untitled". The site is "Cruising", and the blog page is "Travel 2012-2013". We insert these target names into XPath search strings to locate the relevant content.
search= '{{http://developer.apple.com/namespaces/bl}}domain[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(domain_name)
domain= index_root.find( search )
mdu_uuid_tag= domain.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
domain_filename= "domain-{0}".format( mdu_uuid_value )
search= './/{{http://developer.apple.com/namespaces/bl}}site[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(site_name)
cruising= domain.find(search)
mdu_uuid_tag= cruising.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
site_filename= "site-{0}".format(mdu_uuid_value)
search= '{{http://developer.apple.com/namespaces/bl}}site-blog[@{{http://developer.apple.com/namespaces/sf}}name="{0}"]'.format(site_blog_name)
site_nodes= cruising.find('{http://developer.apple.com/namespaces/bl}site-nodes')
travel= site_nodes.find(search)
mdu_uuid_tag= travel.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
site_blog_filename= "site-blog-{0}".format(mdu_uuid_value)
This will allow us to iterate through the blog entries, called "pages". Each page, it turns out, is stored in a separate XML file with the page details and styles. A lot of styles. We have to assemble the path from the base path, the domain, site, site-blog and site-page names. We'll find an ".xml.gz" file that has the individual blog post.
for site_page in travel.findall('{http://developer.apple.com/namespaces/bl}series/{http://developer.apple.com/namespaces/bl}site-page'):
mdu_uuid_tag= site_page.find('{http://developer.apple.com/namespaces/bl}metadata/{http://developer.apple.com/namespaces/bl}MDUUID')
mdu_uuid_value= mdu_uuid_tag.find('{http://developer.apple.com/namespaces/bl}string').get('{http://developer.apple.com/namespaces/sfa}string')
site_page_filename= "site-page-{0}".format(mdu_uuid_value)
blog_path= os.path.join(path_full, domain_filename, site_filename, site_blog_filename, site_page_filename )
with closing( gzip.GzipFile( os.path.join(blog_path,site_page_filename+".xml.gz") ) ) as child:
child_doc= xml.parse( child )
child_root= child_doc.getroot()
main_layer= child_root.find( '{http://developer.apple.com/namespaces/bl}site-page/{http://developer.apple.com/namespaces/bl}drawables/{http://developer.apple.com/namespaces/bl}main-layer' )
Once we have access to the page XML document, we can extract the content. At this point, we could define a function which simply yielded the individual site_page tags.
Summary Iterable
The most useful form for the pages is an iterable that yields the date, title and content text. In this case, we're not going to preserve the internal markup, we're just going to extract the text in bulk.
content_map = {}
for ds in main_layer.findall( '{http://developer.apple.com/namespaces/sf}drawable-shape' ):
style_name= ds.get('{http://developer.apple.com/namespaces/sf}name')
if style_name is None:
#xml.dump( ds ) # Never has any content.
continue
for tb in ds.findall('{http://developer.apple.com/namespaces/sf}text/{http://developer.apple.com/namespaces/sf}text-storage/{http://developer.apple.com/namespaces/sf}text-body' ):
# Simply extract text. Markup is lost.
content_map[style_name] = tb.itertext()
yield content_map
This works because the text has no useful semantic markup. It's essentially HTML formatting full of span and div tags.
Note that this could be a separate generator function, or it could be merged into the loop that finds the site-page tags. It's unlikely we'd ever have another source of site-page tags. But, it's very like that we'd have another function for extracting the text, date and title from a site-page tag. Therefore, we should package this as a separate generator function. We didn't, however. It's just a big old function named postings_iter().
There are three relevant style names. We're not sure why these are used, but they're completely consistent indicators of the content.
- "generic-datefield-attributes (from archive)"
- "generic-title-attributes (from archive)"
- "generic-body-attributes (from archive)"
def flatten_posting_iter( postings=postings_iter(path="~/Documents/iWeb/Domain") ):
"""Minor cleanup to the postings to parse the date and flatten out the title."""
for content_map in postings:
date_text= " ".join( content_map['generic-datefield-attributes (from archive)'] )
date= datetime.datetime.strptime( date_text, "%A, %B %d, %Y" ).date()
title= " ".join( content_map['generic-title-attributes (from archive)'] )
body= content_map['generic-body-attributes (from archive)']
yield date, title, body
Now we can use the following kind of loop to examine each posting.
flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain"))
for date, title, text_iter in sorted(flat_postings):
for text in text_iter:
# examine the text for important content.
We've sorted the posting into date order. Now we can process the text elements to look for the relevant content.
In this case, we're looking for Lat/Lon coordinates, which have rather complex (but easy to spot) regular expressions. So the "examine" part is a series of RE matches to collect the data points we're looking for.
We'll leave off those application-specific details. We'll leave it at the following outline of the processing.
def fact_iter( flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain")) ):
for date, title, text_iter in sorted(flat_postings):
fact= Fact()
for text in text_iter:
# examine the text for important content, set attributes of fact
if fact.complete():
yield fact
fact= Fact()
This iterates through focused data structures that include the requested lat/lon points.
Final Application
The final application function that uses all of these iterators has the following kind of structure.
source= flat_postings=flatten_posting_iter(postings_iter(path="~/Documents/iWeb/Domain"))
with open('target.csv', 'w', newlines='') as target:
wtr= csv.DictWriter( target, Fact.heading )
wtr.writeheader()
for fact in fact_iter( source ):
wtr.writerow( fact.as_dict() )
We're simply iterating through the facts and writing them to a CSV file.
We can also simplify the last bit to this.
wtr.writerows( f.as_dict() for f in fact_iter( source ) )
The iWeb XML structure, while bulky and complex, can easily be reduced to a simple iterator. That's why I love Python.
Thursday, September 12, 2013
Omni Outliner, XML Processing, and Recursive Generators
It has a broad spectrum of file export alternative formats. Most of which are fine for import into some kind of word processor.
But what if the data is more suitable for a spreadsheet or some more structured environment? What if it was a detailed log or a project outline decorated with a column of budget numbers?
We have two approaches, one is workable, but not great, the other has numerous advantages.
In the previous post, "Omni Outliner and Content Conversion", we read an export in tab-delimited format. It was workable but icky.
Here's the alternative. This uses a recursive generator function to flatten out the hierarchy. There's a trick to recursion with generator functions.
Answer 2: Look Under the Hood
At the Mac OS X level, an Omni Outline is a "package". A directory that appears to be a single file icon to the user. If we open that directory, however, we can see that there's an XML file inside the package that has the information we want.
Here's how we can process that file.
import xml.etree.ElementTree as xml
import os
import gzip
packagename= "{0}.oo3".format(filename)
assert 'contents.xml' in os.listdir(packagename)
with gzip.GzipFile(packagename+"/contents.xml", 'rb' ) as source:
self.doc= xml.parse(source)
This assumes it's compressed on disk. The outlines don't have to be compressed. It's an easy try/except block to attempt the parsing without unzipping. We'll leave that as an exercise for the reader.
And here's how we can get the column headings: they're easy to find in the XML structure.
self.heading = []
for c in self.doc.findall(
"{http://www.omnigroup.com/namespace/OmniOutliner/v3}columns"
"/{http://www.omnigroup.com/namespace/OmniOutliner/v3}column"):
# print( c.tag, c.attrib, c.text )
if c.attrib.get('is-note-column','no') == "yes":
pass
else:
# is-outline-column == "yes"? May be named "Topic".
# other columns don't have a special role
title= c.find("{http://www.omnigroup.com/namespace/OmniOutliner/v3}title")
name= "".join( title.itertext() )
self.heading.append( name )
Now that we have the columns titles, we're able to walk the outline hierarchy, emitting normalized data. The indentation depth number is provided to distinguish the meaning of the data.
This involves a recursive tree-walk. Here's the top-level method function.
def __iter__( self ):
"""Find for outline itself. Each item has values and children.
Recursive walk from root of outline down through the structure.
"""
root= self.doc.find("{http://www.omnigroup.com/namespace/OmniOutliner/v3}root")
for item in root.findall("{http://www.omnigroup.com/namespace/OmniOutliner/v3}item"):
for row in self._tree_walk(item):
yield row
Here's the internal method function that does the real work.
def _tree_walk( self, node, depth=0 ):
"""Iterator through item structure; descends recursively.
"""
note= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}note' )
if note is not None:
note_text= "".join( note.itertext() )
else:
note_text= None
data= []
values= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}values' )
if values is not None:
for c in values:
if c.tag == "{http://www.omnigroup.com/namespace/OmniOutliner/v3}text":
text= "".join( c.itertext() )
data.append( text )
elif c.tag == "{http://www.omnigroup.com/namespace/OmniOutliner/v3}null":
data.append( None )
else:
raise Exception( c.tag )
yield depth, note_text, data
children= node.find( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}children' )
if children is not None:
for child in children.findall( '{http://www.omnigroup.com/namespace/OmniOutliner/v3}item' ):
for row in self._tree_walk( child, depth+1 ):
yield row
This gets us the data in a form that doesn't require a lot of external schema information.
Each row has the indentation depth number, the note text, and the various columns of data. The only thing we need to know is which of the data columns has the indented outline.
The Trick
Here's the tricky bit.
When we recurse using a generator function, we have to explicitly iterate through the recursive result set. This is different from recursion in simple (non-generator) functions. In a simple function, we it looks like this.
def function( args ):
if base case: return value
else:
return calculation on function( other args )
When there's a generator involved, we have to do this instead.
def function_iter( args ):
if base case: yield value
else:
for x in function_iter( other args )
yield x
Columnizing a Hierarchy
The depth number makes our data look like this.
0, "2009"
1, "November"
2, "Item In Outline"
3, "Subitem in Outline"
1, "December"
2, "Another Item"
3, "More Details"
We can normalize this into columns. We can take the depth number as a column number. When the depth numbers are increasing, we're building a row. When the depth number decreases, we've finished a row and are starting the next row.
"2009", "November", "Item in Outline", "Subitem in Outline"
"2009", "December", "Another Item", "More Details"
The algorithm works like this.
row, depth_prev = [], -1
for depth, text in source:
while len(row) <= depth+1: row.append(None)
if depth <= depth_prev: yield row
row[depth:]= [text]+(len(row)-depth-1)*[None]
depth_prev= depth
yield row
The yield will have to also handle the non-outline columns that may also be part of the Omni Outliner extract.
Tuesday, September 10, 2013
Omni Outliner and Content Conversion
It has a broad spectrum of file export alternative formats. Most of which are fine for import into some kind of word processor.
But what if the data is more suitable for a spreadsheet or some more structured environment? What if it was a detailed log or a project outline decorated with a column of budget numbers?
We have two approaches, one is workable, but not great, the other has numerous advantages.
Answer 1: Workable
Sure, you say, that's easy. Export into a Plain Text with Tabs (or HTML or OPML) and then parse the resulting tab-delimited file.
In Python. Piece of cake.
import csv
class Tab_Delim(csv.Dialect):
delimiter='\t'
doublequote=False
escapechar='\\'
lineterminator='\n'
quotechar=''
quoting=csv.QUOTE_NONE
skipinitialspace=True
rdr= csv.reader( source, dialect=Tab_Delim )
column_names= next(rdr)
for row in rdr:
# Boom. There it is.
That gets us started. But.
Each row is variable length. The number of columns varies with the level of indentation. The good news is that the level of indentation is consistent. Very consistent. Year, Month, Topic, Details in this case.
[When an outline is super consistent, one wonders why a spreadsheet wasn't used.]
Each outline node in the export is prefaced with "- ".
It looks pretty when printed. But it doesn't parse cleanly, since the data moves around.
Further, it turns out that "notes" (blocks of text attached to an outline node, but not part of the outline hierarchy) show up in the last column along with the data items that properly belong in the last column.
Sigh.
The good news is that notes seem to appear on a line by themselves, where the data elements seem to be properly attached to outline nodes. It's still possible to have a "blank" outline node with data in the columns, but that's unlikely.
We have to do some cleanup
Answer 1A: Cleanup In Column 1
We want to transform indented data into proper first-normal form schema with a consistent number of fixed columns. Step 1 is to know the deepest indent. Step 2 is to then fill each row with enough empty columns to normalize the rows.
Each specific outline has a kind of schema that defines the layout of the export file. One of the tab-delimimted columns will be the "outline" column: it will have tabs and leading "-" to show the outline hierarchy. The other columns will be non-outline columns. There may be a notes column and there will be the interesting data columns which are non-notes and non-outline.
In our tab-delimited export, the outline ("Topic") is first. Followed by two data columns. The minimal row size, then will be three columns. As the topics are indented more and more, then the number of columns will appear to grow. To normalize, then, we need to pad, pushing the last two columns of data to the right.
That leads to a multi-part cleanup pipeline. First, figure out how many columns there are.
rows= list( rdr )
width_max= max( len(r) for r in rows )+1
This allows us the following two generator functions to fill each row and strip "-".
def filled( source, width, data_count ):
"""Iterable with each row filled to given width.
Rightmost {data_count} columns are pushed right to preserve
their position.
"""
for r_in in source:
yield r_in[:-data_count] + ['']*(width-len(r_in)) + r_in[-data_count:]
def cleaned( source ):
"""Iterable with each column cleaned of leading "- "
"""
def strip_dash( c ):
return c[2:] if c.startswith('- ') else c
for row in source:
yield list( strip_dash(c) for c in row )
That gets us to the following main loop in a conversion function.
for row in cleaned( filled( rows, width_max, len(columns) ) ):
# Last column may have either a note or column data.
# If all previous columns empty, it's probably a note, not numeric value.
if all( len(c)==0 for c in row[:-1] ):
row[4]= row[-1]
row[-1]= ''
yield row
Now we can do some real work with properly normalized data. With overheads, we have an 80-line module that lets us process the outline extract in a simple, civilized CSV-style loop.
The Ick Factor
What's unpleasant about this is that it requires a fair amount of configuration.
The conversion from tab-delim outline to normalized data requires some schema information that's difficult to parameterize.
1. Which column has the outline.
2. Are there going to be notes on lines by themselves.
We can deduce how many columns of ancillary data are present, but the order of the columns is a separate piece of logical schema that we can't deduce from the export itself.
Tuesday, August 27, 2013
Obstinate Idiocy, Expanded
See Obstinate Idiocy for some background.
Here are three warning signs I was able to deduce.
- No Rational Justification
- Ineffective Tool Choice
- Random Whining
To which I can now add two more.
Symptom 4 of Obstinate Idiocy is that all questions are rhetorical and they often come with pre-argued answers.
Actual email quote:
Me: ">Excel is almost the stupidest choice possible
OI: "What criteria are you using to make that statement?My criteria was that I needed a way for non-tech people and non-programmers..."
And on the email spins, pre-arguing points and pre-justifying a bad answer. Since their argument is already presented (in mind-numbing detail), there's no effective way to answer the question they asked. Indeed, there's little point in trying to answer, since the pre-argued response is likely to be the final response.
In order to answer, we have to get past the pre-argued response. And this can be difficult because this devolves to "it's political, you don't need the details." So, if it's not technical, why am I involved?
Symptom 5 of Obstinate Idiocy is Learning is Impossible. This may actually be the root cause for Symptom 3, Ineffective Tool Choice. It now seems to me that the tool was chosen to minimize learning. I had suggested using Mathematica. I got this response: " I don't know Python or R or SAS." The answer seems like a non-sequitur because it is. It's justification for a bad decision.
The problem they're trying to solve is gnarly, perhaps it's time to consider learning a better toolset.
Excel has already failed the OI. They asked for an opinion ("Q2: What do you believe are the pros/cons of ... using Excel with "Excel Solver" ...?") that seems to ignore the fact that they already failed trying to use Excel. They already failed, and they followed up by asking for the pros and cons of a tool they already failed with.
From this limited exchange it appears that they're so unwilling to learn that they can't gather data from their own experience and learn from it.
Thursday, August 8, 2013
Negative Requirements
An actual quote.
... don't screw up cutting and pasting and the "/" vs "\" depending on unix / windows.Why not list everything that's not supposed to happen?
- No fire in the server room.
- No anthrax outbreak.
- No Zombie apocalypse.
The list could go on. I wonder why it doesn't.
Tuesday, August 6, 2013
How to Manage Risk
Orders of Ignorance and Risk Management.
- Check the facts,
- plan specific contingencies,
- use Agile methods because of their built-in ability to manage change.
Thursday, July 25, 2013
Database Conversion or Schema Migration
HPL related a tale of woeful conversion problems from a mismanaged schema migration.
While I could feel HPL's pain, the reasons given for their pain were wrong. They didn't quite get the lessons they had learned. Consequently, HPL sounded like someone doomed to repeat the mistake, or—worse—unlearning their lessons.
Here's HPL's most distressing comment.
"we can't migrate over the weekend and be done w/ it."Apparently, the horror of a weekend migration was somehow desirable to HPL. Who wants a lost weekend? And who wants to put all of the eggs in a single basket?
Anyone who's done more than one "lost weekend migration"—and who's also honest—knows that they don't go well. There are always subsets of data that (a) don't get converted properly and (b) have to get swept under the carpet in order to claim to meet the schedule.
It's a standard situation to have less than 100% of the data successfully converted and still call the effort a success. If 100% was not required, why lose a weekend over it?
Good Plans and Bad Plans
From far wiser people than me, I learned an important lesson in schema migration.
These Wiser Heads ran a "conversion" business. They moved data and applications from platform to platform. They knew a lot about database schema migrations. A lot.
Their standard plan was to build a schema migration script (usually a sequence of apps) that could be run to convert the database (or files or whatever) from old to new schema as often as was necessary.
I'll repeat that.
As often as was necessary.
They debugged the script to get to an acceptable level of conversion. The data conversion (or schema migration) was perfectly repeatable. Of course, they longed for 100% conversion; but pragmatically, the legacy software had bad data. So some fraction would not convert. And once that fraction was found, the schema migration applications could be modified to treat the non-convertable data in some intelligent way.
Their stated goal was to convert data and run parallel testing with that converted data as often as necessary to create confidence that the new data was as perfect a conversion as was possible. At some point, the confidence became certainty and the parallel testing was deemed complete. Since they were parallel testing with live data, the decision amounted to a formalized "commissioning" of the new application. And by then, the new application was already being used.
There are bad ways to do schema migration, of course. HPL listed many.
Horrible Mistakes
The horror story from HPL included this:
"For the migrated tables, create views in the old system and create instead of triggers on those views to ship data to the new system."It appears that they used views and triggers to create a new system "facade" over the legacy system. Apparently, they wanted both suites of application software to coexist. Not a good approach to schema migration. It appeared that they were trying to share one database with two application schema.
This seems like it's doomed. Unless they're all geniuses.
Wiser Heads would have strongly suggested that the new system use a extract of the old system's data.
HPL goes on to complain,
"Sometimes we can take over a column or 2 and sometimes we can only take over some of the data in the table".HPL emphasizes this point with "This is not that far fetched". I'm not sure why the emphasis was needed.
This is not "far fetched". It doesn't need emphasis. It's not really much of a problem, either. It's a standard part of schema migration. Extracting a copy of the data makes this quite easy. Triggers and views to create some kind of active SQL-based Facade is what created the complexity. Not the number of columns involved.
HPL summarizes,
"So you end up w/ [many] tables/views triggers all moving data back and forth and faking stuff out"Back and forth. A fundamental mistake. A copy can be much easier to manage. One way data movement: Legacy to New.
HPL concludes with a litany of errors of various types: performance, change management, file system issues, error logging and auditing. Blah blah blah. Yes, it was a nightmare. I feel their pain.
What About Coexistence?
It appears that HPL was involved in a project where the entire old and new applications were supposed to somehow coexist during the conversion.
It appeared that they failed to do any kind of partitioning.
Coexistence is not a trivial exercise. Nor is it a monolith where the entire legacy application suite must coexist with the entire new schema and the entire new application suite.
Pragmatically, coexistence usually means that some portion of the legacy must be kept running while some other portion is modernized. This means the coexistence requires that the application portfolio be partitioned.
Step 1: Some suite of functionality is migrated. That means data from the legacy database/file system is copied to new. That also means some data from new is copied back into the legacy database file/system. Copied.
Step 2: Some other suite of functionality is migrated. As functionality is moved, less and less data is copied back to the legacy.
At some point, this copying back is of no value and can be discontinued.
What About Timing?
This copying clearly requires some coordination. It's not done haphazardly.
Does it require "real time" data movement? i.e. triggers and views?
Rarely is real time movement required. This is the point behind partitioning wisely. Partitioning includes timing considerations as well as data quality and functionality considerations.
It's remotely possible that timing and partitioning are so pathological that data is required in both legacy and new applications concurrently. This is no reason to throw the baby out with the bathwater. This is nothing more than an indication that the data is being copied back to the legacy application close to real time.
This also means performance must be part of the test plan. As well as error handling and diagnostic logging. None of this is particularly difficult. It simply requires care.
Lessons Learned
HPL appeared to make the claim that schema migration is super hard. Or maybe that coexistence is really hard.
Worse, HPL's horror story may be designed to indicate that a horrifying lost weekend is the only way to do schema migration.
Any or all of these are the wrong lessons to learn.
I think there are several more valuable lessons here.
- Schema migration can and should be done incrementally. It's ideally tackled as an Agile project using Scrum techniques. It's okay to have release cycles that are just days apart as each phase of the conversion is run in parallel and tested to the user's satisfaction.
- Coexistence requires partitioning to copy any data back to unconverted legacy components. Triggers and views and coexistence of entire suites of software make a difficult problem harder.
- The conversion script is just another first-class application. The same quality features apply to the conversion as to every other component of the app suite.
- The conversion must be trivially repeatable. It must be the kind of thing that can be run as often as necessary to move legacy data to the new schema.
Tuesday, July 23, 2013
Almost a good idea
See http://en.wikipedia.org/wiki/AppleWorks#End_of_Appleworks
Which is fine unless you have an old computer with old applications that still works. For example, a 2002-vintage iMac G4 http://www.imachistory.com/2002/ still works. Slowly.
When someone jumps 11 years to a new iMac, they find that their 2002 iMac with 2007 apps has files which are essentially unreadable by modern applications.
How can someone jump a decade and preserve their content?
1. iWork Pages is cheap. Really. $19.99. I could have used it to convert their files to their new iMac and then told them to ignore the app. Pages can be hard to learn. For someone jumping from 2007-vintage apps, it's probably too much. However, they can use TextEdit once the files are converted to RTF format.
2. iWork for iCloud may be a better idea. But they have to wait a while for it to come out. And they want their files now.
3. Try to write a data extractor.
Here are some places to start.
- https://github.com/teacurran/appleworks-parser
- http://fossies.org/linux/misc/abiword-2.9.4.tar.gz/dox/ie__imp__ClarisWorks_8cpp_source.html This appears to have a known bug in chaining through the ETBL resources.
- https://github.com/joshenders/appleworks_format This project is more notes and examples than useful code.
In the long run $19.99 for a throw-away copy of Pages is probably the smartest solution.
import argparse
import struct
import sys
import os
from io import open
class CWK:
"""Analyzes a .CWK file; must be given a file opened in "rb" mode.
"""
DSET = b"DSET"
BOBO = b"BOBO"
ETBL = b"ETBL"
def __init__( self, open_file ):
self.the_file= open_file
self.data= open_file.read()
def header( self ):
self.version= self.data[0:4]
#print( self.version[:3] )
bobo= self.data[4:8]
assert bobo == self.BOBO
version_prev= self.data[8:12]
#print( version_prev[:3] )
return self.version
def margins( self ):
self.height_page= struct.unpack( ">h", self.data[30:32] )
self.width_page= struct.unpack( ">h", self.data[32:34] )
self.margin_1= struct.unpack( ">h", self.data[34:36] )
self.margin_2= struct.unpack( ">h", self.data[36:38] )
self.margin_3= struct.unpack( ">h", self.data[38:40] )
self.margin_4= struct.unpack( ">h", self.data[40:42] )
self.margin_5= struct.unpack( ">h", self.data[42:44] )
self.margin_6= struct.unpack( ">h", self.data[44:46] )
self.height_page_inner= struct.unpack( ">h", self.data[46:48] )
self.width_page_inner= struct.unpack( ">h", self.data[48:50] )
def dset_iter( self ):
"""First DSET appears to have content.
This DSET parsing may not be completely correct.
But it finds the first DSET, which includes all
of the content except for headers and footers.
It seems wrong to simply search for DSET; some part of the
resource directory should point to this or provide an offset to it.
"""
for i in range(len(self.data)-4):
if self.data[i:i+4] == self.DSET:
#print( "DSET", i, hex(i) )
pos= i+4
for b in range(5): # Really? Always 5?
size, count= struct.unpack( ">Ih", self.data[pos:pos+6] )
pos += size+4
#print( self.data[i:pos] )
yield pos
def content_iter( self, position ):
"""A given DSET may have multiple contiguous blocks of text."""
done= False
while not done:
size= struct.unpack( ">I", self.data[position:position+4] )[0]
content= self.data[position+4:position+4+size].decode("MacRoman")
#print( "ENDING", repr(self.data[position+4+size-1]) )
if self.data[position+4+size-1] == 0:
yield content[:-1]
done= True
break
else:
yield content
position += size+4
The function invoked from the command line is this.
def convert( *file_list ):
for f in file_list:
base, ext = os.path.splitext( f )
new_file= base+".txt"
print( '"Converting {0} to {1}"'.format(f,new_file) )
with open(f,'rb') as source:
cwk= CWK( source )
cwk.header()
with open(new_file,'w',encoding="MacRoman") as target:
position = next( cwk.dset_iter() )
for content in cwk.content_iter(position):
# print( content.encode("ASCII",errors="backslashreplace") )
target.write( content )
atime, mtime = os.path.getatime(f), os.path.getmtime(f)
os.utime( new_file, (atime,mtime) )
This is brute-force. But. It seemed to work. Buying Pages would have been less work and probably produced better results.
This does have the advantage of producing files with the original date stamps. Other than that, it seems an exercise in futility because there's so little documentation.
What's potentially cool about this is the sane way that Python3 handles bytes as input. Particularly pleasant is the way we can transform the file-system sequence of bytes into proper Python strings with a very simple bytes.decode().
Thursday, July 18, 2013
NoSQL Befuddlement: DML and Persistence
I got an email in which the simple concepts of "data manipulation" and "persistence" had become entangled with SQL DML to a degree that the conversation failed to make sense to me.
They had been studying Pandas and had started to realize that the RDBMS and SQL were not an essential feature of all data processing software.
I'll repeat that with some emphasis to show what I found alarming.
They had started to realize that the RDBMS and SQL were not an essential feature of all data processing.Started to realize.
They were so entrenched in RDBMS thinking that the very idea of persistent data outside the RDBMS was novel to them.
They asked me about extending their growing realization to encompass other SQL DML operations: INSERT, UPDATE and DELETE. Clearly, these four verbs were all the data manipulation they could conceive of.
This request meant several things, all of which are unnerving.
- They were sure—absolutely sure—that SQL DML was essential for all persistent data. They couldn't consider read-only data? After all, a tool like Pandas is clearly focused on read-only processing. What part of that was confusing to them?
- They couldn't discuss persistence outside the narrow framework of SQL DML. It appears that they had forgotten about the file system entirely.
- They conflated data manipulation and persistence, seeing them as one thing.
Persistence and Manipulation
We have lots of persistent data and lots of manipulation. Lots. So many that it's hard to understand what they were asking for.
Here's some places to start looking for hints on persistence.
http://docs.python.org/3/library/persistence.html
http://docs.python.org/3/library/archiving.html
http://docs.python.org/3/library/fileformats.html
http://docs.python.org/3/library/netdata.html
http://docs.python.org/3/library/markup.html
http://docs.python.org/3/library/mm.html
This list might provide some utterly random hints as to how persistent data is processed outside of the narrow confines of the RDBMS.
For manipulation... Well... Almost the entire Python library is about data manipulation. Everything except itertools is about stateful objects and how to change state ("manipulate the data.")
Since the above lists are random, I asked them for any hint as to what their proper use cases might be. It's very difficult to provide generic hand-waving answers to questions about concepts as fundamental as state and persistence. State and persistence pervade all of data processing. Failure to grasp the idea of persistence outside the database almost seems like a failure to grasp persistence in the first place.
The Crazy Request
Their original request was—to me—incomprehensible. As fair as I can tell, they appeared to want the following.
I'm guessing they were hoping for some kind of matrix showing how DML or CRUD mapped to other non-RDBMS persistence libraries.
So, it would be something like this.
SQL | OS | Pandas | JSON | CSV |
---|---|---|---|---|
CREATE | file() | some pandas request | json.dump() | csv.writer() |
INSERT | file.write() | depends on the requirements | could be anything | csv.writerow() |
UPDATE | file.seek(); file.write() | doesn't make sense | not something that generalizes well | depends on the requirements |
DELETE | file.seek(); file.write() | inappropriate for analysis | depends on the requirements | hard to make this up without more details |
APPEND -- not part of DML | file.write() | depends on requirements | could be anything | csv.writerow() |
The point here is that data manipulation, state and persistence is intimately tied to the application's requirements and processing.
All of which presumes you are persisting stateful objects. It is entirely possible that you're persisting immutable objects, and state change comes from appending new relationships, not changing any objects.
The SQL reductionist view isn't really all that helpful. Indeed, it appears to have been deeply misleading.
The Log File
Here's an example that seems to completely violate the spirit of their request. This is ordinary processing that doesn't fit the SQL DML mold very well at all.
Let's look at log file processing.
- Logs can be persisted as simple files in simple directories. Compressed archives are even better than simple files.
- For DML, a log file is append-only. There is no insert, update or delete.
- For retrieval, a query-like algorithm can be elegantly simple.
Interestingly, we would probably loose considerable performance if we tried to load a log file into an RDBMS table. Why? The RDBMS file for a table that represents a given log file is much, much larger than the original file. Reading a log file directly involves far fewer physical I/O operations than the table.
Here's something that I can't answer for them without digging into their requirements.
A "filter" could be considered as a DELETE. Or a DELETE can be used to implement a filter. Indeed, the SQL DELETE may work by changing a row's status, meaning the the SQL DELETE operation is actually a filter that rejects deleted records from future queries.
Which is it? Filter or Delete? This little conundrum seems to violate the spirit of their request, also.
Python Code
Here's an example of using persistence to filter the "raw" log files. We keep the relevant events and write these in a more regular, easier-to-parse format. Or, perhaps, we delete the irrelevant records. In this case, we'll use CSV file (with quotes and commas) to speed up future parsing.
We might have something like this:
log_row_pat= re.compile( r'(\d+\.\d+\.\d+\.\d+) (\S+?) (\S+?) (\[[^\]]+?]) ("[^"]*?") (\S+?) (\S+?) ("[^"]*?") ("[^"]*?")' )
def log_reader( row_source ):
for row in row_source:
m= log_row_pat.match( row )
if m is not None:
yield m.groups()
def some_filter( source ):
for row in source:
if some_condition(row):
yield row
with open( subset_file, "w" ) as target:
with open( source_file ) as source:
rdr= log_reader( source )
wtr= csv.writer( target )
wtr.writerows( some_filter( rdr ) )
This is a amazingly fast and very simple. It uses minimal memory and results in a subset file that can be used for further analysis.
Is the filter operation really a DELETE?
This should not be new; it should not even be interesting.
As far as I can tell, they were asking me to show them how is data processing can be done outside a relational database. This seems obvious beyond repeating. Obvious to the point where it's hard to imagine what knowledge gap needs to be filled.
Conclusion
Persistence is not a thing you haphazardly laminate onto an application as an afterthought.
Data Manipulation is not a reductionist thing that has exactly four verbs and no more.
Persistence—like security, auditability, testability, maintainability—and all the quality attributes—is not a checklist item that you install or decline.
Without tangible, specific use cases, it's impossible to engage in general hand-waving about data manipulation and persistence. The answers don't generalize well and depend in a very specific way on the nature of the problem and the use cases.
Tuesday, July 16, 2013
How Managers Say "No": The RDBMS Hegemony Example
"Their response was nice but can you flush [sic] it out more"
There's a specific suggestion for this "more". But it indicates a profound failure to grasp the true nature of the problem. It amounts to a drowning person asking us to throw them a different colored brick. It's a brick! You want a life preserver! "No," they insist, "I want a brick to build steps to climb out."
For example, trying to mash relatively free-form "documents" into an RDBMS is simple craziness. Documents—you know, the stuff created by word processors—are largely unstructured or at best semi-structured. For most RDBMS's, they're represented as Binary Large Objects (BLOBs). To make it possible to process them, you can decorate each document with "metadata" or tags and populate a bunch of RDBMS attributes. Which is fine for the first few queries. Then you realize you need more metadata. Then you need more flexible metadata. Then you need interrelated metadata to properly reflect the interrelationships among the documents. Maybe you flirt with a formal ontology. Then you eventually realize you really should have started with document storage, not a BLOB in an RDBMS.
Yes, some companies offer combo products that do both. The point is this: avoiding the RDBMS pitfall in the first place would have been a big time and money saver. Google Exists. The RDBMS is not the best choice for all problems.
- Getting away from RDBMS Hegemony requires management thinking and action.
- Management thinking is a form of pain.
- Management action is a form of pain.
- Managers hate pain.
"I finally convinced my current client that RDBMS's are expensive in terms of adding another layer to the archtiecture [sic] and then trying to maintain it."
We'll return to the "more information" part below.
It was good to start the conversation.
It's good to continue the conversation. But the specific request was silliness.
Exposing the Existing Pain
The best place to look for avoidable labor is break-fix problem reports, bugs and enhancements. Another good source of avoidable costs are schema migrations: waiting for the DBA's to add columns to a table, or add tables to a database.
If you can point to specific trouble tickets that come from wrong use of an RDBMS, then you might be able to get a manager to think about it.
The Airtight Case
Your goal on breaking RDBMS Hegemony is to have a case that is "airtight". Ideally, so airtight that the manager in question sits up, takes notice, and demands that a project be created to rip out the database and save the company all that cost. Ideally, their action at the end of the presentation is to ask how long it will take to realize the savings.
Ideally.
It is actually pretty easy to make an airtight case. There are often a lot of trouble tickets and project delays due to overuse and misuse of the RDBMS.
However.
Few managers will actually agree to remove the RDBMS from an application that's limping along. Your case may be airtight, and compelling, and backed with solid financials, but that's rarely going to result in actual action.
"If it ain't broke, don't fix it," is often applied to projects with very high thresholds for broken. Very high.
This is another way management says "no". By claiming that the costs are acceptable or the risk of change is unacceptable. Even more farcical claims will often be made in favor of the status quo. They may ask for more cost data, but it's just an elaborate "no".
It's important to make the airtight case.
It's important to accept the "no" gracefully.
Management Rewards
When you look at the management reward structure, project managers and their ilk are happiest when they have a backlog of huge, long-running projects that involve no thinking and no action. Giant development efforts with stable requirements, unchallenging users, mature technology and staff who don't mind multiple-hour status meetings.
A manager with a huge long-running project feels valuable. When the requirements, people and technology are stable, then thinking is effectively prevented.
Suggesting that technology choices are not stable introduces thinking. Thinking is pain. The first response to pain is "no". Usually in the form of "get more data."
Making a technology choice may require that a manager facilitate a conversation which selects among competing technology choices. That involves action. And possible thinking.
Real Management Pain. The response? Some form of "no".
Worse. (And it does get worse.)
Technology selection often becomes highly political. The out-of-favor project managers won't get projects approved because of "risky technology." More Management Pain.
War story. Years ago, I watched the Big Strategic Initiative shot down in flames because it didn't have OS/370 as the platform. The "HIPPO" (Highest Paid Person's Opinion) was that Unix was "too new" and that meant risk. Unix predates OS/370 by many years. When it comes to politics, facts are secondary.
Since no manager wants to think about potential future pain, no manager is going to look outside the box. Indeed, they're often unwilling to look at the edge of the box. The worst are unwilling to admit there is a box.
The "risk" claim is usually used to say "no" to new technology. Or. To say "no" to going back to existing, well-established technology. Switching from database BLOBs to the underlying OS file system can turn into a bizzaro-world conversation where management is sure that the underlying OS file system is somehow less trustworthy than RDBMS BLOBs. The idea that the RDBMS is using the underlying file system for persistence isn't a compelling argument.
It's important to challenge technology choices for every new project every time.
It's necessary to accept the "no" gracefully.
The "stop using the database for everything" idea takes a while to sink in.
Proof Of Concept
The only way to avoid management pain (and the inaction that comes from pain avoidance) is to make the technology choice a fait accompli.
You have to actually build something that actually works and passes unit tests and everything.
Once you have something which works, the RDBMS "question" will have been answered. But—and this is very important—it will involve no management thought or action. By avoiding pain, you also default into a kind of management buy-in.
War Story
The vendors send us archives of spreadsheets. (Really.) We could unpack them and load them into the RDBMS. But. Sadly. The spreadsheets aren't consistent. We either have a constant schema migration problem adding yet another column for each spreadsheet, or we have to get rid of the RDBMS notion of a maximalist schema. We don't want the schema to be an "at most" definition; we'd need the schema be an "at least" that tolerates irregularity.
It turns out that the RDBMS is utterly useless anyway. We're barely using any SQL features. The vendor data is read-only. We can't UPDATE, INSERT or DELETE under any circumstances. The delete action is really a ROLLBACK when we reject their file and a CREATE when they send us a new one.
We're not using any RDBMS features, either. We're not using long-running locks for our transactions; we're using low-level OS locks when creating and removing files. We're not auditing database actions; we're doing our own application logging on several levels.
All that's left are backups and restores. File system backups and restores. It turns out that a simple directory tree handles the vendor-supplied spreadsheet issue gracefully. No RDBMS used.
We had—of course—originally designed a lot of fancy RDBMS tables for loading up the vendor-supplied spreadsheets. Until we were confronted with reality and the inconsistent data formats.
We quietly stopped using the RDBMS for the vendor-supplied data. We wrote some libraries to read the spreadsheets directly. We wrote application code that had methods with names like "query" and "select" and "fetch" to give a SQL-like feel to the code.
Management didn't need to say "no" by asking for more information. They couldn't say no because (a) it was the right thing to do and (b) it was already done. It was cheaper to do it than to talk about doing it.
Failure To See The Problem
The original email continued to say this:
"how you can achieve RDBMS like behavior w/out an actual RDBMS"What? Or perhaps: Why?
If you need RDBMS-like behavior, then you need an RDBMS. That request makes precious little sense as written. So. Let's dig around in the email for context clues to see what they really meant.
"consider limting [sic] it to
1) CREATE TABLE
2) INSERT
3) UPDATE
An update requires a unique key. Let's limit the key to contain only 1 column.
4) DELETE
A delete requires a unique key. Let's limit the key to contain only 1 column."