S.Lott-Software Architect: Parsing HTML from Microsoft Products (Like Front Page, etc.)

Wednesday, November 4, 2009

Parsing HTML from Microsoft Products (Like Front Page, etc.)

Ugh. When you try to parse MS-generated HTML, you find some extension syntax that is completely befuddling.

I've tried a few things in the past, none were particularly good.

In reading a file recently, I found that even Beautiful Soup was unable to prettify or parse it.

The document was filled with  constructs that looked vaguely directive or comment-like, but still managed to stump the parser.

The BeautifulSoup parser has a markupMassage parameter that applies a sequence of regexps to the source document to cleanup things that are baffling. Some things, however, are too complex for simple regexp's. Specifically, these nested comment-like things were totally confusing.

Here's what I did. I wrote a simple generator which emitted the text that was unguarded by these things. The resulting sequence of text blocks could be assembled into a document that BeautifulSoup could parse.


def clean_directives( page ):
"""
Stupid Microsoft "Directive"-like comments!
Must remove all <!--[if...]>...<![endif]--> sequences.  Which can be nested.
Must remove all <![if...]>...<![endif]> sequences.  Which appear to be the nested version.
"""
if_endif_pat= re.compile(  r"(\<!-*\[if .*?\]\>)|(<!\[endif\]-*\>)" )
context= []
start= 0
for m in if_endif_pat.finditer( page ):
   if "[if" in m.group(0):
       if start is not None:
           yield page[start:m.start()]
       context.append(m)
       start= None
   elif "[endif" in m.group(0):
       context.pop(-1)
       if len(context) == 0:
           start= m.end()+1
if start is not None:
   yield page[start:]

2 comments:

Kevin HNovember 6, 2009 at 10:50 AM
Those if...endif (blogspot won't let me post the real syntax...grrr) things are called "conditional comments", and are used to do browser detection and try and make up for the fact that MS couldn't be bothered to follow web standards for a really long time.

see: http://www.quirksmode.org/css/condcom.html

Also, I'm curious, did you try using lxml.html? It's often handy when dealing with broken pages, and sometimes it can even deal with pages that BeautifulSoup chokes on.

I do like your solution, though.
ReplyDelete
Replies
Michael WatkinsNovember 6, 2009 at 1:18 PM
+1 on lxml; it's the Swiss Army knife (tm) of xml/html parsing and munging. I'll let blogger screw up the formatting - this is pretty easy:

# this strips all styles, id and class attributes
from lxml.html import clean, fromstring, tostring
cleaner = clean.Cleaner(page_structure=False,
style=True,
safe_attrs_only=True,
comments=True,
remove_unknown_tags=True,
remove_tags=['span',])

doc = fromstring(open('mswordexport.html').read().decode('windows-1252'))
cleaner(doc)
# clear certain attributes
for el in doc.xpath('.//*'):
el.attrib.pop('id', None)
el.attrib.pop('class', None)
el.attrib.pop('style', None)

print tostring(doc).encode('utf-8')
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Bio and Publications

Wednesday, November 4, 2009

Parsing HTML from Microsoft Products (Like Front Page, etc.)

2 comments: