Tuesday, September 18, 2018

Data Modeling Nightmare -- XML, HTML, and Markdown

Here's a particularly tangled and difficult problem. It arises because I have another blog. Specifically this: Team Red Cruising. And it's an unholy mess.

There are two important features of the Team Red Cruising blog.
  1. It's managed with off-line editor(s) so I can write posts from the boat and then upload them when I get connectivity. Welcome to being a technomad -- I don't always have a web-based blog editor available.
  2. It was actually created with two different off-line editors over a period of years: iWeb and Sandvox. iWeb is long dead. Sandvox hasn't seen many updates recently, and I think I'd like to move on to something newer and "better". 
(In this case, "better" means iOS-friendly. e.g., Blogo or BlogPad ProAlso. Blogo's support site seems to be a right mess. Not a good look. They're working on it.)

The blog isn't the unholy mess. We'll get to the mess below. First, however some background on the overall strategy. I want to move my content. What's involved? There are several things in play: the hosting, the target, and the source. So. Essentially. Everything.

Changing the Hosting Platform

Both of my legacy tools would export and upload the changes to my hosting service directly, avoiding the overheads of having any complex hosting software. The site was static and served simply from the filesystem via Apache httpd. Publishing was an SFTP transfer to the server. Nothing more. The "platform" was almost nothing.

(I could switch to using an Amazon S3 bucket and a DNS entry and it would work nicely.)

Both of these offline editing tools have a tiny bias toward working with hosting services like WordPress. Blogo claims it can also work with Medium, and Blogger, as well.

This means running Wordpress on top of my default SFTP/Apache configuration. I use A2 Hosting, so this is really easy to do.

So. The hosting is more-or-less settled. I'll do very little. (Dealing with breaking links is a separate hand-wringing exercise.)

In order to move from iWeb and Sandvox to another tool, and start using WordPress, I have two strategies for converting the content.
  1. Ignore my legacy content. Leave it where it is, more-or-less uneditable. The tool(s) are gone, all that's left is the static HTML output from the tool. 
  2. Gather the legacy content and migrate it to WordPress and then pick an offline tool that works with WordPress. 
I've already done strategy #1, when I converted from iWeb to Sandvox. I left the old iWeb stuff out there, and moved to a new URL path with new content. While a clever menu structure can make it look like it's all one multi-year blog, the pages themselves are vastly different in the way they look. There's no comprehensive search. And, of course, I can't easily maintain the old iWeb stuff.

Having one #1, I'm now sure that's a bad idea.

An advantage of moving to WordPress is the ability to have all of the content in one, uniform database. WordPress has export functionality, so the next tool is a distinct possibility.

Note that SandVox seems to have a distinct problem trying to import iWeb's published content. They have a cool HTML scraper, but iWeb relies on JavaScript, and scraper doesn't do well.

Getting to WordPress

What we're looking at is a fairly complex data structure. While I'd like to look at this from a vast and reserved distance (i.e., in the abstract) I have a very concrete problem. So, we're forced to consider this from the WordPress POV.

We have a WordPress "Site" with a long series of posts and some pages.

The essence here is that the content can -- to an extent -- be converted to Markdown. The titles and dates are easy to preserve. The body? Not so much.

We can, as an alternative to Markdown, use some kind of skinny HTML that WordPress supports. I think WP can handle a structure free of class names, and using a most of the available HTML tags.

Most of the blog content is relatively flat. The block structure is generally limited to images, block quotes, paragraphs, ordered and unordered lists. The inline tags in use seem to be a, img, strong, em, and a few span tags for font changes.

The complexity, then, is building a useful content model from the source. There are a few AST's for Markdown. commonmark.py might have a useful AST.  It's not complex, so it may be simpler to define my own.

It's hard to understand the inline blocks in mistletoe. The python-markdown project uses ElementTree objects to build the AST. I'm not a fan of this because I'm not parsing Markdown.

Starting From -- Well, it's Complicated

There are -- as noted above -- two sources:
  • Sandvox.
  • iWeb.
The Sandvox desktop "database" structure is opaque. The media is easy to find. The content is some kind of binary-encoded data with headers that tell me a little about the XCode environment, but nothing else.

To read this, I have to scrape the HTML using Beautiful Soup. It involves processing like this:

    content = soup.html.body.find("div", id="main-content")
    article = content.find(class_="article-content").find(class_="RichTextElement").div

Find a nested <div> with a target ID. Inside that <div> is where the article can be found.

This seems to work out pretty well. Almost everything I want to preserve can be -- sort of -- mushed into Markdown.

The iWeb desktop "database" is XML. The published HTML depends on Javascript and is hard to work with. The XML is -- of course -- densely wordy and convoluted as can be. But the words and markup are there.  I can use ElementTree to walk down through XML to locate the right tags.

There's a lot of code like this

    main_layer = child_root.find('ns0:site-page/ns0:drawables/ns0:main-layer', ns)

This example digs into site pages, and nested drawables, and main layers of content.  Eventually, we wind up looking at <p>, <span>, <attachment-ref>, and <link> tags in the XML to build the relevant content.

The nuance is style. They're not part of the inline markup. They're stored separately, and included by reference. Each of the four tags that seem to be in use have a style attribute that references styles defined within the posting. Once these references are resolved, I think Mardown can be generated.

The Unholy Mess

The hateful part of this is the disconnect between HTML (and XML) and Markdown. The source data permits indefinite nesting of tags. Semantically meaningless <p><p>words</p></p> are legal. The "flattening" from HTML/XML to Markdown is worrisome: what if I trash an entry by missing something important?

Ideally, it's this:

Pragmatically, HTML/XML can be more complex. This diagram assumes we won't have paragraphs inside list items. HTML permits it. It's redundant in Markdown.

Worse, of course, are the inline tags. HTML has a kabillion of them. The software I've been using seems to limit me to <img>, <strong>, <em>, and <a>. HTML/XML allows nesting. Markdown doesn't.

Ideally, I can reframe the inline tags to create a flat sequence of styled-text objects within any of the tags.

Right now. Headaches.

Working on the code. It's not a general solution to anyone else's problem. But. I'm hoping -- as I beat the problem into submission -- to find a way to make some useful tutorial materials on mapping between complex, and different, data structures.

Tuesday, September 11, 2018

Code Review

I can't actually share all the code. So this is feels incomplete. But I can share what I said about the code. Then you can look at your code and decide if you've got similar problems to fix.

My responses were these. I'll expand on them below.
  1. This appears to be a single cell in a Jupyter notebook? Why isn’t it a script?
  2. The code doesn’t look like any effort was made to follow any conventions. Use black. Or pylint. Make the code look conventional. 
  3. There don’t appear to be any docstring comments. That’s really a very bad practice. 
  4. The design appears untestable. That’s a very bad practice. 
  5. If this is an example of “production” code, I would suggest it needs a lot of rework.
Let's review these in a little more detail.

Number 1 was based on the file name being something_p36.ipynb.txt. The Jupyter notebookiness of the name is a problem. The _p36 is extra creepy, and indicates either a severe problem understanding how bash "shebang" comments work, or a blatant refusal to simply use Python3. It's hard to say what's going on, and I didn't even try to ask because... well... too many other things weren't clear.

Don't make up complex, weird naming rules. Use something.py. Simple. Flat. Pythonic.

Number 2 was based on things like this: def PrintParameters(pca): I hate to get super-pure PEP-8, but this kind of thing is simply hard to read. There were a LOT of other troubling aspects to the code. Once this is corrected, some of the other problems will go away, and we could move forward to more substantial issues.

Follow existing code styles. Find Python code. The standard library has a LOT of examples already part of your installation. Read it. Enjoy it. Mimic it.

Use pylint. Always.

Number 3 and Number 4 are consequences of the bulk of the code being a flat script with few class or function definitions. Actually, there were one of each. One class. One function. 240 or so lines of code. There was no separate __name__ == "__main__" section, so I was generally unhappy with the overall design.

Also. There's code like this

if True:

Yes.  That's a real line of code. Sigh.

Here's an ancillary problem. If you need to write something like this, you're doing it wrong.

 -- init Stuff

The code that follows one of these "big billboard comment" sections *must* be part of a function or class. It can't be left floating around with a billboard for demarcation. It should be refactored into a function (or method of a class), documented, and tested.

Did I mention tested?

It's untestable as written. Sigh.

Number 5 may be a misunderstanding on my part. The email had this: "They have produced production code that mathematically optimizes stuff for [redacted]. So, they are heads up type of people."

I'm guessing this is relevant because the team has some "production" code in Python and consider themselves knowledgeable. Otherwise, this is noise, and I should have ignored it.

I'm hopeful they'll use black, make the code minimally readable, and we can move on to substantial issues regarding design for testability and overall possible correctness issues.

It wasn't the worst code I've seen. But. It shows a lot of room for growth and improvement.

Tuesday, September 4, 2018

Handy Flask Configuration -- Bookmark the original article

Pycoders Weekly (@pycoders)
Configure Python 3, Flask and Gunicorn on Ubuntu 18.04 LTS – bit.ly/2vRZYQR

We worked through this about a year ago, without the help of this post. Having the article would have saved us some time and effort. You should bookmark it.

We liked this tech stack because it was simple and effective.

The team I'm on now is using NGINX and uWSGI as well as Python3 and Flask. It's also effective and it's also pretty simple. It has a few more moving parts, but works reliably.