Tuesday, November 13, 2018

Using Python instead of bash

See Bashing the Bash — Replacing Shell Scripts with Python for some concrete examples of stuff you can do in Python or the shell.

And yes, it's a good, workable idea. 

1. It's unit testable.
2. It's easier to read.
3. It may be faster. Not that you'd notice unless you've really made a terrible mistake and written some gigantic application as a shell script.

While you're at it, check out the overall blog: https://medium.com/capital-one-tech. There's a lot going on.

Tuesday, November 6, 2018

PyData 2018 Washington, DC

See https://pydata.org/dc2018/

You do need to get your tickets ASAP. The schedule is fabulous.

Hotel rooms are still available, so don't waste any time getting connected.

Tuesday, October 30, 2018

The SourceForge vs. GitHub Conundrum


Or "When is it time to move?"

I've got https://sourceforge.net/projects/stingrayreader/ which has been on SourceForge since forever. 

Really since about 2014. Not that long. But. Maybe long enough?

The velocity of change is relatively slow.

However. 

(And this is a big however.) SourceForge seems kind of complicated when compared with Github. 

It's not a completely fair comparison. SourceForge has a *lot* of features. I don't use very many of those features. 

The troubling issues are these.

1. Documentation. SourceForge -- while it has a Git interface -- doesn't handle my documentation very well. Instead of a docs directory, I do a separate upload of the HTML. It's inelegant. SourceForge may handle this more smoothly nowadays. Or maybe I should switch to readthedocs? 

2. The Literate Programming Workflow. There's an extra step (or two) in LP workflows. The PyLit3 synchronization to create the working Python from the RST source. This is followed by the ubiquitous steps creation of a release, creation of a distribution, and the upload to PyPI. I don't have an elegant handle on this because my velocity of change is so low. SourceForge imposed a "make your own ZIP file" mentality that could be replaced by a nicer "use PyPI" approach.

3. Clunky Design Issue. I've uncovered a clunky, stateful design problem in the StingrayReader. I really really really need to fix it. And while fixing it, why not move to Github?

4. Compatibility Testing. The StingrayReader seems to work with Python 3.5 and up. I don't have a formal Tox suite. I think it works with a number of versions of XLRD. And it *should* be amenable to other tools for Excel processing. Not sure. And (until I start using tox) can't tell. 

5. Type Hints. See #3. The stateful design problem can be finessed into a much more elegant use of NamedTuples. And then mypy can be used.

6. Unit Tests. Currently, the testing is all unittest.TestCase. I really want to convert to pytest and simplify all of it.

7. Lack of a proper workflow in the first place. See #2. It's a more-or-less sitting in the master branch of a git repo that's part of SourceForge. That's kind of shabby. 

8. Version Numbering Vagueness. When I was building my own Zip archives from the code manually (because that's the way SourceForge worked.) I wasn't super careful about semantic versioning, and I've been release patch-number versions for a while. Which is wrong. A few of those versions included new features. Minor, but features. 

But. One tiny new feature. So. It will be release 4.5.

See https://sourceforge.net/p/stingrayreader/blog/2018/10/moving-to-github/ for status, also

Tuesday, October 16, 2018

The Edge of the Envelope

I don't -- generally -- think of myself as an edge-of-the-envelope developer. I'm a tried-and-proven kind of engineer. I want stuff that's been around for years, with a long history of changes.

Except.

Today.

Currently, I'm revising Mastering Object-Oriented Python. Second Edition.

That means upgrading everything to Python 3.7 with full type hints throughout almost all of the 18 chapters. (SQLAlchemy presents some problems, so we're not going deep there.)

The chapter on foundational WSGI applications is *totally* broken. I can't get anything to work with mypy. (The unit tests run, but mypy complains. Loudly.) Of course, I tried every wrong thing for three solid days. Then I pulled the stub file from typeshed and realized how dumb I was.

Okay. I finally got the correct type hints. Yay!

But.

Something in mypy is balking at the start_response() function calls. Too many arguments.

Read the issues. Hm. Stack Overflow. Hm.

Just to be sure, I updated to the new 0.630 release in September, 2018.

Problem solved. So. I've arrived at the edge of the envelope. I now require the absolutely latest and greatest mypy release. By the time I'm done with the rewrites, this release will be ancient history. But today, it was wonderful to get past the examples.

Tuesday, September 18, 2018

Data Modeling Nightmare -- XML, HTML, and Markdown

Here's a particularly tangled and difficult problem. It arises because I have another blog. Specifically this: Team Red Cruising. And it's an unholy mess.

There are two important features of the Team Red Cruising blog.
  1. It's managed with off-line editor(s) so I can write posts from the boat and then upload them when I get connectivity. Welcome to being a technomad -- I don't always have a web-based blog editor available.
  2. It was actually created with two different off-line editors over a period of years: iWeb and Sandvox. iWeb is long dead. Sandvox hasn't seen many updates recently, and I think I'd like to move on to something newer and "better". 
(In this case, "better" means iOS-friendly. e.g., Blogo or BlogPad ProAlso. Blogo's support site seems to be a right mess. Not a good look. They're working on it.)

The blog isn't the unholy mess. We'll get to the mess below. First, however some background on the overall strategy. I want to move my content. What's involved? There are several things in play: the hosting, the target, and the source. So. Essentially. Everything.

Changing the Hosting Platform

Both of my legacy tools would export and upload the changes to my hosting service directly, avoiding the overheads of having any complex hosting software. The site was static and served simply from the filesystem via Apache httpd. Publishing was an SFTP transfer to the server. Nothing more. The "platform" was almost nothing.

(I could switch to using an Amazon S3 bucket and a DNS entry and it would work nicely.)

Both of these offline editing tools have a tiny bias toward working with hosting services like WordPress. Blogo claims it can also work with Medium, and Blogger, as well.

This means running Wordpress on top of my default SFTP/Apache configuration. I use A2 Hosting, so this is really easy to do.

So. The hosting is more-or-less settled. I'll do very little. (Dealing with breaking links is a separate hand-wringing exercise.)

In order to move from iWeb and Sandvox to another tool, and start using WordPress, I have two strategies for converting the content.
  1. Ignore my legacy content. Leave it where it is, more-or-less uneditable. The tool(s) are gone, all that's left is the static HTML output from the tool. 
  2. Gather the legacy content and migrate it to WordPress and then pick an offline tool that works with WordPress. 
I've already done strategy #1, when I converted from iWeb to Sandvox. I left the old iWeb stuff out there, and moved to a new URL path with new content. While a clever menu structure can make it look like it's all one multi-year blog, the pages themselves are vastly different in the way they look. There's no comprehensive search. And, of course, I can't easily maintain the old iWeb stuff.

Having one #1, I'm now sure that's a bad idea.

An advantage of moving to WordPress is the ability to have all of the content in one, uniform database. WordPress has export functionality, so the next tool is a distinct possibility.

Note that SandVox seems to have a distinct problem trying to import iWeb's published content. They have a cool HTML scraper, but iWeb relies on JavaScript, and scraper doesn't do well.

Getting to WordPress

What we're looking at is a fairly complex data structure. While I'd like to look at this from a vast and reserved distance (i.e., in the abstract) I have a very concrete problem. So, we're forced to consider this from the WordPress POV.

We have a WordPress "Site" with a long series of posts and some pages.


The essence here is that the content can -- to an extent -- be converted to Markdown. The titles and dates are easy to preserve. The body? Not so much.

We can, as an alternative to Markdown, use some kind of skinny HTML that WordPress supports. I think WP can handle a structure free of class names, and using a most of the available HTML tags.

Most of the blog content is relatively flat. The block structure is generally limited to images, block quotes, paragraphs, ordered and unordered lists. The inline tags in use seem to be a, img, strong, em, and a few span tags for font changes.

The complexity, then, is building a useful content model from the source. There are a few AST's for Markdown. commonmark.py might have a useful AST.  It's not complex, so it may be simpler to define my own.

It's hard to understand the inline blocks in mistletoe. The python-markdown project uses ElementTree objects to build the AST. I'm not a fan of this because I'm not parsing Markdown.

Starting From -- Well, it's Complicated

There are -- as noted above -- two sources:
  • Sandvox.
  • iWeb.
The Sandvox desktop "database" structure is opaque. The media is easy to find. The content is some kind of binary-encoded data with headers that tell me a little about the XCode environment, but nothing else.

To read this, I have to scrape the HTML using Beautiful Soup. It involves processing like this:

    content = soup.html.body.find("div", id="main-content")
    article = content.find(class_="article-content").find(class_="RichTextElement").div

Find a nested <div> with a target ID. Inside that <div> is where the article can be found.

This seems to work out pretty well. Almost everything I want to preserve can be -- sort of -- mushed into Markdown.

The iWeb desktop "database" is XML. The published HTML depends on Javascript and is hard to work with. The XML is -- of course -- densely wordy and convoluted as can be. But the words and markup are there.  I can use ElementTree to walk down through XML to locate the right tags.

There's a lot of code like this

    main_layer = child_root.find('ns0:site-page/ns0:drawables/ns0:main-layer', ns)

This example digs into site pages, and nested drawables, and main layers of content.  Eventually, we wind up looking at <p>, <span>, <attachment-ref>, and <link> tags in the XML to build the relevant content.

The nuance is style. They're not part of the inline markup. They're stored separately, and included by reference. Each of the four tags that seem to be in use have a style attribute that references styles defined within the posting. Once these references are resolved, I think Mardown can be generated.

The Unholy Mess

The hateful part of this is the disconnect between HTML (and XML) and Markdown. The source data permits indefinite nesting of tags. Semantically meaningless <p><p>words</p></p> are legal. The "flattening" from HTML/XML to Markdown is worrisome: what if I trash an entry by missing something important?

Ideally, it's this:



Pragmatically, HTML/XML can be more complex. This diagram assumes we won't have paragraphs inside list items. HTML permits it. It's redundant in Markdown.

Worse, of course, are the inline tags. HTML has a kabillion of them. The software I've been using seems to limit me to <img>, <strong>, <em>, and <a>. HTML/XML allows nesting. Markdown doesn't.

Ideally, I can reframe the inline tags to create a flat sequence of styled-text objects within any of the tags.

Right now. Headaches.

Working on the code. It's not a general solution to anyone else's problem. But. I'm hoping -- as I beat the problem into submission -- to find a way to make some useful tutorial materials on mapping between complex, and different, data structures.

Tuesday, September 11, 2018

Code Review

I can't actually share all the code. So this is feels incomplete. But I can share what I said about the code. Then you can look at your code and decide if you've got similar problems to fix.

My responses were these. I'll expand on them below.
  1. This appears to be a single cell in a Jupyter notebook? Why isn’t it a script?
  2. The code doesn’t look like any effort was made to follow any conventions. Use black. Or pylint. Make the code look conventional. 
  3. There don’t appear to be any docstring comments. That’s really a very bad practice. 
  4. The design appears untestable. That’s a very bad practice. 
  5. If this is an example of “production” code, I would suggest it needs a lot of rework.
Let's review these in a little more detail.

Number 1 was based on the file name being something_p36.ipynb.txt. The Jupyter notebookiness of the name is a problem. The _p36 is extra creepy, and indicates either a severe problem understanding how bash "shebang" comments work, or a blatant refusal to simply use Python3. It's hard to say what's going on, and I didn't even try to ask because... well... too many other things weren't clear.

Don't make up complex, weird naming rules. Use something.py. Simple. Flat. Pythonic.

Number 2 was based on things like this: def PrintParameters(pca): I hate to get super-pure PEP-8, but this kind of thing is simply hard to read. There were a LOT of other troubling aspects to the code. Once this is corrected, some of the other problems will go away, and we could move forward to more substantial issues.

Follow existing code styles. Find Python code. The standard library has a LOT of examples already part of your installation. Read it. Enjoy it. Mimic it.

Use pylint. Always.

Number 3 and Number 4 are consequences of the bulk of the code being a flat script with few class or function definitions. Actually, there were one of each. One class. One function. 240 or so lines of code. There was no separate __name__ == "__main__" section, so I was generally unhappy with the overall design.

Also. There's code like this

if True:

Yes.  That's a real line of code. Sigh.

Here's an ancillary problem. If you need to write something like this, you're doing it wrong.

##########################
 -- init Stuff
##########################

The code that follows one of these "big billboard comment" sections *must* be part of a function or class. It can't be left floating around with a billboard for demarcation. It should be refactored into a function (or method of a class), documented, and tested.

Did I mention tested?

It's untestable as written. Sigh.

Number 5 may be a misunderstanding on my part. The email had this: "They have produced production code that mathematically optimizes stuff for [redacted]. So, they are heads up type of people."

I'm guessing this is relevant because the team has some "production" code in Python and consider themselves knowledgeable. Otherwise, this is noise, and I should have ignored it.

I'm hopeful they'll use black, make the code minimally readable, and we can move on to substantial issues regarding design for testability and overall possible correctness issues.

It wasn't the worst code I've seen. But. It shows a lot of room for growth and improvement.

Tuesday, September 4, 2018

Handy Flask Configuration -- Bookmark the original article

Pycoders Weekly (@pycoders)
Configure Python 3, Flask and Gunicorn on Ubuntu 18.04 LTS – bit.ly/2vRZYQR

We worked through this about a year ago, without the help of this post. Having the article would have saved us some time and effort. You should bookmark it.

We liked this tech stack because it was simple and effective.

The team I'm on now is using NGINX and uWSGI as well as Python3 and Flask. It's also effective and it's also pretty simple. It has a few more moving parts, but works reliably.