OpenGraph and RSS Improvements for Gensite

Andrew Stephens, Tuesday the 4th of July, 2017 in Computing

Opengraph LogoThere are reasons why most people stick to established website generators. Back in the day it was a perfectly normal thing to fire up a text editor, bash out a few lines of HTML and then start writing whatever you wanted to say. Save, upload, job done.

Not any more. Any self-respecting page needs huge chucks of customized markup to be considered a first-class citizen. Omit these lines risk being marginalized with low search rankings and ugly social media links.

What happened? The web got fancy. No matter how interesting your content is, it is hard to compete with the slickly produced pages with nice formatting, appropriate fonts, and eye-catching images. So everyone upped their game, and little by little it became very difficult to do all the little things that make websites look so nice these days.

Gensite (the software that produces these pages) was written to help me manage the complexity of a modern webpage by automatically generating much of the annoying little details that are, if not strictly required, highly desirable. I have just added one of the last missing features, proper OpenGraph support.

<meta name="author" content="Andrew Stephens" />
<meta property="og:url"                
      content="https://sheep.horse/2017/6/other_minds-_the_octopus,_the_sea,_and_the_deep_or.html" />
<meta property="og:type"               
      content="article" />
<meta property="og:title"              
      content="Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness" />
<meta property="og:description"        
      content="There is something endearing about octopuses. A wet, spidery, amorphous creature 
      should provoke an instinctive fear response in humans but for some reason the cephalopods 
      (octopuses, cuttlefish, and nautiluses) fascinate..." />
<meta property="og:image"              
      content="https://sheep.horse/2017/6/other_minds_cover.jpg" />
<meta property="twitter:card"          
      content="summary" />
<meta property="twitter:site"   
      content="@SandflyAndrew" />
<meta property="twitter:title"  
      content="Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness" />

OpenGraph specifies a bunch of meta tags that you are supposed to include in every page so that other pages can easily generate a visual representation of your site. This is purely so that links to your site look nice when shared on Facebook or Twitter.

What all that code looks like on Facebook
What all that code looks like on Facebook

It is annoying that I have to do all this extra work just to help Facebook out a little, but I actually do enjoy when other people share my stuff on social media so I guess the effort will pay off.

The really annoying bits to generate are the default og:image and the og:description tags. Gensite generates html from markdown, the only bulletproof way of determining which images are included in the database is to perform the markdown->html transformation, parse the resulting html fragments, and do xpath lookups on the resulting soup. I generate the summaryreally just the first 30 words the same way.

def summarize_markup(self):
    """ parse some markup and try to extract some meaningful text """
    try:
        elements = lxml.html.fragments_fromstring(self.processed_text);
    except lxml.etree.XMLSyntaxError:
        print("XMLSyntaxError when parsing markup")
        return

    summary = ""

    for e in elements:
        for i in e.findall(".//img"):
            folder = os.path.split(self.dest_file_name())[0]
            image_url = self.site_config.root_url + folder + "/" + i.get("src");
            self.images.append(image_url);

        if (e.tag != "p"):
            continue

        for t in e.findall(".//span"):
            c = t.get("class");
            if c:
                if ((c.find("sidenote") != -1) or
                   (c.find("importantmarginnote") != -1)):
                   t.drop_tree();

        summary += e.text_content();

    """ grab the first 30 words """
    summary = " ".join(summary.split(maxsplit=30)[:30]) + "...";
    self.summary = summary;

This works out OK, and saves me having to manually set the image and description on each article, but it does slow things down a little. Still, with over 300 articles, sheep.horse only takes a couple of seconds to compile so it is not a huge problem for now.

The RSS and Atom feeds are also much improved along similar lines (summary text, etc). Gensite is getting almost, dare I type it, feature complete.

Next up, a large reorganization of the code.