Importing Old Blog Posts with Python

, in Computing

I believe that this world deserves to see the literary treasure trove of posts that disappeared when my old sandfly.net.nz blog fell offline due to lack of internal and external interest.

I kept a backup of the database, and thanks to a quick piece of Python I now have the text of those posts in a somewhat tractable form. Much of what I wrote is either out-of-date or just rubbish, but I plan on moving at least a few representative posts over to this new system.

Ripping out the articles from my WordPress database backup turned out to be easier than I had anticipated. Python has some good libraries for accessing MySQL and the WordPress schema is pretty straight-forward.

Incidentally, this post seems like as good as place as any to try out the code-formatting markdown syntax, so here is the python script I whipped up to do the job.

# Export files from a wordpress database to a directory structure with
# formatted gensite-style files

import os
import pymysql
import files
import argparse
import datetime

def export(host, port, user, password, database, output_dir, author):
  results = []
  connection = pymysql.connect(host=host, port=port, user=user, password=password, db=database)
  try:    
    with connection.cursor() as c:
      c.execute("select post_title, post_content, post_modified_gmt from wp_posts where post_type='post'")
      while True:
        result = c.fetchone()
        if result == None:
          break
        if (result[2] == None):
          continue
        title = result[0]
        content = result[1].replace("\r\n", "\n")
        note = " "
        pos = content.find(".")
        content = content[:pos] + note + content[pos:]
        t = result[2].timetuple()
        p = files.create_new_article(output_dir, title, author, t, initial_contents=content)
        print(p, " created")
  finally:
    connection.close()

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.add_argument("host", help="DB Server Host")
  parser.add_argument("port", type=int, help="DB Service Port")
  parser.add_argument("user", help="User name")
  parser.add_argument("password", help="Password")
  parser.add_argument("database", help="Database")
  parser.add_argument("dest", help="Destination directory")
  parser.add_argument("author", help="Post author")

  args = parser.parse_args()

  export(args.host, args.port, args.user, args.password, args.database, args.dest, args.author)

That did the bulk of the work. There is still some manual fiddling with each file before gensite will produce good output. I have all the images in a separate folder, but they need to be manually copied together, plus some of the formatting looks weird with the new stylesheet - some judicious on-the-fly editing is sometimes required.

That said, it is still a quick job to import one of my old articles and I plan on bringing most of the archive back online in dribs and drabs over the next month or so.