Blackrod progress

My XML database and project provide better data access to the Federal hansard is really coming along. I’ve now got the indexing code working quite reliably. I’m still coming up with new ideas and tweaking it, but it’s settling down. Here’s an example index definition:

from blackrod.index import *
import pytz, datetime

def talk_date(speech_date, talk_time):
    # times are in EST
    if len(speech_date) != 1 or len(talk_time) != 1:
        return None
    est = pytz.timezone('Australia/Canberra')
    try: dt = est.localize(
       datetime.datetime.strptime(
         ' '.join((speech_date[0], talk_time[0])),
                   '%Y-%m-%d %H:%M:%S')
    except ValueError: return None
    return dt.astimezone(pytz.utc)
class Talker(Index):
    query = '//talker[not(ancestor::continue)]'
    talker_id = StringField('name.id/text()')
    name = StringField('name[@role="metadata"][position()=1]/text()')
    when = DateField('/hansard/session.header/date/text()',
                     'time.stamp/text()', map_fn=talk_date)
    role = StringField('role/text()')
    party = StringField('party/text()')
    in_gov = BooleanField('in.gov/text()', from_int=True)
    first_speech = BooleanField('first.speech/text()', from_int=True)
    text = StringField(
      "../para//text()|../../para//text()|" \
      "../../continue/talk.start/para/text()|" \
      "../../continue/para//text()",
     join='\n')

If you’ve ever used Django you’ll see that I’ve ripped off their way of doing model definition. There’s no code in common, I just like the approach so I’ve come up with something similar. Anyway, here’s what that looks like when running on some of the Hansard:

{'_id': 'Talker0.64',
 'name': 'SPEAKER, Mr',
 'next': 'Talker0.123',
 'party': '',
 'prev': 'Talker0.47', 'role': '',
 'root': '0.64',
 'talker_id': '10000',
 'text': '—Order! The member for Brisbane has been most accommodating. He is entitled at least to
  the attention of those currently congregating in the aisles.',
 'type': 'Talker'}

(Source: The Parliament of Australia, Creative Commons 3.0 Attribution-NonCommercial-NoDerivs)

The indexer is working pretty much well enough that I can leave it for now and concentrate on the couchdb side of things. Full text search will be provided by ElasticSearch; I just need to learn how to use Couch views to provide other types of query access.

My remaining todo items before I launch the website are:

  1. set up some couch views so you can run queries such as ‘Everything that XX has ever said’. I also need utility views, eg. ‘delete all index documents of type XX’.
  2. figure out how to safely expose CouchDB and ElasticSearch to the Internet; I think this will likely be a little bit of Python or node.js which proxies between them
  3. write an example front end in HTML+Javascript to show what this code can do

… so really, there’s not that much left to do. Expect to see some amusing graphs generated from the Hansard soon, as I get the views up and running.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Blackrod progress

  1. Keith says:

    Shiny :D

    I like that you can do CSS-style selectors, but how on earth does this line work?

    name = StringField(‘name[@role=”metadata”][position()=1]/text()’)

    From when do position() and text() come? (Not a Python aficionado)

    Would be very interested to see how you go with node.js – I tried using it for a project about a year ago but didn’t get very far. At the time there wasn’t a lot available in the way of resources for helping to install and learn it though. It’s got massive potential – especially for near-real-time client-server communication.

    If you’re after a graphing package the JavaScript InfoVis Toolkit (http://thejit.org/demos/) would be worth a look. I’ve been wanting to use it on a project for a while now based on the awesomness of the demos alone :)

    • grahame says:

      It’s just an XPath query:
      http://en.wikipedia.org/wiki/XPath

      The query= bit of the definition sets the root node for each object. The other queries for each of the fields are then run relative to that node. So in this case I find all the talker nodes (that aren’t in a continuation of another talker node) then grab all the fields you can see :-) I forgot to link the code, but it’s over here on bitbucket:
      https://bitbucket.org/angrygoat/blackrod

      It’s a bit of a pain to get going as I ported some of the deps to Python 3 myself, I’ll make a bundle with deps included at some point!

      I’m thinking node is the right fit, I just want to do a little bit of simple transformation; determine if the URL to couch/elasticsearch is allowed, then grabbing the result and if needed looking up the XML from the source data. I’ll let you know how I go :)

      Cheers for the graphing toolkit link, that’s amazing! I looked probably two years ago and they weren’t this far along, wow!

  2. Keith says:

    Ahh, XPath, cool! Coming from JavaScript land I associate all that with CSS selectors :)

    Node does indeed sound like a good choice. As best I know it’s the most lightweight web server around (though I guess it isn’t *technically* a web server). LightTPD is pretty good too (in my brief experience) – and probably a bit more Apache-like (and a bit easier to understand at first?) than Node.

    Aye, same, I first took a look when they announced it but they’ve made a lot of progress since then. *raises hand* Frontend monkey willing to help with graphs here :) I forsee much fun with searches relating to the Member for Sturt :D

    • grahame says:

      Help would be great! I’ll get it set up on a server this week and give you an SSH account – I just need to get a little more done so that there’s a useful backend to run a frontend against :-)

      Woo!

  3. Keith says:

    Cool :)

    Incidentally, was having a go at replacing some old flash graphing software I’m using and ran across this package: http://www.highcharts.com/demo/ (free for personal use). It looks *very* shiny indeed :D

    Had a bit of a play with JIT while I was looking too. Pretty easy to use but doesn’t natively support line graphs (which was mostly what I was after).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s