Tossing up between Redis and MongoDB (or postgres)

I’m working on an XML indexer. I’m running a bunch of XPath queries, which return a list of nodes. For each of these nodes, I then run subsequent XPath queries which return textual keys. For example:

talker: //talker
    time = time.stamp/text()
    name = name[@role="metadata"]/text()

I want to be able to search on these keys. I’ve got a system which allows me to address nodes in the XML document as a short string (eg. ‘12.2235’). The two keys in my example are a date (effectively an integer) and name (a string). I’d like to provide fast glob or regular expression search on the string keys in an index, and fast range limiting on integer keys.

My approach using redis would be to create a list of all possible values for both “time” and “name”.  Integer range queries could be achieved using a sorted list, and the globs or regular expressions by just retrieving all the values and running the match. Probably not too slow (names will be in the order of high hundreds of unique values, times in the millions of values).

I’d then retrieve keys such as talker:name:Bowland which would return a set of XML node addresses. I’d take either the union or intersection of these sets as appropriate and use my XML addressing system to pull the nodes out.

In my dataset I have 2.6 million ‘talker’ nodes. A data search over a year could return 300,000 results. A search for ‘everything said by Bowland’ might return 50,000 results. As the set intersection in Redis is O(N*M) that’s going to be prohibitively expensive. I’m thinking this rules out the approach I’ve described, but I’m stuck trying to think of any better way to do things.

In MongoDB, I could  have a ‘document’ for every talker node and store the current speaker and time within the document. I’d then use Mongo’s queries (with its indexing support) to do fast searches. The problem of set intersection goes away, as I’d just store the XML address on the ‘document’ and read it out of the query results.

A big advantage of MongoDB is that I wouldn’t have to think about how users might query so much. Each ‘talker’ has seven attributes, and I’d like fast search on each of those attributes. I’m also unsure what use people will make of this XML database, and want it to be open ended and easy for others to adapt.

A friend has suggested I implement an R-tree myself. An R-tree would require that all of my terms are sortable, so I’d have to give away the idea of glob/regexp search for strings and restrict myself to a prefix search – probably sufficient. I’m hesitant to take this approach as I don’t want to reinvent the wheel.

Can anyone suggest a better approach? I’m not totally against SQL, but given that RAM is cheap and my dataset is only 1.2GB of XML with extremely infrequent updates, I don’t need a lot of the functionality it would provide. If I can get everything in RAM it’ll be incredibly fast – I’m happy to just throw away the index and rebuild when new documents are added to the DB.

I want this to be as fast as possible as I want to keep down hosting costs. Although I’ve been oblique about my purpose, all the software I write as part of this project will be released under the MIT license once I have something useful to show :-) Thanks in advance for any help!

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Tossing up between Redis and MongoDB (or postgres)

  1. Alex Popescu says:

    I might be missing some details, but why not using a search library for it? (Lucene, Solr, ElasticSearch)

    • grahame says:

      I can’t remember why I decided against a full-text-search system now… and you make a good point, I should just use one. ElasticSearch looks great, I’ll see if I can use that. Cheers!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s