I’m working on an XML indexer. I’m running a bunch of XPath queries, which return a list of nodes. For each of these nodes, I then run subsequent XPath queries which return textual keys. For example:
talker: //talker time = time.stamp/text() name = name[@role="metadata"]/text()
I want to be able to search on these keys. I’ve got a system which allows me to address nodes in the XML document as a short string (eg. ‘12.2235’). The two keys in my example are a date (effectively an integer) and name (a string). I’d like to provide fast glob or regular expression search on the string keys in an index, and fast range limiting on integer keys.
My approach using redis would be to create a list of all possible values for both “time” and “name”. Integer range queries could be achieved using a sorted list, and the globs or regular expressions by just retrieving all the values and running the match. Probably not too slow (names will be in the order of high hundreds of unique values, times in the millions of values).
I’d then retrieve keys such as talker:name:Bowland which would return a set of XML node addresses. I’d take either the union or intersection of these sets as appropriate and use my XML addressing system to pull the nodes out.
In my dataset I have 2.6 million ‘talker’ nodes. A data search over a year could return 300,000 results. A search for ‘everything said by Bowland’ might return 50,000 results. As the set intersection in Redis is O(N*M) that’s going to be prohibitively expensive. I’m thinking this rules out the approach I’ve described, but I’m stuck trying to think of any better way to do things.
In MongoDB, I could have a ‘document’ for every talker node and store the current speaker and time within the document. I’d then use Mongo’s queries (with its indexing support) to do fast searches. The problem of set intersection goes away, as I’d just store the XML address on the ‘document’ and read it out of the query results.
A big advantage of MongoDB is that I wouldn’t have to think about how users might query so much. Each ‘talker’ has seven attributes, and I’d like fast search on each of those attributes. I’m also unsure what use people will make of this XML database, and want it to be open ended and easy for others to adapt.
A friend has suggested I implement an R-tree myself. An R-tree would require that all of my terms are sortable, so I’d have to give away the idea of glob/regexp search for strings and restrict myself to a prefix search – probably sufficient. I’m hesitant to take this approach as I don’t want to reinvent the wheel.
Can anyone suggest a better approach? I’m not totally against SQL, but given that RAM is cheap and my dataset is only 1.2GB of XML with extremely infrequent updates, I don’t need a lot of the functionality it would provide. If I can get everything in RAM it’ll be incredibly fast – I’m happy to just throw away the index and rebuild when new documents are added to the DB.
I want this to be as fast as possible as I want to keep down hosting costs. Although I’ve been oblique about my purpose, all the software I write as part of this project will be released under the MIT license once I have something useful to show :-) Thanks in advance for any help!