So in setting up a search engine for my little artsy project I ran across Hype, a Python wrapper API for the Hyper Estraier full-text search system. Actually, I first ran across this blog post that, forgive the pun, hyped it up. It's a straightforward-enough API. The relevant sections of the code I put together for it looked like this:
...
db = hype.Database(indexPath)
for artFile in self.artCollect():
print '%s found' % os.path.basename(artFile)
self.index(artFile, db)
db.optimize()
db.close()
...
def index(artFile, db):
# grab values and fill the index
doc = hype.Document()
doc['@uri'] = unicode(os.path.basename(artFile)[:-4])
doc['@created'] = binding.article.created
doc['@modified'] = binding.article.modified
doc['@title'] = unicode(binding.article.title)
doc.add_text(unicode(binding.article.title))
doc['@description'] = unicode(binding.article.description)
doc.add_text(unicode(binding.article.description))
doc['@keyword'] = u''
flag = 0
for keyword in binding.article.keyword:
if flag:
doc['@keyword'] += u' '
flag = 1
doc['@keyword'] += unicode(keyword)
doc.add_text(unicode(keyword))
for content in binding.xml_xpath(u'/article/content//*'):
doc.add_text(unicode(content))
db.put_doc(doc)
print '%s added to index' % os.path.basename(artFile)
I didn't get around to a search cgi, but it would look something like
db = hype.Document(indexPath)
# some regex madness i'll figure out later would go here
hits = db.search(unicode(query)).add(unicode(attBits)))
for hit in hits:
out += '<a href="%s">%s</a>' % \
(hit['@uri'], hit['@title'])
So the thing is that with Hype a document has short, non-tokenized attributes, assigned to variables with the "@" sign, and everything else is tokenized text (see the usage doc). This obviously mirrors the way most of us search: we go for the general and do a full-text for concepts or in recollection of a few words from an article or whatnot, or we go for the particular and look for a date range or something in the title, or we combine the two approaches. That model works fine most of the time, but it constrains in a way the PyLucene API doesn't (really it's the Lucene API, but since I'm using PyLucene I'll use that as the reference). In PyLucene I can choose whether or not to tokenize each field; I'm not forced into a number of non-tokenized attributes and one catch-all tokenized text field. Also, there's a clearer way to involve different tokenizers and analyzers in PyLucene, but I haven't really messed around with that yet so let's just focus on which fields can do what for now.
The truth is that even though I'm using PyLucene I'll still create a catch-all tokenized text field, and most everything else about these article files can be put into non-tokenized attribute-like fields. The exception is the "description" field. I know it's an edge case, so it shouldn't be of much concern, but say someone wants to search for something that only appears in the description of an article? With Hype, that means searching for it as an attribute, and that means it's not tokenized, which sucks because the description has sentences with punctuation and such that screw with STRINC searches (I guess part of the problem is also Hyper Estraier's limited attribute searching, but that's also due to the attributes-and-text document model).
Don't get me wrong, I'm not bashing Hyper Estraier or Hype, and the API presented could be the perfect fit for a lot of projects, but for me the extra bit of control that the PyLucene API presents is something that I can't go without. Now, if only Lucene weren't written in Java, and PyLucene didn't require an old, old version of GCJ and the new, new Python 2.4 (well it's not quite so new but it's still not the default in Debian) in order to run. Can't we please get a decent, native Python text indexer/searcher? Maybe I'll resurrect Lupy and turn it into the fastest, rockingest one ever. Guess I better start doing some research on how these things really work.
breaksalot.org