I’ve yet to actually use full-text indexing on a project, but I’m continually drawn to the idea. I found mysql a bit surprised by the fact that it isn’t being used nearly as much as (I thought) it should be. The two obvious places that I thought full-text indexing would surely just take off are blogs and wikis. More than anything these two systems are designed to hold lots and lots of words. If the rise of Google has taught us anything it is that being able to easily, accurately and quickly search through huge amounts of words is vital to how useful those words are. What’s the point of having wikis and blogs store their data forever (more or less) if their own built in search features can’t find what you’re looking for? Some of you will immediately raise your hand to tell me that simply letting Google index your wiki and blog is much better than trying to have a good built in search. That may or may not be true, but that only works for publicly available resources. What about all of those wikis and blogs that are for internal use only? So once again we are back to relying on the built in search features of these tools.
So if I were to write a wiki or blog system it seems like using full-text indexing would be the obvious choice when it comes searching. So I started looking around to see if something like WordPress or MediaWiki. WordPress doesn’t appear to make any use of full-text indexing, but MediaWiki does. I did some quick looking around and couldn’t find any other wiki or blog system that made use this. I thought this was very strange since most of these systems use MySQL as their back end database. MySQL has had support for full-text indexing for years and looks to be pretty easy to use, so why isn’t it seeing more use? I think the answer can be found in the MySQL Full-Text docs, full-text indexing only works for MyISAM table types. For some reason I had never come across this limitation before, I just assumed that it also worked with InnoDB tables. It is on their ToDo list, but there doesn’t appear to be an expected timeline for it.
So anyone who wants to make use of full-text indexing has to use MyISAM on those tables under MySQL. This seems like a huge limitation. One of the reasons MySQL has advanced so much (in my view) is because of the features that InnoDB has brought to the table.
This brings me PostgreSQL and full-text indexing, made possible by the tsearch2 extension. The immediate difference between MySQL and PostgreSQL that jumps out is that MySQL has it built in and PostgreSQL has as a third party module. The install instructions look pretty easy, but I haven’t tried them yet so I can’t say for sure. Both of these approaches have advantages, but the advantage is targeted towards different groups. For PostgreSQL, having a third party develop this extension makes things easier for the PostgreSQL developers and harder for the users of software that uses PostgreSQL for the backend. Now not only do I have to make sure that my web host supports PostgreSQL (that’s another story in itself), but I have to make sure that they’ve installed tsearch2. For MySQL there is likely more work for the developers of MySQL because now they have one more thing that they have to support and test for each release. The benefit of that work is that full-text indexing is available to users of any install of MySQL. This is a grossly over simplified look at things, but I believe it fits the mind set of those who want to run software X on some random web host.
Unlike MySQL, PostgreSQL doesn’t have different table types. This means all of the features and abilities are available to you wether you want use full-text indexing or not. This factor would almost completely dry up if full-text indexing was supported in InnoDB. There are also some implementation details that are different between the two. MySQL uses (at least for syntax) something that looks like what you would use to create a regular index. In tsearch2 you end up adding a field to you table and then creating an index on that field.
So which is better? I don’t know, I hope to try out of both of them in the future, just to see how they work. Based on what I’ve read I don’t think either of these solutions provide a simple and powerful full-text indexing that can be used widely. MySQL’s support for it excludes their most powerful table type, InnoDB and PostgreSQL requires the install of an additional module in order to work.
Perhaps full-text indexing is a sufficiently complex task that there will never be a true poor mans solution.
Update (3:16 pm 12 Aug 2004): I guess it really is all about timing. There’s an intro article on PostgreSQL full-text indexing by Joshua Drake over at DevX dated 10 Aug 2004.
6 replies on “The State Of Full-Text Indexing For The Poor Man”
Have you looked at swish-e? It’s an open source full text indexing engine: http://www.swish-e.org/
I absolutely love it – it’s what I used to build the search engine for http://www2.ljworld.com/search/ and http://www.visitlawrence.com/search/ , among others. You need to run the indexer periodically (we do it once a day) so it doesn’t search your “live” data, rather a snapshot from last time you ran the indexer.
I don’t remember coming across swish-e before, it certainly looks interesting. I’m not sure how exactly this would integrate into a purely dynamic database drive site though.
I’m glad you mentioned this though because I’ve got another project at work where this will come in very handy.
You can import data in to swish-e by writing a script that pulls things from your database and spits them out to standard output one at a time as simple HTML files – you then pipe that straight in to the swish-e indexer. All of our sites are dynamic / database driven, but we use the swish-e indexer to build an index of our content once a day. Then we use the Python swish-e binding to query the index for serving up actual search results (there are bindings for other languages as well).
I am currently using Lucene plus either MySql or Postgres. However, I must say, it would be lot better if MySql could integrate CLucene into its implementation.
i use innodb for mysql and i replicate all varchar/text data to a search index table each time there is an update or create. i think it is the best way because you get innodb and text search…until mysql release full text for innoDB
swish-e does not support Unicode. End of story.
> In tsearch2 you end up adding a field to you table and then creating an index on that field.
Yes, and that is why it works much better. When you search and rank results, mysql reads the index, then has to reparse the text, whereas tsearch2 uses the canned pre-parsed tsvector in your table. Besides, since postgres compresses the text and vector, the resulting table size is the same as mysql. Postgres also does lexemization so your search results are much better.
Some benchmarks for you. Read on 😉
http://peufeu.free.fr/ftsbench/