Looking for Archivematica the software project? see: archivematica.org

archivemati.ca

archivemati.ca header image 1

Zend Search Lucene, Symfony and the ICA-AtoM application

March 8th, 2007 · 25 Comments · ICA-AtoM

About six months ago Dave Dash posted a great little tutorial demonstrating how to integrate the Zend Framework’s Search component into a Symfony application. That is exactly what I did for the ICA-AtoM archival description application that I am developing using the Symfony platform. Now that I am working on the next version of this application I have upgraded to the latest version (0.8.0) of Zend Search. This upgrade adds proximity, grouped and boolean searches as well as term rank boosting to the ICA-AtoM application. At the end of this post I have some tips to add to Dave’s tutorial that anyone upgrading from older Zend Search versions should be aware of.

The Apache Lucene search engine is probably the most widely adopted open-source search engine. It is, in fact, gaining huge popularity in the digital collections world (as evidenced, for example, by the Lucene workshop at the recent Code4Lib conference). As Mark Jordan noted in his report on the Access 2006 library conference, “Lucene is emerging as the indexer of choice for a number of open source and commercial products since it provides fast searches, can search across separate indexes, and facilitates faceted browsing.”

The beauty of the Zend_Search_Lucene component is that is intended to be a direct PHP port of the Java-based Apache Lucene search engine. The index files it generates are natively accessible to Java-based Lucene applications as well as handy little Lucene utilities like Luke. So we’ll be getting all the functionality of a world-class search engine with the power and flexibility of the web-ready, object-oriented PHP5 language.

Zend_Search_Lucene is one of several very useful components found in the Zend framework. Given Symfony’s flexibility it is very simple to integrate such a component into a Symfony-based application. By the way, I don’t see the Zend Framework as a direct competitor to Symfony (unlike CakePHP, for example). The Zend Framework provides very granular components that can be easily integrated into lightweight PHP applications as well as more structured MVC platforms like Symfony. If you are really motivated you can use it to assemble your very own MVC platform but why would you when you can use such a well-designed, richly documented and proven platform as Symfony.

Zend_Search_Lucene tips for Symfony application developers

O.K. fine. I like Symfony, I like Lucene and I like Zend_Search_Lucene for the ICA-AtoM application. Let’s move on to the nitty-gritty. If you are looking to use Zend_Lucene_Search in a Symfony application for the first time, I’ll post my helper class with some basic instructions further below.

Firstly, though, if you’ve already used Dave Dash’s tutorial to integrate an older version of Zend_Lucene_Search, I’ll just list the changes to keep in mind when you upgrade:

  1. Firstly, there was a bug related to deleting documents in the original 0.8 release. This bug was fixed but you have to download one of the post-0.8 nightly snapshots to get it. I am using snapshot# 20070306-3762 which is performing fine.
  2. There is no need for the require_once calls. Just drop the Zend Framework library in one of Symfony’s class autoloading directories or use Symfony’s Zend Framework Bridge. The only files and directory required for Zend_Lucene_Search are:
    1. Zend.php
    2. /Zend/Search/
    3. /Zend/Exception.php

    It will safe you about 8MB to get rid of the rest of the Zend Framework component directories if you are not going to use them.

  3. Use the Zend_Search_Lucene::create method to instantiate a new Zend_Search_Lucene object when rebuilding an index or creating a new one. Use the Zend_Search_Lucene::open method to add documents to an existing index. This replaces the use of the ‘true’ parameter to indicate whether a new index is getting created, e.g.:
    $index = Zend_Search_Lucene::create('index_location');

    not

    $index = new Zend_Search_Lucene('index_location', true); 
  4. There is no need anymore for the index->commit() call. This is handled automatically by the addDocument() method. Exactly when documents are committed to the index depends on your index optimization settings. Tweaking the buffer sizes creates trade-offs between the speed of indexing versus the speed of querying. Regardless of your index optimization settings, documents are indexed and available for querying as soon as the addDocument() call is completed.
  5. There is no need anymore for the one mega ‘contents’ field to act as an aggregate default search field. Zend_Search_Lucene will search through all fields by default now.
  6. It is still not possible to update documents in the index. You have to delete them first then add them anew. This is where you’ll hit a major snag if you followed Dave’s tutorial like I did. The default Zend_Search_Lucene analyzer (which turns field contents into searchable index terms) now ignores numbers. Therefore, if you try to use the my_symfony_object_id field to pull individual documents out of the index it won’t work. There are two possible solutions:
    1. Use the Query API to turn the Id value into a valid query term:
      $term =  new Zend_Search_Lucene_Index_Term($my_symfony_object->getId(), 'my_symfony_object_id');
      $query = new Zend_Search_Lucene_Search_Query_Term($term);
      $hits = array();
      $hits  = $index->find($query); 
    2. Change the default analyzer to one that treats number as index terms. If you change the default analyzer you need to make this explicit each time you create or update the index and before you submit a query:
       Zend_Search_Lucene_Analysis_Analyzer::setDefault(Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num());

      or

       Zend_Search_Lucene_Analysis_Analyzer::setDefault(Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum());

      I am using the Utf8Num analyzer so that I can index Utf8 scripts and get hits on number characters (e.g. dates) in all my fields. This way I can also use the _id field as a term specific query in my application’s standard search box (e.g. retrieve a specific object by id). However, the Utf8 analyzer is case-sensitive so you have to use a strtolower() work around, as well as store some extra, unindexed copies of the fields you want to use in your hit display so that they are displayed with proper case sensitivity. In the test I’ve done so far, Utf8 indexing, querying and display seems to working correctly. I did have problems with one obscure script which, I think, is due to the fact that it is a right-to-left script.

    I use both a. and b. together, just in case I want to change the default analyzer at some point.

  7. Using ‘_id’ in index document field names is no longer reserved so you can use it to name the field that holds your application’s unique identifier (e.g. my_symfony_object_id). Keep in mind, though, that ‘id’ is still reserved as a unique identifier within the index and the array that is returned for hits.

My Zend_Search_Lucene helper class

Here is a copy of the helper class that I created for the ICA-AtoM application. In the example below I’ve changed the object name from ‘archival_material’ to ‘my_symfony_object’ to make it a generic example. I’ve also reduced and renamed the fields that get added as terms to the document for simplicity.

I call the methods in the mySearchIndex helper from my actions whenever I create, update or delete a ‘my_symfony_object’:

mySearchIndex::updateIndexDocument($my_symfony_object->getId());

or

mySearchIndex::deleteIndexDocument($my_symfony_object->getId());

Of course, these could also be refactored down to the model classes but the timing of my index delete/update calls varies depending on a number of factors (namely whether or how many-to-many relationships are added/deleted) so I need to keep these helper calls in my actions for now.

< ?php

class mySearchIndex
{

  public static function getIndexLocation()
  {
  $index_location = SF_ROOT_DIR.DIRECTORY_SEPARATOR.'data'.DIRECTORY_SEPARATOR.'search_index';

  return $index_location;
  }

  public static function getIndexAnalyzer()
  {
  $index_analyzer = new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num();

  return $index_analyzer;
  }

  public static function BuildIndex()
  {

  $index = Zend_Search_Lucene::create(self::getIndexLocation());
  Zend_Search_Lucene_Analysis_Analyzer::setDefault(self::getIndexAnalyzer());

  $my_symfony_objects = mySymfonyObjectPeer::doSelect(new Criteria());
  foreach ($my_symfony_objects AS $my_symfony_object)
      {
      $doc = self::createIndexDocument($my_symfony_object);

      $index->addDocument($doc);
      }

  }

  public static function updateIndexDocument($id)
  {

  $index = Zend_Search_Lucene::open(self::getIndexLocation());
  Zend_Search_Lucene_Analysis_Analyzer::setDefault(self::getIndexAnalyzer());

  $my_symfony_object = mySymfonyObjectPeer::retrieveByPk($id);

  //first delete existing index entries for this my_symfony_object
  $term =  new Zend_Search_Lucene_Index_Term($my_symfony_object->getId(), 'my_symfony_object_id');
  $query = new Zend_Search_Lucene_Search_Query_Term($term);
  $hits = array();
  $hits  = $index->find($query);

  foreach ($hits AS $hit)
    {
      $index->delete($hit->id);
    }

  //create and add document to index
  $doc = self::createIndexDocument($my_symfony_object);

  $index->addDocument($doc);

  }

  private static function createIndexDocument($my_symfony_object)
  {
    $doc = new Zend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::Keyword('mysymfonyobject_id', $my_symfony_object->getId()));
    $doc->addField(Zend_Search_Lucene_Field::Unstored('title', strtolower($my_symfony_object->getTitle())));
    $doc->addField(Zend_Search_Lucene_Field::Unstored('author', strtolower($my_symfony_object->getAuthor())));
    $doc->addField(Zend_Search_Lucene_Field::Unstored('description', strtolower($my_symfony_object->getDescription())));
    $doc->addField(Zend_Search_Lucene_Field::Unstored('subjects', strtolower($my_symfony_object->getSubjects())));

    //add unindexed, case-sensitive copies of fields for use in hit display
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('display_title', $my_symfony_object->getTitle(), 'utf-8'));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('display_subjects', $my_symfony_object->getSubjects(), 'utf-8'));

    return $doc;

}

public static function deleteIndexDocument($my_symfony_object_id)
  {
  $index = Zend_Search_Lucene::open(self::getIndexLocation());
  Zend_Search_Lucene_Analysis_Analyzer::setDefault(self::getIndexAnalyzer());

  $term =  new Zend_Search_Lucene_Index_Term($my_symfony_object_id, 'mysymfonyobject_id');
  $query = new Zend_Search_Lucene_Search_Query_Term($term);
  $hits = array();
  $hits  = $index->find($query);

  foreach ($hits as $hit)
    {
    $index->delete($hit->id);
    }

  }

}  

My application’s Search action looks something like this:

public function executeSearch()
{

   $this->query = $this->getRequestParameter('search_query');

    if ($this->query)
     {
      $this->getResponse()->setTitle('Search for \'' . $this->query . '\'', true);
      }

    $search_index = SearchIndex::getIndexLocation();
    Zend_Search_Lucene_Analysis_Analyzer::setDefault(SearchIndex::getIndexAnalyzer());

    $hits = array();

    if ($this->query)
      {
       $index = Zend_Search_Lucene::open($search_index);
       $hits = $index->find(strtolower($this->query));
      }

    $this->hits = $hits;

    //create search-info string
    $this->searchinfo = 'search for \'' . $this->query . '\' resulted in ' . count($hits) . ' hits';

}

I use the following code to list the hits in my application’s Search Result template:


< ?php foreach ($hits as $hit): ?>
< ?php echo link_to($hit->display_title, 'my_symfony_object/show?id='.$hit->my_symfony_object_id) ?>
< ?php echo truncate_text($hit->display_description, 250) ?>
< ?php endforeach ?>

25 responses so far ↓

  • 1 Dave Dash // Mar 9, 2007 at 12:08 pm

    In regards to point 5, is there a cleaner way of adding more value to certain fields than other?

    E.g. in the simple user table, is there a way that if someone searches for ‘dave’ that it’ll give more weight to a name that has the word dave, versus an email, or soemthing else.

    Previously I would use str_repeat to repeat the instances of ‘Dave’ to whatever weight I wanted in the contents string… but there should be a cleaner way to do this.

  • 2 Peter Van Garderen // Mar 9, 2007 at 4:10 pm

    I’m still trying to figure that out myself.

    The documentation mentions boosting a query term but it is not clear that it is possible to boost a field when you are adding it to the index.

    It also talks about creating your own Similarity class to modify the scoring algorithm but the examples don’t seem to address field boosting.

    Also, I found this thread in the Zend Framework forum that seems to suggest that it is possible but, again, the details are vague.

  • 3 Peter Van Garderen // Mar 12, 2007 at 11:12 am

    Turns out it is pretty simple. I posted this question to the Zend Framework forum and got a swift reply from Alexander Veremyev, the Zend_Lucene_Search lead developer.

    To boost the terms in the Title field by a factor of 1.5, for example:

    $titleField = Zend_Search_Lucene_Field::Unstored
    ('title', strtolower($my_symfony_object-&gt;getTitle()));

    $titleField-&gt;boost = 1.5;

    $doc-&gt;addField($titleField);

  • 4 lynx // Mar 14, 2007 at 12:46 pm

    Thank you for the great write-up and add on to the other integration post about syfmony and Zend. How did you originally build your index when you had data already populated? Did you call your BuildIndex methond in your helper class from any certain action then remove the call after the index was built?

  • 5 Peter Van Garderen // Mar 14, 2007 at 1:15 pm

    Hi lynx. I’ve got an action called ‘BuildIndex’ that just calls the BuildIndex method:

    SearchIndex::BuildIndex();

    The BuildIndexSuccess.php template just reads:

    index build completed.

    While developing and testing the Search Index I just load this page into my browser to do the initial index build and any subsequent rebuilds.

    I am also going to add a link to this action/template combo from my application’s maintenance page so the administrator can trigger the Build Index process if that is ever necessary while the application in live production.

  • 6 Jérôme Charron // Apr 3, 2007 at 2:59 am

    I tried to boost some fields as expained in your previous comment with the Zend_Platform 0.9.1, but it doesn’t work =&gt; there is no changes in my results with or without boost on some fields.
    Did you tried it successfully?

  • 7 Peter Van Garderen // Apr 3, 2007 at 6:35 am

    It does work successfully for me. However, I found that I needed to use a boost factor of about 5 rather than the 1.5 that is used in the example above. Of course, the ranking algorithm is quite complex and other factors can come into play depending on the nature of the data in your index.

  • 8 Jérôme Charron // Apr 4, 2007 at 3:32 am

    Yes, it works fine… it was due to an error on my side.
    Thanks for your great post.

  • 9 Full-text search using Apache Lucene search engine | my-whiteboard // Apr 6, 2007 at 6:00 pm

    [...] 5, Zend Search Lucene, Symfony and the ICA-AtoM application [...]

  • 10 hope // Apr 10, 2007 at 7:32 am

    Hi,
    Thanks for the tuto,was really useful. I was wondering if there’s a way after creating the index to search only one of the fields.
    For example, on the search page there’s a dropdown (in your example,would contain title author, description…) where the user can specify the field he wants to search .
    Thanks again,

  • 11 Jérôme Charron // Apr 10, 2007 at 7:45 am

    hope, simply use the standard Lucene syntax:
    title:word description:word

  • 12 Peter Van Garderen // Apr 10, 2007 at 10:22 am

    Yes, the user can simply type the index field name into the query string as Jérôme indicated. Of course, you can also create seperate text boxes for each term, each with its own label (e.g. title, description) and then simply append the field name to the front of the query term before sending it to the search engine.

  • 13 sfZendPlugin at Spindrop // Apr 10, 2007 at 4:33 pm

    [...] I originally intended to rewrite my Zend Search Lucene tutorial, but Peter Van Garderen covered the bulk of what&#8217;s changed and I was too busy developing search functionality for lyro.com (not to mention finding inconsistencies with the Zend Search Lucene port and Lucene) to finish the tutorial. So I broke it up into smaller pieces. [...]

  • 14 hope // Apr 11, 2007 at 7:55 am

    worked great,
    Thanks!
    Another question: is it possible to limit the number of results? right now i’m just showing the first nth items in $hits.

  • 15 Peter Van Garderen // Apr 11, 2007 at 8:48 am

    I don’t think so Hope. It doesn’t mention this option in the Zend Search Documentation:
    http://framework.zend.com/manual/en/zend.search.html

    You could ask in the Zend Framework Forums:
    http://www.nabble.com/Zend-Framework-f15440.html

    I am planning on adding a Pager for my search results in the near future. I checked out the PEAR Pager component which looks like a possible candidate for integration:
    http://pear.php.net/package/Pager

  • 16 developercast.com &#187; Spindrop.us: sfZendPlugin (a Zend Framework plugin for Symfony) // Apr 12, 2007 at 4:38 am

    [...] I originally intended to rewrite my Zend Search Lucene tutorial, but Peter Van Garderen covered the bulk of what’s changed and I was too busy developing search functionality for lyro.com (not to mention finding inconsistencies with the Zend Search Lucene port and Lucene) to finish the tutorial. So I broke it up into smaller pieces. [...]

  • 17 hope // Apr 13, 2007 at 8:48 am

    Pager would be nice! I use a mod of the propelPager, and it’s really not efficient, since I need to load all the search result first, and display them accordingly to the page number.

  • 18 t8d blog &#187; Blog Archiv &#187; Zend_Search_Lucene und Symfony - Teil 1 // Jul 10, 2007 at 1:54 pm

    [...] Die Integration von Zend_Search_Lucene in das PHP-Framework Symfony, siehe auch der Artikel von Andreas Stephan, ist sehr einfach, wählt man den Weg, den Peter Van Garderen beschreibt. Nach dem Kopieren der benötigten Klassen in den lib-Ordner des jeweiligen Symfony-Projektes steht Lucene global in der Applikation zur Verfügung. Es bietet sich natürlich an, eine Klasse, welche die Funktionalität zur Bearbeitung des Indexes und zur Suche bereitstellt, ebenfalls in lib anzulegen. Die Erstellung eines Indexes ist simpel: [...]

  • 19 Jules // Aug 2, 2007 at 11:19 pm

    Does it return several times the same items, or did I do something wrong?

  • 20 hope // Aug 2, 2007 at 11:21 pm

    No, it shouldn’t.

  • 21 Jules // Aug 2, 2007 at 11:25 pm

    Nevermind. I was inserting several times the same symfony object. It works now. Great tutorial, thanks.

  • 22 Jules // Aug 2, 2007 at 11:31 pm

    @hope: waouh, thanks for the fast feedback :)

  • 23 hope // Aug 2, 2007 at 11:34 pm

    no prob, it’s 4am and i’m still struggling with some voice frames drops in Asterisk; cool u could find out the prob, symfony rocks:)

  • 24 Medieval Programming &#187; Blog Archive &#187; Integrating Lucene into Symfony - a wrap up // Sep 15, 2007 at 10:06 am

    [...] Peter van Garderen uses Daves tutorial and adds some comments for newer versions. [...]

  • 25 Kevin // Sep 28, 2007 at 5:03 am

    Regarding the PEAR Pager at the link:
    http://pear.php.net/package/Pager

    Is there any update?

    I am now also trying to integrate it. But regarding the following part, how should I modify it so I could get the $hit-&gt;title, $hit-&gt;url, etc?
    ————–

    ————–

    Thanks a lot.