Blacklight, ActiveFedora and Shelver: Interplay between Searching, Managing and Indexing in a Repository Solution
Warning: filesize() [function.filesize]: stat failed for http://yourmediashelf.com/blog/wp-content/uploads/2010/03/or10_proposal_blacklight_activefedora_and_shelver.pdf in /home/mshlf/public_html/wp-content/plugins/mimetypes-link-icons/mime_type_link_images.php on line 376
Warning: filesize() [function.filesize]: stat failed for http://yourmediashelf.com/blog/wp-content/uploads/2010/03/long_version_of_or10_proposal_blacklight_activefedora_and_shelver.pdf in /home/mshlf/public_html/wp-content/plugins/mimetypes-link-icons/mime_type_link_images.php on line 376
OpenRepositories 2010 Presentation Proposal (Long Version)
Any repository solution provides facilities for Creation, Management, & Editing of Content as well as facilities for Searching & Browsing through that content. Experience has shown that when a solution binds these two areas of functionality together too tightly, the system becomes brittle and unworkable, discouraging innovation. Our work on the Hydra project has produced a flexible and intuitive solution that combines these two areas in an almost entirely decoupled fashion. This solution, which is already working in multiple Hydra applications, is built on a three-part pattern where Blacklight handles Search & Discovery, ActiveFedora handles Creation, Management and Editing of Content, and a small application called Shelver supplies the crossover point by indexing the content into Solr so that it will show up in Blacklight. This three-part approach reflects a strong pattern for designing and/or improving repository solutions. The main pivot of this approach is to treat indexing as its own separate part of the application and to allow that indexing processes to evolve constantly as part of the application development cycle.
This work is the product of combining established best practices, best of breed software, and lessons learned from an iterative approach to application development. While our implementation is focused on Fedora Repositories, the software could be used in multiple contexts and the pattern is certainly applicable to any content-oriented application.
The anatomy of a Hydra Application
Note: This is a working model of the functional structure of a Hydra application. The complete designs for the final features and functionality of Hydra applications reach far beyond what is presented here. For more information on the greater vision around the Hydra project, please refer to the Hydra Project pages on the Fedora Commons wiki.
- The portion of a Hydra application that handles Creation, Management, & Editing of content is provided by the Hydra Core, which consists of ActiveFedora along with a few Hydra “helpers” which integrate ActiveFedora into Ruby on Rails.
- The Search & Discovery portion of a Hydra application is a Blacklight installation – nothing more, nothing less. As with any Blacklight installation, its behavior and appearance are likely to be customized but otherwise there is nothing Hydra-specific about it.
- Shelver (which can be run either from within the application, from the command line, or as a JMS listener) indexes content and its salient metadata in Solr, usually pulling that content from Fedora.
These three components — Blacklight, and Hydra Core and Shelver — work in concert to present a consolidated repository solution to the end user. Meanwhile, the three components are sufficiently decoupled that each could be run as a freestanding application. They interoperate based on a minimal contract that revolves around decisions about what information should be in Solr and how it should be represented in the Solr index in order to achieve the ideal search experience.
In the process of customizing or extending a Hydra application, some changes require modifications to all three components, but most changes impact only one or two of the components at a time. This makes it very easy to iteratively improve the application and adapt to real world needs.
This structure grew naturally out of a process of exploration. In early 2009 developers at UVA and Stanford discovered that it was relatively easy to put Blacklight on top of a Solr index that had in turn been been populated by ActiveFedora — effectively turning Blacklight into a search & discovery interface for that Fedora repository. Based on that, we tried dropping ActiveFedora-driven views & controllers for editing Fedora content into the same Ruby on Rails application as an existing copy of Blacklight. It worked like a charm. The two systems happily coexist. What we found was that as long as we could change and refine how the metadata percolates from Fedora into Solr, getting Blacklight to operate together with the ActiveFedora management component was completely straightforward.
With most Hydra applications, all content is stored in a Fedora Repository. However, there is nothing to prevent you from adding non-Fedora content to solr and having it show up in the (Blacklight) search & discovery views. Of course, that content will not be editable unless you implement the code to integrate with that content’s host system.
Best of Breed: Blacklight & ActiveFedora
Blacklight is a next generation Search & Discovery tool. It was intentionally designed to serve a single purpose – Search & Discovery – without having any knowledge of indexing, cataloging, or even the location of the content it’s searching through. Whatever information you have in your Solr index, Blacklight will help you expose a rich, faceted search interface for exploring through that information and displaying detail views of individual records. This open-ended design made it very easy for us to integrate Blacklight directly into our Hydra applications as-is. The ease with which we achieved this seamless integration is a testament to the quality of Blacklight’s design.
ActiveFedora is a Ruby library that encapsulates the details of interacting with a Fedora repository and provides high-level tools for defining data models, creating Fedora objects, and modifying the data associated with those objects. While opening the door to rapid, iterative application development, ActiveFedora also attempts to expose and accentuate many of the strong design patterns inherent in Fedora. ActiveFedora’s emphasis on flexibility and design patterns provided us with many opportunities to make our Hydra applications robust and re-usable. In particular, ActiveFedora makes it possible for multiple Hydra (and non-Hydra) applications to operate on top of a single Fedora repository, thus achieving the goal of providing many lightweight views onto complex, heterogeneous repository content.
Hydra Core: the building blocks of an interface for creating & editing Fedora content
Hydra Core provides the Ruby on Rails code that handles Creation, Management, & Editing of Fedora content. This primarily consists of Rails helpers for generating edit interfaces and Rails controllers to handle the submissions from those interfaces.
Fedora allows a great amount of freedom with respect to data models and metadata. As a result, we could not simply create a single generic content management interface in Hydra. Instead, we created a number of “helpers” that allow you to deal with your Fedora content and its metadata at a high level of abstraction. For example, the editable_metadata_field helper generates the HTML for displaying an editable version of whichever metadata field you specify. All you have to know is what field you want to display and where it is stored within the object. Everything else is handled for you.
The forms generated by the Hydra helpers need somewhere to submit their data to. This is handled by the Rails controllers provided by Hydra core.
Underneath the helpers and controllers, Hydra Core relies on ActiveFedora to handle connecting with Fedora, modeling Fedora objects & metadata, and performing the basic operations of creation, retrieval, updating and deletion.
Shelver: a script that brought unexpected freedom
When we wrote Shelver, we didn’t anticipate how integral it would become to the application development process. Shelver started out as something extremely simple. A developer at Stanford initially wrote it in order to populate a Solr index with some working data from a Fedora Repository. Over time, as needs arose, we built out the script to be more robust. It soon became apparent how crucial it is to be able to modify and/or augment the behavior of your indexing tool. In most other systems, the indexing tool is either implicit (relational databases) or external to your application and difficult to re-configure (ie. Fedora GSearch). As a result, when working with other systems, discussion of (and changes to) the indexing strategy are kept to a minimum. In contrast, since we had Shelver at our disposal, we found ourselves constantly tweaking it to satisfy new functionality. This ability to tweak our indexing routines gives us radical freedom to explore new features, improve the search experience, and increase the quality of search results.
Eventually, we pulled shelver into the Hydra app itself so that we could trigger it as part of the save/update process, though we retained the hooks for running it as a command line tool as well. We also did this because we found that changes we made to shelver often corresponded to changes in the search interface. Shelver was continually evolving in conjunction with the application, so it made sense to track the code with a single versioning system.
Approaches to Indexing: from RDBMS to Fedora + Shelver
RDBMS (data model & search index combined)
If you rely solely on a Relational Database to drive your application, your data model (the database schema) is also your indexing model — any search oriented changes necessitate changes to your data model. This makes it difficult to refine and extend the search & discovery portion of the application without impacting other areas of functionality.
RDBMS + Solr (separate search index from data without much thought to the conceptual differences)
A number of tools exist for pulling content from a relational database into Solr. This achieves the goal of separating the search index from the data itself, allowing you to have an indexing model separate from your data model. However, often with these systems the indexing methodology remains tightly bound to the data model. This is more of a conceptual stumbling block than a technical one. It’s easy to underestimate the complexity and distinctiveness of indexing for search & discovery. It is not enough to index your data in Solr; you must think differently about how and why you put it there. This “thinking” must be manifest somewhere in the application’s code, ideally separated from the rest of the application.
Fedora + GSearch + Solr (freestanding tool specifically handles indexing)
Fedora is explicitly designed with the idea that you should separate your data model from your indexing solution(s). This allows us to use any variety of content models and metadata schemas to represent our content in Fedora while pulling that information into any number of indexes to suit specific searching needs. The most common indexing approach with Fedora repositories is to use Fedora GSearch to pull Fedora content into a Lucene, Solr or Zebra index. This approach has the benefits of completely separating the data from the index while also providing a freestanding, configurable tool to handle the process of indexing.
GSearch was designed with the goal of enabling 1) full-text searching of Fedora content and 2) indexing of arbitrary XML metadata from Fedora objects. It runs as a web application alongside Fedora, listening for JMS messages or REST API commands telling it to (re)index Fedora objects. The process by which GSearch indexes the content is implemented as a mix of XSLT and Java code.
GSearch establishes the strong best practice of decoupling both the search index and the indexing process from the data itself. This pattern was part of Fedora’s design all along, but thus far GSearch has been the clearest manifestation of it.
Because it was designed specifically to enable full-text indexing using XSL Transformations (XSLT), GSearch operates on the premise that you are transforming the content in order to put it in the index. In a basic system, transformations are sufficient. However, most repository solutions eventually need to actively process the data when indexing it, performing complex actions in order to decide how to populate the search index. Because XSLT does not lend itself to performing such complex processes, you must modify Java code if you want to implement this type of processing in GSearch. Modifying that code has proven daunting for most. Very few projects have taken on the challenge of modifying the GSearch code itself. Those that have modified the code have only done so in minimal and relatively stable ways.
Fedora + Shelver + Solr (allowing the indexing methodology to constantly evolve)
If you want to provide a great search & discovery experience in your application, you must make it easy to iteratively “massage” the indexes. Anyone who manages a Blacklight or VuFind installation on top of their ILS (or anyone who participates in Code4Lib) can attest to the fact that in order to achieve a truly successful search & discovery experience you must continually refine the way you index your metadata. Little changes in your indexing methodology can bear tangible results for end-users.
In building SALT, the first Hydra application to combine Blacklight with ActiveFedora, we created Shelver as an alternative to GSearch because we wanted to be able to specify our indexing process in Ruby code and, where possible, we wanted to use simple mapping files rather than being forced to use XSLT and Java to perform those actions. We assumed that Shelver would be a relatively simple application whose code rarely changed. After all, when using GSearch we rarely changed the XSLT and basically never changed the Java code. We expected that the same would be true with Shelver. We were wrong. Shelver is constantly changing because we are constantly coming up with new things that we want to do to improve the search & discovery utilities in our Hydra applications. As time passes, the code of Shelver itself has stabilized, but the instructions for how to index specific data from Fedora continually morphs as a regular part of the application development process. In fact, touching the Shelver code has become such an integral part of our work that we can’t imagine building a repository solution without this kind of freedom.
Conclusion, Observations and Best Practices
To review, some of the recommendations coming out of this work are to
- use indexing as the crossover point between decoupled solutions for searching through and managing your content
- make the indexer an explicit, evolving part of your application
- use flexible components that were designed with iterative development in mind
- re-use established best practices where possible
- combine best of breed solutions for astounding results
We were pleasantly surprised to discover how easy it was to combine Blacklight and ActiveFedora into a single Fedora solution. The three-part pattern that emerged out of this effort, which now constitutes a basic Hydra application, builds on well established practices and serendipitously combines them in a stable, intuitive way. This in turn provides a strong base for us all to carry out a great amount of innovative work in the coming years.