Technology Evaluation - Static Site Search

From wiki.gpii
Jump to: navigation, search

Introduction

Within the GPII and Fluid communities, we use static sites for demos, complex reports, blog posts, and documentation. The highest profile example is the documentation for Infusion, the framework which is the foundation for nearly all of our work on the GPII. Implementers like myself commonly use the site when writing code, when reviewing pull requests, and when troubleshooting problems.

The Infusion documentation site is presented as a hierarchy of pages and categories, and readers can only use their browser search within the current page. Knowledgeable developers can install the source code for the Infusion documentation or search the repository in GitHub, but it would be very helpful to have searching in context in the documentation site itself. This is also true of various smaller documentation efforts, demonstrations, internal blogs and other static site usages.

This document reviews the range of options available in depth, and suggests a candidate solution. Although there are a range of approaches, I am assuming for the purposes of this review that:

  1. We want performant full text searching of our static sites.
  2. We want a solution that is licensed in a compatible way.
  3. We want a solution that does not impose unusual memory requirements or a lot of dependencies on our end users' browsers.
  4. We want a solution which does not commit us to supporting a new range of build tools and/or languages, i.e. which is written in Javascript.
  5. We would rather not commit to a third-party indexing service like Google Custom Search or Algolia.
  6. We do not want to introduce any requirement for hosting our own server-side architecture.
  7. We want a solution that is widely used, well maintained, and under active development by a healthy community of contributors.

Based on the above requirements, I will be reviewing the javascript-based approaches to indexing and searching full text content, and rating their appropriateness for our purposes.

Initial Review

I initially compiled a list of candidates pulled from Google searches and a review of the "search" and "searching" categories in npm, to screen for candidates that should be tested in depth. Here is a summary of the solutions in alphabetical order.

Solution License Health Recommendation
Fuse.js Apache Actively maintained, although the volume of changes is low. Reasonably sized community of contributors. Candidate for further evaluation.
fuzzy-search ISC Much smaller and much less active community than Lunr JS. Too small/poorly adopted to consider.
fuzzysearch MIT Very small number of contributors, and no real activity since early 2015. Too small/poorly adopted to consider.
fuzzysort MIT Only two contributors, a small amount of effort up front, and then increasingly little development in 2018. No development in 2019. Too small/inactive to consider.
js-search MIT A somewhat smaller and much less active community than Lunr JS. Candidate for further evaluation.
Lunr.js MIT The largest community of the field. Seems to consist of a single major contributor and a wide range of less frequent contributors. They have been and remain active in adding features and fixing bugs, and their reply time for even simple usage questions also seems good. Candidate for further evaluation.

Based on the above review, it seems like Fuse.js, js-search, and Lunr.js are worth evaluating further.

Detailed Review

First, I read through the documentation and source code and tried to use each of the solutions. Here are more detailed observations on each.

Fuse.js

Also geared towards loading the full content in the client as JSON and searching that. Does not help with pre-indexing. Professional looking documentation, but no detailed API docs and only limited examples. Incredibly fast indexing (often < 1ms), but slower searching than Lunr.js (139 ms vs. 3ms).

js-search

A faster alternative to Lunr.js. Less robust documentation and examples. Although it is faster for individual searches (1ms vs. 3ms for Lunr.js), you can only index a full set of documents, which means that it would take more than 7 times as long as Lunr.js to be ready to search (2,721 ms vs. 376 ms)

Lunr.js

Lunr.js supports deeply configurable full-text searching from within the browser. A key feature is the ability to index the content in node and then load the index from the client side. In testing, loading a prebuilt index is over 8 times faster than indexing the full data from scratch (376 ms vs. 2,721 ms).

A short term advantage is that this is the only solution that already has a plugin for docpad, which we use to build our documentation site.

Speed Comparison

Although speed is not the only consideration, it is a major consideration, especially if we decide to have search integrated into and loaded on all parts of our documentation site. To compare the solutions side-by-side, I wrote a script for each solution that pulls in the same 6,000+ JSON documents, indexes them, and then performs a search. I took timing data over multiple runs to compare them. Here is a breakdown of the indexing speed:

Solution Index Speed Search Speed
Fuse.js 0-1 ms 139 ms
js-search 2,483 ms 2 ms
Lunr (full index) 3,419 ms 3-5ms
Lunr (cached index) 384 ms 3-5ms

Recommendation

Of the three candidates, js-search is the least well supported and least used, and (because it cannot be used with a pregenerated index) is also the slowest. Based on that, I would recommend using one of the other two solutions.

Both Fuse.js and Lunr.js are solid solutions with an active community of developers and adopters. The option to index documents on the server side with Lunr.js is appealing, as it lets us use the full range of processing plugins available in the Node ecosystem, and does not limit us to browser-based solutions. The indexing speed of Fuse.js is compelling, but its per-search speed is much slower.

If we would like to have a search interface that refreshes itself in real time as people type, Lunr.js is probably the better option, as it would be faster after as few as four refreshes. If we would like to have a search that is a separate standalone page, then Fuse.js is probably the best option, as the time to load the page, index content, and perform the search is less than half what Lunr.js requires.

Of the two, I would argue for using a standalone search page, as it avoids the complexity of repeatedly announcing dynamic content. It also keeps the vast majority of the documentation site simple, as most pages only have a link or a form input that points to the new form and don't need to load indexing and search libraries themselves. Unless the Lunr.js docpad plugin makes some of the initial integration work easier, I would propose creating the new page using Fuse.js.