Elasticsearch

Koha has two search engine options, Zebra and ElasticSearch. Zebra is a simpler search engine that works well with MARC records, but is a bit slow because of how Koha is structured. Elasticsearch is a search engine that has more flexibility in how it is configured to search a dataset. Koha is unique in that data is not stored in one singular index, but rather multiple smaller indexes. Library catalogs that use Elasticsearch tend to return more relevant results more quickly than other types of search engines.

System Preferences

Which search engine Koha uses is determined by the system preference SearchEngine. There are only two choices here, Zebra and Elastic. Koha defaults to Zebra. Switching to Elastic is not quite as easy as just flipping this switch, though. There’s some setup that needs to happen on your server to get Elastic installed and running. If you flip this switch before that’s been done, you won’t get any search results at all.

Once Elastic has been set up on your server and we’ve flipped this switch for you, Zebra is going to continue to run and keep itself updated. For now, Koha’s still using Zebra in a few places outside of catalog searching, so it’s still around. That means that one could use the system preference to go back to Zebra at any time -- though we’d generally prefer that you didn’t, as we’d like to see and fix any Elastic issues you’re having. If you do switch back to Zebra, Elastic will not continue to keep itself updated while you use Zebra. That means switching back to Elastic requires a little extra work on the server end that we can take care of for you.

ElasticsearchCrossFields

This system preference was created in response to a change between Elastic versions 5 and 6. Currently, Koha is capable of using either version of Elastic. You can see your Elastic version in the About Koha page of your staff client. All ByWater partner libraries currently using Elastic are using Elastic 6.

If your Koha system is using Elastic 5 or earlier, you need the ElasticsearchCrossFields system preference to be set to "Disable." Otherwise, all searches will fail.

If your Koha system using Elastic 6 or later, you need the ElasticsearchCrossFields system preference to be set to "Enable." Otherwise keyword searches will return inaccurate results.

More specifically, on Elastic 6 and later, if ElasticsearchCrossFields is turned off, Koha will look for all of your search terms in all of your search indices, but only return titles where all of your terms were in the same index. So a search for "bram stoker dracula" would not find the novel (as "bram stoker" is in the author and "dracula" is in the title), but might find you the 1992 film titled Bram Stoker's Dracula.

Search Engine Configuration

Once you’ve switched to Elastic, you’ll see a new link in the Catalog section of Administration: Search Engine Configuration. This is where Koha shows you which parts of your MARC records get indexed for searching and then how those search indexes are weighted for use in relevancy ranking of search results. The former replaces the search index documentation from the Koha manual. The latter is something we’ve never had clear access to in Zebra. As of Koha version 19.11, most things on this page require some action on your server when they are changed, so we encourage you to contact us before making any changes here. That said, it’s still a great resource for checking which parts of your records are being indexed and how you might search them directly.

Mappings

The tabs labeled “Bibliographic records” and “Authorities” contain your mappings. On each tab the “Search field” column contains search index names and the “Mapping” column contains MARC fields and subfields. In the screenshot above, we can see that the author index contains the 100$a, 110$a, 111$a, 245$c, and 700$a. So an author search will return records with our search terms in any of those fields. Many MARC fields are included in several different indexes. If you’re uncertain of how to search for values in a particular MARC field, you can come to this page and use ctrl-F to look for the MARC field in question.

Not every MARC field is indexed by default. If you want to index a new MARC field or change which indexes include which MARC fields, we can change these values for you. For example, one of our public library testers separated the 245$n and 245$p into a new index called “title-part” so they could search season and volume numbers separately from the rest of the title. A special library partner requested a new index called “product-type” for the 513$a, where they’d been keeping internal data for local records.

The remaining three columns in these tabs have limited functionality. They describe whether or not these fields can be used to sort results, generate facets, or suggest other searches, respectively.

Currently, ByWater is maintaining two default mapping sets. You can see our academic library mapping here and our public library mapping here.

Notify your support vendor if you wish to make any changes to your mappings. Some changes can break functionality. Other changes also require backend configuration.

Weighting

The tab labeled “Search fields” shows the weightings applied to generate relevancy rankings in search results. With no weighting values assigned, all fields are considered equally important in your search results. Essentially, a blank here is the same as entering a one. By increasing the weight value, we give that field more importance so that records with our search terms in the weighted fields are pushed higher in our results than records with our search terms in unweighted fields. In our testing, we settled on the following default weights:

title: 32
author: 16
subject: 8
title-series: 4
contents: 2

These weightings are based on the assumption that your patrons are most likely to provide a title or author when searching. If not a title or author, they may give a subject or series title or some words contained in the 5XX fields. Remember you can always check the mappings to see exactly which fields and included in each of these weighted indices. Weights are applied at search time so can be adjusted without requiring a reindex. Feel free to play with these settings. but please let us know if you settle on a different weighting so we can make sure to record this.

Jumping back to the “Bibliographic records” tab, you’ll find a small table at the very bottom of the page that controls the display of your facets. We can uncheck the box for a given facet to disable it completely or drag the facets up and down the list to change the order. Tell us if you want to change things here so we can make sure those changes are retained should we need to rebuild your Elastic indices.

Elastic changes facets in one more way; it builds your facets from all of the search results it has. That means it does not use the MaxRecordsForFacets system preference at all.

Searching

As with Zebra, the default search in Elastic is a general keyword search. If we do not specify a search index, Koha interprets that as a search in the keyword index.

In Zebra, this meant searching the entire MARC record. In Elasticsearch the keyword index contains only the MARC fields that are included in at least one other index. Generally, this is a helpful thing as it allows us to wholly ignore parts of our records that we don’t care about. Koha has the option to specify which of our search indices are included in a keyword search and to make that differ between the staff and public catalogs. That would allow us, for example, to make the 952$x (non-public item note) factor into staff searches but not public searches. Additionally, there is an option in advanced search to search the entire MARC record as Zebra did, if you prefer.

Specifying an index

When performing a search in Koha, the user can specify which index to search. Again, the basic process for this hasn’t changed in Elastic. To search the title index, you can set the search dropdown to “Title.”

Or, if you prefer, you can leave the dropdown alone and specify your index using CCL (Common Command Language).

Saying “title:batman and robin” tells Elastic to look for “batman and robin” in the title index. Your search engine configuration page will give you a list of all your search indices and which MARC fields they include. While the search dropdown has a default set of options, we can add any you like, if there’s something you search often and would prefer not to use CCL for.

Once you’ve specified an index, it will be used for all following terms in your query until you specify something else.

That means “title: batman harley” looks for both “batman” and “harley” in the title.

Whereas “title:batman kw:harley” looks for “batman” in the title and “harley” in the keyword index.

Of course, you could accomplish that same search with “harley title:batman.” Since “harley” comes before we specify the title field, it defaults to keyword.

You can mix and match any indices in this way. Searching “title:batman author:dini” gives us records with “batman” in the title and “dini” in the author.

Boolean Operators

In Koha, Elastic is configured to assume all our search terms are connected with AND operators. Koha will return results using "and" versus "AND" but the more complete results are coming with the all capital letter version. We can specify different Boolean operators just as we did in Zebra, using either the advanced search page drop down selection or adding the operators in the Koha search bar. Typical boolean commands are AND, OR, and NOT. Elastic Search documentation shows the proper format for the boolean operators is all capital letters, such as OR or AND.

In advanced search, set your Boolean dropdown to the operator you want. The screenshot above shows a search that will bring back records with either “batman” or “superman” (or both).

The same search can be done via CCL by typing “batman OR superman.” Your Boolean operator needs to be in all caps. A search for “batman or superman” will look for the word “or.”

Your operators will be applied in the order AND, OR, NOT. This order of operations can result in some unexpected processing. You can use parentheses to force explicit grouping, which can clarify things.

This returns records that contain “wonder woman” and either “batman” or “superman.”

A Boolean operator makes Elastic forget which index you were looking in, so the query “title:batman OR superman” is the same as “title:batman OR kw:superman.” You can correct for this by repeating your index -- “title:batman OR title:superman” -- or by using parentheses -- “title:(batman OR superman).”

Hyphens

In Elastic Search a hyphen can operate as an 'or'. Trying to search for a term like steam-boiler will return results where the term steam or boiler exists. In order to get correct search results add quotation marks "steam-boiler" to force Elastic Search to see it as a single term.

Another example is to search for U-2 boats using U2 rather than U-2. U2 is a single term but it is not returning the expected results because U2 doesn't match "U-2" or U or a 2. One option is to add U2 to a 246 varying title tag so that it appears in search results.

Wildcards and truncation

Elastic supports two different wildcard characters.

A question mark stands in for one character. So “batm?n” will match “batman” or “batmen” or any other word you can make by shoving a letter or number between “batm” and “n.” But remember it will look for exactly one character to replace that question mark. That “bat?man” won’t find “batman” or “batwoman” but would find “bathman.”

An asterisk stands in for zero or more characters. So “bat*man” will find “batman,” “batwoman,” “bathman,” and even more things. Any word that starts with “bat” and ends in “man.”

Stemming

Zebra has a feature called stemming that’s related to truncating. It’s controlled by the QueryStemming system preference. It does things like returning “enabled” when you search for “enabling.” Elastic doesn’t have a feature like that built in, and the QueryStemming system preference doesn’t do anything while you’re using Elastic. However, Elastic does have its own stemming options and that’s something we expect to explore more fully going forward.

Phrase searching

In Elastic, you can force an exact match using quotation marks.

Searching for “batman superman” with quotation marks only returns records with those two words next to each other in that order. In Zebra, you had to select a special “as phrase” search option from your search dropdown to do this (like “title as phrase” or “subject as phrase”). Those options still exist in Elastic, but all they do is insert quotation marks around your search for you.

Be aware that Elastic will ignore any wildcards within quotation marks, since quotation marks mean you want exact matches only.

Ranges

You can make Elastic search a range of values in several ways. This mostly applies to indices of numeric fields, like date-of-publication, which holds the publication year from the 008 field.

Square brackets like I’ve used here are inclusive, so my search is for anything published in 2010, 2011, or 2012. If you use curly brackets like “{2010 TO 2012}” the range would be exclusive, meaning it would only find things published in 2011. You can even get fancy and mix them up, like “[2010 TO 2012},” which would be inclusive on the low end but exclusive on the high end. Just like your Boolean operators, “TO” needs to be all caps.

You can also use greater and lesser than symbols for number-based searches. The search above returns everything with a publication date greater than or equal to 2010.

These searches won’t work well with fields that aren’t strictly numeric. If your 245 tags contain a subfield for data like “Volume 6” or “Season 2,” Elastic won’t know how to discard the word and look only at the number. However, this is exactly the sort of functionality folks in the community are currently working on, so that may change!

Negating and requiring search terms

If you want to make sure your results don’t include a specific term, you can negate it with a minus sign.

A search for “batman -joker” returns records that contain “batman” but not “joker.” You can also make terms required by adding a plus sign, but that’s redundant because we’re default to connecting all of our terms with AND, which also makes them required.

Fuzzy searching

In Zebra, turning on the QueryFuzzy system preference made all of your searches look for similarly-spelled words. It was sort of ill-defined and unpredictable and we tended to suggest folks not use it. In Elastic, turning QueryFuzzy on doesn’t change your search results on its own, but it gives you the option of making any individual term in your search fuzzy by putting a tilde after it.

A search for “batman azzarelo~” looks for records with “batman” spelled just as you’ve spelled it but “azzarelo” with some spelling variation. How much variation is allowed is based on how long you fuzzy word is: a word six or more characters long allows up to two changes, a word three to five characters long allows 1 change, and a word just one or two characters long doesn’t allow any changes (so making it fuzzy doesn’t do anything). A change here means moving a letter, replacing a letter, adding a letter, or removing a letter. So two changes in “azzarelo” is enough to make it find the correct spelling of this author’s name “azzarello.”

Proximity

A proximity search lets you find two words within a certain distance of each other.

To perform a proximity search, but your terms in quotes and then add a tilde and a number. So “batman robin”~1 gives us records in which “batman” and “robin” appear within one word of each other. That would include “batman and robin” or “batman & robin.” Note that when we say words here we’re using the terms loosely, basically to mean a group of characters separated from other characters by spaces. So in this context an ampersand is considered a word.

Now, technically the number here isn’t a count of words between our terms. It’s a count of changes needed to make our record match our search (sort of like how fuzziness counted changes to letters). It takes one edit (removing the “and”) to make “batman and robin” match “batman robin.” Following this idea of counting edits, two edits allows us to transpose our words. So “batman robin”~2 would match “robin batman.” And “batman robin”~3 would match “robin and batman.”

Boosting relevancy

In our Elastic configuration post, we talk about how to define weightings to configure how your search results are ordered. Elastic will also let you use the boost function to give a specific term some extra importance in any given search.

To boost a term in your search, follow it with a caret and a number. A search for “batman robin stephanie^10” returns records with those three words, but makes “stephanie” more important in deciding which order to display your results in. Because the default weighting is 1, you can also use a boost value between 0 and 1 to reduce a term’s importance in your search results.

Escaping punctuation

Many of the search features discussed here use punctuation marks to let Elastic know you’re doing something special. If you want to perform a search that includes one of these punctuation marks, you need to tell Elastic to ignore the punctuation’s special meaning. In coding, this is referred to as escaping the punctuation mark.

To escape a punctuation mark, put a backslash before it. So a search for “title:batman\: year one” searches for “batman: year one” directly without trying to use the colon to do something special. The following punctuation marks need to be escaped if included in your search: +, -, =, &&, ||, >, <, !, (, ), {, }, [, ], ^, ", ~, *, ?, :, \, and /.

Of course, you should usually just be able to leave those punctuation marks out of your search entirely, rather than worrying about escaping them. In my example above, “title: batman year one” without the colon would have found the same title without a problem.

Ghost Records

A ghost record occurs when the request to add or delete a record does not make it to the Elasticsearch cluster. This is essentially a type of timeout error.

Syntax

Syntax Cheatsheet

Related Articles
Advanced Searching in Koha
Koha has a built-in 'Advanced Search' option which allows staff and patrons to build a search using search terms, item types, shelving locations, publication date, availability, location, and more! Staff Advanced Search If a staff member was looking ...
Item Search
If you’re responsible for Collection Development but SQL and creating custom reports feels intimidating and out of reach, Item Search is the powerful little tool you need. Item Search builds an SQL report in the background with no need to know SQL. ...
Batch Search Plugin for Koha
The Batch Search plugin can perform a batch ISBN search in Koha. Batch Search Plugin The first step is to install the Batch Search plugin. Then, in Administration > Manage plugins, you will click the Actions button for this plugin and select "Run the ...
Koha Glossary of Terms
Find the basics of Koha's vocabulary here This glossary is in alphabetical order. Use control+F to quickly search this page for specific terms. Authority Record: allows your library to control and search fields in your MARC records such as subject ...
Koha to Koha ILL
This articles walks through the steps of setting up a Koha to Koha interlibrary loan instance. The ILL (Interlibrary loans) module adds the ability to request and manage loans or copies of material. Patrons can submit a request via the OPAC from the ...

Elastic Search Boolean Operators