How to Prevent Diacritics from Misbehaving in Koha

How to Prevent Diacritics from Misbehaving in Koha

What are diacritics?

Have you ever noticed a funky symbol in your catalog that looks like a diamond with a question mark in the middle? Here’s an example:

Gabriel Garc�ia M�arquez

Are you wondering what this symbol means and why it appears in your records? This is a diacritic behaving badly. Instead of seeing the diamond with the question mark, you are probably supposed to be seeing an umlaut or an acute accent, or any number of other diacritics (most commonly found in non-English languages).

If you see these diacritics gone wrong in your catalog, it probably happened when the record was imported.

Cataloging

The problem above arises when you export records from OCLC as one type of encoding, and then import them into Koha as another. The encoding type must be the same. There are a couple of different options available for character encoding. The two most used encoding formats used in Koha are:
  1. MARC-8
  2. UTF-8 encoding.
To find out more about this character encoding, see the resources section below.

When you import records into Koha from Tools > Stage MARC Records for Import, you can choose what type of encoding to use: MARC-8 or UTF-8. In Koha, UTF-8 is the default import encoding.

Below is a screenshot of importing records into Koha, where the encoding option lives:



The Koha manual has more info on importing MARC records.

The next thing is to double-check OCLC export settings and verify the default export encoding is also UTF-8. To modify your export settings in OCLC:
  1. “Go to Export Options Screen”
  2. On the General tab, click Admin.
  3. At the Preferences screen, click Export Options.
  4. When you’re done modifying these preferences, select “Save My Default” to make this option the default every time.
Please see the OCLC manual for more info on exporting MARC records.

Remember: the important thing is that the encoding matches! If it’s UTF-8 coming out of OCLC, it also needs to be UTF-8 going into Koha. You must use the same encoding all through the pipeline. Otherwise, it’s like you’re asking your computer to use a French dictionary to translate Russian words, and the � symbol peppered throughout your catalog can be interpreted as, “What IS this character, and what am I supposed to do with it?!”

If you are finding that you are getting diacritics and your library does not get records from OCLC, check with your vendor that you are receiving records from and find out what encoding they are sending your records.

If your data contains special characters or diacritics, make sure your file is encoded in UTF-8. Otherwise, the special characters will not be imported correctly.

Resources

What is Marc8?

The MARC-8 encoding standard was developed in 1968 specifically for libraries, with the beginning of the use of MARC format.

What is UTF-8?

Much later on in 1993, the UTF-8 encoding standard was released. UTF-8 supports every character in the Unicode character set, which includes over 110,000 characters covering 100 scripts. UTF-8 supports far more characters than MARC-8 and has become the dominant character encoding for the internet.

How Do I Prevent Z39.50 records Coming in with Bad Diacritics?

As explained above, diacritics pop up in Koha when a record encoded in one format is brought into Koha with a different encoding.

Fortunately, in a Z39.50 search, you can preview the MARC record to know that an encoding mismatch has occurred, like this example Pokémon graphic novel:


Take note of which Z39.50 target this record came from.

Under Koha Administration -> Additional Parameters, select Z39.50/SRU servers. You should now have a list of your current Z39.50 targets. Find the target you noticed had records with bad diacritics, and under the action menu on the furthest right column of the results, navigate into Edit. Halfway down the list of configurations is Encoding.



Most targets' encoding will be either utf-8 or MARC-8; if the target was encoded with utf-8 when you spotted the bad diacritics, switch it to MARC-8 and save, or change to utf-8 if it was in MARC-8 when you spotted the bad characters.

Navigate back to your Z39.50 search and look for the same record. The bad diacritics should be displaying and importing correctly now.
    • Related Articles

    • Importing Records into Koha

      A library that receives a file of MARC records from a vendor can import them into Koha. This is a simple process done through Koha's Cataloging Module The file of MARC records should be a file with the file extension of .mrc Go To Cataloging > Stage ...
    • Cataloging Concerns

      This feature can be set up for either staff or OPAC (or both!) and will allow staff and/or patrons to report issues with catalog records. The reported concerns will be visible in the dashboard on the main page of the staff interface and available ...
    • Monograph Cataloging

      Monograph cataloging is used to catalog a non-serial publication contained in one volume (book) that may be part of a set. Harry Potter could be cataloged as a monograph set with volumes, though monographs are typically academic titles containing ...
    • Fast Add Cataloging

      Within the Cataloging module, an option to utilize the Fast Add option appears. Fast Add cataloging is useful in a few ways. Many libraries use Fast Add for their ILL process. Academic Libraries may use Fast Add when adding in faculty-owned material ...
    • Analytical Cataloging

      Libraries sometimes make journal articles and articles within monographs and serials accessible to library patrons through analytical cataloging. Analytical cataloging creates separate bibliographic records for these articles, chapters, sections, ...