Laura Buszard–Welcher, Rosetta Project/UC Berkeley


FIELD (Field Input Environment for Linguistic Data):
An Ontology-Backed Lexical Database Tool


As linguists strive to document endangered languages, there is a pressing need for tools that facilitate data collection and analysis while conforming to best practice in digital archiving. To address this need, the E-MELD project (Electronic Metastructure for Endangered Languages Data) has produced FIELD, a Web-based software tool that supports the development and sharing of lexical databases, while creating digital archives in accordance with the best practice recommendations of the E-MELD community. Originally developed as an in-house tool to facilitate the conversion of legacy datasets for endangered languages, FIELD has grown in functionality to become a sophisticated lexical database tool. It has the ease of use of other popular database programs, as well as the added benefit of exporting best practice lexical data for archival purposes.

FIELD’s support of the EMELD best practice recommendations is briefly outlined below:

1) Irreplaceable data on endangered languages should be archived in an XML file with a schema that conforms to best practice. FIELD users can export their data at any time as an XML file. (An option to export data as a tab-delimited text file is also available, since this format is widely supported by commercially available database programs.) The XML file is validated by a schema file to enable data interchange with other third party software (for example, software for text annotation and analysis). The XML file can then be rendered by stylesheets into viewer-friendly formats for on-line or print display. (The EMELD School of Best Practice has several examples of stylesheets, each designed for a different purpose).

2) XML markup tags should be provided by a common linguistic ontology. When creating a new FIELD lexicon, the user must first set up a ‘language profile’, choosing the set of grammatical concepts found in the language (lexical and morphosyntactic categories), and mapping their terms to those provided by common linguistic ontology (GOLD). When users export their data, this mapping is incorporated into the XML markup of the archive file, ensuring that the XML markup will be intelligible to future generations. Another benefit of terminology mapping is it enables searches across electronic language resources. The FIELD 'search across languages' function makes use of the mapping, and we anticipate that other ontology-based search engines will be developed in the near future.

3) Unicode character encoding. FIELD fully supports Unicode, using it for data input, display, and storage. In addition, we have developed several means of facilitating the entry of IPA and other commonly used international characters from within FIELD. The first is Charwrite, a program that opens an interactive IPA chart when users double-click in a text field. If a user enters a character and right-clicks (or control-clicks on a Mac) Charwrite opens a pop-up menu of similar characters that when selected can replace the entered character. The FIELD program also allows users to define a set of keyboard shortcuts to increase the speed of entering international characters.

This poster presentation will demonstrate the main functionalities of the FIELD program using screen shots and descriptive text, as follows:
  • setting up a ‘language profile’ that links language-specific grammatical concepts and terminology to a morphosyntactic ontology (GOLD),
  • data entry screens and data upload,
  • keyboard entry of international characters and search options,
  • data output as XML for archival purposes,
  • stylesheets for on-line and print presentation forms
  • a new feature that enables collaborative lexicographical work.
The development version of FIELD is available in the EMELD School of Best Practice at http://emeld.org/school/workroom/lexicon/index.html. FIELD currently houses substantial lexical data for six typologically diverse languages: Biao Min (Hmong-Mien), Mocoví (Guaicuruan), Potawatomi and Ottawa (Algonquian), Mongour (Southeastern Mongolic), and Ega (Kwa), and we anticipate the addition of another three languages. These language databases are available for public searching at http://emeld.org/school/search/searchlang/.