The Unified Scottish Gaelic Wordnet

The Unified Scottish Gaelic Wordnet (USGW) is a freely available, English-aligned lexico-semantic resource for the Scottish Gaelic (Gàidhlig) language. It was built by merging two sources:

  • in a larger part (~6,500 words), original lexical translations and glosses from English, provided by professional translators who are fluent or native speakers of Gaelic;
  • in a smaller part (~4,300 words), the existing wordnet from the Extended Open Multilingual Wordnet project, converted from Wiktionary data.

The translated data was thoroughly validated by professionals and can be considered as accurate.

Decrease of Gaelic speakers in Scotland throughout the last hundred years

Scottish Gaelic, a Celtic language, derives from Middle Irish, yet it is considered today as a distinct language. While it used to be spoken throughout Scotland, its current speakers are estimated to be fewer than 60,000 and is considered as an endangered language.

This new, considerably extended wordnet targets multiple communities: language speakers and learners; linguists; computer scientists solving problems related to natural language processing. By publishing it as a freely downloadable resource, we hope to contribute to the long-term preservation of Scottish Gaelic as a living language, both offline and on the Web.

Download

The wordnet resource can be downloaded in TAB format (the same format as in the Open Multilingual Wordnet) from below. For details on the TAB format, see here. However, we extended the three tab-separated columns of the TAB format by a fourth column that provides succinct provenance information for lemmas, glosses, and examples. In this provenance column, ‘EOMW’ stands for entries retrieved from the Extended Open Multilingual Wordnet, while ‘DS’ stands for entries provided by the DataScientia Foundation, i.e. our original translations.

Licence

The wordnet is distributed under the CC BY-SA 3.0 licence.

Statistics

The current version of the wordnet contains:

  • 10,472 words;
  • 15,264 senses;
  • 12,953 synsets;
  •  8,247 glosses;
  • XXX lexical gaps.

As of 2019, this resource is among the 30 largest wordnets in the world.

Credits

The effort to create this wordnet was co-funded by Heriot-Watt University and the DataScientia initiative from the University of Trento. Part of the funding came from the EU’s Horizon2020 programme under Grant Agreement number 823783 and under the research funding scheme Future and Emerging Technologies (FET).

People who have worked on this project so far:

  • Rody Gorman: translation;
  • Kirsty MacDonald: validation and translation;
  • Gábor Bella: research and supervision;
  • Fiona McNeill: research and supervision;
  • Caoimhín Ó Donnaíle: research and consulting;
  • Yamini Chandrashekar: development;
  • Abed Alhakim Freihat: development.