The Unified Scottish Gaelic Wordnet (USGW) is a freely available, English-aligned lexico-semantic resource for the Scottish Gaelic (Gàidhlig) language. It was built by merging two sources:
- in a larger part (~6,500 words), original lexical translations and glosses from English, provided by professional translators who are fluent or native speakers of Gaelic;
- in a smaller part (~4,300 words), the existing wordnet from the Extended Open Multilingual Wordnet project, converted from Wiktionary data.
The translated data was thoroughly validated by professionals and can be considered as accurate.
Scottish Gaelic, a Celtic language, derives from Middle Irish, yet it is considered today as a distinct language. While it used to be spoken throughout Scotland, its current speakers are estimated to be fewer than 60,000 and is considered as an endangered language.
This new, considerably extended wordnet targets multiple communities: language speakers and learners; linguists; computer scientists solving problems related to natural language processing. By publishing it as a freely downloadable resource, we hope to contribute to the long-term preservation of Scottish Gaelic as a living language, both offline and on the Web.
The wordnet resource can be downloaded in TAB format (the same format as in the Open Multilingual Wordnet) from below. For details on the TAB format, see here. However, we extended the three tab-separated columns of the TAB format by a fourth column that provides succinct provenance information for lemmas, glosses, and examples. In this provenance column, ‘EOMW’ stands for entries retrieved from the Extended Open Multilingual Wordnet, while ‘DS’ stands for entries provided by the DataScientia Foundation, i.e. our original translations.
If you use this resource in your research or project, please cite our publication described below.
This LREC paper describes in detail the contents of the wordnet, as well as how it was built and validated. Please cite it if you use the resource or find it relevant to your research.
Gábor Bella, Fiona McNeill, Rody Gorman, Caoimhín Ó Donnaíle, Kirsty MacDonald, Yamini Chandrashekar, Abed Alhakim Freihat, and Fausto Giunchiglia: A Major Wordnet for a Minority Language: Scottish Gaelic. 12th Language Resources and Evaluation Conference (LREC), 2020, Marseille, France.
The wordnet is distributed under the CC BY-SA 3.0 licence.
The current version of the wordnet contains:
- over 10k words;
- over 15k word senses;
- over 13k synsets;
- over 8k glosses;
- over 600 Gaelic lexical gaps (English words without Gaelic equivalents);
- over 70 English lexical gaps (Gaelic words without English equivalents).
As of 2019, this resource is among the 30 largest wordnets in the world.
The effort to create this wordnet was co-funded by Heriot-Watt University and the DataScientia initiative from the University of Trento. Part of the funding came from the European Union’s Horizon2020 Research and Innovation programme under grant agreement number 826106.
People who have worked on this project so far:
- Rody Gorman: translation;
- Kirsty MacDonald: validation and translation;
- Gábor Bella: research and supervision;
- Fiona McNeill: research and supervision;
- Caoimhín Ó Donnaíle: research and consulting;
- Yamini Chandrashekar: development;
- Abed Alhakim Freihat: development.