CogNet

map-based illustration of cognate data

What is CogNet? It is a large-scale database of cognate pairs: it contains 5.9 million cognates in 338 languages, 38 writing systems, and 91285 concepts. It was automatically constructed from wordnets and dictionaries contained within the UKC resource, as described in our paper.

What are cognates? In short, cognates are words in different languages that share a common origin and the same meaning, such as the English letter and the French lettre. CogNet links its cognates to Princeton WordNet synsets, making the shared meanings explicit.

Why are cognates important? Cognates have been extensively studied in the fields of language typology and historical linguistics, as they are considered useful for researching the relatedness of languages. Cognates are also used in computational linguistics, e.g., for lexicon extension or to improve cross-lingual NLP tasks such as machine translation or bilingual word recognition.

How is CogNet licenced? Under CC-BY-SA-NC-4.0.

Where can I download CogNet? Use the links below.

  • CogNet, a cognate database v1.0 [github]
  • WikTra, a transliteration tool used for building CogNet [github]
While CogNet is free to use, we ask you to cite the following paper if you use it or the WikTra transliteration tool in your research:
Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia. CogNet: A Large-Scale Cognate Database. Proceedings of ACL 2019, Florence, Italy.

How can I explore cognate data? Besides downloading the entire CogNet as a structured text file, you can also use the Linguarena website to display and browse (currently an older version of) cognate data interactively on a world map, as also shown in the figure above.

How is CogNet structured? Each row in the resource represents a cognate instance which is formed by the following individuals (columns are tab-separated):

ColumnDescription
concept A code used by Princeton WordNet 3.0 to represent a synset.
language 1the 3-letter iso code for the first language
word 1a word in the language 1
language 2the 3-letter iso code for the second language
word 2a word in the language 2
evidencedirect etymological or indirect algorithmic
transliteration 1a romanized word for the first word
tranlisteration 2a romanized word for the second word

Example

concept lang1 word1 lang2 word2 evidence transliteration1 transliteration2
n14996158 glg polipropileno jpn ポリプロピレン ETY NO_TRANSLIT poripuropiren
n06566077 nep सफ्टवेर kas سافٹویٚیَر ALG saphtawera saftoeyar

 

Acknowledgements