What is CogNet? It is a large-scale database of cognate pairs: it contains 5.9 million cognates in 338 languages, 38 writing systems, and 91285 concepts. It was automatically constructed from wordnets and dictionaries contained within the UKC resource, as described in our paper.
What are cognates? In short, cognates are words in different languages that share a common origin and the same meaning, such as the English letter and the French lettre. CogNet links its cognates to Princeton WordNet synsets, making the shared meanings explicit.
Why are cognates important? Cognates have been extensively studied in the fields of language typology and historical linguistics, as they are considered useful for researching the relatedness of languages. Cognates are also used in computational linguistics, e.g., for lexicon extension or to improve cross-lingual NLP tasks such as machine translation or bilingual word recognition.
How is CogNet licenced? Under CC-BY-SA-NC-4.0.
Where can I download CogNet? All versions of CogNet can be downloaded from the CogNet project GitHub page. See the table below for the contents of each version.
|Version||# Cognates||# Words||# Concepts||Precision||Kinds of evidence used|
|CogNet v2||8.14 M||1.07 M||91 k||95.62%||etymological, phonetic, orthographic, geographic|
|CogNet v1||3.16 M||570 k||81 k||93.94%||etymological, orthographic, geographic|
|CogNet v0||60 k||90 k||9 k||100%||etymological only|
You can also download WikTra, the Wiktionary-based transliteration tool used for building CogNet.
How can I explore cognate data? Besides downloading the entire CogNet as a structured text file, you can also use the Linguarena website to display and browse (currently an older version of) cognate data interactively on a world map, as also shown in the figure above.
How is CogNet structured? Each row in the resource represents a cognate instance which is formed by the following individuals (columns are tab-separated):
|concept||A code used by Princeton WordNet 3.0 to represent a synset.|
|language 1||the 3-letter iso code for the first language|
|word 1||a word in the language 1|
|language 2||the 3-letter iso code for the second language|
|word 2||a word in the language 2|
|evidence||direct etymological or indirect algorithmic|
|transliteration 1||a romanized word for the first word|
|tranlisteration 2||a romanized word for the second word|