Automated Language Identification
Introduction
Assumptions
- Data is correct unicode for the language. (except maybe transliteration)
Methods
Expert Rule Based
Learned Rule Based
Weighting Rules
- Different Weighting for Different Texts
- Points System
Writing Systems / Orthography
Verb Endings, Agglutination, Morphology
Identification of Language Groups/Families
- Levenschtein distance
Non-Language Specific Words (Proper Nouns, Jargon and Text Noise)
- Junk Characters/Words
Character Encoding
- Different Unicode encodings (UTF-8, UCS-2, UCS-4 etc.)
- Other encodings (KOI8-R etc.)
Transliteration
- Identifying transliterated languages (add convention to language model?)
- PhoneticAlphabet
Unique Characters (Accents, Umlauts and Знаки)
Orthographic diacritics
Most languages with the exception of english to use the Latin/Roman alphabet make use of diacritics on characters to denote a different pronounciation or sound. In some cases, such as Icelandic, the character and diacritic together are treated as a whole new letter, whereas in French the diacritic is considered merely an add on.
Latin
ß àáâãäåāăąȁȃ çćĉċč ďđ èéêëēĕėęěȅȇ ĝğġģ ĥħ ìíîïĩīĭįıȉȋ ĵ ķĸ ĺļľŀł ð ñńņňʼnŋ òóôõöøōŏőȍȏ ŕŗřȑȓ śŝşšș ţťŧț ùúûüũūŭůűųȕȗ ŵẅẃẁ ýŷÿỳ źżž þ ẛ
Albanian / A B C Ç D E Ë F G H I J K L M N O P Q R S T U V X Y Z Czech / A Á B C Č D Ď E É Ě F G H I Í J K L M N Ň O Ó P Q R Ř S Š T Ť U Ú Ů V W X Y Ý Z Ž Danish / A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å Estonian / A B C D E F G H I J K L M N O P Q R S Š Z Ž T U V W Õ Ä Ö Ü X Y Finnish / A B C D E F G H I J K L M N O P Q R S T U V X Y Z Å Ä Ö Hungarian / A Á B C D E É F G H I Í J K L M N O Ó Ö Ő R S T Ty U Ú Ü Ű V Z Icelandic / A Á B D Ð E É F G H I Í J K L M N O Ó P R S T U Ú V X Y Ý Þ Æ Ö Norwegien / A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å Latvian / A Ā B C Č D E Ē F G Ģ H I Ī J K Ķ L Ļ M N Ņ O P R S Š T U Ū V Z Ž Portugese / A B C D E F G H I J L M N O P Q R S T U V X Z Á É Í Ó Ú Â Ê Ô Ã Õ À Ü Ç Romanian / A Ă Â B C D E F G H I Î J K L M N O P R S Ș T Ţ U V X Z Slovenian / A B C Č D E F G H I J K L M N O P R S Š T U V Z Ž Spanish / A B C D E F G H I J K L M N Ñ O P Q R S T U V W X Y Z Swedish / A B C D E F G H I J K L M N O P Q R S T U V X Y Z Å Ä Ö Turkish / A B C Ç D E F G Ğ H I İ J K L M N O Ö P R S Ş T U Ü V Y Z Vietnamese / A Ă Â B C D Đ E Ê G H I K L M N O Ô Ơ P Q R S T U Ư V X Y
Cyrillic
In Unicode, the Cyrillic block extends from U+0400 to U+052F. The characters in the range U+0400–U+045F are basically the characters from ISO 8859-5 moved upward by 864 positions. The characters in the range U+0460–U+0489 are historic letters, not used now. The characters in the range U+048A–U+052F are additional letters for various languages that are written with Cyrillic script.
ѐ ђ ѓґ ї ј љ њ ќ ѝ ў џ
Belarusian / А Б В Г Д Е Ё Ж З І Й К Л М Н О П Р С Т У Ў Ф Х Ц Ч Ш Ы Ь Э Ю Я Bulgarian / А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ь Ю Я Macedonian / А Б В Г Д Ѓ Е Ж З Ѕ И Ј К Л Љ М Н Њ О П Р С Т Ќ У Ф Х Ц Ч Џ Ш Russian / А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я Serbian / А Б В Г Д Ђ Е Ж З И Ј К Л Љ М Н Њ О П Р С Т Ћ У Ф Х Ц Ч Џ Ш Ukrainian / А Б В Г Ґ Д Е Є Ж З И І Ї Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ю Я Ь Kazakh / А Ә Б В Г Ғ Д Е Ё Ж З И Й К Қ Л М Н Ң О П Ө Р С Т У Ұ Ү Ф Х Һ Ц Ч Ш Щ Ъ Ы İ Ь Э Ю Я Kyrgyz / А Б В Г Д Е Ё Ж З И Й К Л М Н Ң О Ө Р С Т У Ү Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
References
Encodings / Character Sets
- http://216.239.59.104/search?q=cache:9d97oR9Y8F8J:ws.edu.isoc.org/workshops/2004/ICANN-KL/ICANN-ISOC-KL-IDN.ppt+unicode+universal+declaration+of+human+rights&hl=en
Linguistics / Orthography
- http://www.phon.ucl.ac.uk/home/wells/dia/diacritics-revised.htm
Competitors
Transliteration
- http://en.wikipedia.org/wiki/Transliteration
- http://pubs.usgs.gov/of/of95-807/geoicelandic.html
Corpus
- http://www.rawbw.com/~emuller/unicode/
- http://www.lexilogos.com/declaration/index.htm
- http://www.geonames.de/udhr.html
- http://www.gaydzer.am/Armenian_Cause.htm
- http://www.freeserbs.org/~mladich/udhr/udhr_hyu
Todo
- UNUDHR (UniversalDeclarationOfHumanRights) in Unicode for all major languages
- Write Language Model DTD