Automated Language Identification

Introduction

 

Assumptions

  • Data is correct unicode for the language. (except maybe transliteration)

 

Methods

Expert Rule Based

Learned Rule Based

 

Weighting Rules

  • Different Weighting for Different Texts
  • Points System

 

Writing Systems / Orthography

 

Verb Endings, Agglutination, Morphology

 

Identification of Language Groups/Families

  • Levenschtein distance

 

Non-Language Specific Words (Proper Nouns, Jargon and Text Noise)

  • Junk Characters/Words

 

Character Encoding

  • Different Unicode encodings (UTF-8, UCS-2, UCS-4 etc.)
  • Other encodings (KOI8-R etc.)

 

Transliteration

  • Identifying transliterated languages (add convention to language model?)
  • PhoneticAlphabet

 

Unique Characters (Accents, Umlauts and Знаки)

 

Orthographic diacritics

Most languages with the exception of english to use the Latin/Roman alphabet make use of diacritics on characters to denote a different pronounciation or sound. In some cases, such as Icelandic, the character and diacritic together are treated as a whole new letter, whereas in French the diacritic is considered merely an add on.

 

Latin

ß

àáâãäåāăąȁȃ

çćĉċč

ďđ

èéêëēĕėęěȅȇ

ĝğġģ

ĥħ

ìíîïĩīĭįıȉȋ

ĵ

ķĸ

ĺļľŀł

ð

ñńņňʼnŋ

òóôõöøōŏőȍȏ

ŕŗřȑȓ

śŝşšș

ţťŧț

ùúûüũūŭůűųȕȗ

ŵẅẃẁ

ýŷÿỳ

źżž

þ

ẛ

Albanian   / A B C Ç D E Ë F G H I J K L M N O P Q R S T U V X Y Z
Czech      / A Á B C Č D Ď E É Ě F G H I Í J K L M N Ň O Ó P Q R Ř S Š T Ť U Ú Ů V W X Y Ý Z Ž 
Danish     / A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å 
Estonian   / A B C D E F G H I J K L M N O P Q R S Š Z Ž T U V W Õ Ä Ö Ü X Y 
Finnish    / A B C D E F G H I J K L M N O P Q R S T U V X Y Z Å Ä Ö 
Hungarian  / A Á B C D E É F G H I Í J K L M N O Ó Ö Ő R S T Ty U Ú Ü Ű V Z 
Icelandic  / A Á B D Ð E É F G H I Í J K L M N O Ó P R S T U Ú V X Y Ý Þ Æ Ö 
Norwegien  / A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å 
Latvian    / A Ā B C Č D E Ē F G Ģ H I Ī J K Ķ L Ļ M N Ņ O P R S Š T U Ū V Z Ž
Portugese  / A B C D E F G H I J L M N O P Q R S T U V X Z Á É Í Ó Ú Â Ê Ô Ã Õ À Ü Ç
Romanian   / A Ă Â B C D E F G H I Î J K L M N O P R S Ș T Ţ U V X Z
Slovenian  / A B C Č D E F G H I J K L M N O P R S Š T U V Z Ž
Spanish    / A B C D E F G H I J K L M N Ñ O P Q R S T U V W X Y Z 
Swedish    / A B C D E F G H I J K L M N O P Q R S T U V X Y Z Å Ä Ö
Turkish    / A B C Ç D E F G Ğ H I İ J K L M N O Ö P R S Ş T U Ü V Y Z 
Vietnamese / A Ă Â B C D Đ E Ê G H I K L M N O Ô Ơ P Q R S T U Ư V X Y

 

Cyrillic

In Unicode, the Cyrillic block extends from U+0400 to U+052F. The characters in the range U+0400–U+045F are basically the characters from ISO 8859-5 moved upward by 864 positions. The characters in the range U+0460–U+0489 are historic letters, not used now. The characters in the range U+048A–U+052F are additional letters for various languages that are written with Cyrillic script.

ѐ

ђ

ѓґ

ї

ј

љ

њ

ќ

ѝ

ў

џ
Belarusian / А Б В Г Д Е Ё Ж З І Й К Л М Н О П Р С Т У Ў Ф Х Ц Ч Ш Ы Ь Э Ю Я
Bulgarian  / А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ь Ю Я
Macedonian / А Б В Г Д Ѓ Е Ж З Ѕ И Ј К Л Љ М Н Њ О П Р С Т Ќ У Ф Х Ц Ч Џ Ш
Russian    / А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
Serbian    / А Б В Г Д Ђ Е Ж З И Ј К Л Љ М Н Њ О П Р С Т Ћ У Ф Х Ц Ч Џ Ш 
Ukrainian  / А Б В Г Ґ Д Е Є Ж З И І Ї Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ю Я Ь

Kazakh     / А Ә Б В Г Ғ Д Е Ё Ж З И Й К Қ Л М Н Ң О П Ө Р С Т У Ұ Ү Ф Х Һ Ц Ч Ш Щ Ъ Ы İ Ь Э Ю Я 
Kyrgyz     / А Б В Г Д Е Ё Ж З И Й К Л М Н Ң О Ө Р С Т У Ү Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я

 

References

 

Encodings / Character Sets

  • http://216.239.59.104/search?q=cache:9d97oR9Y8F8J:ws.edu.isoc.org/workshops/2004/ICANN-KL/ICANN-ISOC-KL-IDN.ppt+unicode+universal+declaration+of+human+rights&hl=en

 

Linguistics / Orthography

  • http://www.phon.ucl.ac.uk/home/wells/dia/diacritics-revised.htm

 

Competitors

 

Transliteration

  • http://en.wikipedia.org/wiki/Transliteration
  • http://pubs.usgs.gov/of/of95-807/geoicelandic.html

 

Corpus

  • http://www.rawbw.com/~emuller/unicode/
  • http://www.lexilogos.com/declaration/index.htm
  • http://www.geonames.de/udhr.html
  • http://www.gaydzer.am/Armenian_Cause.htm
  • http://www.freeserbs.org/~mladich/udhr/udhr_hyu

 

Todo

  • UNUDHR (UniversalDeclarationOfHumanRights) in Unicode for all major languages
  • Write Language Model DTD