Spectre/AutomatedLanguageIdentification

< Spectre

Main Page | Recent changes | Random | Special
Edit this page | Page history | Discuss this page
 


Category: MajorProjects   Page type: Normal

Automated Language Identification

The attempt to identify a Language from a textfile.

Contents

Introduction

Assumptions

Methods

Expert Rule Based

Learned Rule Based

Weighting Rules

Writing Systems / Orthography

Verb Endings, Agglutination, Morphology

Identification of Language Groups/Families

Non-Language Specific Words (Proper Nouns, Jargon and Text Noise)

Character Encoding

Transliteration

Unique Characters (Accents, Umlauts and Знаки)

Orthographic diacritics

Most languages with the exception of english to use the Latin/Roman alphabet make use of diacritics on characters to denote a different pronounciation or sound. In some cases, such as Icelandic, the character and diacritic together are treated as a whole new letter, whereas in French the diacritic is considered merely an add on.

Latin

ß

àáâãäåāăąȁȃ

çćĉċč

ďđ

èéêëēĕėęěȅȇ

ĝğġģ

ĥħ

ìíîïĩīĭįıȉȋ

ĵ

ķĸ

ĺļľŀł

ð

ñńņňʼnŋ

òóôõöøōŏőȍȏ

ŕŗřȑȓ

śŝşšș

ţťŧț

ùúûüũūŭůűųȕȗ

ŵẅẃẁ

ýŷÿỳ

źżž

þ

ẛ

Albanian   / A B C Ç D E Ë F G H I J K L M N O P Q R S T U V X Y Z
Czech      / A Á B C Č D Ď E É Ě F G H I Í J K L M N Ň O Ó P Q R Ř S Š T Ť U Ú Ů V W X Y Ý Z Ž 
Danish     / A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å 
Estonian   / A B C D E F G H I J K L M N O P Q R S Š Z Ž T U V W Õ Ä Ö Ü X Y 
Finnish    / A B C D E F G H I J K L M N O P Q R S T U V X Y Z Å Ä Ö 
Hungarian  / A Á B C D E É F G H I Í J K L M N O Ó Ö Ő R S T Ty U Ú Ü Ű V Z 
Icelandic  / A Á B D Ð E É F G H I Í J K L M N O Ó P R S T U Ú V X Y Ý Þ Æ Ö 
Norwegien  / A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å 
Latvian    / A Ā B C Č D E Ē F G Ģ H I Ī J K Ķ L Ļ M N Ņ O P R S Š T U Ū V Z Ž
Portugese  / A B C D E F G H I J L M N O P Q R S T U V X Z Á É Í Ó Ú Â Ê Ô Ã Õ À Ü Ç
Romanian   / A Ă Â B C D E F G H I Î J K L M N O P R S Ș T Ţ U V X Z
Slovenian  / A B C Č D E F G H I J K L M N O P R S Š T U V Z Ž
Spanish    / A B C D E F G H I J K L M N Ñ O P Q R S T U V W X Y Z 
Swedish    / A B C D E F G H I J K L M N O P Q R S T U V X Y Z Å Ä Ö
Turkish    / A B C Ç D E F G Ğ H I İ J K L M N O Ö P R S Ş T U Ü V Y Z 
Vietnamese / A Ă Â B C D Đ E Ê G H I K L M N O Ô Ơ P Q R S T U Ư V X Y


Cyrillic

In Unicode, the Cyrillic block extends from U+0400 to U+052F. The characters in the range U+0400–U+045F are basically the characters from ISO 8859-5 moved upward by 864 positions. The characters in the range U+0460–U+0489 are historic letters, not used now. The characters in the range U+048A–U+052F are additional letters for various languages that are written with Cyrillic script.

ѐ

ђ

ѓґ

ї

ј

љ

њ

ќ

ѝ

ў

џ

Belarusian / А Б В Г Д Е Ё Ж З І Й К Л М Н О П Р С Т У Ў Ф Х Ц Ч Ш Ы Ь Э Ю Я
Bulgarian  / А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ь Ю Я
Macedonian / А Б В Г Д Ѓ Е Ж З Ѕ И Ј К Л Љ М Н Њ О П Р С Т Ќ У Ф Х Ц Ч Џ Ш
Russian    / А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
Serbian    / А Б В Г Д Ђ Е Ж З И Ј К Л Љ М Н Њ О П Р С Т Ћ У Ф Х Ц Ч Џ Ш 
Ukrainian  / А Б В Г Ґ Д Е Є Ж З И І Ї Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ю Я Ь

Kazakh     / А Ә Б В Г Ғ Д Е Ё Ж З И Й К Қ Л М Н Ң О П Ө Р С Т У Ұ Ү Ф Х Һ Ц Ч Ш Щ Ъ Ы İ Ь Э Ю Я 
Kyrgyz     / А Б В Г Д Е Ё Ж З И Й К Л М Н Ң О Ө Р С Т У Ү Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я

References

Encodings / Character Sets


Linguistics / Orthography

Competitors

Transliteration

Corpus

Todo