Skip to content

Latest commit

 

History

History
412 lines (370 loc) · 35.4 KB

character-tables-tamil.md

File metadata and controls

412 lines (370 loc) · 35.4 KB

Tamil character tables

This document lists the per-character shaping information needed to shape Tamil text.

Table of Contents

Tamil character table

Tamil glyphs should be classified as in the following table. Codepoints in the Tamil block with no assigned meaning are designated as unassigned in the Unicode category column.

Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine. Note that this does include some valid codepoints, such as currency marks, punctuation, and other symbols.

Note: the NUMBER and SYMBOL Shaping classes are important during syllable identification, but generally evoke no further special behavior during the rest of the shaping process.

The Mark-placement subclass column indicates mark-placement positioning for codepoints in the Mark category. Assigned, non-mark codepoints have a null in this column and evoke no special mark-placement behavior. Marks tagged with [Mn] in the Unicode category column are categorized as non-spacing; marks tagged with [Mc] are categorized as spacing-combining.

Some codepoints in the following table use a Shaping class that differs from the codepoint's Unicode General Category. The Shaping class takes precedence during OpenType shaping, as it captures more specific, script-aware behavior.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+0B80 unassigned
U+0B81 unassigned
U+0B82 Mark [Mn] BINDU TOP_POSITION ஂ Anusvara
U+0B83 Letter MODIFYING_LETTER null ஃ Visarga
U+0B84 unassigned
U+0B85 Letter VOWEL_INDEPENDENT null அ A
U+0B86 Letter VOWEL_INDEPENDENT null ஆ Aa
U+0B87 Letter VOWEL_INDEPENDENT null இ I
U+0B88 Letter VOWEL_INDEPENDENT null ஈ Ii
U+0B89 Letter VOWEL_INDEPENDENT null உ U
U+0B8A Letter VOWEL_INDEPENDENT null ஊ Uu
U+0B8B unassigned
U+0B8C unassigned
U+0B8D unassigned
U+0B8E Letter VOWEL_INDEPENDENT null எ E
U+0B8F Letter VOWEL_INDEPENDENT null ஏ Ee
U+0B90 Letter VOWEL_INDEPENDENT null ஐ Ai
U+0B91 unassigned
U+0B92 Letter VOWEL_INDEPENDENT null ஒ O
U+0B93 Letter VOWEL_INDEPENDENT null ஓ Oo
U+0B94 Letter VOWEL_INDEPENDENT null ஔ Au
U+0B95 Letter CONSONANT null க Ka
U+0B96 unassigned
U+0B97 unassigned
U+0B98 unassigned
U+0B99 Letter CONSONANT null ங Nga
U+0B9A Letter CONSONANT null ச Ca
U+0B9B unassigned
U+0B9C Letter CONSONANT null ஜ Ja
U+0B9D unassigned
U+0B9E Letter CONSONANT null ஞ Nya
U+0B9F Letter CONSONANT null ட Tta
U+0BA0 unassigned
U+0BA1 unassigned
U+0BA2 unassigned
U+0BA3 Letter CONSONANT null ண Nna
U+0BA4 Letter CONSONANT null த Ta
U+0BA5 unassigned
U+0BA6 unassigned
U+0BA7 unassigned
U+0BA8 Letter CONSONANT null ந Na
U+0BA9 Letter CONSONANT null ன Nnna
U+0BAA Letter CONSONANT null ப Pa
U+0BAB unassigned
U+0BAC unassigned
U+0BAD unassigned
U+0BAE Letter CONSONANT null ம Ma
U+0BAF Letter CONSONANT null ய Ya
U+0BB0 Letter CONSONANT null ர Ra
U+0BB1 Letter CONSONANT null ற Rra
U+0BB2 Letter CONSONANT null ல La
U+0BB3 Letter CONSONANT null ள Lla
U+0BB4 Letter CONSONANT null ழ Llla
U+0BB5 Letter CONSONANT null வ Va
U+0BB6 Letter CONSONANT null ஶ Sha
U+0BB7 Letter CONSONANT null ஷ Ssa
U+0BB8 Letter CONSONANT null ஸ Sa
U+0BB9 Letter CONSONANT null ஹ Ha
U+0BBA unassigned
U+0BBB unassigned
U+0BBC unassigned
U+0BBD unassigned
U+0BBE Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ா Sign Aa
U+0BBF Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ி Sign I
U+0BC0 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ீ Sign Ii
U+0BC1 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ு Sign U
U+0BC2 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ூ Sign Uu
U+0BC3 unassigned
U+0BC4 unassigned
U+0BC5 unassigned
U+0BC6 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ெ Sign E
U+0BC7 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ே Sign Ee
U+0BC8 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ை Sign Ai
U+0BC9 unassigned
U+0BCA Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ொ Sign O
U+0BCB Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ோ Sign Oo
U+0BCC Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ௌ Sign Au
U+0BCD Mark [Mn] VIRAMA TOP_POSITION ் Virama
U+0BCE unassigned
U+0BCF unassigned
U+0BD0 Letter null null ௐ Om
U+0BD1 unassigned
U+0BD2 unassigned
U+0BD3 unassigned
U+0BD4 unassigned
U+0BD5 unassigned
U+0BD6 unassigned
U+0BD7 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ௗ Au Length Mark
U+0BD8 unassigned
U+0BD9 unassigned
U+0BDA unassigned
U+0BDB unassigned
U+0BDC unassigned
U+0BDD unassigned
U+0BDE unassigned
U+0BDF unassigned
U+0BE0 unassigned
U+0BE1 unassigned
U+0BE2 unassigned
U+0BE3 unassigned
U+0BE4 unassigned
U+0BE5 unassigned
U+0BE6 Number NUMBER null ௦ Digit Zero
U+0BE7 Number NUMBER null ௧ Digit One
U+0BE8 Number NUMBER null ௨ Digit Two
U+0BE9 Number NUMBER null ௩ Digit Three
U+0BEA Number NUMBER null ௪ Digit Four
U+0BEB Number NUMBER null ௫ Digit Five
U+0BEC Number NUMBER null ௬ Digit Six
U+0BED Number NUMBER null ௭ Digit Seven
U+0BEE Number NUMBER null ௮ Digit Eight
U+0BEF Number NUMBER null ௯ Digit Nine
U+0BF0 Number NUMBER null ௰ Number Ten
U+0BF1 Number NUMBER null ௱ Number One Hundred
U+0BF2 Number NUMBER null ௲ Number One Thousand
U+0BF3 Symbol SYMBOL null ௳ Day Sign
U+0BF4 Symbol SYMBOL null ௴ Month Sign
U+0BF5 Symbol SYMBOL null ௵ Year Sign
U+0BF6 Symbol SYMBOL null ௶ Debit Sign
U+0BF7 Symbol SYMBOL null ௷ Credit Sign
U+0BF8 Symbol SYMBOL null ௸ As Above Sign
U+0BF9 Symbol SYMBOL null ௹ Tamil Rupee Sign
U+0BFA Symbol SYMBOL null ௺ Number Sign
U+0BFB unassigned
U+0BFC unassigned
U+0BFD unassigned
U+0BFE unassigned
U+0BFF unassigned

Tamil Supplement character table

Tamil text runs may also include historical symbols and fractions from the Tamil Supplement block. These characters should be classified as follows.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+11FC0 Number NUMBER null 𑿀 Fraction One Three-Hundred-And-Twentieth
U+11FC1 Number NUMBER null 𑿁 Fraction One One-Hundred-And-Sixtieth
U+11FC2 Number NUMBER null 𑿂 Fraction One Eightieth
U+11FC3 Number NUMBER null 𑿃 Fraction One Sixty-Fourth
U+11FC4 Number NUMBER null 𑿄 Fraction One Fortieth
U+11FC5 Number NUMBER null 𑿅 Fraction One Thirty-Second
U+11FC6 Number NUMBER null 𑿆 Fraction Three Eightieths
U+11FC7 Number NUMBER null 𑿇 Fraction Three Sixty-Fourths
U+11FC8 Number NUMBER null 𑿈 Fraction One Twentieth
U+11FC9 Number NUMBER null 𑿉 Fraction One Sixteenth-1
U+11FCA Number NUMBER null 𑿊 Fraction One Sixteenth-2
U+11FCB Number NUMBER null 𑿋 Fraction One Tenth
U+11FCC Number NUMBER null 𑿌 Fraction One Eighth
U+11FCD Number NUMBER null 𑿍 Fraction Three Twentieths
U+11FCE Number NUMBER null 𑿎 Fraction Three Sixteenths
U+11FCF Number NUMBER null 𑿏 Fraction One Fifth
U+11FD0 Number NUMBER null 𑿐 Fraction One Quarter
U+11FD1 Number NUMBER null 𑿑 Fraction One Half-1
U+11FD2 Number NUMBER null 𑿒 Fraction One Half-2
U+11FD3 Number NUMBER null 𑿓 Fraction Three Quarters
U+11FD4 Number NUMBER null 𑿔 Fraction Downscaling Factor Kiizh
U+11FD5 Symbol SYMBOL null 𑿕 Sign Nel
U+11FD6 Symbol SYMBOL null 𑿖 Sign Cevitu
U+11FD7 Symbol SYMBOL null 𑿗 Sign Aazhaakku
U+11FD8 Symbol SYMBOL null 𑿘 Sign Uzhakku
U+11FD9 Symbol SYMBOL null 𑿙 Sign Muuvuzhakku
U+11FDA Symbol SYMBOL null 𑿚 Sign Kuruni
U+11FDB Symbol SYMBOL null 𑿛 Sign Pathakku
U+11FDC Symbol SYMBOL null 𑿜 Sign Mukkuruni
U+11FDD Symbol SYMBOL null 𑿝 Sign Kaacu
U+11FDE Symbol SYMBOL null 𑿞 Sign Panam
U+11FDF Symbol SYMBOL null 𑿟 Sign Pon
U+11FE0 Symbol SYMBOL null 𑿠 Sign Varaakan
U+11FE1 Symbol SYMBOL null 𑿡 Sign Paaram
U+11FE2 Symbol SYMBOL null 𑿢 Sign Kuzhi
U+11FE3 Symbol SYMBOL null 𑿣 Sign Veli
U+11FE4 Symbol SYMBOL null 𑿤 Wet Cultivation Sign
U+11FE5 Symbol SYMBOL null 𑿥 Dry Cultivation Sign
U+11FE6 Symbol SYMBOL null 𑿦 Land Sign
U+11FE7 Symbol SYMBOL null 𑿧 Salt Pan Sign
U+11FE8 Symbol SYMBOL null 𑿨 Traditional Credit Sign
U+11FE9 Symbol SYMBOL null 𑿩 Traditional Number Sign
U+11FEA Symbol SYMBOL null 𑿪 Current Sign
U+11FEB Symbol SYMBOL null 𑿫 And Odd Sign
U+11FEC Symbol SYMBOL null 𑿬 Spent Sign
U+11FED Symbol SYMBOL null 𑿭 Total Sign
U+11FEE Symbol SYMBOL null 𑿮 In Possession Sign
U+11FEF Symbol SYMBOL null 𑿯 Starting From Sign
U+11FF0 Symbol SYMBOL null 𑿰 Sign Muthaliya
U+11FF1 Symbol SYMBOL null 𑿱 Sign Vakaiyaraa
U+11FF2 unassigned
U+11FF3 unassigned
U+11FF4 unassigned
U+11FF5 unassigned
U+11FF6 unassigned
U+11FF7 unassigned
U+11FF8 unassigned
U+11FF9 unassigned
U+11FFA unassigned
U+11FFB unassigned
U+11FFC unassigned
U+11FFD unassigned
U+11FFE unassigned
U+11FFF Punctuation null null 𑿿 End Of Text

Grantha marks character table

Tamil text runs may also include diacritical and syllable-modifier marks from the Grantha block. These characters should be classified as follows.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+11301 Mark [Mn] BINDU TOP_POSITION 𑌁 Grantha Candrabindu
U+11303 Mark [Mc] VISARGA RIGHT_POSITION 𑌃 Grantha Visarga
U+1133B Mark [Mn] NUKTA BOTTOM_POSITION 𑌻 Combining Bindu Below
U+1133C Mark [Mn] NUKTA BOTTOM_POSITION 𑌼 Grantha Nukta

Vedic Extensions character table

Sanskrit runs written in the Tamil script may also include characters from the Vedic Extensions block. These characters should be classified as follows.

Note: See the Vedic Extensions document for additional information.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+1CD0 Mark [Mn] CANTILLATION TOP_POSITION ᳐ Tone Karshana
U+1CD1 Mark [Mn] CANTILLATION TOP_POSITION ᳑ Tone Shara
U+1CD2 Mark [Mn] CANTILLATION TOP_POSITION ᳒ Tone Prenkha
U+1CD3 Punctuation null null ᳓ Sign Nihshvasa
U+1CD4 Mark [Mn] CANTILLATION OVERSTRUCK ᳔ Tone Midline Svarita
U+1CD5 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳕ Tone Aggravated Independent Svarita
U+1CD6 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳖ Tone Independent Svarita
U+1CD7 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳗ Tone Kathaka Independent Svarita
U+1CD8 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳘ Tone Candra Below
U+1CD9 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳙ Tone Kathaka Independent Svarita Schroeder
U+1CDA Mark [Mn] CANTILLATION TOP_POSITION ᳚ Tone Double Svarita
U+1CDB Mark [Mn] CANTILLATION TOP_POSITION ᳛ Tone Triple Svarita
U+1CDC Mark [Mn] CANTILLATION BOTTOM_POSITION ᳜ Tone Kathaka Anudatta
U+1CDD Mark [Mn] CANTILLATION BOTTOM_POSITION ᳝ Tone Dot Below
U+1CDE Mark [Mn] CANTILLATION BOTTOM_POSITION ᳞ Tone Two Dots Below
U+1CDF Mark [Mn] CANTILLATION BOTTOM_POSITION ᳟ Tone Three Dots Below
U+1CE0 Mark [Mn] CANTILLATION TOP_POSITION ᳠ Tone Rigvedic Kashmiri Independent Svarita
U+1CE1 Mark [Mc] CANTILLATION RIGHT_POSITION ᳡ Tone Atharavedic Independent Svarita
U+1CE2 Mark [Mn] AVAGRAHA OVERSTRUCK ᳢ Sign Visarga Svarita
U+1CE3 Mark [Mn] null OVERSTRUCK ᳣ Sign Visarga Udatta
U+1CE4 Mark [Mn] null OVERSTRUCK ᳤ Sign Reversed Visarga Udatta
U+1CE5 Mark [Mn] null OVERSTRUCK ᳥ Sign Visarga Anudatta
U+1CE6 Mark [Mn] null OVERSTRUCK ᳦ Sign Reversed Visarga Anudatta
U+1CE7 Mark [Mn] null OVERSTRUCK ᳧ Sign Visarga Udatta With Tail
U+1CE8 Mark [Mn] AVAGRAHA OVERSTRUCK ᳨ Sign Visarga Anudatta With Tail
U+1CE9 Letter SYMBOL null ᳩ Sign Anusvara Antargomukha
U+1CEA Letter null null ᳪ Sign Anusvara Bahirgomukha
U+1CEB Letter null null ᳫ Sign Anusvara Vamagomukha
U+1CEC Letter SYMBOL null ᳬ Sign Anusvara Vamagomukha With Tail
U+1CED Mark [Mn] AVAGRAHA BOTTOM_POSITION ᳭ Sign Tiryak
U+1CEE Letter SYMBOL null ᳮ Sign Hexiform Long Anusvara
U+1CEF Letter null null ᳯ Sign Long Anusvara
U+1CF0 Letter null null ᳰ Sign Rthang Long Anusvara
U+1CF2 Letter CONSONANT_DEAD null ᳲ Sign Ardhavisarga
U+1CF3 Letter CONSONANT_DEAD null ᳳ Sign Rotated Ardhavisarga
U+1CF3 Mark [Mc] VISARGA null ᳳ Sign Rotated Ardhavisarga
U+1CF4 Mark [Mn] CANTILLATION TOP_POSITION ᳴ Tone Candra Above
U+1CF5 Letter CONSONANT_WITH_STACKER null ᳵ Sign Jihvamuliya
U+1CF6 Letter CONSONANT_WITH_STACKER null ᳶ Sign Upadhmaniya
U+1CF7 Mark [Mc] null null ᳷ Sign Atikrama
U+1CF8 Mark [Mn] CANTILLATION null ᳸ Tone Ring Above
U+1CF9 Mark [Mn] CANTILLATION null ᳹ Tone Double Ring Above
U+1CFA Letter PLACEHOLDER null ᳺ Sign Double Anusvara Antargomukha
U+1CFB unassigned
U+1CFC unassigned
U+1CFD unassigned
U+1CFE unassigned
U+1CFF unassigned

Miscellaneous character table

In addition to general punctuation, runs of Tamil text often use the danda (U+0964) and double danda (U+0965) punctuation marks from the Devanagari block. Tamil text can also incorporate the udatta (U+0951) and anudatta (U+0952) signs from the Devanagari block.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+0951 Mark [Mn] CANTILLATION TOP_POSITION ॑ Udatta
U+0952 Mark [Mn] CANTILLATION BOTTOM_POSITION ॒ Anudatta
U+0964 Punctuation null null । Danda
U+0965 Punctuation null null ॥ Double Danda

Other important characters that may be encountered when shaping runs of Tamil text include the dotted-circle placeholder (U+25CC), the zero-width joiner (U+200D) and zero-width non-joiner (U+200C), and the no-break space (U+00A0).

The dotted-circle placeholder is frequently used when displaying a dependent vowel (matra) or a combining mark in isolation. Real-world text syllables may also use other characters, such as hyphens or dashes, in a similar placeholder fashion; shaping engines should cope with this situation gracefully.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+00A0 Separator PLACEHOLDER null   No-break space
U+00B2 Number SYLLABLE_MODIFIER TOP ² Superscript Two
U+00B3 Number SYLLABLE_MODIFIER TOP ³ Superscript Three
U+200C Other NON_JOINER null ‌ Zero-width non-joiner
U+200D Other JOINER null ‍ Zero-width joiner
U+2010 Punctuation PLACEHOLDER null ‐ Hyphen
U+2011 Punctuation PLACEHOLDER null ‑ No-break hyphen
U+2012 Punctuation PLACEHOLDER null ‒ Figure dash
U+2013 Punctuation PLACEHOLDER null – En dash
U+2014 Punctuation PLACEHOLDER null — Em dash
U+2074 Number SYLLABLE_MODIFIER TOP ⁴ Superscript Four
U+2082 Number SYLLABLE_MODIFIER TOP ₂ Subscript Two
U+2083 Number SYLLABLE_MODIFIER TOP ₃ Subscript Three
U+2084 Number SYLLABLE_MODIFIER TOP ₄ Subscript Four
U+25CC Symbol DOTTED_CIRCLE null ◌ Dotted circle

The zero-width joiner (ZWJ) is primarily used to prevent the formation of a conjunct from a "Consonant,Halant,Consonant" sequence. The sequence "Consonant,Halant,ZWJ,Consonant" blocks the formation of a conjunct between the two consonants.

Note, however, that the "Consonant,Halant" subsequence in the above example may still trigger a half-forms feature. To prevent the application of the half-forms feature in addition to preventing the conjunct, the zero-width non-joiner (ZWNJ) must be used instead. The sequence "Consonant,Halant,ZWNJ,Consonant" should produce the first consonant in its standard form, followed by an explicit "Halant".

A secondary usage of the zero-width joiner is to prevent the formation of "Reph". An initial "Ra,Halant,ZWJ" sequence should not produce a "Reph", where an initial "Ra,Halant" sequence without the zero-width joiner otherwise would.

The no-break space (NBSP) is primarily used to display those codepoints that are defined as non-spacing (marks, dependent vowels (matras), below-base consonant forms, and post-base consonant forms) in an isolated context, as an alternative to displaying them superimposed on the dotted-circle placeholder. These sequences will match "NBSP,ZWJ,Halant,Consonant", "NBSP,mark", or "NBSP,matra".

Tamil text sometimes uses the Latin numerals 2, 3, and 4 in superscript or subscript positions to annotate Sanskrit. When used in this fashion, the superscripts and subscripts are treated as SYLLABLE_MODIFIER signs for shaping purposes.