Skip to content

Latest commit

 

History

History
308 lines (275 loc) · 26.2 KB

character-tables-telugu.md

File metadata and controls

308 lines (275 loc) · 26.2 KB

Telugu character tables

This document lists the per-character shaping information needed to shape Telugu text.

Table of Contents

Telugu character table

Telugu glyphs should be classified as in the following table. Codepoints in the Telugu block with no assigned meaning are designated as unassigned in the Unicode category column.

Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine. Note that this does include some valid codepoints, such as currency marks, punctuation, and other symbols.

Note: the NUMBER and SYMBOL Shaping classes are important during syllable identification, but generally evoke no further special behavior during the rest of the shaping process.

The Mark-placement subclass column indicates mark-placement positioning for codepoints in the Mark category. Assigned, non-mark codepoints have a null in this column and evoke no special mark-placement behavior. Marks tagged with [Mn] in the Unicode category column are categorized as non-spacing; marks tagged with [Mc] are categorized as spacing-combining.

Some codepoints in the following table use a Shaping class that differs from the codepoint's Unicode General Category. The Shaping class takes precedence during OpenType shaping, as it captures more specific, script-aware behavior.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+0C00 Mark [Mn] BINDU TOP_POSITION ఀ Combining Candrabindu Above
U+0C01 Mark [Mc] BINDU RIGHT_POSITION ఁ Candrabindu
U+0C02 Mark [Mc] BINDU RIGHT_POSITION ం Anusvara
U+0C03 Mark [Mc] VISARGA RIGHT_POSITION ః Visarga
U+0C04 Mark [Mn] BINDU TOP_POSITION ఄ Combining Anusvara Above
U+0C05 Letter VOWEL_INDEPENDENT null అ A
U+0C06 Letter VOWEL_INDEPENDENT null ఆ Aa
U+0C07 Letter VOWEL_INDEPENDENT null ఇ I
U+0C08 Letter VOWEL_INDEPENDENT null ఈ Ii
U+0C09 Letter VOWEL_INDEPENDENT null ఉ U
U+0C0A Letter VOWEL_INDEPENDENT null ఊ Uu
U+0C0B Letter VOWEL_INDEPENDENT null ఋ Vocalic R
U+0C0C Letter VOWEL_INDEPENDENT null ఌ Vocalic L
U+0C0D unassigned
U+0C0E Letter VOWEL_INDEPENDENT null ఎ E
U+0C0F Letter VOWEL_INDEPENDENT null ఏ Ee
U+0C10 Letter VOWEL_INDEPENDENT null ఐ Ai
U+0C11 unassigned
U+0C12 Letter VOWEL_INDEPENDENT null ఒ O
U+0C13 Letter VOWEL_INDEPENDENT null ఓ Oo
U+0C14 Letter VOWEL_INDEPENDENT null ఔ Au
U+0C15 Letter CONSONANT null క Ka
U+0C16 Letter CONSONANT null ఖ Kha
U+0C17 Letter CONSONANT null గ Ga
U+0C18 Letter CONSONANT null ఘ Gha
U+0C19 Letter CONSONANT null ఙ Nga
U+0C1A Letter CONSONANT null చ Ca
U+0C1B Letter CONSONANT null ఛ Cha
U+0C1C Letter CONSONANT null జ Ja
U+0C1D Letter CONSONANT null ఝ Jha
U+0C1E Letter CONSONANT null ఞ Nya
U+0C1F Letter CONSONANT null ట Tta
U+0C20 Letter CONSONANT null ఠ Ttha
U+0C21 Letter CONSONANT null డ Dda
U+0C22 Letter CONSONANT null ఢ Ddha
U+0C23 Letter CONSONANT null ణ Nna
U+0C24 Letter CONSONANT null త Ta
U+0C25 Letter CONSONANT null థ Tha
U+0C26 Letter CONSONANT null ద Da
U+0C27 Letter CONSONANT null ధ Dha
U+0C28 Letter CONSONANT null న Na
U+0C29 unassigned
U+0C2A Letter CONSONANT null ప Pa
U+0C2B Letter CONSONANT null ఫ Pha
U+0C2C Letter CONSONANT null బ Ba
U+0C2D Letter CONSONANT null భ Bha
U+0C2E Letter CONSONANT null మ Ma
U+0C2F Letter CONSONANT null య Ya
U+0C30 Letter CONSONANT null ర Ra
U+0C31 Letter CONSONANT null ఱ Rra
U+0C32 Letter CONSONANT null ల La
U+0C33 Letter CONSONANT null ళ Lla
U+0C34 Letter CONSONANT null ఴ Llla
U+0C35 Letter CONSONANT null వ Va
U+0C36 Letter CONSONANT null శ Sha
U+0C37 Letter CONSONANT null ష Ssa
U+0C38 Letter CONSONANT null స Sa
U+0C39 Letter CONSONANT null హ Ha
U+0C3A unassigned
U+0C3B unassigned
U+0C3C unassigned
U+0C3D Letter AVAGRAHA null ఽ Avagraha
U+0C3E Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ా Sign Aa
U+0C3F Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ి Sign I
U+0C40 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ీ Sign Ii
U+0C41 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ు Sign U
U+0C42 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ూ Sign Uu
U+0C43 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ృ Sign Vocalic R
U+0C44 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ౄ Sign Vocalic Rr
U+0C45 unassigned
U+0C46 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ె Sign E
U+0C47 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ే Sign Ee
U+0C48 Mark [Mn] VOWEL_DEPENDENT TOP_AND_BOTTOM_POSITION ై Sign Ai
U+0C49 unassigned
U+0C4A Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ొ Sign O
U+0C4B Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ో Sign Oo
U+0C4C Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ౌ Sign Au
U+0C4D Mark [Mn] VIRAMA TOP_POSITION ్ Virama
U+0C4E unassigned
U+0C4F unassigned
U+0C50 unassigned
U+0C51 unassigned
U+0C52 unassigned
U+0C53 unassigned
U+0C54 unassigned
U+0C55 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ౕ Length Mark
U+0C56 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ౖ Ai Length Mark
U+0C57 unassigned
U+0C58 Letter CONSONANT null ౘ Tsa
U+0C59 Letter CONSONANT null ౙ Dza
U+0C5A Letter CONSONANT null ౚ Rrra
U+0C5B unassigned
U+0C5C unassigned
U+0C5D unassigned
U+0C5E unassigned
U+0C5F unassigned
U+0C60 Letter VOWEL_INDEPENDENT null ౠ Vocalic Rr
U+0C61 Letter VOWEL_INDEPENDENT null ౡ Vocalic Ll
U+0C62 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ౢ Sign Vocalic L
U+0C63 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ౣ Sign Vocalic Ll
U+0C64 unassigned
U+0C65 unassigned
U+0C66 Number NUMBER null ౦ Digit Zero
U+0C67 Number NUMBER null ౧ Digit One
U+0C68 Number NUMBER null ౨ Digit Two
U+0C69 Number NUMBER null ౩ Digit Three
U+0C6A Number NUMBER null ౪ Digit Four
U+0C6B Number NUMBER null ౫ Digit Five
U+0C6C Number NUMBER null ౬ Digit Six
U+0C6D Number NUMBER null ౭ Digit Seven
U+0C6E Number NUMBER null ౮ Digit Eight
U+0C6F Number NUMBER null ౯ Digit Nine
U+0C70 unassigned
U+0C71 unassigned
U+0C72 unassigned
U+0C73 unassigned
U+0C74 unassigned
U+0C75 unassigned
U+0C76 unassigned
U+0C77 Punctuation null null ౷ Sign Siddham
U+0C78 Number NUMBER null ౸ Fraction Zero Odd P
U+0C79 Number NUMBER null ౹ Fraction One Odd P
U+0C7A Number NUMBER null ౺ Fraction Two Odd P
U+0C7B Number NUMBER null ౻ Fraction Three Odd P
U+0C7C Number NUMBER null ౼ Fraction One Even P
U+0C7D Number NUMBER null ౽ Fraction Two Even P
U+0C7E Number NUMBER null ౾ Fraction Three Even P
U+0C7F Symbol SYMBOL null ౿ Tuumu

Vedic Extensions character table

Sanskrit runs written in the Telugu script may also include characters from the Vedic Extensions block. These characters should be classified as follows.

Note: See the Vedic Extensions document for additional information.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+1CD0 Mark [Mn] CANTILLATION TOP_POSITION ᳐ Tone Karshana
U+1CD1 Mark [Mn] CANTILLATION TOP_POSITION ᳑ Tone Shara
U+1CD2 Mark [Mn] CANTILLATION TOP_POSITION ᳒ Tone Prenkha
U+1CD3 Punctuation null null ᳓ Sign Nihshvasa
U+1CD4 Mark [Mn] CANTILLATION OVERSTRUCK ᳔ Tone Midline Svarita
U+1CD5 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳕ Tone Aggravated Independent Svarita
U+1CD6 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳖ Tone Independent Svarita
U+1CD7 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳗ Tone Kathaka Independent Svarita
U+1CD8 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳘ Tone Candra Below
U+1CD9 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳙ Tone Kathaka Independent Svarita Schroeder
U+1CDA Mark [Mn] CANTILLATION TOP_POSITION ᳚ Tone Double Svarita
U+1CDB Mark [Mn] CANTILLATION TOP_POSITION ᳛ Tone Triple Svarita
U+1CDC Mark [Mn] CANTILLATION BOTTOM_POSITION ᳜ Tone Kathaka Anudatta
U+1CDD Mark [Mn] CANTILLATION BOTTOM_POSITION ᳝ Tone Dot Below
U+1CDE Mark [Mn] CANTILLATION BOTTOM_POSITION ᳞ Tone Two Dots Below
U+1CDF Mark [Mn] CANTILLATION BOTTOM_POSITION ᳟ Tone Three Dots Below
U+1CE0 Mark [Mn] CANTILLATION TOP_POSITION ᳠ Tone Rigvedic Kashmiri Independent Svarita
U+1CE1 Mark [Mc] CANTILLATION RIGHT_POSITION ᳡ Tone Atharavedic Independent Svarita
U+1CE2 Mark [Mn] AVAGRAHA OVERSTRUCK ᳢ Sign Visarga Svarita
U+1CE3 Mark [Mn] null OVERSTRUCK ᳣ Sign Visarga Udatta
U+1CE4 Mark [Mn] null OVERSTRUCK ᳤ Sign Reversed Visarga Udatta
U+1CE5 Mark [Mn] null OVERSTRUCK ᳥ Sign Visarga Anudatta
U+1CE6 Mark [Mn] null OVERSTRUCK ᳦ Sign Reversed Visarga Anudatta
U+1CE7 Mark [Mn] null OVERSTRUCK ᳧ Sign Visarga Udatta With Tail
U+1CE8 Mark [Mn] AVAGRAHA OVERSTRUCK ᳨ Sign Visarga Anudatta With Tail
U+1CE9 Letter SYMBOL null ᳩ Sign Anusvara Antargomukha
U+1CEA Letter null null ᳪ Sign Anusvara Bahirgomukha
U+1CEB Letter null null ᳫ Sign Anusvara Vamagomukha
U+1CEC Letter SYMBOL null ᳬ Sign Anusvara Vamagomukha With Tail
U+1CED Mark [Mn] AVAGRAHA BOTTOM_POSITION ᳭ Sign Tiryak
U+1CEE Letter SYMBOL null ᳮ Sign Hexiform Long Anusvara
U+1CEF Letter null null ᳯ Sign Long Anusvara
U+1CF0 Letter null null ᳰ Sign Rthang Long Anusvara
U+1CF1 Letter SYMBOL null ᳱ Sign Anusvara Ubhayato Mukha
U+1CF2 Letter CONSONANT_DEAD null ᳲ Sign Ardhavisarga
U+1CF3 Letter CONSONANT_DEAD null ᳳ Sign Rotated Ardhavisarga
U+1CF4 Mark [Mn] CANTILLATION TOP_POSITION ᳴ Tone Candra Above
U+1CF5 Letter CONSONANT_WITH_STACKER null ᳵ Sign Jihvamuliya
U+1CF6 Letter CONSONANT_WITH_STACKER null ᳶ Sign Upadhmaniya
U+1CF7 Mark [Mc] null null ᳷ Sign Atikrama
U+1CF8 Mark [Mn] CANTILLATION null ᳸ Tone Ring Above
U+1CF9 Mark [Mn] CANTILLATION null ᳹ Tone Double Ring Above
U+1CFA Letter PLACEHOLDER null ᳺ Sign Double Anusvara Antargomukha
U+1CFB unassigned
U+1CFC unassigned
U+1CFD unassigned
U+1CFE unassigned
U+1CFF unassigned

Miscellaneous character table

In addition to general punctuation, runs of Telugu text often use the danda (U+0964) and double danda (U+0965) punctuation marks from the Devanagari block. Telugu text can also incorporate the udatta (U+0951) and anudatta (U+0952) signs from the Devanagari block.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+0951 Mark [Mn] CANTILLATION TOP_POSITION ॑ Udatta
U+0952 Mark [Mn] CANTILLATION BOTTOM_POSITION ॒ Anudatta
U+0964 Punctuation null null । Danda
U+0965 Punctuation null null ॥ Double Danda

Other important characters that may be encountered when shaping runs of Telugu text include the dotted-circle placeholder (U+25CC), the zero-width joiner (U+200D) and zero-width non-joiner (U+200C), and the no-break space (U+00A0).

The dotted-circle placeholder is frequently used when displaying a dependent vowel (matra) or a combining mark in isolation. Real-world text syllables may also use other characters, such as hyphens or dashes, in a similar placeholder fashion; shaping engines should cope with this situation gracefully.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+00A0 Separator PLACEHOLDER null   No-break space
U+200C Other NON_JOINER null ‌ Zero-width non-joiner
U+200D Other JOINER null ‍ Zero-width joiner
U+2010 Punctuation PLACEHOLDER null ‐ Hyphen
U+2011 Punctuation PLACEHOLDER null ‑ No-break hyphen
U+2012 Punctuation PLACEHOLDER null ‒ Figure dash
U+2013 Punctuation PLACEHOLDER null – En dash
U+2014 Punctuation PLACEHOLDER null — Em dash
U+25CC Symbol DOTTED_CIRCLE null ◌ Dotted circle

The zero-width joiner (ZWJ) is primarily used to prevent the formation of a conjunct from a "Consonant,Halant,Consonant" sequence. The sequence "Consonant,Halant,ZWJ,Consonant" blocks the formation of a conjunct between the two consonants.

Note, however, that the "Consonant,Halant" subsequence in the above example may still trigger a half-forms feature. To prevent the application of the half-forms feature in addition to preventing the conjunct, the zero-width non-joiner (ZWNJ) must be used instead. The sequence "Consonant,Halant,ZWNJ,Consonant" should produce the first consonant in its standard form, followed by an explicit "Halant".

A secondary usage of the zero-width joiner is to prevent the formation of "Reph". An initial "Ra,Halant,ZWJ" sequence should not produce a "Reph", where an initial "Ra,Halant" sequence without the zero-width joiner otherwise would.

The no-break space (NBSP) is primarily used to display those codepoints that are defined as non-spacing (marks, dependent vowels (matras), below-base consonant forms, and post-base consonant forms) in an isolated context, as an alternative to displaying them superimposed on the dotted-circle placeholder. These sequences will match "NBSP,ZWJ,Halant,Consonant", "NBSP,mark", or "NBSP,matra".