Skip to content

Latest commit

 

History

History
308 lines (275 loc) · 26.3 KB

character-tables-malayalam.md

File metadata and controls

308 lines (275 loc) · 26.3 KB

Malayalam character tables

This document lists the per-character shaping information needed to shape Malayalam text.

Table of Contents

Malayalam character table

Malayalam glyphs should be classified as in the following table. Codepoints in the Malayalam block with no assigned meaning are designated as unassigned in the Unicode category column.

Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine. Note that this does include some valid codepoints, such as currency marks, punctuation, and other symbols.

Note: the NUMBER and SYMBOL Shaping classes are important during syllable identification, but generally evoke no further special behavior during the rest of the shaping process.

The Mark-placement subclass column indicates mark-placement positioning for codepoints in the Mark category. Assigned, non-mark codepoints have a null in this column and evoke no special mark-placement behavior. Marks tagged with [Mn] in the Unicode category column are categorized as non-spacing; marks tagged with [Mc] are categorized as spacing-combining.

Some codepoints in the following table use a Shaping class that differs from the codepoint's Unicode General Category. The Shaping class takes precedence during OpenType shaping, as it captures more specific, script-aware behavior.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+0D00 Mark [Mn] BINDU TOP_POSITION ഀ Combining Anusvara Above
U+0D01 Mark [Mn] BINDU TOP_POSITION ഁ Candrabindu
U+0D02 Mark [Mc] BINDU RIGHT_POSITION ം Anusvara
U+0D03 Mark [Mc] VISARGA RIGHT_POSITION ഃ Visarga
U+0D04 Letter BINDU null ഄ Vedic Anusvara
U+0D05 Letter VOWEL_INDEPENDENT null അ A
U+0D06 Letter VOWEL_INDEPENDENT null ആ Aa
U+0D07 Letter VOWEL_INDEPENDENT null ഇ I
U+0D08 Letter VOWEL_INDEPENDENT null ഈ Ii
U+0D09 Letter VOWEL_INDEPENDENT null ഉ U
U+0D0A Letter VOWEL_INDEPENDENT null ഊ Uu
U+0D0B Letter VOWEL_INDEPENDENT null ഋ Vocalic R
U+0D0C Letter VOWEL_INDEPENDENT null ഌ Vocalic L
U+0D0D unassigned
U+0D0E Letter VOWEL_INDEPENDENT null എ E
U+0D0F Letter VOWEL_INDEPENDENT null ഏ Ee
U+0D10 Letter VOWEL_INDEPENDENT null ഐ Ai
U+0D11 unassigned
U+0D12 Letter VOWEL_INDEPENDENT null ഒ O
U+0D13 Letter VOWEL_INDEPENDENT null ഓ Oo
U+0D14 Letter VOWEL_INDEPENDENT null ഔ Au
U+0D15 Letter CONSONANT null ക Ka
U+0D16 Letter CONSONANT null ഖ Kha
U+0D17 Letter CONSONANT null ഗ Ga
U+0D18 Letter CONSONANT null ഘ Gha
U+0D19 Letter CONSONANT null ങ Nga
U+0D1A Letter CONSONANT null ച Ca
U+0D1B Letter CONSONANT null ഛ Cha
U+0D1C Letter CONSONANT null ജ Ja
U+0D1D Letter CONSONANT null ഝ Jha
U+0D1E Letter CONSONANT null ഞ Nya
U+0D1F Letter CONSONANT null ട Tta
U+0D20 Letter CONSONANT null ഠ Ttha
U+0D21 Letter CONSONANT null ഡ Dda
U+0D22 Letter CONSONANT null ഢ Ddha
U+0D23 Letter CONSONANT null ണ Nna
U+0D24 Letter CONSONANT null ത Ta
U+0D25 Letter CONSONANT null ഥ Tha
U+0D26 Letter CONSONANT null ദ Da
U+0D27 Letter CONSONANT null ധ Dha
U+0D28 Letter CONSONANT null ന Na
U+0D29 Letter CONSONANT null ഩ Nnna
U+0D2A Letter CONSONANT null പ Pa
U+0D2B Letter CONSONANT null ഫ Pha
U+0D2C Letter CONSONANT null ബ Ba
U+0D2D Letter CONSONANT null ഭ Bha
U+0D2E Letter CONSONANT null മ Ma
U+0D2F Letter CONSONANT null യ Ya
U+0D30 Letter CONSONANT null ര Ra
U+0D31 Letter CONSONANT null റ Rra
U+0D32 Letter CONSONANT null ല La
U+0D33 Letter CONSONANT null ള Lla
U+0D34 Letter CONSONANT null ഴ Llla
U+0D35 Letter CONSONANT null വ Va
U+0D36 Letter CONSONANT null ശ Sha
U+0D37 Letter CONSONANT null ഷ Ssa
U+0D38 Letter CONSONANT null സ Sa
U+0D39 Letter CONSONANT null ഹ Ha
U+0D3A Letter CONSONANT null ഺ Ttta
U+0D3B Mark [Mn] PURE_KILLER TOP_POSITION ഻ Vertical Bar Virama
U+0D3C Mark [Mn] PURE_KILLER TOP_POSITION ഼ Circular Virama
U+0D3D Letter AVAGRAHA null ഽ Avagraha
U+0D3E Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ാ Sign Aa
U+0D3F Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ി Sign I
U+0D40 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ീ Sign Ii
U+0D41 Mark [Mn] VOWEL_DEPENDENT RIGHT_POSITION ു Sign U
U+0D42 Mark [Mn] VOWEL_DEPENDENT RIGHT_POSITION ൂ Sign Uu
U+0D43 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ൃ Sign Vocalic R
U+0D44 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ൄ Sign Vocalic Rr
U+0D45 unassigned
U+0D46 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION െ Sign E
U+0D47 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION േ Sign Ee
U+0D48 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ൈ Sign Ai
U+0D49 unassigned
U+0D4A Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ൊ Sign O
U+0D4B Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ോ Sign Oo
U+0D4C Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ൌ Sign Au
U+0D4D Mark [Mn] VIRAMA TOP_POSITION ് Virama
U+0D4E Letter CONSONANT_PRE_REPHA null ൎ Dot Reph
U+0D4F Symbol SYMBOL null ൏ Para
U+0D50 unassigned
U+0D51 unassigned
U+0D52 unassigned
U+0D53 unassigned
U+0D54 Letter CONSONANT_DEAD null ൔ Chillu M
U+0D55 Letter CONSONANT_DEAD null ൕ Chillu Y
U+0D56 Letter CONSONANT_DEAD null ൖ Chillu Lll
U+0D57 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ൗ Au Length Mark
U+0D58 Number NUMBER null ൘ Fraction 1/160
U+0D59 Number NUMBER null ൙ Fraction 1/40
U+0D5A Number NUMBER null ൚ Fraction 3/80
U+0D5B Number NUMBER null ൛ Fraction 1/20
U+0D5C Number NUMBER null ൜ Fraction 1/10
U+0D5D Number NUMBER null ൝ Fraction 3/20
U+0D5E Number NUMBER null ൞ Fraction 1/5
U+0D5F Letter VOWEL_INDEPENDENT null ൟ Archaic Ii
U+0D60 Letter VOWEL_INDEPENDENT null ൠ Vocalic Rr
U+0D61 Letter VOWEL_INDEPENDENT null ൡ Vocalic Ll
U+0D62 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ൢ Sign Vocalic L
U+0D63 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ൣ Sign Vocalic Ll
U+0D64 unassigned
U+0D65 unassigned
U+0D66 Number NUMBER null ൦ Digit Zero
U+0D67 Number NUMBER null ൧ Digit One
U+0D68 Number NUMBER null ൨ Digit Two
U+0D69 Number NUMBER null ൩ Digit Three
U+0D6A Number NUMBER null ൪ Digit Four
U+0D6B Number NUMBER null ൫ Digit Five
U+0D6C Number NUMBER null ൬ Digit Six
U+0D6D Number NUMBER null ൭ Digit Seven
U+0D6E Number NUMBER null ൮ Digit Eight
U+0D6F Number NUMBER null ൯ Digit Nine
U+0D70 Number NUMBER ൰ Number Ten
U+0D71 Number NUMBER ൱ Number One Hundred
U+0D72 Number NUMBER ൲ Number One Thousand
U+0D73 Number NUMBER ൳ Fraction 1/4
U+0D74 Number NUMBER ൴ Fraction 1/2
U+0D75 Number NUMBER ൵ Fraction 3/4
U+0D76 Number NUMBER ൶ Fraction 1/16
U+0D77 Number NUMBER ൷ Fraction 1/8
U+0D78 Number NUMBER null ൸ Fraction 3/16
U+0D79 Symbol SYMBOL null ൹ Date Mark
U+0D7A Letter CONSONANT_DEAD null ൺ Chillu Nn
U+0D7B Letter CONSONANT_DEAD null ൻ Chillu N
U+0D7C Letter CONSONANT_DEAD null ർ Chillu Rr
U+0D7D Letter CONSONANT_DEAD null ൽ Chillu L
U+0D7E Letter CONSONANT_DEAD null ൾ Chillu Ll
U+0D7F Letter CONSONANT_DEAD null ൿ Chillu K

Vedic Extensions character table

Sanskrit runs written in the Malayalam script may also include characters from the Vedic Extensions block. These characters should be classified as follows.

Note: See the Vedic Extensions document for additional information.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+1CD0 Mark [Mn] CANTILLATION TOP_POSITION ᳐ Tone Karshana
U+1CD1 Mark [Mn] CANTILLATION TOP_POSITION ᳑ Tone Shara
U+1CD2 Mark [Mn] CANTILLATION TOP_POSITION ᳒ Tone Prenkha
U+1CD3 Punctuation null null ᳓ Sign Nihshvasa
U+1CD4 Mark [Mn] CANTILLATION OVERSTRUCK ᳔ Tone Midline Svarita
U+1CD5 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳕ Tone Aggravated Independent Svarita
U+1CD6 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳖ Tone Independent Svarita
U+1CD7 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳗ Tone Kathaka Independent Svarita
U+1CD8 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳘ Tone Candra Below
U+1CD9 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳙ Tone Kathaka Independent Svarita Schroeder
U+1CDA Mark [Mn] CANTILLATION TOP_POSITION ᳚ Tone Double Svarita
U+1CDB Mark [Mn] CANTILLATION TOP_POSITION ᳛ Tone Triple Svarita
U+1CDC Mark [Mn] CANTILLATION BOTTOM_POSITION ᳜ Tone Kathaka Anudatta
U+1CDD Mark [Mn] CANTILLATION BOTTOM_POSITION ᳝ Tone Dot Below
U+1CDE Mark [Mn] CANTILLATION BOTTOM_POSITION ᳞ Tone Two Dots Below
U+1CDF Mark [Mn] CANTILLATION BOTTOM_POSITION ᳟ Tone Three Dots Below
U+1CE0 Mark [Mn] CANTILLATION TOP_POSITION ᳠ Tone Rigvedic Kashmiri Independent Svarita
U+1CE1 Mark [Mc] CANTILLATION RIGHT_POSITION ᳡ Tone Atharavedic Independent Svarita
U+1CE2 Mark [Mn] AVAGRAHA OVERSTRUCK ᳢ Sign Visarga Svarita
U+1CE3 Mark [Mn] null OVERSTRUCK ᳣ Sign Visarga Udatta
U+1CE4 Mark [Mn] null OVERSTRUCK ᳤ Sign Reversed Visarga Udatta
U+1CE5 Mark [Mn] null OVERSTRUCK ᳥ Sign Visarga Anudatta
U+1CE6 Mark [Mn] null OVERSTRUCK ᳦ Sign Reversed Visarga Anudatta
U+1CE7 Mark [Mn] null OVERSTRUCK ᳧ Sign Visarga Udatta With Tail
U+1CE8 Mark [Mn] AVAGRAHA OVERSTRUCK ᳨ Sign Visarga Anudatta With Tail
U+1CE9 Letter SYMBOL null ᳩ Sign Anusvara Antargomukha
U+1CEA Letter null null ᳪ Sign Anusvara Bahirgomukha
U+1CEB Letter null null ᳫ Sign Anusvara Vamagomukha
U+1CEC Letter SYMBOL null ᳬ Sign Anusvara Vamagomukha With Tail
U+1CED Mark [Mn] AVAGRAHA BOTTOM_POSITION ᳭ Sign Tiryak
U+1CEE Letter SYMBOL null ᳮ Sign Hexiform Long Anusvara
U+1CEF Letter null null ᳯ Sign Long Anusvara
U+1CF0 Letter null null ᳰ Sign Rthang Long Anusvara
U+1CF2 Letter CONSONANT_DEAD null ᳲ Sign Ardhavisarga
U+1CF3 Letter CONSONANT_DEAD null ᳳ Sign Rotated Ardhavisarga
U+1CF3 Mark [Mc] VISARGA null ᳳ Sign Rotated Ardhavisarga
U+1CF4 Mark [Mn] CANTILLATION TOP_POSITION ᳴ Tone Candra Above
U+1CF5 Letter CONSONANT_WITH_STACKER null ᳵ Sign Jihvamuliya
U+1CF6 Letter CONSONANT_WITH_STACKER null ᳶ Sign Upadhmaniya
U+1CF7 Mark [Mc] null null ᳷ Sign Atikrama
U+1CF8 Mark [Mn] CANTILLATION null ᳸ Tone Ring Above
U+1CF9 Mark [Mn] CANTILLATION null ᳹ Tone Double Ring Above
U+1CFA Letter PLACEHOLDER null ᳺ Sign Double Anusvara Antargomukha
U+1CFB unassigned
U+1CFC unassigned
U+1CFD unassigned
U+1CFE unassigned
U+1CFF unassigned

Miscellaneous character table

In addition to general punctuation, runs of Malayalam text often use the danda (U+0964) and double danda (U+0965) punctuation marks from the Devanagari block. Malayalam text can also incorporate the udatta (U+0951) and anudatta (U+0952) signs from the Devanagari block.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+0951 Mark [Mn] CANTILLATION TOP_POSITION ॑ Udatta
U+0952 Mark [Mn] CANTILLATION BOTTOM_POSITION ॒ Anudatta
U+0964 Punctuation null null । Danda
U+0965 Punctuation null null ॥ Double Danda

Other important characters that may be encountered when shaping runs of Malayalam text include the dotted-circle placeholder (U+25CC), the zero-width joiner (U+200D) and zero-width non-joiner (U+200C), and the no-break space (U+00A0).

The dotted-circle placeholder is frequently used when displaying a dependent vowel (matra) or a combining mark in isolation. Real-world text syllables may also use other characters, such as hyphens or dashes, in a similar placeholder fashion; shaping engines should cope with this situation gracefully.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+00A0 Separator PLACEHOLDER null   No-break space
U+200C Other NON_JOINER null ‌ Zero-width non-joiner
U+200D Other JOINER null ‍ Zero-width joiner
U+2010 Punctuation PLACEHOLDER null ‐ Hyphen
U+2011 Punctuation PLACEHOLDER null ‑ No-break hyphen
U+2012 Punctuation PLACEHOLDER null ‒ Figure dash
U+2013 Punctuation PLACEHOLDER null – En dash
U+2014 Punctuation PLACEHOLDER null — Em dash
U+25CC Symbol DOTTED_CIRCLE null ◌ Dotted circle

The zero-width joiner (ZWJ) is primarily used to prevent the formation of a conjunct from a "Consonant,Halant,Consonant" sequence. The sequence "Consonant,Halant,ZWJ,Consonant" blocks the formation of a conjunct between the two consonants.

Note, however, that the "Consonant,Halant" subsequence in the above example may still trigger a half-forms feature. To prevent the application of the half-forms feature in addition to preventing the conjunct, the zero-width non-joiner (ZWNJ) must be used instead. The sequence "Consonant,Halant,ZWNJ,Consonant" should produce the first consonant in its standard form, followed by an explicit "Halant".

A secondary usage of the zero-width joiner is to prevent the formation of "Reph". An initial "Ra,Halant,ZWJ" sequence should not produce a "Reph", where an initial "Ra,Halant" sequence without the zero-width joiner otherwise would.

The no-break space (NBSP) is primarily used to display those codepoints that are defined as non-spacing (marks, dependent vowels (matras), below-base consonant forms, and post-base consonant forms) in an isolated context, as an alternative to displaying them superimposed on the dotted-circle placeholder. These sequences will match "NBSP,ZWJ,Halant,Consonant", "NBSP,mark", or "NBSP,matra".