This page, written in HTML 4, tests the diacritics characters usually used to transliterate Sanskrit and other Indic languages and provides information on how to use them within HTML documents on the WWW.


Introduction

Indologists, Sanskritists and others have long wanted a standard way of encoding in electronic documents and word-processor files the diacritic characters they use for the transliteration of Sanskrit and other Indic languages. Previously these diacritic characters were not found in any standard character set and so scholars had to resort to using ASCII representation of these characters (e.g. The Kyoto Harvard Convention), or using the FONT FACE tag along with ad-hoc conventions such the "Classical Sanskrit" (CS) and "Classical Sanskrit Extended" (CSX) conventions, [also known as IASS (International Association of Sanskrit Studies) conventions] which use fonts in which certain character glyphs have been substituted by character glyphs for these diacritic characters - using encoding positions defined for a particular character by a completely different character

It has been also been difficult to exchange electronic documents with others who may be using a different type of computer system and / or not have the same fonts installed on their system. Another result of using a non-standardized representation for these characters is that it has been difficult to search documents on the WWW for words containing such characters - since the same word may be encoded in different ways in different documents.

A solution is now available!

HTML 4 was designed so that documents may be unambiguously written in every language and be transported easily around the world. This was accomplished by incorporating [RFC2070], which deals with the internationalization of HTML.

One important step was the adoption of the ISO/IEC:10646 standard (see [ISO10646]) as the basic document character set for HTML 4. This is the world's most inclusive standard dealing with issues of the representation of international characters, text direction, punctuation, and other world language issues. With the greater support for diverse human languages within an HTML document, more effective indexing of documents for search engines, higher-quality typography, better text-to-speech conversion, better hyphenation, etc. will be possible.

Within a few years, these features will also allow us to use Devanagri, other Indic scripts and Tibetan in a standard way on the WWW - however the practical use of these scripts in a standard way requires that the application (i.e. browser) or rendering system handles complex context-sensitive glyph substitution and character substitution issues in a transparent manner - and such features have not yet been widely implemented across a broad range of computer operating systems and applications - let alone web-browsers.

However these complex rendering problems do not exist for these diacritics characters. All you need to understand is the proper way of representing or encoding these characters within HTML and to have access to an HTML 4 compliant Web browser such as Microsoft IE4 or Navigator 4 along with a font with the necessary characters properly encoded.

Font

With the wide availability of multi-script computing over the next few years most Roman-script fonts will probably be updated to include a comprehensive repetoire of diacritics character glyphs - including those used for transliteration of Sanskrit. Right now there are few partial "Unicode" fonts, but these are not widely available and most do not yet contain all the diacritic characters used for transliteration of Sanskrit. Then again the sheer size of some "Unicode fonts" - which may contain outlines for tens of thousands of individual character glyphs - currently taxes the resources of many computer systems.

In order that people can start properly using Sanskrit Diacritics with HTML 4 we have created a Roman font (Nitartha Indic Roman) which contains glyphs for these diacritics characters along with those for the the standard ISO Latin-1 character set. Click here to download this font, and install it on any system using Windows 95 and NT 4 and above - it should also work on Macs using system 8.5+. This is just a demonstration font, not a commercial font with professional hinting and hence may not render quite as well on screen. In the near future high-quality commercial fonts containing these characters at the same codepoints will undoubtedly be released.

Character Representation

A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as ISO10646. There are several different encodings of parts of ISO10646 (such as Unicode) in addition to encodings of the entire character set (such as UCS-4). In order to maintain backward-compatibility with older browsers and to allow the safe transmission of documents containing multibyte characters over the Internet which was designed only for 7 or 8-bit character encodings, in HTML 4 multibyte characters are either referenced by their numeric codepoint in the ISO10646 standard or encoded using the UTF-8 transformation format - a method of representing multibyte characters by encoding them as a series of single bytes


Note.To view the rest of this page properly you must be using a browser that supports HTML 4 and have the Nitartha Indic-Times font installed on your system. If you do not have the font installed, another font with the necessary characters properly encoded should work only if it is set up as the default font for your browser.

If you know a little HTML and want to understand how the characters on this page are encoded you will probably want to look at it with your browser's "View Page Source" feature or in a plain text editor. Caution: Simply opening this page in some HTML editors will destroy the character encoding.


Decimal Character References:

The syntax for this is "&#D", where D is a decimal number, refers to the Unicode decimal character number D.

Ā ā Ī ī Ṛ ṛ Ṝ ṝ Ḷ ḷ Ḹ ḹ Ṃ ṃ Ḥ ḥ Ñ ñ Ṭ ṭ Ḍ ḍ Ṅ ṅ Ṇ ṇ Ś ś Ṣ ṣ

  Ā Ā   ā ā
  Ī Ī   ī ī
  Ū Ū   ū ū
  Ē Ē   ē ē
  Ō Ō   ō ō
  Ṛ   ṛ
  Ṝ   ṝ
  Ḷ   ḷ
  Ḹ   ḹ
  Ṃ   ṃ
  Ḥ   ḥ
  Ñ Ñ   ñ ñ
  Ṭ   ṭ
  Ḍ   ḍ
  Ṅ   ṅ
  Ṇ   ṇ
  Ś Ś   ś ś
  Ṣ   ṣ

Hexadecimal Character References:

The syntax for this is "&#xH;" or "&#XH;", where H is an hexadecimal number, refers to the Unicode hexadecimal character number H. Hexadecimal numbers in numeric character references are supposed to be case-insensitive. See:  HTML 4 Spec: 5.3.1 Numeric character references

At the time this page was created (June, 1998) neither Netscape Navigator 4 nor Microsoft Internet Explorer 4 supported hexadecimal character representation. Therefore the samples in this section have been commented out since they will not display properly in either of these browsers.

Note. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention would be particularly useful if it worked since character standards generally use hexadecimal representations.


UTF-8:

The 2 byte (UCS-2) encoding of ISO 10646 is identical to the Unicode standard.

The UTF 8 encoding of ISO 10646/Unicode allows standard 8-bit systems, ASCII text editors, browsers and so on to continue to work with HTML and other text files files containing encoded multibyte characters.

With UTF-8 encoding, ASCII text (0x0..0x7F) continues to appear without any changes and encodes all characters from 0x80.. 0x7FFFFFFF into a series of six or fewer bytes. If the most significant bit of the first character is 0, then the remaining seven bits are interpreted as an ASCII character. Otherwise, the number of leading 1 bits indicates the number of (8-bit) bytes following. There is always a 0 bit between the count bits and any data.

First byte could be one of the following. The X indicates bits available to encode the character.

  0XXXXXXX  only one byte        0..0x7F (ASCII)
  110XXXXX  two bytes            Maximum character value is 0x7FF 
  1110XXXX  three bytes          Maximum character value is 0xFFFF
  11110XXX  four bytes           Maximum character value is 0x1FFFFF
  111110XX  five bytes           Maximum character value is 0x3FFFFFF
  1111110X  six bytes            Maximum character value is 0x7FFFFFFF

All following bytes have this format: 10XXXXXX

A two byte example. The encoding position for an N tilde ÑÑ "circled R re.g.istered sign" is 209 in both ISO/Latin-1 (8859/1) and ISO 10646. In hexadecimal, it is 0xAE. In HTML, it is Ñ or &NTilde;. In UTF-8 it has the following two-byte encoding: 0xC3, 0x91.

For more information on UTF-8 see: ISO/IEC JTC1/SC2/WG2 N 1036.

For background information which led to the development of UTF-8, see the proposal that describes the File System Safe UTF (FSS-UTF).

A full set of UTF-8 Test Pages can be found at: http://titus.uni-frankfurt.de/unicode/index.htm

Ā ā Ī ī Ṛ ṛ Ṝ ṝ Ḷ ḷ Ḹ ḹ Ṃ ṃ Ḥ ḥ Ñ ñ Ṭ ṭ Ḍ ḍ Ṅ ṅ Ṇ ṇ Ś ś Ṣ ṣ

    Hex Dec      Hex Dec
  Ā 0xC4, 0x80; 196, 128;   ā 0xC4, 0x81; 196, 129;
  Ī 0xC4, 0xAA; 196, 170;   ī 0xC4, 0xAB; 196, 171;
  Ū 0xC5, 0xAA; 197, 170;   ū 0xC5, 0xAB; 197, 171;
  Ē 0xC4, 0x92; 196, 146;   ē 0xC4, 0x93; 196, 147;
  Ō 0xC5, 0x8C; 197, 140;   ō 0xC5, 0x8D; 197, 141;
  0xE1, 0xB9, 0x9A; 225, 185, 154;   0xE1, 0xB9, 0x9B; 225, 185, 155;
  0xE1, 0xB9, 0x9C; 225, 185, 156;   0xE1, 0xB9, 0x9D; 225, 185, 157;
  0xE1, 0xB8, 0xB6 225, 184, 182   0xE1, 0xB8, 0xB7 225, 184, 183
  0xE1, 0xB8, 0xB8 225, 184, 184   0xE1 0xB8 0xB9 225, 184, 185;
  0xE1, 0xB9, 0x82 225, 185, 130;   0xE1, 0xB9, 0x83 225, 185, 131;
  0xE1, 0xB8, 0xA4; 225, 184, 164;   0xE1, 0xB8, 0xA5; 225, 184, 165
  Ñ 0xC3, 0x91; 195, 145;   ñ 0xC3, 0xB1; 195, 177;
  0xE1, 0xB9, 0xAC; 225, 185, 172;   0xE1, 0xB9, 0xAD; 225, 185, 173;
  0xE1, 0xB8, 0x8C; 225, 184, 140;   0xE1, 0xB8, 0x8D; 225, 164, 141;
  0xE1, 0xB9, 0x84; 225, 185, 132;   0xE1, 0xB9, 0x85; 225, 185, 133;
  0xE1, 0xB9, 0x86; 225, 185, 134;   0xE1,0xB9, 0x87; 225, 185, 135;
  Ś 0xC5, 0x9A 197, 154;   ś 0xC5 0x9B 197, 155;
  0xE1 0xB9, 0xA2; 225, 185, 162;   0xE1, 0xB9, 0x83; 225, 185, 163;

- Chris Fynn

Dream Flag: join Nitartha
Home
Vision
Publications
Tibetan Software
Digital Tibetan
Institute
Collections
Special Projects

  Nitartha international, New York, New York
  Web pages © Nitartha international.
  Photographs, drawings and images © Dzogchen
        Ponlop, Rinpoche or the artist
  Web design by Martin Marvet
  Comments may be sent to 
webmaster@nitartha.org
 
      For additional contact information, see our
information page. 


 

   

                                                                                     Ancient Wisdom for the Modern Mind ™

Examples
WebTibetan: NG
WebTibetan: LT
Viewing Req
HowTo: Tibetan
HowTo:Diacritics
Web Standards
Design Issues