W3C

Character Model for the World Wide Web

World Wide Web Consortium Working Draft 25-February-1999

This version:
http://www.w3.org/TR/1999/WD-charmod-19990225
Latest version:
http://www.w3.org/TR/WD-charmod
Editor:
Martin J. Dürst (W3C) <duerst@w3.org>

Status of this document

This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is published as part of the Internationalization Activity, see http://www.w3.org/International/Activity by the Internationalization Working Group (I18N WG), with the help of the Internationalization Interest Group (I18N IG). The I18N WG will not allow early implementation to constrain its ability to make changes to this specification prior to final release. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C working drafts can be found at http://www.w3.org/TR/.

Among else, this document addresses the requirements laid out in Requirements for String Identity and Character Indexing Definitions for the WWW. It also contains lists of topics for explicit formulation of the character model used by W3C specifications; these lists will be expanded in the next version. Comments to this Working Draft are very welcome. Comments should be sent to i18n-editor@w3.org. Public discussion of internationalization issues of the WWW takes place on www-international@w3.org.

Abstract

This document defines various aspects of a character model for the WWW. It contains basic definitions and models, specifications to be used by other specifications or directly by implementations, and explanatory material. In particular, early uniform normalization, string identity matching, string indexing, and conventions for URIs are addressed.

Character Model for the World Wide Web

Table of Contents


1. Introduction

1.1 Background

Starting with [RFC 2070], a character model for the WWW has been emerging, with the use of the UCS (Universal Character Set, [ISO 10646]/[Unicode]) as a common reference. As long as data transfer on the WWW was primarily unidirectional (from server to browser), and the main purpose was rendering, the direct use of the UCS as a common reference posed no problems.

However, from early on, the WWW included bidirectional data transfer (forms,...). Recently, purposes other than rendering are becoming more and more important. The WWW has traditionally been seen as a collection of applications exchanging data based on protocols. However, it can also be seen as a single, very large application [Nicol]. The second view is becoming more and more important due to the following developments:

In this context, some properties of the UCS become relevant and have to be addressed. It should be noted that such properties also exist in legacy encodings, and in many cases have been inherited  by the UCS in one way or another from such legacy encodings. In particular, these properties are:

This means that in order to ensure consistent behaviour on the WWW, some additional specifications, based on the UCS, are necessary. This is also taken as an occasion to clearly define the basic model and its use in various specifications and implementations.

1.2 Potential Users of this Specification

This specification has a wide range of potential users, some of which are listed below.

The adoption, wherever appropriate, of specifications or guidelines given in this document to work outside the W3C is strongly encouraged.

1.3 Structure of this Document

Section 2 of this document is intended to give a general introduction into the treatment of characters in protocols and formats, and corresponding guidelines. Section 3 defines and explains the concept of early uniform normalization for string identity matching. Section 4 deals with string indexing. Character encodings in URIs are discussed is Section 5. A glossary gives additional explanations for some of the terms used in this document.

1.4 Notation

Words written in all capitals, such as MUST and SHOULD, are used as defined in [RFC 2119]. UCS codepoints are denoted as U+hhhh, where hhhh is a sequence of hexadecimal digits.

Where this specification contains procedural descriptions, they are understood to be a way to specify the desired external behavior. As long as observable behaviour is not affected, implementations may use some other way of achieving the same results.

2. Use of UCS as a Common Reference

This section defines a general model, e.g. in the sense of the reference processing model in [RFC 2070], and general guidelines, e.g. similar to those in [RFC 2130] and [RFC 2277]. The current version of this document only gives lists of topics to be addressed for each subsection; the WG plans to address the topics in more detail in the next version of this document.

2.1 Characters and Bytes

In the next version of this document, the WG plans to address the following topics in this subsection:

2.2 UCS as a Common Reference

Since [RFC 2070], [ISO 10646]/[Unicode] (hereafter denoted as UCS, Universal Character Set) is used as a common reference for character encoding in W3C specifications (see [HTML 4.0], [XML 1.0], and [CSS2]). This choice was motivated by the fact that the UCS:

In the next version of this document, the WG plans to address the following topic in this subsection:

2.3 Identification of Character Encodings

In the next version of this document, the WG plans to address the following topics in this subsection:

2.4 Character Escaping

In the next version of this document, the WG plans to address the following topics in this subsection::

3. Webwide Early Uniform Normalization

Text data interchange using W3C protocols and formats is based on the principle of early normalization. This section gives a short overview of the reasons for using webwide early normalization (Section 3.1), defines the exact form to which text data has to be normalized (Section 3.2), and the cases in which normalization must be applied (Section 3.3). Section 3.4 explicitly discusses String Identity Matching, and Section 3.5 contains additional advice for compatibility equivalents and control characters.

3.1 Rationale for Webwide Early Uniform Normalization

From early on, the WWW included bidirectional data transfer (forms,...). Recently, purposes other than rendering are becoming more and more important. The WWW has traditionally been seen as a collection of applications exchanging data based on protocols. However, it can also be seen as a single, very large application [Nicol]. The second view is becoming more and more important due to the following developments:

In this context, some properties of the UCS become relevant and have to be addressed. It should be noted that such properties also exist in legacy encodings, and in many cases have been inherited  by the UCS in one way or another from such legacy encodings. In particular, these properties are:

In particular, string identity matching [CharReq] is a basic operation that is carried out frequently. String identity matching is a subset of the more general problem of string matching. There are various degrees of specificity for string matching, from approximate matching such as regular expressions or phonetic matching for English, to more specific matches such as accent-insensitive or case-insensitive matching. String identity matching is concerned only with strings that contain no user-identifiable distinctions.

At various places in the WWW infrastructure, strings, and in particular identifiers, are compared for identity. If different places use different definitions of string identity matching, or if they rely on different mechanisms to test identity, the results are undesired unpredictability and unnecessary conversions. To solve the problem of string identity matching, the following issues have to be addressed:

  1. Which representations to treat as equivalent (and which not)
  2. Which components in the WWW architecture to make responsible for equivalence:
    1. Each individual component that performs a string identity check has to take equivalents into account (late normalization)
    2. Duplicates and ambiguities are removed as close to their source as possible (early normalization)
  3. Which way to normalize (in the case that early normalization (2.2) is needed)

For the following reasons, early uniform normalization was chosen:

3.2 W3C Text Normalization

Text data is in normalized form according to this specification if all of the following apply:

Note: The list of control codepoints to exclude, and of others to advise against, is still under discussion. See also Section 3.5.

Text data is also considered to be in normalized form for the purpose of this specification if all of the following apply:

Note: It is possible, in theory, that legacy encodings also exhibit the problem of duplicate encodings as described in Section 3.1. In this case, it would be appropriate if a corresponding normalization were applied. However, no such legacy encoding is currently known.

3.3 Application of Early Uniform Normalization

Where this section uses the term normalized form, this means the form defined in Section 3.2.

Applications or tools transcoding from a legacy encoding to an encoding based on UCS MUST ensure that their output is in normalized form.

The producer of text data MUST assure that data is produced or sent out in normalized form. For the purpose of W3C specifications and their implementations, the producer of text data is the sender of the data in the case of protocols. In the case of formats, it is the tool that produces the data.

Implementors of producer software in the above sense are encouraged to delegate normalization to their respective data sources wherever possible. (Examples of data sources would be: operating system, libraries, keyboard drivers.)

If any intermediate recipient of text data applies any operations, it MUST assure that the results of these operations is again in normalized form, provided the incoming data is in that form. Intermediate recipients may provide additional normalization towards the normalization form, as a side-effect of their operations and/or as an additional service.

If intermediate recipients do not touch the data but just pass it on, they are not required to check normalization or to normalize data. (Example: caching proxies)

The recipients of text data SHOULD assume that data is normalized. Recipients MAY provide normalization as an add-on service and safety measure. Recipients used in connection with text generation tools SHOULD NOT provide normalization.

Example: An authoring tool should normalize all the text that is produced, but must not provide normalization for the operations that are the same as in a browser, in order to catch potential problems early.

Tools or operations that just do string identity matching, and that have both strings to be matched available in the same encoding, SHOULD do so by binary comparison.

Note: Generators, intermediate recipients, and transcoders must support a repertoire of Unicode codepoints that is complete with respect to normalization, i.e. if any arbitrary sequence of codepoints in the repertoire is normalized, the codepoints needed for the normalization must also be part of the repertoire.

3.4 String Identity Matching

String identity matching on the World Wide Web is based on the following steps:

Conversion to UCS, and to the same encoding for both strings, assures that text strings and not just bytes are compared. Early normalization gives the responsibility of avoiding duplicate encodings to the data producer; ensures that a minimum of effort is spent on solving the problem.

3.5 Compatibility Equivalents and Control Characters

This specification does not address compatibility equivalents. Compatibility as listed in the Unicode database covers a wide range of similarities/distinctions. Depending on the situation, some distinctions are needed, and others will be confusing. To specify all these situations in a single place seems premature. In the absence of any further specifications, implementations are advised to generate the non-compatibility equivalent if they do not explicitly need the compatibility character. A compatibility character here is a character that disappears when applying Unicode Compatibility Composition (Normalization Form CC of Unicode [TR #15]). A non-compatibility equivalent is the character resulting from applying Unicode Compatibility Composition.

Specifications are advised to exclude compatibility characters in the syntactic elements of the formats they define if this is reasonable (e.g. exclusion of compatibility characters for GIs in XML). In the future, compatibility characters should be replaced by appropriate style or markup information wherever possible.

This specification does not address any further equivalents, such as case equivalents, the equivalence between katakana and hiragana, the equivalence between accented and un-accented characters, the equivalence between full characters and fallbacks (e.g. "ö" vs. "oe"), and the equivalence between various spellings and morphological forms (e.g. color vs. colour). Such equivalence is on a higher level; whether and where it is needed depends on the language, the application, and the preferences of the user.

This specification does not address all characters with control functions (see Section 3.2 for some). In general, applications are advised to reduce the use of control characters to a minimum. Specifications are advised to exclude control characters in the syntactic elements of the formats they define if this is reasonable (e.g. exclusion of control characters for GIs in XML). Control characters should be replaced by appropriate markup or style information wherever possible.

4. String Indexing

On many occasions, in order to access a substring or a character, it is necessary to identify positions (between "characters") in a text string/sequence/array. Where such indices are exchanged between components of the WWW, there is a need for a uniform definition of string indexing in order to insure consistent behavior. The requirements for string indexing are discussed in [CharReq, section 4].

Because of the wide variability of scripts and characters, and because of tradeoffs between user friendliness and implementation efficiency, different operations may be required to work at different levels of aggregation or subdivision. At least the following levels can be distinguished:

Note: In many cases, it is highly preferable to use non-numeric ways of identifying substrings. The specification of string indexing for the WWW should not be seen as a general recommendation for the use of string indexing for substring identification. As an example, in the case of translation of a document from one language to another, identification of substrings based on document structure can be expected to be much more stable than identification based on string indexing.

Note: The issue of indexing origin, i.e. whether the first character in a string is indexed as character number 0 or as character number 1, will not be addressed here. In general, even individual characters should be understood and processed as substrings, identified by a position before and a position after the substring. In this case, starting with an index of 0 for the position at the start of the string is the best solution.

5. Character Encoding in URIs

According to the current definition [RFC2396], URIs are restricted to a subset of US-ASCII. There is also an escaping mechanism to encode arbitrary byte values using the %HH convention, but because in general, the mapping from characters to bytes is not defined, this is of limited use. To avoid future incompatibilities, W3C specifications MUST include the following paragraph by reference:

For all syntactic elements in the format/protocol which are being interpreted as URIs, characters that are syntactically not allowed by the generic URI syntax (i.e. all non-ASCII characters plus the excluded characters [RFC2396, Section 2.4.3]) MUST be treated as follows: Each such character is represented in UTF-8 as one or more bytes, each of these bytes is escaped with the URI escaping mechanism (i.e. converted to %HH, where HH is the hexadecimal notation of the byte value), and the original character is replaced by the resulting character sequence.

Example: In the URI <http://www.w3.org/People/Dürst/>, the character "ü" is not allowed. The representation of "ü" in UTF-8 consists of two bytes with the values 0xC3 and 0xBC. The URI is therefore converted to <http://www.w3.org/People/D%C3%BCrst/>.

Note: The intent of this is not to freeze the definitions of URIs to a subset of US-ASCII characters forever, but to assure that W3C technology correctly and predictably interacts with systems that are based on the current definition of URIs while not inhibiting a future extension of the URI definition.

Note: Current W3C specifications already contain provisions in accordance with the above. For [XML 1.0], please see Section 4.2.2, External Entities. For [HTML 4.0], please see Appendix B.2.1: Non-ASCII characters in URI attribute values, which also contains some provisions for backwards compatibility. Further information and links can be found at [I18NURI].

Glossary

This glossary does not provide exact definitions of terms but gives some background on how certain words are used in this document.

Character
Used in a loose sense to denote small units of text, where the exact definition of these units is still open.
Early Normalization
See Early Uniform Normalization.
Early Uniform Normalization
Duplicates and ambiguities are removed as close to their source as possible. This is done by normalizing them to a single representation. Because the normalization is not done by the component that carries out the identity check, normalization has to be done uniformly for all the components of the WWW.
Encoding based on UCS
An encoding that uses UCS codepoints in a reasonably simple manner. Examples: UTF-8, UTF-7 (discouraged), UTF-16, UCS-2, UCS-4. (Note: For the later three, an encoding definition also needs to include provisions for defining and identifying the serialization of 16-bit or 31-bit values into byte sequences.)
Late Normalization
Each individual component that performs a string identity check has to take equivalence into account. This would be done by normalizing each string to a preferred representation that eliminates duplicates and ambiguities. Because, with late normalization, normalization is done locally and on the fly, there is no need to specify a webwide uniform normalization.
Legacy Encoding
An encoding not based on UCS.  Examples: ISO-8859-1, EUC-KR.
String Identity Matching
Exact matching of strings, except for encoding duplicates indistinguishable to the user. See section 2.
String Indexing
Indexing into a string to address a character or a sequence of characters. See section 4.
Transcoding
The process of changing text data from one encoding to another.
UCS
Universal Character Set, the character repertoire defined in parallel by [ISO 10646] and [Unicode].
URI
Uniform Resource Identifier, see [RFC 2396].
WWW
World Wide Web, the collection of technologies built up starting with HTML, HTTP, and URIs, the corresponding software (servers, browsers,...), and/or the corresponding content.


References

[CSS2]
Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds., Cascading Style Sheets, level 2 (CSS2 Specification), W3C Recommendation 12-May-1998, <http://www.w3.org/TR/REC-CSS2>.
[DOM]
Vidur Apparao et al., Document Object Model (DOM) Level 1 Specification, W3C Recommendation 1 October, 1998, <http://www.w3.org/TR/REC-DOM-Level-1/>.
[I18NURI]
Internationalization: URIs and other identifiers <http://www.w3.org/International/O-URL-and-ident>.
[ISO 6937]
ISO/IEC 6937:1994, Information technology -- Coded graphic character set for text communication -- Latin alphabet.
[ISO 8859]
ISO 8859 (various parts and publication dates), Information technology -- 8-bit single-byte coded graphic character sets.
[ISO 10646]
ISO/IEC 10646-1:1993, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane.
[HTML 4.0]
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.0 Specification, W3C Recommendation 18-Dec-1997 (revised on 24-Apr-1998), <http://www.w3.org/TR/REC-html40/>.
[MIME]
Ned Freed, Nathaniel Borenstein, Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies, RFC 2045, November 1996, <http://www.ietf.org/rfc/rfc2045.txt>.
[Nicol]
Gavin Nicol, The Multilingual World Wide Web, Chapter 2: The WWW As A Multilingual Application, <http://www.mind-to-mind.com/documents/i18n/multilingual-www.html#ID-2A08F773>.
[CharReq]
Martin J. Dürst, Requirements for String Identity and Character Indexing Definitions for the WWW, <http://www.w3.org/TR/WD-charreq>.
[RFC 2070]
F. Yergeau, G. Nicol, G. Adams, M. Dürst, Internationalization of the Hypertext Markup Language, RFC 2070, January 1997, <http://www.ietf.org/rfc/rfc2070.txt>.
[RFC 2119]
S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, <http://www.ietf.org/rfc/rfc2119.txt>.
[RFC 2130]
C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, P. Svanberg, The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996, RFC 2130, April 1997, <http://www.ietf.org/rfc/rfc2130.txt>.
[RFC 2277]
H. Alvestrand, IETF Policy on Character Sets and Languages, RFC 2277 / BCP 18, January 1998, <http://www.ietf.org/rfc/rfc2277.txt>.
[RFC 2396]
T. Berners-Lee, R. Fielding, L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax, August 1998, <http://www.ietf.org/rfc/rfc2396.txt>.
[TR #15]
Mark Davis, UNICODE NORMALIZATION FORMS, DRAFT Unicode Technical Report #15, December 1998, <http://www.unicode.org/unicode/reports/tr15/tr15-10.html>.
[Unicode 2.0]
The Unicode Consortium, The Unicode Standard, Version 2.0, Addison-Wesley, Reading, MA, 1996.
[Unicode 2.1]
Lisa Moore, Unicode Technical Report # 8, The Unicode Standard, Version 2.1, September 1998, <http://www.unicode.org/unicode/reports/tr8.html>.
[XML 1.0]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation 10-February-1998, <http://www.w3.org/TR/REC-xml>.