WG14 Document Number: N1026 Document Date: 15 Sep 2003 ISO/IEC JTC 1/SC22 Programming languages, their environments and system software interfaces Secretariat: U.S.A. (ANSI) ISO/IEC JTC 1/SC22 N3649 TITLE: Summary of Voting on SC 22 N 3579 - Concurrent Registration and Approval Ballot for PDTR 19769, Specification for additional character data types to the programming language C (Type 2 TR) DATE ASSIGNED: 2003-09-15 SOURCE: SC 22 Secretariat BACKWARD POINTER: N/A DOCUMENT TYPE: Summary of Voting PROJECT NUMBER: 22.20.01 STATUS: The results of this ballot are forwarded to SC 22/WG 14 for review, production of a disposition of comments report, and preparation of the DTR text. ACTION IDENTIFIER: ACT DUE DATE: DISTRIBUTION: Text CROSS REFERENCE: N/A DISTRIBUTION FORM: Def Address reply to: ISO/IEC JTC 1/SC22 Secretariat Matt Deane ANSI 25 West 43rd Street New York, NY 10036 Telephone: (212) 642-4992 Fax: (212) 840-2298 Email: mdeane@ansi.org _____ end of cover page, beginning of registration summary_________ SUMMARY OF VOTING ON Letter Ballot Reference No: SC22 N3579 Circulated by: JTC 1/SC22 Circulation Date: 2002-05-22 Closing Date: 2002-08-22 SUBJECT: Summary of Voting on SC 22 N3579, Concurrent Registration and Approval Ballot for PDTR 19769, Specification for additional character data types to the programming language C (Type 2 TR) ---------------------------------------------------------------------- The following responses have been received on the subject of registration: "P" Members supporting registration without comments 11 (Canada, China, Czech Republic, Denmark, Italy, Japan, Republic of Korea, Netherlands, Norway, Russian Federation, USA) P" Members supporting registration with comments - "P" Members not supporting registration - "P" Members abstaining 2 (Switzerland, UK) "P" Members not voting 12 (Austria, Belgium, Brazil, Egypt, Finland, France, Germany, Ireland, DPR of Korea, Romania, Slovenia, Ukraine) __end of registration summary, beginning of approval summary____ SUMMARY OF VOTING ON Letter Ballot Reference No: SC22 N3579 Circulated by: JTC 1/SC22 Circulation Date: 2002-05-22 Closing Date: 2002-08-22 SUBJECT: Summary of Voting on SC 22 N3579, Concurrent Registration and Approval Ballot for PDTR 19769, Specification for additional character data types to the programming language C (Type 2 TR) ---------------------------------------------------------------------- The following responses have been received on the subject of approval: "P" Members supporting approval without comment 8 (Canada, China, Czech Republic, Denmark, Italy, Republic of Korea, Norway, Russian Federation) "P" Members supporting approval with comments 3 (Japan, Netherlands, USA) "P" Members not supporting approval - "P" Members abstaining 2 (Switzerland, UK) "P" Members not voting 12 (Austria, Belgium, Brazil, Egypt, Finland, France, Germany, Ireland, DPR of Korea, Romania, Slovenia, Ukraine) ___________ end of summary, beginning on NB comments _____________ Japan The name of Macro "__STDC_UTF_16" at the second paragraph of chapter 4 Encoding should be changed to "__STDC_UTF_16__". Netherlands Care should be taken to ensure that the emphasis of the proposed extension is on the support of ISO/IEC 10646, rather than on the support of the Unicode standards. The 1st paragraph of the introduction should be modified to reflect this. United States 1. The proposal should also include adding a -3 return code from mbrtowc, as proposed by Clive D.W. Feather. This permits general N-to-M mappings, not just 1-to-N and N-to-1. ___________________ 1. 1 Introduction The sentence The C language has matured over the last decades, yet the character concept has remained stable. is troubling. Surely one would expect that as a language matures, its fundamental concepts would remain stable. 2. 1 Introduction The sentence Various code pages and multibyte libraries have been introduced in the past; however, the character data type in the C language has remained 8 bit based. presents several problems. First, there are today C implementations for machines with 36-bit words. The C standard requires characters for such machines to be at least 9 bits (see Sections 5.2.4.2.1 and 6.2.6.1). Perhaps the text should say they are "byte based" instead of "8 bit based." Even then, there is the problem that the concrete definition of "character" given in the C standard (see Section 3.7.1) specifies single-byte characters. I fear that the different meanings of the word "character" in the standard and this TR are liable to lead to confusion and defect reports. The TR might be able to avoid this confusion by pointing out which uses of the word "character" in the C standard retain the old meaning, and which take on the new meaning. The mention of code pages seems irrelevant to the point being made. All uses of code pages I have seen have been 8 bit based. 3. 2.1 Scope The statement that the TR "specifies two character data types" conflicts with the existing C standard. The TR should either include edits to Sections 3.7 and 6.2.5 to resolve the conflicts, or it should use new terminology to refer to these new types. 5. 3 The new typedefs I assume there is nothing special about the typedefs given beyond provide more convenient names for referring to the integer types uint_least16_t and uint_least32_t. If I am wrong and special properties are tied to the typedefs, are those properties propagated through further typedefs. For example, if a user provides a typedef typedef char16_t c16_t; will c16_t have all the properties of char16_t. 6. 4 Encoding What does the statement If the macro __STDC_UTF_16, the type char16_t shall have the UTF-16 encoding. mean? According to Section C.1 of ISO/IEC 10646-1:2000, there are 16-bit values in UTF-16 that must be paired with other 16-bit values to be valid in UTF-16. Can a scalar of type char16_t take on one of those values when the macro __STDC_UTF_16 is defined? Can a value that does not correspond to a Unicode or ISO/IEC 10646 character be assigned to a variable of type char16_t? 7. 4 Encoding The statement In the absence of the mentioned macros, an implementation may define other macro's (sic) to indicate a different encoding; ... implies but does not state that in the presence of those macros an implementation shall not define other macros to indicate a different encoding. Isn't such a statement void? As long as the macro chosen is a name reserved to the implementation, no strictly conforming program could tell such a macro had been provided. 8. 6 Library functions Use of the functions specified in the TR could be facilitated by providing a feature test macro for the presence of those functions. 9. 6 Library functions Does the use of char16_t and char32_t place any restriction on the integer values passed to and returned from the functions beyond the restrictions that would apply if uint_least16_t and uint_least32_t had been used instead? 10. 7 ANNEX A Unicode encoding forms: UTF-16, UTF-32 The definitions of UTF-16 and UTF-32 provided in *The Unicode Standard,* version 3.0 are incomplete in and of themselves. Additional information available online and on the CD provided with the book are needed. The definitions provided in ISO/IEC 10646, on the other hand, are complete in and of themselves. Since ISO/IEC 10646 are the normative references listed in the TR, Section 7 ANNEX A of the TR should reference the definitions given in ISO/IEC 10646. ___________________ 1. Do we need to say something about uint_least16_t and uint_least32_t not being defined by ? Would it be better to define char16_t and char32_t in terms of the base types that uint_least*_t are defined in? =================== __X__ general: In 1, the statement that "the character data type in the C language has remained 8 bit based" is misleading at best. It would be more correct to say that the character data type *in most C implementations* has remained 8 bit based. Likewise, the statement that "wchar_t does not offer platform portability for C applications" is also misleading; it may not offer the same kind of platform portability that Unicode does, but it does offer platform portability of a different kind-that is its sole reason for existing. =================== __X__ editorial: When types are mentioned in the running text, they generally appear in italics as opposed to appearing in courier bold as in the C standard. Since this TR is in some sense an addendum to the C standard, it may be better to use the same style. In the table of contents, the function names should be bold. In 1, the initial sentence of the final paragraph would read better as: It is generally desirable that C applications process entire strings at once rather than processing individual characters in isolation. In 2.2, the correct title of ISO/IEC 9899:1999 is "Programming languages * C". In 3, the header "uchar.h" should be referred to as "" for consistency with the C standard. Also, the final reference is in the wrong font. In 4, it is not clear whether the user is to define the macros to influence the implementation or whether the implementation is to define the macros to document its behavior. Also, there should be restrictions on the names of any other macros the implementation may define to prevent encroaching on the user's namespace. The typography makes it difficult to see that the underscores at the beginning and end of the macro names are doubled; a thin space should be added between them. In the first paragraph, the form "yyyymmL" and the example "199712L" should be set in courier bold, as should the name of the "mbstowcs" function. Also, "Analogue" should be "Analogous". In the second paragraph, "__STDC_UTF_16__" is missing its trailing underscores and the type "char16_t" appears in the wrong font, as does "char32_t" in the third paragraph. In the fourth paragraph, "Analogically" should be "Analogously" (or, better still, "Similarly"). In the fifth paragraph, "macro's" should be "macros". In 5.1, "analogue" should be "analogous". Also, the type "char16_t" appears in the wrong font (twice), the initial "U" and "u" in the string literal formats are in the wrong font, as are the closing quote marks (the leading quote marks may also be in the wrong font, it's hard to tell for sure). The same problem appears in 5.2 along with the initial "L" for wide string literals. And all of the examples are in the wrong font. I am unable to make sense of the statement in 5.2: ... when adjacent string literals of the same format are concatenated the result is widened to the representation of the other string literal also if one of the adjacent literals is a "narrow" string. Surely when string literals of the same format are concatenated, there is no need to widen the result, only when a "narrow" string is involved. In 6, "the C applications" should be "C applications" and "the future enhancements" should be "future enhancements". Also, normal publishing style is to use words for small numbers like four and three rather than numerals. In the various subclauses of 6, the parameter "s" frequently appears in the running text (and footnotes) in the wrong font. No semantics are given for the "ps" parameters. The "state" is sometimes referred to as just "state", sometimes as "shift state", and sometimes as "conversion state"; the terminology should be consistent. Also, in the "Returns" sections, it would be clearer to say "the resulting conversion state" rather than just "the conversion state". In the "Returns" sections of 6.1 and 6.3, "(size_t)(-2)" and "(size_t)(-1)" appear in the wrong font. In the "Returns" section of 6.1 in the text for "(size_t)(-1)", the parameter "n" appears in the wrong font. ___________________ On page 4, the first usage of __STDC_UTF_16__ is missing the trailing __. ___________________ 0. I did not understand until I read and reread the proposed draft TR several times that the types char16_t and char32_t are used solely to represent the encodings UTF-16 and UTF-32. My initial reading was that they represented Unicode characters or the characters in ISO/IEC 10646. With that understanding came the realization that the conversion functions provided are not conversions between wide characters and encodings of those wide characters, as are the similarly named functions in the C standard. Rather, they are conversions between encodings of characters. The nature of the extensions presented in the TR should be clearly stated in the introduction. 4. 2.2 References Since all of the references given are dated, the reference to undated reference should be elided. _________________ The first paragraph of the TR's section 1, Introduction, makes some misstatements about the previous state of the "character concept" and platform portability in the C language. It should instead make the following argument: (1) The C standard does not *require* the "char" type to have an 8-bit width; however, because this type is identified with the minimum addressable storage unit, most implementations have chosen to use 8 bits for "char", making it unsuitable for encoding large character sets. (2) The type intended by the C standard for encoding large character sets is "wchar_t", but this is not *required* to have a width more than 8 bits, nor to use Unicode encoding. (3) wchar_t does provide platform portability where details of character encoding are not essential to the function of the program. (4) Because Unicode needs multiple encoding forms, the existing single form of "wide character" specified in the C standard is insufficient.