JTC1/SC22/WG14 N1026

WG14 Document Number: N1026
Document Date:        15 Sep 2003

ISO/IEC JTC 1/SC22 
Programming languages, their environments and system software interfaces 
Secretariat:  U.S.A.  (ANSI) 
  
ISO/IEC JTC 1/SC22 N3649
  
TITLE: 
Summary of Voting on SC 22 N 3579 - Concurrent Registration and Approval
Ballot for PDTR 19769, Specification for additional character data types to
the programming language C (Type 2 TR)
DATE ASSIGNED: 
2003-09-15
  
SOURCE: 
SC 22 Secretariat 
BACKWARD POINTER: 
N/A 
  
DOCUMENT TYPE: 
Summary of Voting 
PROJECT NUMBER: 
22.20.01 
  
STATUS: 
The results of this ballot are forwarded to SC 22/WG 14 for review,
production of a disposition of comments report, and preparation of the DTR
text.
  
ACTION IDENTIFIER: 
ACT
  
DUE DATE: 
  
DISTRIBUTION: 
Text
CROSS REFERENCE: 
N/A 
  
DISTRIBUTION FORM: 
Def
  
Address reply to: 
ISO/IEC JTC 1/SC22 Secretariat 
Matt Deane 
ANSI 
25 West 43rd Street 
New York, NY  10036 
Telephone:  (212) 642-4992 
Fax:             (212) 840-2298 
Email:  mdeane@ansi.org 

_____ end of cover page, beginning of registration summary_________ 

SUMMARY OF VOTING ON 
Letter Ballot Reference No: SC22 N3579
Circulated by: JTC 1/SC22
Circulation Date: 2002-05-22
Closing Date: 2002-08-22

SUBJECT: Summary of Voting on SC 22 N3579, Concurrent Registration and
Approval Ballot for PDTR 19769, Specification for additional character data
types to the programming language C (Type 2 TR)
---------------------------------------------------------------------- 
The following responses have been received on the subject of registration: 

"P" Members supporting registration without comments
11 (Canada, China, Czech Republic, Denmark, Italy, Japan, Republic of Korea,
Netherlands, Norway, Russian Federation, USA) 
P" Members supporting registration with comments 
- 
"P" Members not supporting registration
- 
"P" Members abstaining 
2 (Switzerland, UK) 
"P" Members not voting 
12 (Austria, Belgium, Brazil, Egypt, Finland, France, Germany, Ireland, DPR
of Korea, Romania, Slovenia, Ukraine) 

__end of registration summary, beginning of approval summary____


SUMMARY OF VOTING ON 
Letter Ballot Reference No: SC22 N3579
Circulated by: JTC 1/SC22
Circulation Date: 2002-05-22
Closing Date: 2002-08-22

SUBJECT: Summary of Voting on SC 22 N3579, Concurrent Registration and
Approval Ballot for PDTR 19769, Specification for additional character data
types to the programming language C (Type 2 TR)
----------------------------------------------------------------------

The following responses have been received on the subject of approval:


"P" Members supporting approval without comment                   

8 (Canada, China, Czech Republic, Denmark, Italy, Republic of Korea, Norway,
Russian Federation)

"P" Members supporting approval with comments              

3 (Japan, Netherlands, USA)        

"P" Members not supporting approval       

-

"P" Members abstaining                    

2 (Switzerland, UK)

"P" Members not voting                    

12 (Austria, Belgium, Brazil, Egypt, Finland, France, Germany, Ireland, DPR
of Korea, Romania, Slovenia, Ukraine)


___________ end of summary, beginning on NB comments _____________


Japan

The name of Macro "__STDC_UTF_16" at the second paragraph of chapter 4
Encoding should be changed to "__STDC_UTF_16__".


Netherlands

Care should be taken to ensure that the emphasis of the proposed
extension is on the support of ISO/IEC 10646, rather than on the
support of the Unicode standards.  The 1st paragraph of the
introduction should be modified to reflect this.


United States

1.
The proposal should also include adding a -3 return code from mbrtowc,
as proposed by Clive D.W. Feather.  This permits general N-to-M
mappings, not just 1-to-N and N-to-1.

___________________

1.  1 Introduction
The sentence
The C language has matured over the last decades,
yet the character concept has remained stable.

is troubling.  Surely one would expect that as a
language matures, its fundamental concepts would
remain stable.

2.  1 Introduction
The sentence
Various code pages and multibyte libraries have
been introduced in the past; however, the
character data type in the C language has
remained 8 bit based.

presents several problems.  First, there are today
C implementations for machines with 36-bit words.
The C standard requires characters for such machines
to be at least 9 bits (see Sections 5.2.4.2.1 and
6.2.6.1).  Perhaps the text should say they are
"byte based" instead of "8 bit based."  Even then,
there is the problem that the concrete definition of
"character" given in the C standard (see Section 3.7.1)
specifies single-byte characters.  I fear that the
different meanings of the word "character" in the
standard and this TR are liable to lead to confusion
and defect reports.  The TR might be able to avoid
this confusion by pointing out which uses of the word
"character" in the C standard retain the old meaning,
and which take on the new meaning.
    
The mention of code pages seems irrelevant to the point
being made.  All uses of code pages I have seen have
been 8 bit based.

3.  2.1 Scope
The statement that the TR "specifies two character data
types" conflicts with the existing C standard.  The TR
should either include edits to Sections 3.7 and 6.2.5 to
resolve the conflicts, or it should use new terminology
to refer to these new types.

5.  3 The new typedefs
I assume there is nothing special about the typedefs given
beyond provide more convenient names for referring to the
integer types uint_least16_t and uint_least32_t.  If I am
wrong and special properties are tied to the typedefs, are
those properties propagated through further typedefs.  For
example, if a user provides a typedef

typedef char16_t c16_t;
will c16_t have all the properties of char16_t.
6. 4 Encoding
What does the statement
If the macro __STDC_UTF_16, the type char16_t shall
have the UTF-16 encoding.

mean?  According to Section C.1 of ISO/IEC 10646-1:2000,
there are 16-bit values in UTF-16 that must be paired with
other 16-bit values to be valid in UTF-16.  Can a scalar
of type char16_t take on one of those values when the
macro __STDC_UTF_16 is defined?  Can a value that does not
correspond to a Unicode or ISO/IEC 10646 character be
assigned to a variable of type char16_t?

7.  4 Encoding
The statement
In the absence of the mentioned macros, an implementation
may define other macro's (sic) to indicate a different
encoding; ...

implies but does not state that in the presence of those macros
an implementation shall not define other macros to indicate a
different encoding.  Isn't such a statement void?  As long as
the macro chosen is a name reserved to the implementation, no
strictly conforming program could tell such a macro had been
provided.

8.  6 Library functions
Use of the functions specified in the TR could be facilitated
by providing a feature test macro for the presence of those
functions.

9.  6 Library functions
Does the use of char16_t and char32_t place any restriction on
the integer values passed to and returned from the functions
beyond the restrictions that would apply if uint_least16_t and
uint_least32_t had been used instead?

10. 7 ANNEX A Unicode encoding forms: UTF-16, UTF-32
The definitions of UTF-16 and UTF-32 provided in *The
Unicode Standard,* version 3.0 are incomplete in and of
themselves.  Additional information available online and on the
CD provided with the book are needed.  The definitions provided
in ISO/IEC 10646, on the other hand, are complete in and of
themselves.  Since ISO/IEC 10646 are the normative references
listed in the TR, Section 7 ANNEX A of the TR should reference
the definitions given in ISO/IEC 10646.

___________________

1. Do we need to say something about uint_least16_t and uint_least32_t
not
being defined by <uchar.h>?  Would it be better to define char16_t and
char32_t in terms of the base types that uint_least*_t are defined in?

===================

__X__  general:
In 1, the statement that "the character data type in the C language has
remained 8 bit based" is misleading at best.  It would be more correct
to say that the character data type *in most C implementations* has
remained 8 bit based.  Likewise, the statement that "wchar_t does not
offer platform portability for C applications" is also misleading; it
may not offer the same kind of platform portability that Unicode does,
but it does offer platform portability of a different kind-that is
its sole reason for existing.

===================

__X__  editorial:
When types are mentioned in the running text, they generally appear in
italics as opposed to appearing in courier bold as in the C standard. 
Since this TR is in some sense an addendum to the C standard, it may be
better to use the same style.

In the table of contents, the function names should be bold.
In 1, the initial sentence of the final paragraph would read better as:
It is generally desirable that C applications process entire
strings at once rather than processing individual characters in
isolation.

In 2.2, the correct title of ISO/IEC 9899:1999 is "Programming languages
* C".

In 3, the header "uchar.h" should be referred to as "<uchar.h>" for
consistency with the C standard.  Also, the final reference is in the
wrong font.

In 4, it is not clear whether the user is to define the macros to
influence the implementation or whether the implementation is to define
the macros to document its behavior.  Also, there should be restrictions
on the names of any other macros the implementation may define to
prevent encroaching on the user's namespace.  The typography makes it
difficult to see that the underscores at the beginning and end of the
macro names are doubled; a thin space should be added between them. 

In the first paragraph, the form "yyyymmL" and the example "199712L"
should be set in courier bold, as should the name of the "mbstowcs"
function.  Also, "Analogue" should be "Analogous".

In the second paragraph, "__STDC_UTF_16__" is missing its trailing
underscores and the type "char16_t" appears in the wrong font, as does
"char32_t" in the third paragraph.

In the fourth paragraph, "Analogically" should be "Analogously" (or,
better still, "Similarly").

In the fifth paragraph, "macro's" should be "macros".
In 5.1, "analogue" should be "analogous".  Also, the type "char16_t"
appears in the wrong font (twice), the initial "U" and "u" in the
string literal formats are in the wrong font, as are the closing quote
marks (the leading quote marks may also be in the wrong font, it's hard
to tell for sure).  The same problem appears in 5.2 along with the
initial "L" for wide string literals.  And all of the examples are in
the wrong font.

I am unable to make sense of the statement in 5.2:
... when adjacent string literals of the same format are
concatenated the result is widened to the representation of the
other string literal also if one of the adjacent literals is a
"narrow" string.

Surely when string literals of the same format are concatenated, there
is no need to widen the result, only when a "narrow" string is involved.

In 6, "the C applications" should be "C applications" and "the future
enhancements" should be "future enhancements".  Also, normal publishing
style is to use words for small numbers like four and three rather than
numerals.

In the various subclauses of 6, the parameter "s" frequently appears in
the running text (and footnotes) in the wrong font.  No semantics are
given for the "ps" parameters.  The "state" is sometimes referred to as
just "state", sometimes as "shift state", and sometimes as "conversion
state"; the terminology should be consistent.  Also, in the "Returns"
sections, it would be clearer to say "the resulting conversion state"
rather than just "the conversion state".

In the "Returns" sections of 6.1 and 6.3, "(size_t)(-2)" and
"(size_t)(-1)" appear in the wrong font.

In the "Returns" section of 6.1 in the text for "(size_t)(-1)", the
parameter "n" appears in the wrong font.

___________________

On page 4, the first usage of __STDC_UTF_16__ is missing the trailing
__.

___________________

0.  I did not understand until I read and reread the
proposed draft TR several times that the types
char16_t and char32_t are used solely to represent
the encodings UTF-16 and UTF-32.  My initial reading
was that they represented Unicode characters or
the characters in ISO/IEC 10646.  With that
understanding came the realization that the
conversion functions provided are not conversions
between wide characters and encodings of those
wide characters, as are the similarly named
functions in the C standard.  Rather, they are
conversions between encodings of characters.

The nature of the extensions presented in the TR
should be clearly stated in the introduction.

4.  2.2 References
Since all of the references given are dated, the reference
to undated reference should be elided.
    
_________________

The first paragraph of the TR's section 1, Introduction,
makes some misstatements about the previous state of the
"character concept" and platform portability in the C
language.  It should instead make the following
argument:
(1) The C standard does not *require* the "char"
type
to have an 8-bit width; however, because this
type is
identified with the minimum addressable storage
unit,
most implementations have chosen to use 8 bits
for
"char", making it unsuitable for encoding large
character sets.
(2) The type intended by the C standard for
encoding
large character sets is "wchar_t", but this is
not
*required* to have a width more than 8 bits, nor
to
use Unicode encoding.
(3) wchar_t does provide platform portability
where
details of character encoding are not essential
to
the function of the program.
(4) Because Unicode needs multiple encoding
forms,
the existing single form of "wide character"
specified in the C standard is insufficient.