Accredited Standards Committee X3 Doc No: X3J16/94-0181 WG21/N0568 Information Processing Systems Date: Sept 27, 1994 Page 1 of 13 Operating under the procedures of Project: Programming Language C++ American National Standards Institute Ref Doc: Reply to: Josee Lajoie (josee@vnet.ibm.com) +------------------+ | C++ Memory model | +------------------+ 1) C's Memory Model =================== This section uses the information provided in the ISO C standard as well as the information provided by Tom MacDonald in core message 4156 describing the content of C's Defect Report 69 and the proposed resolution for this defect presented by Tom Plum in core message 4229. 1.1 unsigned char is a "byte" ----------------------------- The proposed resolution for defect report #69 presented in core message 4229 indicates that the type 'unsigned char' is the C type that represents a 'byte' of memory: For any object type T, the underlying bytes of the object can be copied into an array of unsigned char : #define N sizeof(T) union aligned_buf { T t; unsigned char s[N]; } buf; T object; memcpy(buf.s, (const void *)&object); Even though core message 4229 doesn't explicitly say so, I will also assume that: After this memcpy operation, 't' has the same value as 'object'. The memcpy operation is guaranteed to be well-defined, even if 'object' does not hold a valid value of type T. 1.2 terminology --------------- Core message 4229 defines some terminology: #define N sizeof(T) union aligned_buf { T t; unsigned char s[N]; } buf; The _object representation_ of an object consists of the resulting sequence of N unsigned char objects in the buffer. The object representation is the amount of storage taken up by the object of type T, amount of storage which is described as an array of unsigned char . -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 2 The _value representation_ of an object is the sequence of bits in the array of unsigned char that holds the value of type T. The bits of the value representation determine a _value_, which is one discrete element of an implementation-defined set of values. Example: Here is an example. Consider a (possibly hypothetical) implementation whose int value representation provides one sign bit and 40 integer bits. +-+---------------------+ | | | +-+---------------------+ 1 40 Its object representation provides one sign bit, a hole containing seven non-participating bits, and 40 integer bits: +-+------+---------------------+ | | | | +-+------+---------------------+ 1 7 40 1.3 representation of signed and unsigned integer types ------------------------------------------------------- The ISO C standard already specifies many requirements regarding the representation of signed and unsigned integer types. From ISO C Standard, sub-clause 6.1.2.5: For each of the signed integer types, there is a corresponding (but different) "unsigned integer type" (designated with the keyword unsigned) that uses the same amount of storage (including sign information) and has the same alignment requirements. [Note: using the terminology defined in section 1.2 above, this paragraph can be interpreted to say that the object representation of a signed integer type must be the same as the object representation of its corresponding unsigned integer type.] Sub-clause 6.1.2.5 continues: The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the representation of the same value in each type is the same.(16) (16) The same representation and alignment requirements are meant to imply interchangeability as arguments to functions, return values from functions, and members of unions. [Note: using the terminology defined in section 1.2 above, this paragraph can be interpreted to say that the value representation of a nonnegative value of a signed integer type must be the same as the value representation of the same value of -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 3 the corresponding unsigned integer type.] Sub-clause 6.1.2.5 continues: The [value] representations of integral types shall define values by use of a pure binary numeration system (18). (18) A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral power of 2, except perhaps the bit with the highest position. With regards to the value representation of integral types, core message 4156 also indicates that: The C standard Committee intended to permit 1's complement, 2's complement and signed magnitude implementations. 1.4 object representation vs value representation of integral types ------------------------------------------------------------------- Core message 4229 therefore concludes: For character types, all bits of the object representation participate in the value representation. This requirement does not hold for other types. For the type unsigned char , all possible bit patterns of the value representation represent numbers. If all values of type char are nonnegative, then this is also true type char. This requirement does not hold for other types. 1.5 value representation of scalar types ---------------------------------------- Core message 4229 indicates that: The value representation of floating-point and pointer types is implementation-defined. 1.6 Examples of implementations ------------------------------- I took these examples from core-4156. However, I believe some of the answers provided in core-4156 need to be changed in the light of the proposed resolution for defect 69 provided in core-4229 by Tom Plum. After further discussions on this topic with Tom Plum and Bill Plauger, here are the answers I believe are accurate in the light of the definitions above. The changes from core-4156 are marked with '|' in the left margin. -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 4 Q: In particular, are the following five implementations allowed? h) Unsigned values are pure binary. Signed values are represented using ones complement (in other words, positive and negative values with the same absolute value differ in all bits, and zero has two representations). Positive numbers have a sign of 0, and negative numbers a sign of 1. In both cases, all bits are significant. h) Yes, provided there is no other violation of the Standard. i) Unsigned values are pure binary. Signed values are represented using sign-and-magnitude with a pure binary magnitude (note that the top bit is not "additive"). Positive numbers have a sign bit of 0, and negative numbers a sign bit of 1. In both cases, all bits are significant. i) Yes, provided there is no other violation of the Standard. j) Unsigned values are pure binary, with all bits significant. Signed values with an MSB (sign bit) of 0 are positive, and the remainder of the bits are evaluated in pure binary. Signed values with an MSB of 1 are negative, and the remainder of the bits are evaluated in BCD. If ints are 20 bits, then INT_MAX is 524,287 and INT_MIN is -79,999. j) No, it is not a pure binary system. k) Signed values are twos-complement using all bits. Unsigned values are pure binary, but ignoring the MSB (so each number has two representations). In this implementation, SCHAR_MAX==UCHAR_MAX, SHRT_MAX==USHRT_MAX, INT_MAX==UINT_MAX, and LONG_MAX==ULONG_MAX. | k) No, | contradicts the resolution listed in section 1.4 above: | that is, for character types, _all bits_ of the object | representation must contribute to the value representation. In | particular, the condition SCHAR_MAX==UCHAR_MAX doesn't respect | this resolution. l) Signed values are twos-complement. Unsigned values are pure binary. In both cases, the top three bits of the value are ignored (and each number has eight representations). For signed values, the sign bit is the fourth from the top. | l) No, | contradicts the resolution listed in section 1.4 above: | that is, for character types, _all bits_ of the object | representation must contribute to the value representation. -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 5 1.7 Aliasing ------------ 1.7.1 Reinterpret cast What should the behavior of the following example be? extern int *pi; extern unsigned int *pui; *pi = 1; pui = (unsigned int *) pi; *pui == 1; //1 Does ISO C guarantee that line //1 always yield true? I believe it does. C indicates that: Sub-clause 6.1.2.5: o ints and unsigned ints have the same object representation. o for the range of nonnegative values that can be represented by both a signed int and an unsigned int, the value representation of the value as a signed int must be the same as the value representation of the value as an unsigned int. Sub-clause 6.3 (expressions): o a stored value of type signed int can be accessed by an lvalue of type unsigned int. Sub-clause 6.3.4 (cast operator): o the resulting pointer may not be valid if it is improperly aligned for the type pointed to [which is not the case we have here]. 1.7.2 Unions What should the behavior of the following program be? union X { int x; unsigned int y; unsigned char buffer[sizeof(int)]; } u; u.x = 1; u.y == 1; //1 Does ISO C guarantee that line //1 always yield true? No, it doesn't. >From ISO C Standard, Section 6.1.2.5 indicates that: "The value of at most one of the members can be stored in a union object at one time." -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 6 Also, the proposed resolution for defect #69 [ listed in section 1.1 above ] also clearly indicates that: For any object of type T, the underlying bytes of the object can _be copied_ into an array of unsigned char. As Tom Plum emphasizes in core-4642: It was for very deliberate reasons that WG14 defined "object representation" in terms of an array of unsigned char which was _copied_ via memcpy, not aliased over the same storage. The ISO C rule quoted above grants license for super-checking environments to diagnose programs which fetch out of a different union member; it's an undefined behavior, strictly speaking. The "same representation" rules do imply that, if your implementation isn't so pedantic as to diagnose this union-overlaying, then certain unsurprising behaviors must result. 2) What should C++ say? ======================= 2.1 As close to C as possible... -------------------------------- I believe C++ has to allow what ISO C currently allows. And I believe the WP is fairly close. Sub-clause 3.7.1 [_basic.fundamemntal_] indicates that o unsigned types occupy the same storage and have the same alignment requirements as their corresponding signed types. Sub-clause 9.2 [ _class.mem_ ] indicates that: o The range of nonnegative values of a signed integral type is a subrange of the corresponding unsigned integral type and the representation of the same value in each type is the same. o A program can access the stored value of an object other through an lvalue of one of the following types: [ ... ] . a type that is the signed or unsigned type corresponding to the declared type of the object, ... Sub-clause 9.6 [ _class.union_ ] indicates that: At most t one of the member objects can be stored in a union at any time. Proposal -------- Incorporate in section 3.7 [_basic.types_] and its sub-clauses the resolutions described in section 1.1 and 1.4: -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 7 o For any object type T, the underlying bytes of the object can be copied into an array of unsigned char . The memcpy operation is guaranteed to be well-defined, even if the object does not hold a valid value of type T. o For character types, all bits of the object representation participate in the value representation. This requirement does not hold for other types. o For the type unsigned char , all possible bit patterns of the value representation represent numbers. If all values of type char are nonnegative, then this is also true type char. This requirement does not hold for other types. See Appendix A for a complete description of the proposed WP changes. 2.2 Can any character type represent raw storage? ------------------------------------------------- This is the thorniest issue in this paper. Should C++ allow more than what C allows and say that any character type can be used to manipulate raw storage? For example, should C++ allow the following: For any object type T, the underlying bytes of the object can be copied into an array of char . Many C and C++ programmers assume that this is true. Many C++ libraries assume that this is true. However, the C standard does not require implementations to support it. >From Tom Plum in a private email: The problem with the use of char in C libraries is that "value collapse" can (theoretically) happen during assignment. E.g. if a ones-complement system distinguishes +0 and -0, where 0xFF is -0 and 0x00 is +0, AND if -0 is converted to +0 before the assignment, and a char receives 0xFF char c = 0xFF; you could find 0x00 in c afterwards. Most of WG14 thought that we never ruled out "value collapse" so it remains a problem in C. So what should C++ do? >From Tom Plum in a private email: In my opinion, it would be a good idea for C++ to specify that value collapse cannot happen during assignment; the bit patterns must be copied. But that is not a total solution, because how can a null termination byte (0x00) be distinguished from a 0xFF byte, in a ones-complement system? When the program contains while ( c == 0) and c has the representation 0xFF, how could c==0 produce anything other than "true"? So I'm still not sure what we can do, unless we prohibit ones-complement for char type. There is still a problem here. -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 8 Proposal -------- Ones-complement arithmetic for character types is prohibited in C++. Any character type can be used to manipulate raw storage; that is, for any character type, all possible bit patterns of the value representation represent numbers. For any object of type T, the underlying bytes of the object can be copied into an array of C where C is any one of the character types: #define N sizeof(T) union aligned_buf { T t; C s[N]; } buf; T object; memcpy(buf.s, (const void *)&object); After this memcpy operation, 't' has the same value as 'object'. The memcpy operation is guaranteed to be well-defined, even if 'object' does not hold a valid value of type T. 2.3 Can any integral type manipulate raw storage? ------------------------------------------------- The C standard does not require implementations to support this. I do not believe this is necessary and believe this would restrict implementations too much. I therefore do not propose that C++ imposes the character type restrictions on all integral types. 3. Uninitialized Objects ======================== With the exception of objects of type unsigned char (proposal 2.1 above), and possibly of objects of type char and signed char (proposal 2.2 above), objects that are not initialized may contain invalid values for their types. Proposal -------- Add to section 8.5 Initializers [ _dcl.init_ ]: An uninitialized object has unspecified value and referring to an object with an unspecified value results in undefined behavior. What does this mean for the copy constructor and assignment operator? The C++ WP specifies that [12.8 _class.copy_ ]: "If not declared by the programmer, they [ the copy constructor and assignment operator ] will be automatically defined (synthesized) as memberwise initialization and memberwise assignment of the base classes and non-static data members of the class, respectively." -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 9 Since the synthesized copy constructor and assignment operator are defined to be memberwise initialization and memberwise assignment, if some of the members are uninitialized, the synthesized copy constructor and assignment operator will refer to uninitialized members and therefore will have undefined behavior. Example 1: struct S { int i; float f; } s1, s2; ... s1 = s2; //1 In this example, s2.i and s2.f are uninitialized. The assignment on line //1 therefore has undefined behavior. This behavior is the same as the behavior to be expected from a C program. Example 2: class complex { float f, g; complex() { } }; s1 = s2; //1 Since complex's constructor leaves f and g uninitialized, the assignment on line //1 has undefined behavior. Appendix A - Suggested WP changes ================================= 3.7 Types [basic.types] Add the following text after paragraph 1: + 2 For any object type T, the underlying bytes of the object can be + copied (using the memcpy library function [ref]) into an array of + character type. The copy operation is guaranteed to be + well-defined, even if the object does not hold a valid value of type + T. + 3 The object representation is the amount of storage taken + up by an object of type T, amount of storage which is described as + an array of character type. + 4 The value representation of an object is the sequence of bits in the + array of character type that holds the value of type T. + 5 The bits of the value representation determine a value, which is + one discrete element of an implementation-defined set of values. -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 10 3.7.1 Fundamental types [basic.fundamental] 1 There are several fundamental types. The standard header specifies the largest and smallest values of each for an implementation. 2 Objects declared as characters (char) are large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character variable, its value is equivalent to the integer code of that character. - It is implementation-specified whether a char object can take on - negative values. Characters may be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage + (including sign information) and have the same alignment + requirements; that is, they have the same object representation. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined. + It is implementation-specified whether a char object can take on + negative values. For character types, all bits of the object + representation participate in the value representation and all + possible bit patterns of the value representation represent numbers + (these requirements do not hold for other types). 3 An enumeration comprises a set of named integer constant values. Each distinct enumeration constitutes a different enumerated type. Each constant has the type of its enumeration. 4 There are four signed integer types: signed char, short int, int, and long int. In this list, each type provides at least as much storage as those preceding it in the list, but the implementation may otherwise make any of them equal in storage size. Plain ints have the natural size suggested by the machine architecture; the other signed integer types are provided to meet special needs. 5 For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: unsigned char, unsigned short int, unsigned int, and unsigned long int, each of which occupies the same amount of storage and has the same alignment requirements (1.5) as the corresponding signed integer type; + that is, each signed integer type has the same object + representation has its corresponding unsigned integer type. (7) An alignment requirement is an implementation-dependent restriction on the value of a pointer to an object of a given type (5.4, 1.5). _________FootNote_________ 7) See 7.1.5.2 regarding the correspondence between types and the sequences of type-specifiers that designate them. + The range of nonnegative values of a signed integral type is a + subrange of the corresponding unsigned integral type, and the + value representation of the same value in each type is the same. -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 11 6 Unsigned integers, declared unsigned, obey the laws of arithmetic modulo 2n where n is the number of bits in the representation of that particular size of integer. This implies that unsigned arithmetic does not overflow. 7 Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (17.5.9.1). Type wchar_t has the same size, signedness, and alignment requirements (1.5) as one of the other integral types, called its underlying type. 8 Values of type bool can be either true or false. (8) There are no signed, unsigned, short, or long bool types or values. As described below, bool values behave as integral types. Thus, for example, they participate in integral promotions (4.1, 5.2.3). Although values of type bool generally behave as signed integers, for example by promoting (4.1) to int instead of unsigned int, a bool value can successfully be stored in a bit-field of any (nonzero) size. _________FootNote_________ 8) Using a bool value in ways described by this International Standard as ``undefined,'' such as by examining the value of an uninitialized automatic variable, might cause it to behave as if is neither true nor false. 9 Types bool, char, and the signed and unsigned integer types are collectively called integral types. A synonym for integral type is integer type. Enumerations (7.2) are not integral, but they can be promoted (4.1) to signed or unsigned int. + The representations of integral types shall define values by use of + a pure binary numeration system (FootNote). + _________FootNote_________ + A positional representation for integers that uses the binary + digits 0 and 1, in which the values represented by successive bits + are additive, begin with 1, and are multiplied by successive + integral power of 2, except perhaps the bit with the highest + position. + For any integral type, 2's complement and signed magnitude + implementations are permitted, and for integral types other than + character types, 1's complement implementations are also permitted. 10There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. Each implementation defines the characteristics of the fundamental floating point types in the standard header . + The value representation of floating-point is + implementation-defined. Integral and floating types are collectively called arithmetic types. -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 12 11The void type specifies an empty set of values. It is used as the return type for functions that do not return a value. No object of type void may be declared. Any expression may be explicitly converted to type void (5.4); the resulting expression may be used only as an expression statement (6.2), as the left operand of a comma expression (5.18), or as a second or third operand of ?: (5.16). + 12Even if the implementation defines two or more basic types to have + the same value representation, they are nevertheless different + types. 3.7.2 Compound types [basic.compound] Add the following text after paragraph 4 and 5: 4 A pointer to objects of a type T is referred to as a pointer to T. For example, a pointer to an object of type int is referred to as pointer to int and a pointer to an object of class X is called a pointer to X. Pointers to incomplete types are allowed although there are restrictions on what can be done with them (3.7). + The value representation of pointer types is + implementation-defined. 5 Objects of cv-qualified (3.7.3) or unqualified type void* (pointer to void), can be used to point to objects of unknown type. A void* must have enough bits to hold any object pointer. + A qualified or unqualified void* shall occupy the same amount of + storage and have the same alignment requirements, that is, have the + same object representation, as a qualified or unqualified char*. 5 Expressions [expr] Add the following text after paragraph 11: + 12If the program attempts to access the stored value of an object + other than through an lvalue of one of the following types: + + o the dynamic type of the object, + + o a qualified version of the declared type of the object, + + o a type that is the signed or unsigned type corresponding to the + declared type of the object, + + o a type that is the signed or unsigned type corresponding to a + qualified version of the declared type of the object, + + o an aggregate or union type that includes one of the aforementioned + types among its members (including, recursively, a member of a + subaggregate or contained union), or + + o a character type. (40) -------- X3J16/94-00181 - WG21/N0568 ----- Lajoie:Memory Model ----- Page 13 + the result is undefined. + + _________FootNote_________ + 40) The intent of this list is to specify those circumstances in + which an object may or may not be aliased. 8.5 Initializers [dcl.init] Add the following text after paragraph 9: + 10An uninitialized object has unspecified value and referring to an + object with an unspecified value results in undefined behavior. 9.2 Class Members [class.mem] Delete paragraph 16 to 22.