Defect Report #212
Submitter: Clive Feather
<clive@demon.net>
Submission Date: 1999-10-20
Reference Document:ISO/IEC WG14 N898
Subject: binding of multibyte conversion state objects
Summary
At present an mbstate_t object can only ever be used to make one
conversion. This is not desirable, and changes are proposed in this area.
Discussion
Clause 7.24.6 paragraph 3 reads, in
part:
If an mbstate_t object has been altered by any of the functions
described in this subclause, and is then used with a different multibyte character sequence, or in
the other conversion direction, or with a different LC_CTYPE category setting than on earlier function calls,the behavior is undefined.
Put another way, each mbstate_t object
is initially "unbound" (if it is initialized to zero) and then becomes
"bound" by any call to a function such as mbrtowc or wcrtomb. When
"bound" it can only be used in the same direction with the same string as
originally bound, and only when the LC_CTYPE category is that in effect
when it was bound. With ordinary mbstate_t objects this is
a annoyance; one implication is that a new object must be created every
single time a new string is to be converted (the Standard does not
provide any way to "unbind" the object). With the mbstate_t object inside a FILE
structure it is even worse, because it makes it impossible to (for example)
write to a file, rewind it, and then read the same file. Similarly, the
internal mbstate_t objects used when the mbstate_t pointer argument is
set to NULL can be used for only one string in the entire program !
Users of mbstate_t objects (including
those in FILE structures) expect to be able to use them for more than a
single purpose.
Proposed solution
The changes introduce the concept that
an mbstate_t object is either "unbound" or "bound".
When set to an all-zero value (which can be at initialization or explicitly later on)
it is unbound. As soon as the object is used for a conversion it
becomes bound to that string, locale, and direction. Returning to the initial
state does not unbind the object (in other words, while all
unbound objects are in the initial state the converse is not necessarily
true).
The special cases of mbrtowc and
wcrtomb are defined to always result in an unbound state. This both provides
more consistent behaviour (the special case resets everything to a
known state) and also allows the internal mbstate_t objects associated
with these functions to be unbound.
The mbstate_t object hidden in a file
is returned to the unbound state whenever end of file is reached on
input, and by any call to fseek (these choices were made to correspond
with the requirements of 7.19.5.3 paragraph 6 for changing I/O
direction).
The internal mbstate_t objects
associated with the mbrlen, mbrtowc,
wcrtomb, mbsrtowcs, and wcsrtombs
functions can only be used with the locale they initially bind to. Other
changes deal with the first three; a previously impossible case is used
for the last two to force the object to the unbound state.
Suggested Technical Corrigendum
(Changes concerning explicit mbstate_t objects.)
Change 7.24.6 paragraph 3 to:
[#3] The initial conversion state corresponds, for a conversion in either direction,
to the beginning of a new multibyte character in the initial
shift state. An mbstate_t object may be "unbound"
or "bound". A zero-valued mbstate_t object is (at least) one way to
describe an unbound object, and if an mbstate_t object is
assigned such a value it it becomes unbound. All unbound
mbstate_t objects are in the initial conversion state (but the
converse is not necessarily true).
[#3a] An unbound object can be used
to initiate conversion involving any multibyte character
sequence, in any LC_CTYPE category setting, in either
direction; once used for a conversion, it becomes bound to that sequence,
category setting, and direction. If a bound mbstate_t object is used
with a different multibyte character sequence, a different LC_CTYPE
category setting, or in the other conversion direction to
that it is bound to, the behavior is undefined.290)
Append to footnote 290:
Furthermore, provided that the
object is unbound, and thus in the initial conversion state, it
can then be used in converting a new string, a new locale, or in
the other direction.
Change 7.24.6.3 paragraph 1 and 7.24.6.4 paragraph 1 from:
[...] which is initialized at
program startup to the initial conversion state. [...]
to:
[...] which is initialized at
program startup to the unbound state. [...]
Change 7.24.6.3.2 paragraph 2 to:
[#2] If s is a null pointer, the mbrtowc function is equivalent to the call:
mbrtowc(NULL, "", 1, ps)
except that the resulting state described is unbound even if an encoding error occurred.
In this case, the values of the
parameters pwc and n are
ignored.
Change 7.24.6.3.3 paragraph 2 to:
[#2] If s is a null pointer, the wcrtomb function is equivalent to the call
wcrtomb(buf, L'\0',ps)
where buf is an internal buffer
except that the resulting state described is always unbound even
if an encoding error occurred 291a; the value of wc is ignored.
291a) The effect is reliably to make *ps unbound.
Append to 7.24.6.4 paragraph 2:
As a special case, if src is a null pointer then the normal behaviour of the function is
ignored and instead ps becomes unbound irrespective of its previous state; an
unspecified value is returned.
(Changes associated with streams.)
Append to 7.19.2 paragraph 6:
If a wide character input function encounters end-of-file, or
after a successful call to the fseek function, the mbstate_t
object associated with the stream is unbound.
Append to the last sentence of 7.19.9.2
paragraph 5:
and if the stream is wide-oriented the associated mbstate_t object shall be unbound.
In 7.24.3.1 paragraph 2, change:
to:
[...] If the stream is at end-of-file, the end-of-file indicator for the stream
is set, the mbstate_t object associated with the stream is unbound,
and fgetwc returns WEOF. [...]
Previous Defect Report < - >
Next Defect Report