Proposal for C2x
WG14 N2653

Title:	char8_t: A type for UTF-8 characters and strings (Revision 1)
Revises:	N2231
Author:	Tom Honermann <tom@honermann.net>
Date:	2021-06-04
Proposal category:	New features, change to existing features
Target audience:	Developers working on combined C and C++ code bases

Proposal for C2x

WG14 N2653

Title:

char8_t: A type for UTF-8 characters and strings (Revision 1)

Revises:

N2231

Author:

Tom Honermann <tom@honermann.net>

Date:

2021-06-04

Proposal category:

New features, change to existing features

Target audience:

Developers working on combined C and C++ code bases

Abstract: C++20, through the adoption of WG21 P0482R6 ^{[WG21 P0482R6]}, added a new char8_t fundamental type, changed the character type of u8 character and string literals from char to char8_t, and added the c8rtomb() and mbrtoc8() functions for conversion between multibyte characters and UTF-8. This paper proposes corresponding changes for C to add a char8_t typedef name with type unsigned char, to change the array element type of u8 string literals from char to unsigned char (u8 character literals already have type unsigned char), and to add the c8rtomb() and mbrtoc8() functions. These changes are intended to maintain compatibility between C and C++ and to improve portable support for UTF-8.

Changes since N2231
Introduction
Motivation
Design Options

The char8_t type: typedef name vs a new integer type

The underlying type of char8_t

UTF-8 string literal type

char array initialization by a UTF-8 string literal

Proposal
Backward Compatibility

Implementation Experience
Formal Wording
Acknowledgements
References

Changes since N2231

Proposal changes:
- Rebased the proposed wording on WG14 N2596 ^{[WG14 N2596]}
- Updated wording to address u8 character literals and removed references to WG14 N2198 since it has been incorporated in the working draft.
- Removed drafting notes regarding WG14 DR 488 since its resolution has now been incorporated in the working draft.
- Removed the previously proposed change to disallow initialization of an array of type char or signed char by a UTF-8 string literal.
- Removed the previously proposed __STDC_UTF_8__ macro since UTF-8 character and string literals and the char8_t type are intended only for use with UTF-8.
Other changes:
- Rewrote the abstract to reflect that WG21 P0482R6 ^{[WG21 P0482R6]} was adopted for C++20.
- Rewrote the Motivation section.
- Added the Design Options section.
- Expanded the Backward Compatibility section.
- Updated the Implementation Experience section with links to completed implementations in gcc and glibc.
- Removed use of highlight.js for code highlighting purposes.

Introduction

C11 introduced support for UTF-8, 16-bit, and 32-bit encoded string literals. New char16_t and char32_t typedef names were added to hold values of code units for the 16-bit and 32-bit variants, but a new type or typedef name was not added for the UTF-8 variant. Instead, UTF-8 string literals were specified with the same type used for ordinary string literals; array of char. UTF-8 is the only character encoding mandated to be supported by the C standard for which the standard does not provide a distinctly named code unit type.

Whether char is a signed or unsigned type is implementation defined. Implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text since the value range of their implementation of char does not extend to the full range of UTF-8 code unit values; programmers working with such implementations must inject casts to unsigned char for portable code to correctly process lead and continuation code unit values.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t typedef and related language and library enhancements intended to better facilitate portable handling of UTF-8 encoded text and to enable working with all five of the standard mandated character encodings in a consistent manner.

Motivation

As of February 2021, UTF-8 is now used by more than 96% of all websites ^[W3Techs]. While UTF-8 now dominates websites, it has not attained similar adoption rates in the execution environments of C and C++ programs. Microsoft has introduced several ways in which a program can opt-in to use of UTF-8 as the Active Code Page (ACP) starting with the April 2018 update of Windows 10, but, by default, the ACP remains dependent on region settings. Most POSIX systems, including Linux and macOS, use UTF-8 as the system encoding by default, but continue to support changing the execution environment encoding via locale related environment variables like LC_ALL. Systems built on EBCDIC, like IBM's z/OS, continue to remain significant players in the C and C++ ecosystems.

Programs that consume or produce UTF-8 text and text for which the encoding is dependent on the execution environment must choose one of a few approaches to manage text represented in these potentially distinct encodings:

Use char for all text and meticulously track which encoding is to be used at all times.
Use char for all text, but meticulously convert to or from UTF-8 when interacting with the environment so that text is always represented as UTF-8 within a component.
Use char when working with text for the execution environment, and a different type, generally unsigned char, for UTF-8 encoded text.

The challenge with the first two approaches is ensuring that text is appropriately tagged and converted as it flows through the program. Since the same type, char, is used as the code unit type for all text, the programmer is unable to rely on the type system to help identify when text has not been appropriately converted.

The challenge with the third approach is the lack of a common type that unambiguously denotes UTF-8 text across components. Within a program, even if there is agreement on an alternate type to use, UTF-8 string literals still have type array of char, not the agreed upon type.

The adoption of a char8_t type via P0482R6 ^[P0482R6] for C++20 provided a common type tailored for use with UTF-8 text. Adoption of a similar type for C would facilitate source code compatibility between C and C++20, establish a standard common type for programmers that prefer the third approach above, and provide consistent behavior across implementations without the difficulties imposed by the implementation-defined signedness of char.

Consider the following function that purports to check whether a pointer points to UTF-8 text that begins with a UTF-8 leading byte. UTF-8 leading bytes have values in the range 192 (0xC0) to 255 (0xFF) though not all values in that range may appear in valid UTF-8 encoded text.

bool starts_with_utf8_leading_byte(const char *s) {
  return *s >= 0xC0;
}

For implementations that define char as either an unsigned type or with a size greater than 8 bits, this function will correctly classify its inputs (assuming no invalid values). However, for implementations that define char as a signed 8-bit type with a two's complement representation and a range of -128 (-0x80) to 127 (0x7F), the values of UTF-8 leading bytes become negative values with the result that this function always returns false. For the function to behave consistently across implementations, it must be modified to ensure the comparison is performed with an unsigned type.

bool starts_with_utf8_leading_byte(const char *s) {
  return (unsigned char)*s >= 0xC0;
}

The introduction of a char8_t type that behaves as an unsigned type would allow the function to be simply implemented as follows such that it behaves the same for all C and C++20 implementations.

bool starts_with_utf8_leading_byte(const char8_t *s) {
  return *s >= 0xC0;
}

Functions like the starts_with_utf8_leading_byte() example above are not frequently written and the problem exhibited can be easily discovered and corrected during testing. However, more insidious problems may be encountered in other cases, such as with the <ctype.h> character classification functions. Consider the following code that naively attempts to convert its input to uppercase using toupper().

void convert_to_uppercase(char *p) {
  for (; *p; ++p) {
    *p = toupper(*p);
  }
}

When called with a UTF-8 encoded string that contains non-ASCII characters, this function encounters undefined behavior for implementations with an 8-bit signed char type; even when the current locale is UTF-8-based. The problem is that lead and continuation UTF-8 code unit values are negative for such implementations and may result in a sign extended negative value (that does not match EOF) being passed to toupper(). The result is undefined behavior according to C17 7.4, "Character handling <ctype.h>", paragraph 1:

The header <ctype.h> declares several functions useful for classifying and mapping characters.²⁰²⁾ In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

For this code to portably work as intended, the argument to toupper() must be cast to unsigned char. Alternatively, changing the type of the convert_to_uppercase() parameter to the proposed char8_t type would portably correct the code while also signifying that the intended input is UTF-8.

Design Options

The `char8_t` type: typedef name vs a new integer type

When the char16_t and char32_t types were introduced in C11 and C++11, a choice was faced whether to introduce them as typedef names of existing types or as new integer types. The WG14 and WG21 committees chose different directions; WG14 opted for typedef names for C and WG21 opted for new integer types for C++. This choice was consistent with prior choices regarding the wchar_t type. The same choice applies for the introduction of a char8_t type.

The char16_t and char32_t types were added to C++11 by the adoption of WG21 N2249 ^{[WG21 N2249]}. The motivation for new integer types stated in that proposal includes the ability to support function overloading and template specialization; abilities that would not be possible, at least not reliably and portably, if the new types were simply typedef names of existing types. At the time these types were adopted, C did not yet have support for generic programming; the _Generic generic selection expression had not yet been adopted. Thus, there was little to no motivation for WG14 to impose the additional effort required to support new integer types on implementors.

WG14 now has several proposals to improve support for generic programming in C:

Desire for generic programming improvements may translate to additional motivation for distinct integer types for character data. The following example illustrates a potential use case that would be enabled by distinct types.

void send_narrow(const char*);
void send_wide(const wchar_t*);
void send_utf8(const char8_t*);
void send_utf16(const char16_t*);
void send_utf32(const char32_t*);
#define send(X)                            \
        _Generic((X),                      \
                 char*:     send_narrow,   \
                 wchar_t*:  send_wide,     \
                 char8_t*:  send_utf8,     \
                 char16_t*: send_utf16,    \
                 char32_t*: send_utf32)(X)
void f() {
  send(L"text");   /* Would be ok with distinct types; calls send_wide(). */
  send(u8"text");  /* Would be ok with distinct types; calls send_utf8(). */
}

Clang supports an extension that enables overloading in C ^{[Clang overloadable]}. If adopted by WG14, the code above could be more simply written as:

void __attribute__((overloadable)) send(const char*);
void __attribute__((overloadable)) send(const wchar_t*);
void __attribute__((overloadable)) send(const char8_t*);
void __attribute__((overloadable)) send(const char16_t*);
void __attribute__((overloadable)) send(const char32_t*);
void f() {
  send(L"text");   /* Would be ok with distinct types; calls send(const wchar_t*). */
  send(u8"text");  /* Would be ok with distinct types; calls send(const char8_t*). */
}

Additional motivation for distinct integer types is the ability to specify them as non-aliasing types. A non-aliasing type is one for which objects of the type may only be accessed using a limited set of types; compatible types and specially designated types like char and unsigned char. Compilers may use type based alias analysis (TBAA) to generate more efficient code for non-aliasing types. Aliasing violations result in undefined behavior.

The following example code would be well-formed in C regardless of whether char8_t is specified as a new integer type or as a typedef name of an existing character type. If char8_t is specified as a typedef name of an existing character type, then the example also works as expected because it does not violate aliasing rules. However, if char8_t is specified as a new integer type, then the example would exhibit undefined behavior because an object of type char is accessed using the char8_t type (assuming no new special provisions added to C17 6.5, Expressions, paragraph 7). Thus, there is a trade-off between code efficiency and safety inherent in how char8_t is defined.

void do_utf8_things(const char8_t *s) {
  *s;
}
void f() {
  const char *presumably_utf8_text = "text";
  do_utf8_things(presumably_utf8_text);
}

Since char8_t is a distinct type in C++ and the C++ type system prohibits implicit access to objects with an incompatible type without use of a cast, the above example is ill-formed in C++20. However, the code may be rendered well-formed in C++20 by the addition of a cast, but will then result in undefined behavior when executed.

void do_utf8_things(const char8_t *s) {
  *s;
}
void f() {
  const char *presumably_utf8_text = "text";
  do_utf8_things((const char8_t*)presumably_utf8_text);
}

Such a cast might be added by a C programmer in order to silence warnings regarding a change of signedness that might be produced when the const char* argument to do_utf8_things() is converted to const char8_t*; assuming char8_t is a typedef name of a differently signed character type (otherwise, if char8_t were a distinct type, the code would exhibit undefined behavior whether or not the cast was present). In that case, the unfortunate result is that the code is well-formed for both C and C++, but exhibits undefined behavior only when compiled for C++.

This aliasing asymmetry between C and C++ is not a new concern; it already exists for the wchar_t, char16_t, and char32_t types. For example, char16_t and uint_least16_t are distinct integer types in C++ (and do not alias), but are the same type in C. Whether these aliasing issues are more significant for char8_t as opposed to the other character types is a subjective concern.

Introduction of a new char8_t integer type without a corresponding change to make wchar_t, char16_t, and char32_t distinct integer types would be inconsistent and surprising. While the author sees potential use for distinct types as shown above, such a change of direction should be pursued via a separate proposal. Should WG14 indicate support for such direction when reviewing this proposal, the author will submit a separate proposal. In the meantime, this proposal advocates for only a new char8_t typedef name in order to maintain consistency with the existing character types.

Proposed: a new char8_t typedef name defined in the uchar.h header.

The underlying type of `char8_t`

UTF-8 code unit values range from 0x00 to 0xF5 (the values 0xC0, 0xC1, and 0xF5 through 0xFF do not occur in well-formed UTF-8 code unit sequences) and therefore require at least an 8-bit type for storage.

The existing char16_t and char32_t typedef names are defined as having the same type as uint_least16_t and uint_least32_t respectively. This suggests that the underlying type of char8_t should be the same type as uint_least8_t. However, the latitude provided for the uint_least8_t typedef name to be defined with a type other than unsigned char provides no benefit for the proposed char8_t type; unsigned char is already defined to be unsigned with a size and alignment of 1 byte. Since bytes are constrained to be at least 8-bits and no smaller types are possible, additional leniency would only serve to limit portability.

The type of character constants with a u8 encoding-prefix is already unsigned char. The underlying type for char8_t in C++20 is also unsigned char. For consistency with u8 character constants and the C++20 char8_t type, this proposal defines the underlying type of the proposed char8_t type to be unsigned char.

Proposed: The underlying type of char8_t is unsigned char.

UTF-8 string literal type

In C17, a UTF-8 string literal has type array of char. Since the size and signedness of char are implementation-defined, portable code requires casts to an unsigned type when reading UTF-8 code unit values stored in objects of type char. This is required because common implementations implement char as a signed 8-bit type for which integer promotion rules produce a negative value for leading and trailing code unit values (which all have values above 0x7F). While it is uncommon for code to directly access the elements of a string literal, such accesses may occur when macros are involved.

In the working draft, UTF-8 character constants have a type of unsigned char. That results in a surprising inconsistency with UTF-8 string literals.

#define M(X) ((X) >= 0x80)

void f() {
  M(u8"\U00E9"[0]); /* True for some implementations, false for others.
                       U+00E9 is encoded as 0xC3 0xA9 in UTF-8.
                       0xC3 will promote to a negative integer value for
                       implementations with a signed 8-bit char type. */
  M(u8'\xC3');      /* True for all implementations. */
}

Changing the type of a UTF-8 string literal to an array of type char8_t would avoid this inconsistency such that both expressions above would result in a true value for all implementations.

For consistency with u8 character constants and the type of C++20 UTF-8 string literals, this proposal changes the type of a UTF-8 string literal from array of char to array of char8_t. The Backward Compatibility section discusses the impact of this change.

Proposed: The type of UTF-8 string literals is changed from array of char to array of char8_t.

`char` array initialization by a UTF-8 string literal

In C17, arrays of type char, signed char, and unsigned char may be initialized by a UTF-8 string literal. These were all made ill-formed in C++20 where only arrays of char8_t may be initialized by a UTF-8 string literal.

const          char cu8[]  = u8"text";  /* Ok in C17 and C++17, ill-formed in C++20. */
const signed   char scu8[] = u8"text";  /* Ok in C17 and C++17, ill-formed in C++20. */
const unsigned char ucu8[] = u8"text";  /* Ok in C17 and C++17, ill-formed in C++20. */

For other character types, whether an array of the character type can be initialized by a string literal with a mismatched encoding prefix depends on the implementation. C17 6.7.9, "Initialization", paragraph 15 states:

An array with element type compatible with a qualified or unqualified version of wchar_t, char16_t, or char32_t may be initialized by a wide string literal with the corresponding encoding prefix (L, u, or U, respectively), optionally enclosed in braces. Successive wide characters of the wide string literal (including the terminating null wide character if there is room or if the array is of unknown size) initialize the elements of the array.

C++ does not allow initialization of mismatched encoding prefixes.

const wchar_t  wc16[] = u"text";  /* Ok in C17 if wchar_t and char16_t are compatible types, ill-formed in C++20. */
const wchar_t  wc32[] = U"text";  /* Ok in C17 if wchar_t and char32_t are compatible types, ill-formed in C++20. */
const char16_t c16w[] = L"text";  /* Ok in C17 if wchar_t and char16_t are compatible types, ill-formed in C++20. */
const char32_t c32w[] = L"text";  /* Ok in C17 if char32_t and wchar_t are compatible types, ill-formed in C++20. */

Prohibiting initialization of arrays of type char and signed char by UTF-8 string literals would improve consistency with C++20. However, the existing inconsistencies are fully explainable as a consequence of the choice to use existing integer types for wide character types in C vs the choice to introduce new integer types in C++. If WG14 were to decide to switch to use of distinct integer types for wide character types (and char8_t) in the future, then it would make sense to align initialization allowances with C++. Until then, this proposal preserves the existing ability to initialize an array of plain char or an array of signed char with a UTF-8 string literal.

Proposed: initialization of an array of type char or an array of type signed char by a UTF-8 string literal remains well-formed.

Proposal

The proposed changes include:

A new char8_t typedef name with type unsigned char defined in the <uchar.h> header.
The type of UTF-8 string literals is changed from array of char to array of char8_t.
The type of UTF-8 character literals is changed from unsigned char to char8_t.
(Since UTF-8 character literals already have type unsigned char, this is not a semantic change).
Initialization of an array of type char or type signed char by a UTF-8 string literal remains well-formed.
New mbrtoc8() and c8rtomb() functions declared in <uchar.h> enable conversions between multibyte characters and UTF-8.
A new ATOMIC_CHAR8_T_LOCK_FREE macro.
A new atomic_char8_t typedef name.

Backward Compatibility

The proposed change to the type of UTF-8 string literals impacts backward compatibility as described in the following sections. Implementors are encouraged to offer options to disable char8_t support when necessary to preserve compatibility with C17.

Pointer conversion from a UTF-8 string literal

Initialization or assignment of char pointers (including parameters) from UTF-8 string literals remains well-formed under this proposal. However, some implementations may produce warnings about differences in signedness depending on whether char is a signed or unsigned type.

For example:

const char *p = u8"text"; // Well-formed in C17 and with this proposal, but
                          // implementations may now warn about different
                          // signedness for the pointer target type.

The value of a UTF-8 string literal element

Code that directly accesses the code unit values of UTF-8 string literals without an intervening cast to an unsigned type may observe different values under this proposal. This will occur for implementations with a signed 8-bit char type when accessing a leading or trailing UTF-8 code unit (such code units have a value in the range 0x80 through 0xFF).

For example:

if (u8"\u00E9"[0] < 0) {} // Well-formed with implementation-defined behavior
                          // in C17.  Well-formed with portable behavior with
                          // this proposal (the conditional is always false).

The author is unaware of use cases that involve directly probing the values of UTF-8 string literal elements, but such accesses may occur as a result of macro processing. Code intended to be portable will already contain an appropriate cast to an unsigned type and will therefore be unaffected by this proposal. Non-portable code that relies on leading and trailing UTF-8 code unit values having a negative value will require modification.

Type inference

Code that makes use of _Generic expressions, type inference extensions such as gcc's __typeof__ type specifier, or Clang's extension for overloading in C may become ill-formed or behave differently with this proposal.

In the following example, serialize is a type-generic macro that, based on the type of its argument, dispatches to either serialize_text(), serialize_wide_text(), serialize_int(). or serialize_double(). With this proposal, there is no longer a type match, so the code becomes ill-formed. This code can be corrected on the caller side by adding a cast to char* or on the callee side by adding a type match for unsigned char*. The latter approach has the benefit of allowing serialize to dispatch to a serialize_u8text() function that specifically handles UTF-8 encoded text.

void serialize_text(const char*);
void serialize_wide_text(const wchar_t*);
void serialize_int(int);
void serialize_double(double);
#define serialize(X) _Generic((X),                           \
                              char*:    serialize_text,      \
                              wchar_t*: serialize_wide_text, \
                              int:      serialize_int,       \
                              double:   serialize_double)(X)
void f() {
  serialize(u8"text"); // Well-formed in C17, ill-formed with this proposal.
}

The following example reimplements the serialization example, using Clang's extension for overloading in C. In this case, the change of type for the UTF-8 string literal results in ambiguous overload resolution. Here again, the code can be corrected on the caller side by adding a cast, or can be corrected on the callee side by adding an overload for const unsigned char*. Again, the latter has the benefit of enabling UTF-8 encoded text to be handled differently than text matching the execution character set.

void serialize(const char*)    __attribute__((overloadable));
void serialize(const wchar_t*) __attribute__((overloadable));
void serialize(int)            __attribute__((overloadable));
void serialize(double)         __attribute__((overloadable));
void f() {
  serialize(u8"text"); // Well-formed in C17 with Clang's overloading extension.
                       // Ill-formed with this proposal.
}

Implementation Experience

The proposed changes have been implemented in forks of gcc and glibc and are available in the char8_t-for-c and char8_t branches respectively of the following repositories:

The changes to glibc provide declarations for the char8_t typedef name and the c8rtomb() and mbrtoc8() functions. When compiling for C, these declarations are only present when the _CHAR8_T_SOURCE feature test macro is defined.

The changes to gcc provide the atomic_char8_t typedef name, the ATOMIC_CHAR8_T_LOCK_FREE macro, and the change of type for UTF-8 literals from array of char to array of unsigned char. The existing -fchar8_t and -fno-char8_t compiler options are extended to C code to allow opting-in or opting-out of these changes. When -fchar8_t is enabled, the _CHAR8_T_SOURCE macro is defined to inform the C library that the char8_t typedef name and the c8rtomb() and mbrtoc8() declarations should be provided by the uchar.h header.

Formal Wording

Hide deleted text

These changes are relative to WG14 N2596 ^{[WG14 N2596]}

Change in 6.4.4 (Character constants) paragraph 9:

The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type:

Prefix Corresponding type

none unsigned char

u8 unsigned charchar8_t

L the unsigned type corresponding to wchar_t

u char16_t

U char32_t

Change in 6.4.4 (Character constants) paragraph 12:

A UTF-8 character constant has type unsigned charchar8_t. The value of a UTF-8 character constant is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF-8 code unit.

Change in 6.4.5 (String Literals) paragraph 6:

[…] For UTF-8 string literals, the array elements have type ~~char~~char8_t, and are initialized with the characters of the multibyte character sequence, as encoded in UTF–8. […]

Change in 7.17.1 (Introduction) paragraph 3:

The macros defined are the atomic lock-free macros
ATOMIC_BOOL_LOCK_FREE
ATOMIC_CHAR_LOCK_FREE
ATOMIC_CHAR8_T_LOCK_FREE
ATOMIC_CHAR16_T_LOCK_FREE
ATOMIC_CHAR32_T_LOCK_FREE
ATOMIC_WCHAR_T_LOCK_FREE
ATOMIC_SHORT_LOCK_FREE
ATOMIC_INT_LOCK_FREE
ATOMIC_LONG_LOCK_FREE
ATOMIC_LLONG_LOCK_FREE
ATOMIC_POINTER_LOCK_FREE

[…]

Change in 7.17.6 (Atomic integer types) paragraph 1:

For each line in the following table,^{[Footnote: See "future library directions" (7.31.10).]} the atomic type name is declared as a type that has the same representation and alignment requirements as the corresponding direct type.^{[Footnote: The same representation and alignment requirements are
meant to imply interchangeability as arguments to functions, return values
from functions, and members of unions.]}

Atomic type name Direct type

[…] […]

atomic_ullong _Atomic unsigned long long

atomic_char8_t _Atomic char8_t

atomic_char16_t _Atomic char16_t

atomic_char32_t _Atomic char32_t

atomic_wchar_t _Atomic wchar_t

[…] […]

Change in 7.28 (Unicode utilities <uchar.h>) paragraph 2:

The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char8_t
which is an unsigned integer type used for UTF-8 characters and is the same type as unsigned char; and
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.12); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (described in 7.20.1.12).

Insert a new subclause before 7.28.1.1 (The mbrtoc16 function):

7.28.1.1 The mbrtoc8 function

Add a new paragraph 1:

Synopsis

#include <uchar.h>
size_t mbrtoc8(char8_t * restrict pc8,
const char * restrict s, size_t n,
mbstate_t * restrict ps);

Add a new paragraph 2:

Description
If s is a null pointer, the mbrtoc8 function is equivalent to the call:

mbrtoc8(NULL, "", 1, ps)

In this case, the values of the parameters pc8 and n are ignored.

Add a new paragraph 3:

If s is not a null pointer, the mbrtoc8 function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the values of the corresponding characters and then, if pc8 is not a null pointer, stores the value of the first (or only) such character in the object pointed to by pc8. Subsequent calls will store successive characters without consuming any additional input until all the characters have been stored. If the corresponding character is the null character, the resulting state described is the initial conversion state.

Add a new paragraph 4:

Returns
The mbrtoc8 function returns the first of the following that applies (given the current conversion state):

0 if the next n or fewer bytes complete the multibyte character that corresponds to the null character (which is the value stored).

between 1 and n inclusive if the next n or fewer bytes complete a valid multibyte character (which is the value stored); the value returned is the number of bytes that complete the multibyte character.

(size_t) (−3) if the next character resulting from a previous call has been stored (no bytes from the input have been consumed by this call).

(size_t) (−2) if the next n bytes contribute to an incomplete (but potentially valid) multibyte character, and all n bytes have been processed (no value is stored).^{[Footnote: When n has at least the value of
the MB_CUR_MAX macro, this case can only occur if s
points at a sequence of redundant shift sequences (for implementations
with state-dependent encodings).]}

(size_t) (−1) if an encoding error occurs, in which case the next n or fewer bytes do not contribute to a complete and valid multibyte character (no value is stored); the value of the macro EILSEQ is stored in errno, and the conversion state is unspecified.

Insert another new subclause before 7.28.1.1 (The mbrtoc16 function):

7.28.1.2 The c8rtomb function

Add a new paragraph 1:

Synopsis

#include <uchar.h>
size_t c8rtomb(char * restrict s, char8_t c8,
mbstate_t * restrict ps);

Add a new paragraph 2:

Description
If s is a null pointer, the c8rtomb function is equivalent to the call

c8rtomb(buf, u8'\0', ps)

where buf is an internal buffer.

Add a new paragraph 3:

If s is not a null pointer, the c8rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the character given or completed by c8 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s, or stores nothing if c8 does not represent a complete character. At most MB_CUR_MAX bytes are stored. If c8 is a null character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state; the resulting state described is the initial conversion state.

Add a new paragraph 4:

Returns
The c8rtomb function returns the number of bytes stored in the array object (including any shift sequences). When c8 is not a valid character, an encoding error occurs: the function stores the value of the macro EILSEQ in errno and returns (size_t) (−1); the conversion state is unspecified.

Change in B.16 (Atomics <stdatomic.h>)

[…]
ATOMIC_CHAR_LOCK_FREE
ATOMIC_CHAR8_T_LOCK_FREE
ATOMIC_CHAR16_T_LOCK_FREE
ATOMIC_CHAR32_T_LOCK_FREE
ATOMIC_WCHAR_T_LOCK_FREE
[…]
atomic_ullong
atomic_char8_t
atomic_char16_t
atomic_char32_t
atomic_wchar_t
[…]

Change in B.27 (Unicode utilities <uchar.h>)

mbstate_t size_t char8_t char16_t char32_t

size_t mbrtoc8(char8_t * restrict pc8,
      const char * restrict s, size_t n,
      mbstate_t * restrict ps);
size_t c8rtomb(char * restrict s, char8_t c8,
      mbstate_t * restrict ps);
size_t mbrtoc16(char16_t * restrict pc16,
      const char * restrict s, size_t n,
      mbstate_t * restrict ps);
size_t c16rtomb(char * restrict s, char16_t c16,
      mbstate_t * restrict ps);
size_t mbrtoc32(char32_t * restrict pc32,
      const char * restrict s, size_t n,
      mbstate_t * restrict ps);
size_t c32rtomb(char * restrict s, char32_t c32,
      mbstate_t * restrict ps);

Change in J.6.1 (Rule based identifiers) paragraph 2:

The following ** count ** identifiers or keywords match these patterns and have particular semantics provided by this document.

[…]
atomic_char
atomic_char8_t
ATOMIC_CHAR8_T_LOCK_FREE
atomic_char16_t
ATOMIC_CHAR16_T_LOCK_FREE
[…]

Change in J.6.2 (Particular identifiers or keywords) paragraph 1:

The following ** count ** identifiers or keywords are not covered by the above and have particular semantics provided by this document.

[…]
char
char8_t
char16_t
char32_t
[…]
BUFSIZ
c8rtomb
c16rtomb
c32rtomb
[…]
mbrlen
mbrtoc8
mbrtoc16
mbrtoc32
[…]

Acknowledgements

Thank you to Aaron Ballman for his kind assistance facilitating interaction with WG14.

Thank you to Richard Smith and Jens Maurer for review feedback and many educational and helpful conversations.

References

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2695.pdf

^[W3Techs]	"Usage of UTF-8 for websites", W3Techs, 2021. https://w3techs.com/technologies/details/en-utf8/all/all
^{[WG14 N2596]}	JeanHeyd Meneide, Freek Wiedijk, et al., "C2x Working Draft", WG14 N2596, 2020. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf
^{[WG14 N2620]}	JeanHeyd Meneide, "Restartable and Non-Restartable Functions for Efficient Character Conversions \| r4", WG14 N2620, 2020. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2620.htm
^{[WG14 N2654]}	Jens Gustedt, "Revise spelling of keywords v5", WG14 N2654, 2021. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2654.pdf
^{[WG14 N2724]}	JeanHeyd Meneide, "Not-So-Magic - typeof(…) in C \| r3", WG14 N2724, 2021. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2724.htm
^{[WG14 N2734]}	Jens Gustedt, "Improve type generic programming", WG14 N2734, 2021. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2734.pdf
^{[WG14 N2735]}	Jens Gustedt, "Type inference for variable definitions and function returns", WG14 N2735, 2021. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2735.pdf
^{[WG14 N2738]}	Jens Gustedt, "Type-generic lambdas", WG14 N2738, 2021. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2738.pdf
^{[WG21 N2249]}	Lawrence Crowl, "New Character Types in C++", WG21 N2249, 2007. https://wg21.link/n2249
^{[WG21 P0482R6]}	Tom Honermann, "char8_t: A type for UTF-8 characters and strings (Revision 6)", WG21 P0482R6, 2018. https://wg21.link/p0482r6
^{[Clang overloadable]}	The Clang Team, "Clang 11 documentation, Attributes in Clang", 2020. https://releases.llvm.org/11.0.0/tools/clang/docs/AttributeReference.html#overloadable

Prefix	Corresponding type
none	`unsigned char`
`u8`	`unsigned charchar8_t`
`L`	the unsigned type corresponding to `wchar_t`
`u`	`char16_t`
`U`	`char32_t`

Atomic type name	Direct type
[…]	[…]
`atomic_ullong`	`_Atomic unsigned long long`
`atomic_char8_t`	`_Atomic char8_t`
`atomic_char16_t`	`_Atomic char16_t`
`atomic_char32_t`	`_Atomic char32_t`
`atomic_wchar_t`	`_Atomic wchar_t`
[…]	[…]

0	if the next `n` or fewer bytes complete the multibyte character that corresponds to the null character (which is the value stored).
between 1 and `n` inclusive	if the next `n` or fewer bytes complete a valid multibyte character (which is the value stored); the value returned is the number of bytes that complete the multibyte character.
`(size_t)` (−3)	if the next character resulting from a previous call has been stored (no bytes from the input have been consumed by this call).
`(size_t)` (−2)	if the next `n` bytes contribute to an incomplete (but potentially valid) multibyte character, and all `n` bytes have been processed (no value is stored).^{[Footnote: When n has at least the value of the MB_CUR_MAX macro, this case can only occur if s points at a sequence of redundant shift sequences (for implementations with state-dependent encodings).]}
`(size_t)` (−1)	if an encoding error occurs, in which case the next `n` or fewer bytes do not contribute to a complete and valid multibyte character (no value is stored); the value of the macro `EILSEQ` is stored in `errno`, and the conversion state is unspecified.