Document Number P0169R0
Date 2015-11-03
Audience Library Evolution Working Group
Reply-To
  • Nozomu Katō
  • <
  • tantataotaztata
  • . tantan, tanuki.
  • taktataata
  • taatataktaetatantaotatattastatautaktataita
  • . tatatan, tanuki.
  • tactataotamtata
  • >

regex with Unicode character types

Table of Contents

  1. Introduction and Motivation
  2. Scope and Impact on the Standard
  3. <regex> with char16_t
  4. Technical Specifications
  5. Relevant Issues
  6. References

I. Introduction and Motivation

Among the four character types that C++ has, only char and wchar_t can be used with the regular expression library in the C++ standard. Because of this, operations involving regular expression matching and searching against a Unicode string are available only in such environments as the value of char or wchar_t denotes a UTF-32 character.

It is unfortunate and inconvenient that while C++ has two character types, two string classes dedicated to Unicode (char16_t and char32_t, u16string and u32string), and a regular expression library (regex), they cannot be used together in all implementations.

In this paper it is proposed that the regular expression library in the C++ standard (henceforth, <regex>) should support sequences of Unicode character types at least as the same level as sequences of char and wchar_t.

II. Scope and Impact on the Standard

Since there are different problems in using <regex> with char16_t or char32_t, different measures are required for each of them:

<regex> with char32_t

The value of char32_t is practically a Unicode code point itself. It should be adaptable to <regex> in essence without special treatment, however, basic_regex<char32_t> is unavailable in most implemantations based on the current standard. Its core reason is that although inside the class it tries to use regex_traits<char32_t>, this is not available because it depends on several classes in <locale>, namely ctype<char32_t>, collate<char32_t>, and collate_byname<char32_t> for which specializations are not defined in the standard.

Thus, for <regex> to support char32_t, it is proposed to define specializations of these classes for char32_t in the standard.

<regex> with char16_t

Use of <regex> with char16_t has the following problems:

  • Regular expressions that represent a set of characters, such as [\u0000-\uFFFF] (character class), . (dot atom), \S (predefined character class) etc. can match a half of a surrogate pair instead of the whole pair that represents one Unicode character, since comparison is performed conceptually between a code unit in the sequence of regular expressions and a code unit in the input sequence passed to an algorithm.

  • Like the case of char32_t, the specializations regex_traits<char16_t>, ctype<char16_t>, collate<char16_t>, and collate_byname<char16_t> are not available. However, unlike char32_t, it is difficult to define appropriately specializations of ctype for char16_t because it has some member functions that take an argument of charT, i.e., char16_t and return a value of the same type. This means that such functions cannot deal with a surrogate pair, and icase matching depending on one of such functions, tolower(), is not performed correctly by the algorithms of <regex>.

    Note: UCS-2 is already obsolete in the Unicode standard and deprecated in ISO/IEC 10646. Newly added features must not support UCS-2 explicitly.

For <regex> to support char16_t, therefore, special treatments would be required. This is discussed in the next section, but in any case the existing libraries except <regex> would not be affected at all.

There might be demand for more full-featured Unicode regular expression support like the ones described in UTS #18 to get into the C++ standard. But I propose, as a first step, for <regex> to support sequences of Unicode character types as the same level as sequences of char and wchar_t, based upon the following reasons:

Note: As of October 2015, among six regular expression grammars referred to by the C++ standard, only RegExp of ECMAScript has explicit Unicode support and it performs character-by-character comparison where each character is either a code point or a code unit of UTF-16, depending upon whether the /u flag is set or not.

III. <regex> with char16_t

There are two options for char16_t support:

1. Provide UTF-16 to UTF-32 converting iterator

In this option the C++ standard does not support std::u16regex, but defines a bidirectional iterator that converts UTF-16 to UTF-32 on the fly for the algorithms of <regex>. This takes pointers or iterators pointing to the sequence [begin, end) of UTF-16 as input, its operator*() returns a value of char32_t, and its operator++() and operator--() move its position to the next and previous character respectively in the sequence. A very rough sketch of it is illustrated as follows:

template<class BidiIterator>
struct regex_u16u32conv_iterator
{
public:
    typedef bidirectional_iterator_tag iterator_category;

    regex_u16u32conv_iterator(BidiIterator begin, BidiIterator end) : boi(begin), eoi(end)
    {
    }

    char32_t operator*()
    {
        if ((*boi & 0xdc00) == 0xd800)
        {
            BidiIterator trail = boi;
            if (++trail != eoi)
                return static_cast<char32_t>(((*boi & 0x3ff) << 10 | (*trail & 0x3ff)) + 0x10000);
        }
        return static_cast<char32_t>(*boi);
    }

    regex_u16u32conv_iterator &operator++()
    {
        ++boi;
        if (boi != eoi && (*boi & 0xdc00) == 0xdc00)
            ++boi;

        return *this;
    }

    bool operator==(const regex_u16u32conv_iterator &right) const
    {
        return boi == right.boi && eoi == right.eoi;
    }

    operator BidiIterator() const
    {
        return boi;
    }

    //  other members...

private:
    BidiIterator boi;
    BidiIterator eoi;
};
typedef regex_u16u32conv_iterator<char16_t*> regex_u16cu32conv_iterator;
typedef regex_u16u32conv_iterator<u16string::iterator> regex_u16su32conv_iterator;

char16_t u16chars[] = u"\u3000\U00010000\u0040";  //  0x3000, 0xd800, 0xdc00, 0x0040
regex_u16cu32conv_iterator u16tou32(u16chars, u16chars + 4);
*u16tou32;                                  //  returns 0x3000 of char32_t
++u16tou32;
*u16tou32;                                  //  returns 0x10000 of char32_t
++u16tou32;
*u16tou32;                                  //  returns 0x40 of char32_t

//  A sequence of regular expressions in UTF-16 needs to be converted
//  into UTF-32 prior to passed to u32regex.
u32string u32restr = U"(abc|def)[ghi]";
u32regex u32re(u32restr);

u16string u16text = u" long long text encoded in UTF-16... ";
regex_u16su32conv_iterator bos(u16text.begin(), u16text.end());
regex_u16su32conv_iterator eos(u16text.end(), u16text.end());
regex_search(bos, eos, u32re);
			

This does not need to satisfy strictly all the requirements of the bidirectional iterator, but only needs to be recognized so by all the algorithms of <regex>.

An advantage of this approach is that a similar iterator can be provided for UTF-8 to UTF-32 conversion, too. It is possible to support all UTFs (UTF-32, UTF-16, and UTF-8) by the combination of adding support for char32_t to <regex> and defining converting iterators.

A disadvantage is that matching operations are likely to be slow, since all code units are translated into UTF-32 through this iterator every time they are accessed in regular expression algorithms. Clearly, it would be faster than the way of this option to convert the input sequence of UTF-16 into UTF-32 in advance of passing it to u32regex or algorithms, if it is possible.

2. Do nothing for <regex> with char16_t

char16_t resembles char32_t in name, however, the characteristics of their values are very different. UTF-16 contained by char16_t resembles UTF-8 rather than UTF-32 contained by char32_t, in that UTF-16 and UTF-8 are variable-width encoding schemes, whereas UTF-32 is not. Therefore, it would be a real option that nothing is done for the time being about char16_t which requires special considerations, whereas char32_t is added into the group of char and wchar_t.

In this option, for UTF-8 and UTF-16 strings, until good treatment gets into the standard, it is encouraged for them to be converted into UTF-32 strings then passed to std::u32regex and regular expression algorithms.

Either way, support for basic_regex<char32_t> is a precondition.

IV. Technical Specifications

1. <regex>

The following changes are proposed to support basic_regex<char32_t>:

2. <locale>

Relationship with <regex>:

Thus, the following changes are proposed for support of regex_traits<char32_t>:

Strict Option

For translate_nocase(charT c) in class regex_traits, the C++ specification says:

However, in terms of the Unicode standard, this way is not appropriate for making a character caseless (i.e., case-folding). Case Folding Stability of Unicode says that "Case folding is not the same as lowercasing, and a case-folded string is not necessarily lowercase. In particular, as of Unicode 8.0, ..., Cherokee text case folds to the existing uppercase letters."

If we follow strictly the Unicode standard, the specification in "28.7 Class template regex_traits [re.traits]" is modified as follows:

charT regex_traits<char32_t>::translate_nocase(charT c);

5 Returns: use_facet<ctype<charT> >(getloc()).tolower(c), if charT is not char32_t.
When charT is char32_t, if CaseFolding.txt of the Unicode Character Database provides a simple (S) or common (C) case folding mapping for c, then returns the result of applying that mapping to c; otherwise returns c. When the current locale is such that tolower(U'I') should return an integer corresponding to U'ı' instead of U'i', the mappings with status T in CaseFolding.txt may be given priority.

In this case, regex_traits<char32_t>::translate_nocase() does not depend upon ctype<char32_t>::tolower(). The proposed changes to do_toupper() and do_tolower() can be removed from this proposal document.

V. Relevant Issues

VI. References