Document Number P1844R0
Date 2019-08-04
Audience LEWGI, SG16
Reply-To Nozomu Katō
Supersedes P0169

Enhancement of regex

Table of Contents

  1. Introduction and Motivation
  2. Why new syntax option? Why not simply enhance the existing ECMAScript mode?
  3. Why RegExp of ECMAScript?
  4. Scope and Impact on the Standard
  5. Technical Specifications
  6. Relevant Matter
  7. References
  8. Appendix

Abstract

I. Introduction and Motivation

The default engine of C++'s regex is based on RegExp, regular expression objects of the ECMAScript specification third edition. Although its original RegExp had not been being modified for many years, in these years has been enhanced as follows:

Unfortunately, while C++'s regex supports six regular expression grammars, all are inferior to other languages' regular expression features in richness of available expressions nowadays. In addition, the problem of how to support Unicode is still unsettled. To resolve these problems, or at least to improve the current situation, this paper proposes adding the new syntax option, ECMAScript2019 to regex.

II. Why new syntax option? Why not simply enhance the existing ECMAScript engine?

Short answer: The ECMAScript engine of C++ has been modified to depend on the locale deeply. The author of this proposal wants to release regex from the locale and to revert it to being locale independent as its original RegExp, then to add new features on it. But doing so involves deprecating some features of the default engine and seems to be difficult.

Long answer: RegExp of ECMAScript is locale independent and treats an input sequence as UTF-16. For example, /[a-z]/ is always interpreted "any character in the range from U+0061 to U+007A inclusive". This has the benefit of allowing that the set of characters that some character class expression matches can be settled even at compile time. This is true even if the icase flag is set, by preparing a reversed case-folding table (while case-folding means converting each of "S", "s", and U"ſ" to "s", reversed case-folding here means returning the set of characters that are converted into the same character when case-folding is done, such as converting each of "S", "s", and U"ſ" to U"Ssſ".)

However, the ECMAScript engine of regex can be modified as being locale sensitive by setting the collate flag. If this flag is used with the icase flag, the pattern compiler is required to call std::tolower and std::toupper per character in one character class [] for gathering all characters that the character class can match. This clumsiness was filed as Defect Report 523 in 2005 and is still Status Open.

Furthermore, the ECMAScript engine was modified to support the POSIX character class and to change the expressions \d, \D, \s, \S, \w, and \W to be equivalent to some of the POSIX character class. This also made the ECMAScript engine depend on the locale.

Although this paper is not intended to propose making regex constexpr, if there is any ECMAScript based engine that inherits the nature of being locale independent from its original RegExp, it might make constexpr support easier than now.

The modifications defined in [re.grammar] also caused DR2986, DR2987 and they are still unresolved. To fix these issues originating in the modifications, I considered to propose deprecating all the modifications and reverting the ECMAScript engine to its original specification of RegExp, before adding something new on it. But the ECMAScript engine is the default one of regex. Trying to deprecate some features of it seems to increase difficulty in getting the proposal accepted.

Moreover, while RegExp supports UTF-16 only, C++'s regex needs to support various character sets. It is a difficult question how Unicode property escapes should be processed in legacy (non-Unicode based) character sets.

Thus, this paper does not touch any part of the current ECMAScript engine and leaves it as is for legacy character sets that use char and wchar_t; and instead propose introducing a new syntax option being compliant to a recent version of ECMAScript keeping the nature of being locale independent, for use with char8_t, char16_t, or char32_t through the template specialization.

III. Why RegExp of ECMAScript?

Because the ECMAScript specification explains the behavior of RegExp through defining closures in detail, there seems to be less room for any undefined or undocumented behaviour to appear. It would be an important factor to the standard.

IV. Scope and Impact on the Standard

The new syntax is implemented through the template specialization for char8_t, char16_t, and char32_t. As of C++20, <regex> supports only char and wchar_t and compiling basic_regex with the other types leads to compile time error by the reasons explained in P0169. So, the existing implementations would not be affected by this proposal except regex_iterator.

basic_regex

When the first template parameter charT of the basic_regex class is char8_t, char16_t, or char32_t, its constructors and assign functions are required to interpret an input sequence as UTF-8, UTF-16, or UTF-32 respectively. In addition, syntax_option_types other than icase, nosubs, optimize, and multiline are disabled. An input sequence is always interpreted assuming that ECMAScript2019 is set.

These three specializations are not required necessarily to be implemented separately. It is expected that typical implementations use an internal iterator class template that translates a sequence of UTF-8, UTF-16, or UTF-32 to a sequence of Unicode code points and construct a finite state machine by parsing that translated sequence in their common base class.

One obstacle to implementing in such a way is the expression \xHH where H is a hexadecimal digit. In the spec of ECMAScript this expression is defined to represent a code unit. However, appearance of an isolated code unit in a UTF-8 sequence requires special treatment, because unlike in UTF-16 and UTF-32, an isolated code unit in UTF-8 cannot be converted to any code point when its value is in 0x80-0xff inclusive. (This does not matter in ECMAScript, since it supports UTF-16 only.)

To simplify things, in the proposed ECMAScript2019 syntax, the expression \xHH is changed to represent a code point value. This is the one and only modification to the spec of ECMAScript.

(If the committee does not like the inconsistency with the meaning of C++'s own \xHH, the previous paragraph will be changed to "... use of the expression \xHH is disabled". RegExp has \uHHHH and \u{H...} for representing a code point. Removing support for \xHH is not likely to cause inconvenience.)

regex_traits

basic_regex<char8_t>, basic_regex<char16_t>, and basic_regex<char32_t> must be locale independent. It means that these specializations have to construct an internal finite state machine only based on the Unicode code point values that are translated from the input sequence and may not use regex_traits.

Thus, regex_traits does not need to be specialized for char8_t, char16_t, and char32_t.

Particularly, when basic_regex is used with char8_t, char16_t, or char32_t, case-folding may not be performed using regex_traits<charT>::translate_nocase(), but performed as defined in the ECMAScript specification.

Algorithm functions (regex_match, regex_search, and regex_replace)

As these take an instance of the basic_regex class as a parameter, they get charT as a template parameter. So, they also can be implemented in a way similar to basic_regex using the template specialization.

regex_match and regex_search

The proposed ECMAScrip2019 syntax supports lookbehind assertions, which none of the existing six engines of C++'s regex has support for. When performing a lookbehind assertion, the algorithm function reads the input sequence backwards. This raises a new issue.

When a user wants to find all the portions that some regular expression matches against some input sequence, the user will call regex_search against the subrange [the end of the previous matched portion of the entire sequence, the end of the entire sequence) while the previous call succeeds. In this case, the subrange [the beginning of the entire sequence, the end of the previous matched portion of the entire sequence) is also a valid range and it is safe for an algorithm function to read a character in that subrange.

However, there is no way at this time to inform an algorithm function about the limit until where it can read backwards for lookbehind.

For example, when a user calls regex_search with the regular expression /(?<!\d{2,})ABC123/ ("ABC123" not preceded by two or more digit characters) against "ABC123ABC1234ABC12345", only the first six characters should be matched, because the second and later "ABC123"s are all preceded by two or more digits. But if there is no way to tell regex_search about until where it can look-behind, the second and later call to regex_search against [end of previous matched subsequence, end of entire sequence) also return "abc123".

The match_prev_avail flag is not suitable for this purpose. It only indicates that the preceding one point is a valid iterator position.

As of C++20, both regex_match and regex_search take as an input sequence 1) two bidirectional iterators that specify [begin, end), 2) const reference to an instance of std::basic_string, or 3) a pointer to a null-terminated string. To fix the problem mentioned above, this paper proposes adding new overload functions that take three bidirectional iterators, which specify [begin, end) and the limit of lookbehind, to both regex_match and regex_search.

This addition is useful only when an algorithm function is used with an instance of basic_regex constructed with char8_t, char16_t, or char32_t. If any variant of algorithm functions that takes three bidirectional iterators is called when its charT is char or wchar_t, the third iterator for specifying the limit of lookbehind is simply ignored.

No specialization is proposed for regex_replace. It does not do matching by itself but uses regex_iterator that calls regex_search internally.

match_results

It is preferable that 1) the new function group() and its overload functions be added to match_results for access to captured sequences by group name, and 2) the member function format() be modified to support the replacement text symbol $<GROUPNAME> that was introduced in ES2018.

However, as match_results does not take charT as a template parameter, it is not easy to implement something specific to the proposed ECMAScript2019 syntax option through the template specialization.

Thus, in this proposal, the new member function gname_to_gnumber() and its overload functions that convert a group name to the group number assigned with it, are added instead into basic_regex specialized for char8_t, char16_t, and char32_t.

regex_iterator

regex_iterator is changed to use one variant of regex_search that takes three bidirectional iterators mentioned above, instead of the current one that takes two bidirectional iterators. In this proposal this is the only proposed change that requires modifications to the existing implementations.

Whether this change breaks ABI is likely to depend on the efficiency of the compiler. Because non-specialized regex_search that takes three bidirectional iterators only calls a variant that takes two bidirectional iterators, regex_iterator ends up calling the variant that it calls at present indirectly when the type of its second template parameter is not char8_t, char16_t, or char32_t. So, it is possible that the output of the compiler is unchanged particularly when optimization is enabled.

This change can be avoided by providing six template specializations for const char(8|16|32)_t * and std::u(8|16|32)string::const_iterator. But the author of this proposal thinks that regex_iterator and what depends on it, namely regex_token_iterator and regex_replace, are supplements to rather than the core of regex and so in this case the simplest way to implement is preferable. Thus, the "modifying" solution is proposed.

The link to a sample implementation based on the method above is shown in the Appendix section of this document.

V. Technical Specifications

1. <regex>

The following changes are proposed:

Others

The undated version of the ECMAScript Specification is added to references.

VI. Relevant Matter

VII. References

VIII. Appendix

This example can be used with char8_t, char16_t, char32_t only. char and wchar_t versions are not implemented. If basic_regex or an algorithm function is used with a type other than char8_t, char16_t, and char32_t, assert(0) is called. It demonstrates that adding a new syntax option for char8_t, char16_t, and char32_t through the template specialization is a real option.

All the classes and algorithms are declared in namespace regex_proposal, instead of std.

This proposal needs a presenter. If you like this proposal and can attend face-to-face committee meetings of C++, please contact me.