« Linkdump on ECMAScript 3/4 Further links on ECMAScript 4 »

ECMAScript 3 Regular Expressions: A specification that doesn’t make sense

November 26th, 2007 by liorean

So, what was that A quick JS quizz for anybody who think they know regex thing that I posted all about? It was about a flaw, a nonsensical behaviour that the ECMAScript specification makes standard. Let me expand a bit on what ECMAScript 3 does wrong:

Capturing submatches in regular expressions are used in at least five ways. First of all they are used together with backreferences in the regular expression. These backreferences match not the pattern used for the capture, but the actual matched value. Second, they are used in backreferences from a string replacement in a substitution string. Third, they are used in string replacement as arguments to a substitution function. Fourth, they are used as parts of the result array from executing the regex on a string. Fifth, they are used as inclusions when splitting a string based on a regex.

Now, there are three issues I have with the algorithms used in ECMAScript 3 on the issue of captures:

  • In the ECMAScript 3 Regular Expression algorithms, a backreference to a capture that has not yet participated in the match will match the empty string. But conceptually, a backreference is intended to match the value of a capturing match. If the capturing submatch has not yet participated in the regex matching, that capture cannot possibly contain a value. If the capture cannot possibly contain a value, it shouldn’t match any value either — not even the empty string, since that is in fact a possible value for the capture. What the ECMAScript 3 algorithm actually does is set the capture to undefined, which is all nice and fine by me. However, what’s broken is that undefined capture when backreferenced matches an empty string, and will in other words always match. This really should fail.
  • Similarly to last point, a capture that fails will in the ECMAScript 3 algorithm be set to undefined. That’s all right by me. However, due to undefined being treated as a match against the empty string, this behaviour is broken. A backreference to a capture that failed to match should never in turn match, but according to these algorithms it will always match.
  • Finally, in the repetition section of the ECMAScript 3 algorithms, all captures within a repetition will be reset to undefined each time through the repetition. This doesn’t make sense - why does the algorithm reset a capture at all? If the capture matched, it matched. If a capture participated, it either failed or got set to a new value. That match should work in a backreference, if that capturing submatch happens to be in a repetition or not shouldn’t matter. Anything else defies logic. Even worse is the fact that the backreference will match something – the empty string – instead of what it should match – the last captured value for that capturing submatch.

The solution to these problems can be summed up in two changes: First of all, undefined shouldn’t match the empty string, it should fail. Second, captures should only be set to undefined in two cases: when regex matching is started and if inside a negative lookahead.

The reason I am talking about this? Why, because if nothing changes the ES4 group will be propagating this mistake into ECMAScript 4. And really, it’s a faulty specification - what ECMAScript 3 says simply does not make sense. It’s unintuitive, defies logic and programmer expectations. Also, this is a safe fix - neither of the two largest ECMAScript implementations, JScript in IE and AVM2 in Flash, behave according to the ECMAScript 3 specifications. Apple’s JavaScriptCore also behaves in a different way from the specification. This means there cannot possibly be that very much code out there that relies on this behaviour.

Another thing to note: Perl 5, PCRE (which by the way is the regex engine used by AVM2/Tamarin and JavaScriptCore), Python, Ruby, Java, .NET, JGsoft, Tcl ARE and POSIX BRE all agree with me on the behaviour on this point. ECMAScript 3 is simply flawed, and ECMAScript 4 should fix this.

Finally, let me just end with saying this: This is not just a theoretical issue. There is real world use for these features. One example is emulating capture based conditionals.

Posted in Javascript, Regex |

You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.


One Response to “ECMAScript 3 Regular Expressions: A specification that doesn’t make sense”

  1. Steven Levithan Says:
    November 26th, 2007 at 12:36 pm

    Thanks for fighting the good fight, liorean. These spec bugs need to be fixed. However, I disagree that both changes are safe. The case of making backreferences to non-participating capturing groups match the empty string rather than fail to match (as ES3 prescribes) is respected by the latest versions of Firefox, IE, and Opera. Safari 3.0.3 beta and lower handles it the more sensible, Perl-compatible way, but this will change in later versions of Safari (see http://bugs.webkit.org/show_bug.cgi?id=14931 ). Unfortunately, some JavaScripters who are not deeply familiar with regexes tend to stumble around when constructing a regex until something works, without necessarily understanding why. It is these cases that have potential to break if ES4 fixes the problem. However, it is likely only these cases since the ES3 handling is never actually intuitive or desirable.

Leave a Reply