Okay, so I’ve been diving into regex again for a while. And as always when diving into something way too deep, you notice that things don’t work as you thought they would. So here I’m going to give you some code samples, and simply ask: What’s the answer?
Of course, here’s a nit: Before you even think of trying it out in your favourite browser, or reading up on it, try to reason about it yourself and give an answer. Browsers do not deal with these things in consistent manners anyway…
var
str='aaaab',
re=/(a\1)+|b/;
re.exec(str); // So, what’s the answer?
var
str='aaba',
re=/(?:(a)(b)?)+/;
re.exec(str); // So, what’s the answer?
var
str='abba',
re=/(?:(\2|a)(b)?)+/;
re.exec(str); // So, what’s the answer?
So, tell me, what do you think the *correct* answer for each case should be? That is, what should it be according to you, not according the ECMAScript 3 specification or any single browser implementation. What answer makes most sense?
Posted in Javascript, Regex |
You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.
11 Responses to “A quick JS quizz for anybody who think they know regex”
September 5th, 2007 at 7:25 pm
I’d be surprised if recursive backreferences or forward backreferences (as you have in examples 1 and 3) are defined or allowed by the specs.
September 6th, 2007 at 3:06 am
Example 01 should only match “b” because you are not supposed to be able to use a backreference inside itself, so the regex should just ignore
(a\1)as nonsensical and only match “b”.Example 02 is, I think, the effectively the same as
((a)(b)?)+,which will match the whole string.?:just gets rid of the backreference on the outer parenthesis, and in this case that’s pointless. Since(b)?is optional, what should happen is that the engine should consume “a” then “ab” then “a” and then match the whole string.Like j, I’d be surprised if Example 3 works but it might. The reference
\2is not self-contained like in Example 1, and you can use references during the match, but I don’t know if it will match when it comes before the thing it is referencing. The?:should get rid of the reference to the outer group, leaving the references to the two groups within. So the expression would simplify to((b|a)(b)?)+, which should match the whole string; first “ab” then “b” then “a”.September 6th, 2007 at 10:38 am
josephdietrich: Well, let’s say that Example 02 works jsut splendid (…it does, in all browsers). What do you expect the values are for the first and the second captured submatch, respectively. Remember that the captured submatches are included in the return value.
September 6th, 2007 at 1:07 pm
I’d expect the submatches for example 2 to be ‘a’ and ‘b’ respectively, as that’s what the groups contain. If they return anything else, I’d suspect a broken regex engine in that browser.
September 6th, 2007 at 3:17 pm
Without testing it I really can’t answer that question, since I am hardly a regex guru. My first instinct would be to say “what j said.” I don’t think that using the
?:option on the outer parenthesis invalidates any references to the inner parenthesis. So if you were to try and do a replacement with \1\2 you’d just get “ab.”But maybe I’m not understanding your question correctly …
September 6th, 2007 at 3:40 pm
Well then. I’m not saying you’re not allowed to try them in browsers. I’m just asking you to reason about what the result should be before you actually try it. Actually, since you’ve already donne your thinking, try them out in each of iew, saf and either of op or moz(these two behave identical in this respect), and you’ll find noticable differences.
Oh, and just to avoid ruining it for others, please refrain from posting browser results in the comments.
September 6th, 2007 at 3:45 pm
I’ll try that tomorrow morning then. Your cryptic response is intriguing. ;-)
September 7th, 2007 at 10:54 am
Well, I tried it out and I admit I’m kind of surprised. Interesting. The results of example 01 really surprised me, because I thought the rule on that was clear.
But then I ran these through a Perl script as well and didn’t get what I expected, so obviously my grasp or regular expressions needs to be improved.
October 24th, 2007 at 2:49 am
I think I know regexes pretty well. I especially know JavaScript regexes, and can tell you what ES3 and the big four browsers will return for your examples without looking it up or testing. However, since you’re asking what users *think* should be returned, here’s what I’d wish for / expect intuitively:
Example 01: /(a\1)+|b/.exec(’aaaab’);
This should match within the string only one time (”b”), and the result for $1 should be undefined (null would also be acceptable), rather than an empty string. Treating non-participating capturing groups as null or undefined rather than an empty string just seems more logical, is typical behavior with most other regex flavors, and allows for some fancy regex tricks which are not otherwise possible when conditionals are not available.
Example 02: /(?:(a)(b)?)+/.exec(’aaba’);
This should match the entire string. $1 should be the very last “a”, and $2 should be “b”. It doesn’t make sense to me to replace the value of backreferences for capturing groups which previously participated in the match with an empty string unless the group re-participates.
Example 03: /(?:(\2|a)(b)?)+/.exec(’abba’);
This should not match anywhere in the string (see my take on Example 01).
@liorean, so what do *you* think they should match?
October 24th, 2007 at 3:09 am
IMO, example 02 is a particularly sneaky/good example. It may not be obvious why I answered the way I did unless you consider less tricky examples like /(?:(a)|(b))+/.exec(”ba”) , which follows the exactly same principal.
If you changed it from /(?:(a)(b)?)+/.exec(’aaba’) to /(?:(a)(b?))+/.exec(’aaba’) then I would expect $1 to still be the last “a”, but $2 to be “” instead of “b”. It’s all about group participation. For anyone curious about why browsers handle this the way they do, have a look at this post of mine from a while ago: http://blog.stevenlevithan.com/archives/npcg-javascript .
January 25th, 2008 at 1:46 am
Don’t think lookbehinds (backreferences, if I’m not a complete idiot) is even allowed in JavaScript (had to do a little bit of RegEx this week, after a while of not messing with it).
Of course, I just used a RegEx tester, and examples of the string to build mine up, bit by bit, until I couldn’t break the RegEx equation when I changed my string. But when I was looking up the style of RegEx JS uses (a limited form of PERL Syntax), I remember reading about this limitation.
Leave a Reply