Regex Tutorial in 25 Minutes
Regex stands for regular expressions.They basically are used for searching text pattern in a string.The implementenation are available written in languages such java, javascript(which we are interested in) etc…
The simplest way to test/learn is to use notetab light a freeware notepad like utility.The search tool in the utility supports regular expressions search.Let us see how the regex engine works .Consider a simple regex expression q this is a literal and every character in regex expression is called token.There is only one token in this regex.the engine will try to match the q against the characters in the string .till it doesn’t match it keeps moving forward.This is the basics of a regex engine.Another important thing is the backtracking ability of the regex engine.
Learning regex starts with learning about 11 special characters.I means characters in real sense(pun intended).these 11 special characters which help in matching the string pattern fall under 5 categories namely character matcher,alternator,position matcher,repeater,groupers.
character matchers…
1.character class/set [] .regex will try to match any character within this square brackets. for eg. [a-zA-Z_0-9] will match any alpha character or digit character.There is always a short form for everything in this world . and for this characterset you can write as \w –> stands for [a-zA-z_0-9] (it is called word matcher and this is wierd to see a word can contain digits as well) and \d–> stands for [0-9]. \s –> space (space would include \t,\r,\n).The capitals of all these shortcuts stands for thier opposite….very cute..\W would match any character other than word character.\S would match any thing other than space.so think….a combination of \s\S or \w\W would match any character then. The character in the set can be negated which means you want to match other than that character.This can be achieved by a caret symbol “^”.There are always exceptions to anything.The exception here related to the character you put within the [] sqaure brackets.anything thing you put with in them in treated as normal character except the following 4 characters namely ],^,-,\. To search them you have to escape them with the backslash.All the meta character you learn after this will not work in the square bracket.So you are better off searching for normal characters.
2. dot is a meta character and it matches any character other than new line .it is a negation for newline character .if you want to check if something is not a new line character the best way would be match by “.” if it fails then it is a new line character other it is some other character.
grouper.
By placing a part of regex expression inside of a paranthensis we can apply a regex operator on the grouped regex expression.Also this creates a backreference which means that we a match is found the string is stored in \1 and second group in \2 and so on till \9 .So if your regex expression want to search based on the find of the first part of search this helps.For eg if you want to search repetitive string like “the the”. Use the regex expression (\w*)\s*\1 would match any repetitive string with space in between.for eg the the…If you don’t want a back refrence then you use the syntax ?: inside of the paranthesis before the regex expression.This would increase the performance.
alternator.
1. it is or operator of a boolean the pipe “|” symbol.The string will be matched against one of the pattern separated by the | for eg license|lisense|lisence will match a license or lisense.
position matcher.
^ and $ .It is the begining position of string and end position of string.In case of string spanning multi line ^ also matches the positon after the line break and start of next line.In case of $ it matches the position after the first line end of the string and the line break.\b matches the first character of string,end of the word.\b – position matcher .matches between word & nonword and nonword & word and start of string & end of string.It will not move the engine forward.It just tries to match the position.
Repeater: This is a very interesting set of metacharacter behaviours.some of them greedy,some lazy,some posessive .The metacharacter are *,+,?.
Greediness:The * ,+ and ? are greedy.It will match every character no matter what and stop only on the end of the string / line-return or what ever is the maximum they can match.It basically overeats / over matches but this is willing to give up some stuff by backtracking if required to give a match. for eg regex a will not match bc but a* will match becoz it has to match 0 or more times.a? will also match becoz it has to match 0 or 1 time but a+ will not match becoz has to match 1 or more time.
Lazy:The ?.when you add the ? the match becomes optional .If you add it to an operator it becomes lazy.for eg let us take the previous scenario of a*? will try to match least number of times .So the least no of times for * is 0.So it will not match and will go ahead if it fails it will come to match basically it will backtrack.take the regex a+? would try to match the min no of time.that is 1 time and go ahead .
Posessive:The + added to operator.Normally the most greedy people are willing to track back in case of failure .But if you are posessive and greedy there is no backtracking.The way to make a operator posessive is to add a plus sign eg., a*+.This would match as much as possible.It will not give up / back track any of its match to make the search succesful. This will make the engine performance wise better since it doesn’t have remember anything.
Eager:Normally all regex engines are eager which means the first match it finds it would return the result.it would not try to better it.Though I say first matching it would always backtrack if it could find the proper match.for eg take the case of a(bc|b)c to match the string abc.it goes first matches a then bc with bc.now when it comes out of the group it remember a path that is yet to be tried.now it comes to c.it fails since there no more character on string to match so it goes back it leaves the match on bc and takes other alternator of b.and comes out and matches c.This is how the engine works.
If you don’t want to use *,+,? then you can use {} with the number inside it.the way define ? would be {0,1} and * {0,any no} and + would be {1,any number} .If you want to defined number {1,5} which means it should be min 1 char and max 5 char if you specify only 1 number then it is the 0,max.
Atomic Group :The syntax for this (?>group).This is related the back-tracking of the engine and it performance.As soon the engine exit the this group it gives up the backtracking options .for eg a(?>bc|b)c would not match abc since it cannot back track after matching bc in the group.
Look Ahead/Look behind/Look Around :these are called as asserts as well which means it tries to match the expression but would not consume the characters.there are 2 types of lookahead positive & negative .(?!u) -> negative any character which is not u but if there is no character then also it is fine.and (?=u)positive which means the character is to be u but it can be blank as well.the match from the lookahead has to be discarded.so the engine steps back from current position to back and tries to match the next characater in regex.
The syntax for lookbehind is (?<!a)b for negative.This means a ‘b’ not preceeded by a. and (?<=a)b for positive which means b preceded by a.
Summary of regex actor/characters.
1 The dot “.” is the humble guy which matches with only one character and any character except new line.
2.The plus ‘+’ is the greedy guy.it matches min 1 token to a max of any number of tokens.
3.The star ‘*’ which is the cousin of ‘+’ is also greedy but differs in the sense of matching with min = 0 tokens and max = any number of token.
< p > some single character matchers..
\s – match space
\S – match everything other than space.so a combination of \s\S is everything —>[\s\S]
\d – matches digit 0-9
\b – position matcher .matches between word & nonword and nonword & word and start of string & end of string.
\B – opposite
\w- [a-zA-Z_0-9] –> word characters
\W- non word characters –> opposite of word character.
\D – opposite of digit.
just a quick eg to check if we understood….make a regex for the date with patter mm/dd/yyyy.
it would be [0,1][0-9]/[0-3]\d/[0-2]\d\d\d .this is quite close …i guess it is ok you understood the concept..
If you don’t have regex search engine..use this html/javascript…one draw back is for every backslash you have escape it with another backslash.
<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01//EN” “http://www.w3.org/TR/html4/strict.dtd“>
<html>
<head>
<script type=’text/javascript’>
function find(){
var reg = document.getElementById(‘regex’).value;
var str = document.getElementById(’string’).value;
var rex = eval ( “new RegExp(\”"+reg+”\”,\”gim\”)” );
var search = rex.exec(str);
if(search){
alert(’suceess,the position is’+str.search(rex)+search);
}else{
alert(‘failure’);
}
}
</script>
</head>
<body class=’yui-skin-sam’>
string <textarea type=’text’ name=’string’ id=’string’></textarea><br>
regex:<input type=’text’ name=’find’ id=’regex’></input><br>
<button type=button value=’findword’ title=’findword’ onclick=’find()’>findword</button>
</body>
</html>
let us check for email..
\w+?@\w+?.\w{2,} This a decent one…
HAPPY REGEXing..