Regular Expressions - A Crash Course

Jul 02, 2011

Regular expressions give you a set of tools to match strings of text, which is often useful for validating user data. Many novice programmers often have trouble with regular expressions. In this article we are going to take a look at the fundamentals. In later articles we will take a look at more advanced regular expressions, learn about the theory behind regular expression engines, and how to apply regular expressions in PHP.

Characters

Characters are broken up into two types: literal characters and special characters. Literal characters are treated literally while the regular expression treats special characters uniquely.
Special Characters
The following characters have special meanings, and we will explore their meanings in more depth soon.
[ ] \ ^ $ . | ? * + ( ) { }
Literal Characters
Any character that you can type on your keyboard, that is not a special character, is a literal character.

Character Classes

A character class allows you to match one out of all the characters in the set. Character classes are defined by using the square brackets [ ] and placing the list of characters between the brackets. For example [aeiou] will match a single vowel. Character classes also support the range operator (hyphen), so instead of writing [0123456789] to match a single digit, you can simply use [0-9]. If you want to use special characters in a character class you must escape them with the backslash. For example, if you wanted to match The square bracket you would use [\[]. You may also use non-printable characters and if your engine supports them, unicode characters as well. Some examples: [\t] will match the tab character, [\x09] will also match the tab character using hex notation. [\u0394] will match the unicode Greek letter capital delta.

Negated character classes
By using the carat after the opening square braket, you can negate a character class. As an example, if we want to only match consonants, we can negate our vowel character class we defined above [^aeiou]. This is a lot easier than creating a character class with the 21 consonants.

Shorthand Characters

The following shorthand characters are provided by many regular expression engines. However, if your engine does not support the shorthand characters, the equivalent character class is also provided.
\d matches a digit. The equivalent character class would be [0-9]
\w matches a word. The equivalent character class would be [a-zA-Z0-9_]
\s matches white space. The equivalent character class would be [ \t\r\n] (space, tab, or line break)
\D negated digit. This is the same as [^\d] or [^0-9]
\W negated word. This is the same as [^\w] or [^a-zA-Z0-9_]
\S negated white space. This is the same as [^\s] or [^ \t\r\n]
. The dot/period character will match anything, except for newlines characters, meaning the dot character is actually a negated newline class [^\n]. It is extremely powerful as it lets you be lazy, which is often a bad thing when you are trying to validate data.

Repeating Character Classes

Matching a single character is often necessary. However, under many circumstances, you will want to match more than one character. For this you can use quantifiers. Quantifiers are placed strong the character or character you want to match.
? The question mark will match zero or one character.
* The star, formally known as the Kleene star, will match a character zero or more times.
+ The plus will match a character one or more times.
Using curly brackets, you can specify quantities as well.
{5} This matches a character exactly five times.
{1,3} This matches a character one, two, or three times.
{5,} This matches a character five or more times.

Lets build a regular expression to match a phone number of the form 111-222-3333:

[\d]{3}-[\d]{3}-[\d]{4}

We can also make the hyphens optional
[\d]{3}-?[\d]{3}-?[\d]{4}

Anchors

Literal characters and character classes allow you to match characters, where as anchors allow you to match positions.
^ The carat matches the position before a string.
$ The dollar sign matches the position after a string.
It is important to not that when the carat is used inside a character class, it is used to negate the class. However, when the carat is used outside of a character class, it acts as an anchor. If we want to match the word the we would use the literal characters. However the would also match there and them. Using the dollar sign we tell the regular expression to only match patterns that end in the. Thus the$ will not match there and them since the words do not end in the. However, this regular expression will still match words like bathe and clothe. To fix this, we can use the carat to tell the regular expression to match only words that begin with the: ^the$

Modifiers

Modifiers allow you to change the behavior of some characters.
/i will make the expression case insensitive. By default, regular expressions are case sensitive. If you noticed above, when we specified the character class to match a word, we specified lowercase as well as uppercase characters. In some instances, you may want to turn off the case sensitivity, and for this you use the /i modifier.
/s enables single-line mode. In this mode, the dot character will match new lines.
/m enables multi-line mode. In this mode, the ^ and $ anchors will span multiple lines.

That is it for this tutorial. Next, we will take a look at grouping, back-references, look-aheads, look-behinds, and extend some of the concepts we have already looked at. Do you have any hand crafted regular expressions you would like to share? I would love to seem them in the comments below!