Some powerful strengths of Regular Expression

Regular expressions provide a powerful, concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.
I will list here some powerful strengths:

Word Boundaries

The meta character \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.
Example: In order to find a name ‘myname‘ in a text you can use:

\bmyname\b

Word Characters

The characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w in some languages. Notice that Java supports Unicode for \b but not for \w.

Lookahead and Lookbehind

Often called “lookaround” or “zero-width assertions”. They are zero-width.
Lookarounds will actually match characters, but then give up the match and only return the result: match or no match (They do not consume characters in the string, but only assert whether a match is possible or not).
Lookahead: Gives you the ability to match something that follows (positive) or not follows (negative) by something else.

  • Negative lookahead gives you the strength to match something not followed by something else. For example, in order to match a “x” not followed by a “y” you can use:
    x(?!y)
  • Positive lookahead gives you the strength to match something that followed by something else without consume the something else characters. For example, in order to match a “x” followed by a “y” you can use:
    x(?=y)

Lookbehind: it has the same effect as Lookahead, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.

  • Negative lookabehind gives you the strength to match something that is not preceded by something else. For example, in order to match a “x” that is not preceded by an “y”, you can use:
    (?<!--y)x

    (It will not match zyx, but will match the x (and only the x) in xml or alexander.)

  • Positive Lookbehind gives you the strength to match something that is preceded by something else. For example, in order to match a “x” that is preceded by an “y”, you can use:
    (?<=y)x

    (It will not match axb, but will match the x (and only the x) in zyx )

Repetition

The question mark tells the engine to attempt match the preceding token zero times or once.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times.

The plus mark tells the engine to attempt to match the preceding token one or more times.

Additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.
Example:
To match a word consist of only Big letters in a size of 3 to 5 characters use:

[A-Z]{3,5}
Greediness:

Please remember that the repetition marks are greedy. It causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack.
For example:
The regex <.+> applied on “This is a <em>first</em> test” will match <em>first</em> and not <em>.
The quick fix to this problem is to make the repetition mark lazy instead of greedy (Lazy quantifiers are sometimes also called “ungreedy” or “reluctant”). You can do that by putting a question mark behind the repetition mark in the regex.
For example:
To solve the previews problem use:

<.+?>

The regex <.+?> applied on “This is a <em>first</em> test” will match <em> and not <em>first</em>.

Unicode Character categories

In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to a particular category with \p{} (small letter p). You can match a single character not belonging to a particular category with \P{} (big letter p).
For example:

  • \p{L} or \p{Letter}: any kind of letter from any language
  • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant
  • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant
  • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter
  • \p{Sm} or \p{Math_Symbol}: any mathematical symbol

Recommended resource

A very recommended good resource for Regular Expressions is:
http://www.regular-expressions.info/tutorial.html

Leave a Reply

*