The Mathematics Way: REGULAR EXPRESSIONS

REGULAR EXPRESSIONS

The language-defining symbols we are about to create are called regular expressions. The languages that are associated with these regular expressions are called regular languages and are also said to be defined by finite representation.

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

The following examples illustrate a few specifications that could be expressed in a regular expression:

the sequence of characters "car" in any context, such as "car", "cartoon", or "bicarbonate"
the word "car" when it appears as an isolated word
the word "car" when preceded by the word "blue" or "red"

Regular expressions can be much more complex than these examples.

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl, Ruby and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.

History

The origins of regular expressions lie in automata theory and formal language theory, both of which are part of theoretical computer science. These fields study models of computation (automata) and ways to describe and classify formal languages. In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions ("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p where re stands for regular expression [citation needed^]).

Perl and Tcl regular expressions were derived from a regex library written by Henry Spencer, though Perl later expanded on Spencer's library to add many new features.^[1] Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regular expression functionality, and is used by many modern tools including PHP and Apache HTTP Server.

Formal language theory

Regular expressions can be expressed in terms of formal language theory. Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. Given a finite alphabet Σ the following constants are defined:

(empty set) ∅ denoting the set ∅
(empty string) denoting a string with no characters.
(literal character) a in Σ denoting a character in the language.

The following operations are defined:

(concatenation) RS denoting the set { αβ | α in R and β in S }. For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.

(alternation) R|S denoting the set union of R and S. (Kleene star) R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R.

For example, {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ... }.

The above constants and operators form a Kleene algebra.

To avoid brackets it is assumed that the Kleene star has the highest priority, then concatenation and then set union. If there is no ambiguity then brackets may be omitted. For example, (ab)c can be written as abc, and a|(b(c*)) can be written as a|bc*.

In Chapter 2, we have defined the language L₄ as :

L₄ = {

x xx xxx xxxx . . . } and we presented one method for indicating this set as a closure of a smaller set.

Let S = { x } . Then L4 = S* As a shorthand we have L4 = { x }* .

We now introduce the use of Kleene star not to the set but directly to the letter x as a superscript as if it were an exponent; x* .

The simple expression, x* will be used to indicate some sequence of x’s ( maybe none at all.)

x* =

or x or x² or x³ or x⁴ . . .

for some n = 0 1 2 3 4 5 . . .

We can think of the star as an unknown power or undetermined power. That is x* stands for a string of x’s but we do not specify how many. The star operator applied to a letter is analogous to the star operator applied to the set. It represents an arbitrary concatenation of copies of that letter. This notation can help us define languages by writing L4 = language ( x* ). Since x* is any string of x’s, L4 is then the set of all possible strings of x’s of any length ( including L ).

L = { a ab abb abbb abbbb . . . } can be summarize as “all words of the form one a, followed by some number of b’s. By the star notation.

L = language ( a b* ) or without the space L = language ( ab* )

The meaning is clear: This is a language in which the words are the concatenation of an initial a with some or no b’s ( that is b* ). No string can contain a blank unless a blank is a character in the alphabet S. If we want blanks to be in the alphabet, we normally introduce some special symbol to stand for them, as blanks themselves are invisible to the naked eye. The reason for putting a blank between a and b* is the product above is to emphasize the point that the star operator is applied to the b only. We have now used a bold face letter without a star as well as with a star.

We can apply the Kleene star to the string ab if we want, as follows:

( ab )* =

or ab or abab or ababab . . .

Parenthesis are not letters in the alphabet of this language, so they can be used to indicate factoring without accidentally changing the words. Since star represents some kind of exponentiation, we use it as powers as in algebra, where by universal understanding the expression xy² means x( y² ), not ( xy )².

If we want to define the language L1 this way, we may write L1 = language ( xx* )

This means that we start each word of L1, by writing down an x and then follow it with some string of x’s; or may use the ⁺ notation from chapter 2 and write L1 = language ( x⁺ ) meaning all words of the form x to some positive power ( that is, not x⁰ =

). The notation ⁺ is a convenience but is not essential since we can say the same thing with *’s alone.

Example 1 : The language L1 can be defined by any of the expressions below :

xx* x+ xx*x* x*xx* x⁺x* x*x⁺ x*x*x*xx*

Example 2 : The language defined by the expression ab*a is the set of all strings of a’s and b’s that have at least two letters, that begin and end with a’s and that have nothing but b’s inside.

language ( ab*a ) = { aa aba abba abbba abbbba . . . }

Example 3 :

The language of the expression a*b* contains all the strings of a’s and b’s in which all the a’s come before all the b’s.

language ( a*b* ) = {

a b aa ab aaa aab abb bbb aaaa . . . }

Note that ba and aba are not in this language. Notice also that there need not be the same number of a’s and b’s. a*b* ≠ ( ab )*

Example : The following expressions both define the language L₂ = { x^odd } x ( xx )* or ( xx )* x but the expression x*xx* does not since it contains the word ( xx ) x ( x ) .

We introduce the use of the + sign. By the expression x + y , where x and y are strings of characters from an alphabet, we mean “either x or y”. This means that x + y offers a choice, much the same way that x* does.

Example : Let T be the language defined over the alphabet S = { a, b, c }.

T = { a c ab cb abb cbb abbb cbbb abbbb cbbbb . . . } All words in T begin with an a or a c and then followed by some number of b’s. Symbolically, we may write this as

T = language (( a + c) b*) = language ( either a or c then some b’s )

We say that the expression ( a + c ) b* define a language in the following sense. Every + and * ask us to make a choice. For each * or + used as a superscript we must select some number of factors for which it stands. For each other + we must decide whether to choose the right side expression or the left side expression. For every set of choices, we have generated a particular string. The set of all strings produceable by this method is the language of the expression. In ( a + c )b* , we must choose either the a or the c for the first letter and then we choose how many b’s the b* stand for. Each set of choices is a word. If from ( a + c ) we choose c and we choose b* to mean bbb, we have the word cbbb formed.

Consider the finite language L that contains all the strings of a’s and b’s of length exactly three.

L = { aaa aab aba abb baa bab bba bbb }

The first letter of the word in L is either a or b. The second letter of each word in L is either an a or a b. The third letter of each word in L is either an a or a b. So we may write

L = language ( ( a + b ) ( a + b ) ( a + b ) ( a + b )) or in short we may write

L = language ( ( a + b ) ³ )

If we want to define the set of all seven letter strings of a’s and b’s, we could write ( a + b )⁷. In general, if we want to refer to the set of all possible strings of a’s and b’s of any length, we could write :

( a + b )*

This set of all possible strings of letters from the alphabet Σ = { a, b } including the null string.

This expression represents a language and if we decide that * stands for 5, then ( a + b )* = ( a + b )⁵ and is equal to ( a + b ) ( a + b ) ( a + b ) ( a + b ) ( a + b ). Now we have to make more five choices : either a or b for the firs letter, either a or b for the second letter . . . . .

All words that begins with letter a is given by a ( a + b )*

All words that begin with an a and end with a b is given by a ( a + b ) * b

EXAMPLE:

Consider the language defined by the given expression a( a + b )* a ( a + b )*

At the beginning, we have ( a + b )*, which stands for anything, that is any string of a’s and b’s, then comes an a, then another anything.

All told, the language is the set of all words over the alphabet S = { a, b } that have a in them somewhere. The only words left out are those that have only b’s and the word L.

The language of all words that have at least two a’s can be described by the expression :

( a + b )* a ( a + b )* a ( a + b )*

= ( some beginning ) ( the first important a )( some middle )( the second important a )( some end )

where the arbitrary parts can have as many a’s ( or b’s ) as they want.

EXAMPLE : Another expression that denotes all words with at least two a’s is : b*ab*a ( a + b )*.

We can scan through jungle of b’s ( or no b’s ) until we find the first a, then more b’s ( or no b’s, then the second a, then finish up with anything. In this set are abbbabb and aaaaa.

We can write (a + b)*a( a + b) * a (a + b)* = b* ab* a ( a + b )* where the equal ( = ) sign do not mean that these expressions are equal algebraically in the same way as x + x = 2x but that they are equal because they describe the same item, as with ruby jubilee = 40 years or 16^th president = Abraham Lincoln.

Thus we could write language ((a + b )* a( a + b )* a ( a + b)*)

= language ( b*ab *a ( a + b )*

= all words with at least two a’s.

To be careful about this point, we say that two regular expressions are equivalent if they describe the same language. The expressions below also describe the language of words with at least two a’s :

EXAMPLE :

If we wanted all the words with exactly two a’s, we use the expression b* ab* ab* which describes such words as aab, baba, and bbbabbbab. To make the word aab, let the first and second b become L and the last b become b. n

EXAMPLE : The language of all words that have at least one a and at least one b is somewhat trickier. If we write : ( a + b )*a ( a + b )* b ( a + b )*

= ( arbitrary ) a ( arbitrary ) b ( arbitrary ),

we are then requiring that an a precede a b in the word. Such words as ba and bbaaaa are not included in this set. Since we know that either the a comes before the b or the b comes before the a, we could define this set by the expression

( a + b )* a ( a + b )* b ( a + b )* + ( a + b )* b ( a + b )* a ( a + b )*

Here we are still using the plus sign in the general sense of disjunction ( or ).

There is a simpler expression that defines the same language. If we are confident that the only words that are omitted by the first term ( a + b )* a ( a + b )* b ( a + b )* are the words of the form some b’s followed by some a’s, then it would be sufficient to add these specific exceptions into the set. These exceptions are all defined by the regular expression : bb* aa*

The language of all words over the alphabet Σ = { a , b } that contain both an a and a b is therefore also defined by the expression : ( a + b )* a ( a + b )* b ( a + b )* + bb* aa*

NOTE that it is necessary to write bb* aa* because b* a* will allow words we do not want, such as aaa.

These LANGUAGE-DEFINING expressions cannot be treated like algebraic symbols. We have shown that in the following expression

( a + b )* a ( a + b ) * b ( a + b )* + ( a + b )* b ( a + b )* a ( a + b )* = ( a + b )*a ( a + b )* b ( a + b )* + bb* aa*

The first terms on both sides of this equation are the same, but if we cancel them, we get

( a + b )* b ( a + b )* a ( a + b )* = bb* aa* which is false, since the left side includes

the word aba, which the expression on the right side does not. The only words that do not contain both an a and a b in them somewhere are the words of all a’s, all b’s or

. When these are added into the language, we get everything. Therefore the regular expression :

( a + b )* a ( a + b )* b ( a + b )* + bb* aa* + a* + b* defines all possible strings of a’s and b’s.

The word

is included in both a* and b*.

( a + b )* = ( a + b )* a ( a + b )* b ( a + b )* + bb* aa* + a* + b*

is not a very obvious equivalence.

LANGUAGE-DEFINING EQUIVALENCES :

( a + b )* = ( a + b )* + ( a + b )*

( a + b )* = ( a + b )* ( a + b )*

( a + b )* = a( a + b )* + b( a + b )* +

( a + b )* = ( a + b )* ab ( a + b )* + b* a*

NOTE : These equivalences should not be treated like algebraic polynomials.

The last of these equivalences means that all the words that do not contain the substring ab are

all a’s , all b’s,

, or some b’s followed by the plus ( union sign ) alone.

EXAMPLE: Let V be the language of all strings of a’s and b’s in which the strings are either all b’s or else

there is an a followed by some b’s. Let V also contain the word

V = {

a b ab bb abb bbb abbb bbbb . . . } V can be defined by the expression b* + ab* where the

word

is included in the term b*. Alternately, we could define V by the expression : (

+ a ) b*

This implies that in front of the string of b’s we have the option of either adding an a or nothing. Since we could always write b* =

b*, we have what appears to be some sort of distributive law at work.

b* + ab* = (

+ a )b* . b* is factored out just as in algebra. It is also because of this analogy in

algebra that we denoted our disjunction by the plus sign instead of the union sign U or the symbolic sign V.

WE HAVE A HYBRID SYSTEM: the * is somewhat like an exponent and the + sign is somewhat like addition. But the analogies to algebra should be approached very suspiciously, since addition in algebra never means choice and algebraic multiplication has properties different from concatenation ( even though we sometimes conveniently refer to it as product ).

ab = ba in algebra

ab ≠ ba in formal languages

Consider the language T = { a c ab cb abb cbb . . . }.

T can also be defined as in the above example as ( a + c )b*

but it can also be defined by ab* + cb*.

REGULAR EXPRESSIONS :

The symbols that appear in regular expressions are :

1. the letters of the alphabet Σ,

2. the symbol for the null string

3. the parentheses ( )

4. the star operator * and

5. the plus sign.

Definition :

Rule 1: Every letter of Σ can be made into a regular expression by writing it in boldface;

is a regular expression.

Rule 2 : If r₁ and r₂ are regular expressions, then so are ( r₁ ) r₁r₂ r₁ + r₂ r₁*.

Rule 3 : Nothing else is a regular expression.

We have not included the plus sign as a superscript r₁⁺ as part of the definition because r₁⁺ = r₁r₁*

Definition :

If S and T are sets of strings of letters ( whether they are finite or infinite sets ), then the product set of strings of letters is : ST = { all concatenation of a string from S concatenated with a string from T }

EXAMPLE :

If S = { a aa aaa } and T = { bb bbb } then

ST = { abb abbb aabb aabbb aaabb aaabbb }

NOTE : These words are not in proper order.

EXAMPLE 1 : If S = { a bb bab } and T = { a ab } then

ST = { aa aab bba bbab baba babab }

EXAMPLE 2 : If P = { a bb bab } and Q = {

bbbb } then

PQ = { a bb bab abbbb bbbbbb babbbbb }

EXAMPLE 3: If M = {

x xx xxx } and N = {

y yy yyy yyyy yyyyy . . . }

then MN = {

y yy yyy yyyy yyyyy . . .

x xy xyy xyyy xyyyy xyyyyy . . .

xxy xxyy xxyyy xxyyyy xxyyyyy . . . }

xxxy xxxyy xxxyyy xxxyyyy xxxyyyyy . . . }

Using regular expressions, these four examples can be written as :

( a + aa + aaa )( bb + bbb ) = abb + abbb + aabb + aabbb + aaabb + aaabbb

( a + bb + bab )( a + ab ) = aa + aab + bba + bbab + baba + babab

( a + bb + bab )(

+ bbbb ) = a + bb + bab + ab⁴ + b⁶ + bab⁵

(

+ x + xx )( y* ) = y* + xy* + xxy*

EXAMPLE: In SPANISH and JAPANESE are their usual languages, then the product SPANISHJAPANESE is the language of all strings that start with a SPANISH word and finish with a JAPANESE word. Some words in this language are mesatokei, bonitooishi and pasearnomimasu.

It might not be clear why we can not just leave the rules for associating a language with a regular expression on the informal level, with the expression “ make choices for + and * ”. The reason is that the informal phrase “make choices” is much harder to explain precisely than the formal mathematical presentation below. We are now ready to give rules for associating a language with every regular expression. The method for doing this is given

recursively.

Definition : The following rules define the language associated with any regular expression.

Rule 1 : The language associated with the regular expression that is just a single letter is that one-letter

word alone and the language associated with L is just {

} , a one-word language.

Rule 2 : If r is a regular expression associated with the language L₁ and r₂ is a regular expression

Associated with the language L₂ then,

( i ) The regular expression ( r₁ ) ( r₂ ) is associated with the language L₁ times L₂

language ( r₁ r₂ ) = L₁L₂

( ii ) The regular expression r₁ + r₂ is associated with the language formed by the union

of the sets L₁ and L_2.

language ( r₁ + r₂ ) = L₁ + L₂

( iii ) The language associated with the regular expression (r₁ )* is L₁*, the Kleene

closure of the set L₁ as a set of words.

language ( r₁* ) = L₁*

Once again this collection of rules proves recursively that there is some language associated with every regular expression. The rules seem to show us how we can interpret the regular expression as a language. But they do not really tell us how to understand the language. By this we mean that if we apply the rules above to the regular expression

( a + b )* a( a + b )* b( a + b )* + bb* aa*

we can develop a description of some language, but can we understand that this is the language of all strings that have an a and a b in them ? This is question of meaning. This correspondence between regular expressions and languages leaves open two other questions. (1 ) Is there some way of telling of telling when this happens ? “Way” means an algorithm which can be taken in the later chapter to determine whether or not two regular expressions define the same language.

We have seen that every regular expression is associated with some language. (2 ) Is it true that every language can be describe by a regular expression ? In the next theorem we show that every finite language can be defined by a regular expression.

Theorem 5 :

If L is a finite language (a language with only finitely many words), then L can be defined by a regular expression.

Proof : To make one regular expression that defines the language L, turn all the words in L into boldface type and stick pluses between them.

For example, the regular expression that defines the language

L = { baa abbba bababa } is baa + abbba + bababa

If L = { aa ab ba bb }, the algorithm described above gives the regular expression aa + ab + ba + bb

Another regular expression that defines this language is ( a + b ) ( a + b ) so the regular expression need not be unique. The reason this trick only works for finite languages is that an infinite language would become a regular expression that is infinitely long, which is forbidden.

EXAMPLE :

Let L = {

x xx xxx xxxx xxxxx }

The regular expression we get from the theorem is :

+ x + xx + xxx + xxxx + xxxxx

A more elegant regular expression for these language is (

+ x )⁵ Of course 5 is, strictly speaking, not a legal symbol for a regular expression although it is understood that it means

(

+ x ) (

+ x )

Consider the expression : ( a + b )*( a a + b b ) ( a + b )*

This is the set of strings of a’s and b’s that at some point contain a double letter. We can think of it as

( arbitrary ) ( double letter ) ( arbitrary )

Let us now ask, “What strings do not contain a double letter ?” Some examples are :

a b ab ba aba bab abab baba . . . The expression ( ab )* covers all of these except those that begin with b or end in a. Adding these choices gives us the regular expression (

+ b ) ( ab )* (

+ a )

Consider the regular expression below :

E = ( a + b )* a ( a + b )* ( a +

) ( a + b )* a ( a + b )*

= ( arbitrary ) a ( arbitrary ) [ a or nothing ] ( arbitrary ) a ( arbitrary )

One obvious fact is that is that all the words in the language of E must have at least two a’s in them. Let us break up the middle plus sign into its two cases : either the middle factor contributes an a or else contributes a

. Therefore :

E = ( a + b )* a ( a + b )* a ( a + b )* a ( a + b )* + ( a + b )* a ( a + b )*

( a + b )* a ( a + b )*

This is a more detailed use of the distributive law. The first term above clearly represents all words that have at least three a’s in them. Before analyzing the second term observe that

( a + b )*

which occurs in the middle of the second term is only another way of saying “any string whatsoever” and could be replaced with the more direct expression

( a + b )* .

This would reduce the second term of the expression to

( a + b )* a ( a + b )* a ( a + b )*

Which we have already seen in a regular expression representing all words that have at least two a’s in them. Therefore the language associated with E is the union of all strings that have three or more a’s with all strings that have two or more a’s . But since all strings with three or more a’s are themselves already strings with two or more a’s , this whole language is just the second set alone. The language associated with E is no different from the language associated with

( a + b )* a ( a + b )* a ( a + b )*

which has been examined before.

It is possible by repeated application of the rules for forming regular expressions to produce an expression in which the star operator is applied to a subexpression that already has a star in it.

( a + b* )* ( a a + a b* )* (( a + b b b a* ) + b a* b )*

In the first of these expressions, the internal * adds nothing to the language.

( a + b* )* = ( a + b )*

since all possible strings of a’s and b’s are described by both expressions. Also in accordance with theorem 1,

( a* )* = a*

However, ( a a + a b* )* ≠ ( a a + a b )*

since the language for the expression on the left includes the word abbabb, which the language on the right does not. ( The language defined by the regular expression on the right cannot contain any word with double b ).

Example : Consider the regular expression

( a* b* )*

The language defined by this expression is all strings that can be made up of factors of the form a* b* , but since both the single letter a and the single letter b are words of the form a* b* , this language contains all strings of a’s and b’s. It cannot contain more than everything, so

( a* b* )* = ( a + b )*

The Mathematics Way

.well-known

Followers

Saturday, January 7, 2012

REGULAR EXPRESSIONS

No comments:

Post a Comment