.well-known

This is a Brave Rewards publisher verification file. Domain: mathematicsway.blogspot.com Token: 61623a436e7c44ba1310544150d3db370336fd3a7242d9db2869ad9e33470742

Followers

Saturday, January 7, 2012

REGULAR EXPRESSIONS

REGULAR  EXPRESSIONS

            The language-defining symbols we are about to create are called regular expressions. The languages that are associated with these regular expressions are called regular languages and are also said to be defined by finite representation. 
             In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

The following examples illustrate a few specifications that could be expressed in a regular expression:
  • the sequence of characters "car" in any context, such as "car", "cartoon", or "bicarbonate"
  • the word "car" when it appears as an isolated word
  • the word "car" when preceded by the word "blue" or "red"
          Regular expressions can be much more complex than these examples.
Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl, Ruby and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.

History
           The origins of regular expressions lie in automata theory and formal language theory, both of which are part of theoretical computer science. These fields study models of computation (automata) and ways to describe and classify formal languages. In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions ("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p where re stands for regular expression [citation needed]).
           Perl and Tcl regular expressions were derived from a regex library written by Henry Spencer, though Perl later expanded on Spencer's library to add many new features.[1] Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regular expression functionality, and is used by many modern tools including PHP and Apache HTTP Server.

Formal language theory
           Regular expressions can be expressed in terms of formal language theory. Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. Given a finite alphabet Σ the following constants are defined:
  • (empty set) denoting the set
  • (empty string) denoting a string with no characters.
  • (literal character) a in Σ denoting a character in the language.

The following operations are defined:
            (concatenation) RS denoting the set { αβ | α in R and β in S }. For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.

(alternation) R|S denoting the set union of R and S. (Kleene star) R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R.
                For example, {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ... }.
The above constants and operators form a Kleene algebra.
To avoid brackets it is assumed that the Kleene star has the highest priority, then concatenation and then set union. If there is no ambiguity then brackets may be omitted. For example, (ab)c can be written as abc, and a|(b(c*)) can be written as a|bc*.

              In Chapter 2, we have defined the language L4 as :  
                          L4  =  {   x   xx   xxx   xxxx . . . }  and  we presented one method for indicating this set as a closure of a smaller set.
                        Let S = { x } . Then  L4 = S*  As a shorthand we have  L4 = { x }* .
            We now introduce the use of Kleene star not to the set but directly to the letter x as a superscript as if it were an exponent;  x* .
            The simple expression, x*  will be used to indicate some sequence of x’s ( maybe none at all.)

                        x*  =    or  x  or  x2  or  x3   or  x4 . . .
                                 for some  n  =  0   1   2   3   4   5 . . .

            We can think of the star as an unknown power or undetermined power.  That is  x* stands for a string of x’s but we do not specify how many.  The star operator applied to a letter is analogous to the star operator applied  to the set. It represents an arbitrary concatenation of copies of that letter. This notation can help us define languages by writing L4 = language ( x* ). Since x* is any string of x’s, L4 is then the set of all possible strings of x’s of any length ( including L ).
            L = { a   ab   abb   abbb abbbb . . . }  can be summarize as “all words of the form  one a, followed by some number of b’s. By the star notation.
                                                L = language ( a b* ) or without the space L = language ( ab* )
            The meaning is clear: This is a language in which the words are the concatenation of an initial a with some or no b’s ( that is b* ). No string can contain a blank unless a blank is a character in the alphabet S. If we want blanks to be in the alphabet, we normally introduce some special symbol to stand for them, as blanks themselves are invisible to the naked eye. The reason for putting a blank  between a and b* is the product above is to emphasize the point that the star operator is applied to the b only. We have now used a bold face letter without a star as well as with a star. 
            We can apply the Kleene star to the string ab if we want, as follows:
                        ( ab )* =    or  ab   or  abab  or  ababab . . .
       Parenthesis are not letters in the alphabet of this language, so they can be used to indicate factoring without accidentally changing the words. Since star represents some kind of exponentiation, we use it as powers  as in algebra, where by universal understanding the expression xy2 means x( y2 ), not ( xy )2.   
            If we want to define the language L1 this way, we may write  L1 = language ( xx* )
This means that we start each word of L1, by writing down an x and then follow it with some string of x’s; or may use the  + notation from chapter 2 and write  L1 = language ( x+ ) meaning all words of the form x to some positive power ( that is, not  x0 = ).  The notation  +  is a convenience but is not essential since we can say the same thing with  *’s  alone.
   Example 1 :   The language  L1 can be defined by any of the expressions below :
                                    xx*    x+    xx*x*    x*xx*    x+x*   x*x+    x*x*x*xx*    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
   Example 2 :  The language defined by the expression   ab*a   is the set of all strings of a’s  and  b’s  that have at least two letters, that begin and end with a’s and that have nothing but b’s inside.
                        language ( ab*a ) = { aa   aba   abba   abbba   abbbba  . . . }
   Example 3 :
     The language of the expression  a*b*  contains all the strings of a’s and b’s  in which all the  a’s come before all the b’s.
                        language ( a*b* ) = {    a   b   aa   ab   aaa   aab   abb   bbb   aaaa . . . }
     Note that ba  and  aba  are not in this language. Notice also that there need not be the same number of a’s  and b’s.   a*b*  ≠  ( ab )*
                                               
Example :  The following expressions both define the language L2  =  { xodd }  x ( xx )*  or  ( xx )* x  but the expression  x*xx* does not  since it contains the word ( xx ) x ( x ) .


         We introduce the use of the  +  sign. By the expression  x +  y , where  x and y are strings of characters from  an alphabet, we mean “either x or y”. This means that  x + y  offers a choice, much the same way  that x*  does.
     
Example :   Let T be the language defined over the alphabet  S = { a, b, c }.
      T = { a   c   ab   cb   abb   cbb   abbb   cbbb   abbbb   cbbbb . . . }  All words in T begin with an a or a c and then followed by some number of b’s. Symbolically, we may write this as
                        T = language (( a + c) b*) = language ( either a or c then some b’s )
         We say that the expression ( a + c ) b*  define a language in the following sense.  Every +  and *  ask us to make a choice.  For each  *  or + used as a superscript  we must select some number of factors for which it stands.  For each other + we must decide whether to choose the right side expression  or the left side expression. For every set of choices, we have generated a particular string. The set of all strings produceable by this method is the language of the expression.  In  ( a + c )b* , we must choose either the a  or the c  for the first letter and then we choose how many b’s the b* stand for. Each set of choices is a word. If from ( a + c ) we choose  c  and we choose b* to mean bbb, we have the word cbbb formed.   

Consider the finite language L that contains all the strings of a’s  and  b’s  of length exactly three.
                                            L  =  { aaa   aab   aba   abb   baa   bab   bba   bbb }  
The first letter of the word in  L  is either  a  or  b. The second letter  of each word in L is either an a  or  a  b.  The third letter of each word in L is  either  an a  or  a   b. So we may write      
                                    L =  language ( ( a + b ) ( a + b ) ( a + b ) ( a + b ))  or in short we may write 
                                                            L = language (  (  a + b ) 3  )   
    If we want to define the set of all seven letter strings of a’s and b’s, we could write ( a + b )7. In general, if we want to refer to the set of all possible strings of  a’s and b’s  of any length, we could write :
                                                            ( a + b )*
            This set of all possible strings of letters from the alphabet  Σ = { a, b } including the null string.
 This expression represents a language and if we decide that * stands for 5, then ( a + b )*  =  ( a + b )5 and  is equal to (  a + b ) ( a + b ) ( a + b ) ( a + b ) ( a + b ).  Now we have to make more five choices : either a or b for the firs letter, either a or b for the second letter . . . . .
            All words that begins with letter a is given by   a ( a + b )*
            All words that begin with an  a  and end with a b is given by   a ( a + b ) * b

EXAMPLE:
            Consider the language defined by the given expression   a( a + b )* a ( a + b )*
       At the beginning, we have ( a + b )*, which stands for anything, that is any string of a’s and b’s, then comes an a, then another anything.  
            All told, the language is the set of all words over the alphabet S = { a, b }  that have a in them somewhere.  The only words left out are those that have only b’s and the word L.

      The language of all words that have at least two  a’s  can be described by the expression :
            ( a + b )* a ( a + b )* a ( a + b )*
            = ( some beginning ) (  the first important a )( some middle )( the second important a )( some end )
                 where the arbitrary parts can have as many  a’s  ( or b’s ) as they want.    

EXAMPLE : Another expression that denotes all words with at least two  a’s  is :   b*ab*a ( a + b )*. 
We can scan through jungle of b’s ( or no b’s ) until we find the first  a, then more b’s ( or no b’s, then the second a, then finish up with anything. In this set are  abbbabb  and aaaaa.   
We can write  (a + b)*a( a + b) * a (a + b)* =  b* ab* a ( a + b )*  where the equal ( = ) sign do not mean that these expressions are equal algebraically in the same way as  x + x = 2x  but that they are equal because they describe the same item,  as with  ruby jubilee  =  40 years   or  16th president = Abraham Lincoln.
  Thus we could write    language ((a + b )* a( a + b )* a ( a + b)*)
                                    = language ( b*ab *a ( a + b )*
                                    =  all words with at least two  a’s.

         To be careful about this point, we say that two regular expressions are equivalent if they describe the same language. The expressions below also describe the language of words with at least two a’s :



EXAMPLE :
            If we wanted all the words with exactly two a’s, we use the expression b* ab* ab*  which describes such words as  aab, baba, and bbbabbbab. To make the word aab, let the first and second b become L and the last b become b.   n

EXAMPLE :  The language of all words that have at least one a and at least one b is somewhat trickier. If we  write :   ( a + b )*a ( a + b )* b ( a + b )*
             = ( arbitrary ) a ( arbitrary ) b ( arbitrary ),  
we are then requiring that an a precede a b in the word. Such words  as ba and bbaaaa are not included in this set. Since we know that either the a comes before the b or the b comes before the a, we could define this set by the expression 
                                    ( a + b )* a ( a + b )* b ( a + b )*  +  ( a + b )* b ( a + b )* a ( a + b )*
Here we are still using the plus sign in the general sense of disjunction ( or ).

            There is a simpler expression that defines the same language. If we are confident that the only words that are omitted by the first term    ( a + b )* a ( a + b )* b ( a + b )*   are the words of the form some b’s  followed by some a’s, then it would be sufficient to add these specific exceptions into the set. These exceptions are all defined by the regular expression :   bb* aa*

            The language of all words over the alphabet   Σ = { a , b } that contain both an a  and  a b is therefore also defined by the expression :    ( a + b )* a ( a + b )* b ( a + b )*  +  bb* aa*
     NOTE that it is necessary to write  bb* aa* because b* a* will allow words we do not want, such as  aaa.  

These LANGUAGE-DEFINING  expressions cannot be treated like algebraic symbols.  We have shown that in the following expression

  ( a + b )* a ( a + b ) * b ( a + b )* + ( a + b )* b ( a + b )* a ( a + b )* = ( a + b )*a ( a + b )* b ( a + b )* + bb* aa*     

 The first terms on both sides of this equation are the same, but if we cancel them, we get

                        ( a + b )* b ( a + b )* a ( a + b )*  =  bb* aa*    which is false, since the left side includes

the word aba, which the expression on the right side does not.  The only words that do not contain both an a and a b in them somewhere are the words of all  a’s,  all  b’s  or  .  When these are added into the language, we get everything. Therefore the regular expression :
                  ( a + b )* a ( a + b )* b ( a + b )*  +  bb* aa* + a* + b* defines all possible strings of  a’s  and  b’s.

The word is included in both a*  and  b*.

            ( a + b )* = ( a + b )* a ( a + b )* b ( a + b )* + bb* aa* + a* + b*  

         is not a very obvious equivalence.   
  
LANGUAGE-DEFINING EQUIVALENCES :

                        ( a + b )*  = ( a + b )*  +  ( a + b )*
                        ( a + b )*  = ( a + b )* ( a + b )*
                        ( a + b )*  = a( a + b )*  +  b( a + b )*  +
                        ( a + b )*  = ( a + b )* ab ( a + b )*  +  b* a* 


      NOTE : These equivalences should not be treated like algebraic polynomials.

            The last of  these equivalences means that all the words that do not contain the substring ab are
all  a’s ,  all b’s,  ,  or some b’s  followed by the plus ( union sign ) alone. 

EXAMPLE:  Let V be the language of all strings of a’s and b’s in which the strings are either all b’s or else 
        there is an a followed by some b’s. Let V also contain the word  .
        V = {   a  b  ab  bb  abb  bbb  abbb  bbbb . . . } V can be defined by the expression b* + ab* where the
            word    is included in the term b*.  Alternately, we could define V  by the expression :  ( + a ) b* 
           
 This implies that in front of the string of b’s  we have the option of either adding an  a  or nothing. Since we could always write  b* = b*, we have what appears to be some sort of distributive law at work.

                        b* + ab* = ( + a )b* .  b*  is factored out just as in algebra. It is also because of this analogy in
algebra that we denoted our disjunction by the plus sign instead of the union sign U or the symbolic sign V.   
WE HAVE A HYBRID SYSTEM: the * is somewhat like an exponent and the + sign is somewhat like addition. But the analogies to algebra should be approached very suspiciously, since addition in algebra never means choice and algebraic multiplication has properties different from concatenation ( even though we sometimes conveniently refer to it as product ).
                                                ab  =  ba    in algebra   
                                                ab    ba    in formal languages





            Consider the language    T = { a   c   ab   cb   abb   cbb . . . }. 

                                    T can also be defined as in the above example as  ( a + c )b* 
                                                but it can also be defined by  ab* + cb*. 

REGULAR EXPRESSIONS : 
   The symbols that appear in regular expressions are : 
                        1.  the letters of the alphabet  Σ,
                        2.  the symbol for the null string ,
                        3.  the parentheses (  )                         
                        4.  the star operator *  and
                        5.  the plus sign.
  Definition :
            Rule  1:  Every letter of  Σ  can be made into a regular expression by writing it in boldface; 
                         is a regular expression.
            Rule  2 :  If  r1  and  r2  are regular expressions, then so are  ( r1 )   r1r2    r1 + r2    r1*.
            Rule  3 :  Nothing else is a regular expression.  

  We have not included the plus sign as a superscript  r1+  as part of the definition because  r1+  =  r1 r1*

  Definition :
            If S and T are sets of strings of letters ( whether they are finite or infinite sets ), then the product set of strings of letters is :   ST = { all concatenation of a string from S concatenated with a string from T }
            EXAMPLE :
                                    If  S =  { a   aa   aaa }   and   T =  { bb   bbb }  then 
                                      ST =  { abb    abbb    aabb    aabbb    aaabb    aaabbb }    
                 NOTE :  These words are not in proper order.        

            EXAMPLE 1 :   If  S = { a   bb   bab }    and  T =  { a   ab }   then  
                                             ST =  { aa   aab   bba   bbab   baba   babab  }    

            EXAMPLE 2 :   If   P = { a   bb   bab }     and     Q = {    bbbb }   then
                                            PQ = { a   bb   bab   abbbb   bbbbbb   babbbbb }     

            EXAMPLE 3:   If  M = {    x   xx   xxx }  and  N = {    y   yy   yyy   yyyy  yyyyy . . .  }
                                     then  MN = {    y   yy   yyy   yyyy  yyyyy . . .
                                                x   xy   xyy   xyyy   xyyyy   xyyyyy . . .
                                                xxy   xxyy   xxyyy   xxyyyy  xxyyyyy . . . }
                                                xxxy   xxxyy   xxxyyy   xxxyyyy   xxxyyyyy . . . }

            Using regular expressions, these four examples can be written as :
                                    ( a + aa + aaa )( bb + bbb ) = abb + abbb + aabb + aabbb + aaabb + aaabbb
                                    ( a + bb + bab )( a + ab )   = aa + aab + bba + bbab + baba + babab
                                    ( a + bb + bab )( + bbbb ) = a + bb + bab + ab4 + b6 + bab5
                                    ( + x + xx )( y* ) =  y* + xy* + xxy*

EXAMPLE: In SPANISH  and JAPANESE are their usual languages, then the product SPANISHJAPANESE  is the language of all strings that start with a SPANISH word and finish with a JAPANESE word. Some words in this language are mesatokei, bonitooishi and pasearnomimasu.

It might not be clear why we can not just leave the rules for associating a language with a regular expression on the informal level, with the expression “ make choices for + and * ”. The reason is that the informal phrase “make choices” is much harder to explain precisely than the formal mathematical presentation below.  We are now ready to give rules for associating a  language with  every  regular  expression.  The  method  for  doing  this  is  given
recursively.

Definition :  The following rules define the language associated with any regular expression.
      Rule 1  :  The language associated with the regular expression that is just a single letter is that one-letter  
                     word alone and the language associated with L is just { } ,  a one-word language.
     Rule  2  :   If r is a regular expression associated with the language L1  and r2 is a regular expression
                      Associated with the language L2 then, 
                        ( i ) The regular expression ( r1 ) ( r2 ) is associated with the language L1 times L2
                                                language ( r1 r2 ) = L1L2
                        ( ii ) The regular expression  r1 + r2 is associated with the language formed by the union
                                of the sets L1 and L2.
                                                            language ( r1 + r2 ) = L1 +  L2

                        ( iii ) The language associated with the regular expression (r1 )* is  L1*, the Kleene
                                  closure of the set L1 as a set of words.
                                                            language ( r1* ) = L1* 

Once again this collection of rules proves recursively that there is some language associated with every regular expression.  The rules seem to show us how we can interpret the regular expression as a language. But they do not really tell us how to understand the language. By this we mean that if we apply the rules above to the regular expression
                                                ( a + b )* a( a + b )* b( a + b )* + bb* aa*

we can develop a description of  some language, but can we understand that this is the language of all strings that have  an a  and a  b in them ? This is question of meaning. This correspondence between regular expressions and languages leaves open two other questions. (1 ) Is there some way of telling of telling when this happens ? “Way” means an algorithm which can be taken in the later chapter to determine whether or not two regular expressions define the same language.
        We have seen that every regular expression is associated with some language. (2 ) Is it true that every language can be describe by a regular expression ?  In the next theorem we show that every finite language can be defined by a regular expression.

Theorem 5 : 
            If L is a finite language (a language with only finitely many words), then L can be defined by a regular expression.
      Proof : To make one regular expression that defines the language L, turn all the words in L into boldface type and stick pluses between them. 
      For example, the regular expression that defines the language
                        L = { baa   abbba   bababa }   is   baa  +  abbba  +  bababa     

  If  L = { aa  ab ba bb }, the algorithm described above gives the regular expression  aa + ab  +  ba  +  bb                     
  Another  regular expression that defines this language is  ( a + b ) ( a + b )  so the regular expression need not be unique. The reason this trick only works for finite languages is that an infinite language would become a  regular  expression that is infinitely long, which is forbidden.   



EXAMPLE :
                        Let      L  =  {   x   xx   xxx   xxxx   xxxxx } 
The regular expression we get from the theorem is :  + x  +  xx  +  xxx  +  xxxx  +  xxxxx   
A more elegant regular expression for these language is   ( + x ) 5  Of course  5 is, strictly speaking, not a legal symbol for a regular expression although it is understood that it means
                                    ( + x ) ( + x ) ( + x ) ( + x ) ( + x )                            

            Consider the expression :   ( a  +  b )*( a a   +  b b ) ( a  +  b )*
This is the set of strings of  a’s  and  b’s  that at some point contain a double letter. We can think of it as 
                                                ( arbitrary ) ( double letter ) ( arbitrary )

Let us now ask, “What strings do not contain a double letter ?” Some examples are  :    a   b   ab   ba   aba   bab   abab baba . . . The expression  ( ab )* covers all of these except those that begin with  b  or  end in a. Adding these choices gives us the regular expression ( +  b ) ( ab )* ( +  a )                                      

       Consider the regular expression below :
               E  =   ( a  +  b )* a ( a  +  b )* (  a  +  ) ( a  +  b )* a (  a +  b )*     
                   =   ( arbitrary ) a ( arbitrary ) [  a or nothing ] ( arbitrary ) a ( arbitrary )

One obvious fact is that is that all the words in the language of  E  must have  at least two  a’s in them.  Let us break up the middle plus sign into its two cases :  either the middle factor contributes an  a  or else contributes a  . Therefore :
      E  =  (  a  +  b )* a ( a  +  b )* a ( a + b )* a ( a + b )* + ( a + b )* a ( a + b )* ( a + b )* a ( a  + b )*

This is a more detailed use of the distributive law. The first term above clearly represents all words that have at least three  a’s  in them. Before analyzing the second term observe that   
                                                            (  a  +  b )* (  a  + b )*  
which occurs in the middle of the second term is only another way of saying “any string whatsoever” and could be replaced with the more direct expression  
                                                            (  a  +  b )* . 
This would reduce the second term of the expression  to 
                                                            (  a  +  b )* a (  a  + b )* a (  a  +  b )*
Which we have already seen in a regular expression representing all words that have at least  two  a’s  in them. Therefore the language  associated with  E  is the union of all strings that have three or more  a’s  with all strings that have two or more  a’s . But since all strings with three or more a’s  are themselves already strings with two or more a’s ,  this whole language is just the second set alone. The language associated with  E  is no different from the language associated with 
                                                            ( a  +  b )* a (  a  +  b )* a (  a  +  b )*    
which has been examined before.   

It is possible by repeated application of the rules for forming regular expressions to produce an expression in which the star operator  is applied to a subexpression  that already has a star in it.
                                    ( a   +   b* )*   ( a a  +  a b* )*      (( a   +  b b b a* )  +    b a* b )*  
 In the first of these expressions, the internal  *  adds nothing to the language.
                                                            ( a  +  b* )*   =  ( a  +  b )* 
 since all possible strings of  a’s  and  b’s  are described by both expressions.  Also in accordance  with theorem 1,
                                                                        ( a* )*  =  a*

             However,         ( a a  +  a b* )*    ( a a  +  a b )* 
since the language for the expression on the left includes the word abbabb, which the language on the right does not. ( The language defined by the regular expression on the right cannot contain any word with double b ).



Example :  Consider the regular expression 
                                                            ( a* b* )*         
The language defined by this expression is all strings that can be made up of factors of the form  a* b* , but since both the single letter  a  and the single letter  b  are  words of the form  a* b* ,  this language contains all strings of  a’s  and b’s.  It cannot contain more than everything,  so
                                                             ( a* b* )* =  ( a  +  b )*                                   




 

No comments:

Post a Comment