20070611

The impossible is only possible sometimes.

I've been working at creating a regular expression engine in Haskell again and have come up with what feels like a fairly elegant solution. Through a series of functions the following conversions happen :


print $ test -- "a|b?c|d"
print $ lfn test -- ["a","|","b","?","c","|","d"]
print $ p2p $ lfn test -- ["a","|","?b","c","|","d"]
print $ gs Top [] $ p2p $ lfn test -- [["a"],["?b","c"],["d"]]


Now I will convert this last output into a series of rules for what to do at certain points of evaluation, using the information at http://swtch.com/~rsc/regexp/ as a guide to how good regular expressions work.

I want to perform the following conversion :


[["a"],["?b","c"],["d"]] -> ( [ [ (Just 'a', 4 ) , (Nothing , 2) , (Just 'd' , 4 ) ] ,
[ (Just 'b', 3 ) , ( Nothing , 3 ) ] ,
[ (Just 'c', 4 ) ]
]
,
4 )


A parser for such a rule set should be simple enough, and once created I will use the Array module to make it go at a decent speed.

I've coded up through the transformation into a ruleset, and have hit an interesting boundary. When designing my solution I remembered an interesting article I had seen when initially starting to look at Haskell wherein the author was coding an assembler and passed the output in as one of the initial arguments. The idea intrigued me.

The algorithm I derived to make my transformation used this concept judiciously. Unfortunately it does not work, though perhaps not for the reasons one might guess.

Here is the code :



-- rx : a regular expression engine for haskell
-- : (c) 2007 by michael speer , released GPL v2.0 license or later

import Maybe

-- get-function : gather a nodes worth of pattern from a pattern-string
-- returns the segment and the remaining pattern in a tuple
gf :: [Char] -> ( [Char] , [Char] )
gf ('(':ps) = let ( segment , remaining ) = gpf [] ps
in ( '(':segment , remaining )
where
-- get-parenthesis'd fragment
gpf :: [Char] -> [Char] -> ( [Char] , [Char] )
gpf segment [] = ( segment , [] )
gpf segment (')':ps) = ( segment++")" , ps )
gpf segment ('(':ps) = let ( subsegment , remaining ) = gpf [] ps
in gpf ( segment ++ ('(':subsegment) ) remaining
gpf segment (p:ps) = gpf (segment ++ p:[]) ps
gf ('\\':p:ps) = ( '\\':p:[] , ps )
gf ('?':'?':ps) = ( "??" , ps )
gf ('*':'?':ps) = ( "*?" , ps )
gf (p:ps) = ( p:[] , ps )

-- list-functions : creates a list of function-pattern-segments from a regular expression
lfn :: [Char] -> [[Char]]
lfn [] = []
lfn pps = let ( fn , remaining ) = gf pps
in fn : lfn remaining

-- postfix-to-prefix : moves all repitition characters to before their targets
-- in addition running the pattern through this function
-- ensures that the pattern is legal.
p2p :: [[Char]] -> [[Char]]
p2p [] = []
p2p (('?':_):_) = error "Misplaced repitition operator `?' "
p2p (('*':_):_) = error "Misplaced repitition operator `*' "
p2p ("\\":_) = error "Escape sequence targetting nothing"
p2p ((')':_):[]) = ")" : [] -- allows for a close-parum at end of pattern
-- later processing must test for this and throw error if it is unexpected ( eg top-level
p2p ((')':_):_) = error "Unmatched terminating group marker , `)' " -- catches mid pattern erroneous close-para
p2p ("|":('?':_):_) = error "Repitition operator `?' applied to procedural alternation operator `|' "
p2p (s1:s2@("??"):ps) = ('?':'?': s1) : p2p ps
p2p (s1:s2@("?"):ps) = ('?': s1) : p2p ps
p2p (s1:s2@("*?"):ps) = ('*':'?': s1) : p2p ps
p2p (s1:s2@("*"):ps) = ('*': s1) : p2p ps
p2p (s1:ps) = s1 : p2p ps

-- group-split : break the contents of a parenthesized group into peices
-- build a current list of segments until | is encountered
-- then append the result of calling this again on the
-- remainder
data PatternDepth = Top | Sub -- Pattern depth
gs :: PatternDepth -> [ [ Char ] ] -> [ [ Char ] ] -> [ [ [ Char ] ] ]
gs Top segments (")":[]) = error "Unmatched terminating group marker `)' at end of pattern"
gs Sub segments (")":[]) = gs Sub segments []
gs _ segments [] = if segments /= [] then
segments : []
else
[""] : []
gs pd segments ("|":rest) = segments : gs pd [] rest
gs pd segments (s:rest) = gs pd (segments++[s]) rest

-- many to single nodes
m2s :: [ [ ( Maybe Char , Integer ) ] ] -> Integer -> ( ( Maybe Char , Integer ) , [ [ ( Maybe Char , Integer ) ] ] , Integer )
m2s (g@(ni:[]):gs) n = ( ni , gs , n ) -- if first group is one long : decrease the count by one so that popping the top item off mess up the counting
m2s ggs n = ( ( Nothing , (n+1) ) , ggs , (n+1) ) -- otherwise add a new member at the beginning to act as a pointer to it

-- or-extracted-nodes
oexn :: [ [ [ Char ] ] ] -> Integer -> Integer -> ( [ ( Maybe Char , Integer ) ] , [ [ ( Maybe Char , Integer ) ] ] , Integer )
oexn (g:[]) n l = let ( ns , x ) = ( aexn g x l )
in ( [] , ns , x )

oexn (g:g':[]) n l = let ( ( ns , x ) ,
( ni' , ns' , x' ) ,
( ns'' , x'' ) ,
( ni''' , ns''' , x''' ) ) = ( aexn g x' l ,
m2s ns n ,
aexn g' x''' l ,
m2s ns'' x )
in
( ni' : ni''' : [] , ns' ++ ns''' , x'' )

oexn (g:gs) n l = let ( ( ns , x ) ,
( ni' , ns' , x' ) ,
( ni'' , ns'' , x'' ) ) = ( aexn g x' l ,
m2s ns n ,
oexn gs x l )
in
( ni' : ni'' , ns' ++ ns'' , x'' )

-- and-extracted-nodes
aexn :: [ [ Char ] ] -> Integer -> Integer -> ( [ [ ( Maybe Char , Integer ) ] ] , Integer )
aexn (b:[]) n l = exn b n l
aexn (b:bs) n l = let ( ( ns , x ) ,
( ns' , x' ) ) = ( exn b n x ,
aexn bs x l )
in
( ns ++ ns' , x' )

-- extract-node
exn :: [ Char ] -> Integer -> Integer -> ( [ [ ( Maybe Char , Integer ) ] ] , Integer )
exn ('(':cs) n l = let ( ni , ns , x ) = oexn (gs Sub [] $ p2p $ lfn cs ) n x
in
if ni == [] then
( ns , x )
else
( [ ni ] ++ ns , x )
exn (c:_) n l = ( [ [ ( Just c , (n+1) ) ] ] , (n+1) )


test = "a|b?c|d"

main = do
print $ test
print $ lfn test
print $ p2p $ lfn test
print $ gs Top [] $ p2p $ lfn test
--print $ let ( ni , ns , x ) = oexn (gs Top [] $ p2p $ lfn test ) 1 (x+1)
-- in ( ni : ns , x )
print $ let ( ns , x ) = aexn [ "a" , "b" , "c" , "d" ] 1 x
in ( ns , x )
--
--print $ let ( ( ns , x ) ,
-- ( ni' , ns' , x' ) ) = ( aexn [ "abc", "d" ] x' x ,
-- let ( (ggs@((g@(ni:ns)::[(Maybe Char,Integer)]):gs)) , a ) = ( [ns] , 1 )
-- in if ns == [] then
-- ( ni , gs , a )
-- else
-- ( ( Nothing , a+1 ) , ggs , a+1 ) )
-- in ( ns , x )



Following the `print $ aexn ...' function above, you can see how the concept works. The oexn portion fails however.

I believe the reason is the way that the compiler handles the mutual dependence of the arguments and outputs of the two functions. It appears that instead of passing a marker that a value is "in progress" when dealing with two separate functions, each one simply calls the other anew to try to generate it. Since the functions rely on each others output to work, they loop trying to instantiate one another.

A single function is allowed to accept as input its own output ( as is demonstrated in the above working code ) but trying to do this with mutually dependent functions causes them to enter an endless loop that smashes the stack.

For any curious I compile the above with `The Glorious Glasgow Haskell Compilation System, version 6.6' on `Linux version 2.6.20-16-generic (root@terranova) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #2 SMP Thu Jun 7 20:19:32 UTC 2007'.

I looked to see if there were options that might allow me to make the system realize that the calls need to be lazy to a single call of the function, but I do not see any. Instead the system continues being confused into stack overflow by the dependence.

The last function call ( commented out ) had me trying to inline the m2s function, but it caused the same error ( presumably because the compiler just converted the sample into a lambda that caused the same dissonance in gathering the data.

Ah well. Maybe in 6.7 ...

About Me

(2) the subculture of the compulsive programmer, whose ethics prescribe that one silly idea and a month of frantic coding should suffice to make him a life-long millionaire. --ewd1036