20071129

A context manager for temporary memoization.

This is very much a fragile hack at the moment. It's an interesting idea I think. I was disappointed when I initially found that the with_statement syntax does not restore the value of the `as var` upon completion.

This made doing something along the lines of


with temporarily_memoized( func ) :
for datum in data :
func( datum )

unattainable.

Well, just a lot hackier actually.

Thus temporarily_memoized( ... ) was born :

#!/usr/bin/python

# a context manager for temporarily turning any function into
# a memoized version of itself.

from __future__ import with_statement
import contextlib , sys , types

def memoize ( func ) :
""" In haskell mutability must be handled explicitly.
Only fair that Python do the same to transparent functionality
"""
remembered = {}
def memoized ( *args ) :
""" memoized version of function """
if args in remembered :
return remembered[ args ]
else :
new = func( *args )
remembered[ args ] = new
return new
return memoized

@contextlib.contextmanager
def temporarily_memoized ( func ) :
""" memoize the given function for the duration of this block
save anything in the local variables that gets in the way of
using this so that it can be restored afterward , the memoized
version is found in the locals. use on actual functions only.
no members. """

# this is being called, there has to be a frame above it
frame = sys._getframe().f_back.f_back

if func.func_name in frame.f_locals :
f = frame.f_locals[ func.func_name ]
frame.f_locals[ func.func_name ] = memoize( func )
try :
# this hack replaces whatever in the local scope
# has the name of the variable. if you try to use
# the 'as whatever' portion of the syntax , you
# are doing it wrong
yield None
finally :
frame.f_locals[ f.func_name ] = f
else :
frame.f_locals[ func.func_name ] = memoize( func )
try :
yield None
finally :
del frame.f_locals[ func.func_name ]

def fib(n):
""" biggus fibbus """
if n == 0 or n == 1:
return n
else:
return fib(n-1) + fib(n-2)

if __name__ == '__main__' :
print fib.__doc__
with temporarily_memoized( fib ) :
print fib.__doc__
for i in xrange( 36 ) :
print "n=%d => %d" % (i, fib(i))
print fib.__doc__
print fib.__doc__


I've played around with replacing the func_code and related variables in order to create a true proxy, but the requirement that the func_code and func_closure agree exactly on the number of local variables and that they refuse to set independently or be tricked into simultaneous setting by calling types.FunctionType.__init__( target , proxy-code , ... ) has thwarted my attempts at this. I considered creating a variety of proxy functions each requiring a different number of free variables and then selecting the appropriate one, which would probably work, but it seems too much the wrong way to do things.

20071128

Don's benchmark.


It appears the mob has spoken and that I am a wrong headed twit.

My apologies to Don on the misunderstanding. When I first saw the numbers I assumed that the speed had to be due to memoization. It's not.


I saw a link earlier to a post by Don Stewart in that he asserts that Haskell is 40 times faster than python for a given algorithm.

He then goes on to point out how easy it is to drop hints to the code to parallelize the code.

While I won't go into any threading issues, the core point that it takes python almost thirty seconds to generate each Fibonacci number up to 36 is misplaced.

Haskell is a language built around the idea of functions that can be assumed to be functionally transparent. If you give them the same set of arguments they'll give the same answer.

The upside to this is that the compiler writers can automatically memoize every function if they want to unless its been explicitly marked as being otherwise.

Python, being based upon the idea of dynamic objects rather than that of transparent functions has no need for such optimizing. The developers time is better spent ensuring rapid dictionary insertion and deletion over transparent function memoization.

But, as he doesn't seem to mind having to annotate concurrancy, I'm sure that annotating transparency is just as fair.

#!/usr/bin/python

# Don's original python
# def fib(n):
# if n == 0 or n == 1:
# return n
# else:
# return fib(n-1) + fib(n-2)
#
# for i in range(36):
# print "n=%d => %d" % (i, fib(i))

def memoize ( func ) :
""" In haskell mutability must be handled explicitly.
Only fair that Python do the same to transparent functionality
"""
remembered = {}
def memoized ( *args ) :
if args in remembered :
return remembered[ args ]
else :
new = func( *args )
remembered[ args ] = new
return new
return memoized

@memoize
def fib(n):
if n == 0 or n == 1:
return n
else:
return fib(n-1) + fib(n-2)

for i in range(36):
print "n=%d => %d" % (i, fib(i))


The above code runs at a comfortable
  real    0m0.020s
user 0m0.012s
sys 0m0.008s


which compared to his times

  Ruby (1.8.5)   64.26s
Python (2.4) 25.16s
Haskell (GHC 6.8) 0.78s
Parallel Haskell (GHC 6.8) 0.67s


fairs reasonably.

Since Python allows mutable variables a looping implementation would have been more idiomatic python. Even better to use a generator. Then we can start output before the sequence is complete.

All his benchmark really showed was that memoization on transparent functions is really great. Thanks guy.

20071113

Random notes on query languages.

I started to include this in my post concerning my new language, but it relly isn't very helpful to the article. I want to keep the information around for myself later. Not all of these are generic information query languages, but all are examples of languages designed to ask for something.

QL :
Compiles to SQL. A combination of Datalog and SQL. Looks like it has a form of list comprehension. That could be useful.

Common Query Language :
This seems to be an attempt to standardize what thed user has to type into the innumerably implemented searches online and off. Simplicity is always nice to consider.

D :
Apparently this was a truly relational language. Google tells me concerning the site pointed at for it from wikipedia, "This site may harm your computer". I think this one best left alone.

Data Definition Language :
This appears to be a fuzzy logic belief system buried in a syntax molded to look like SQL.

Datalog :
Logic programming. ancestor(X,Y) := Parent(X,Y). The ancestor of X is Y if the Parent of X is Y. Could be useful. Looks rather constraining in what it could do easily.

ERROL :
This just led me to read about entity-relationship models, finally giving name to a project I created out of frustration with data manipulation about a year ago. This is a great system for data, this particular language is a natural language version interface. Better for people that machines to query with.

ISBL :
I found references to this but could not locate an example of its syntax.

LDAP :
This isn't a query language I hear you saying. Well, you're right. But I'm looking at data query, and this does query data, albeit not relational. The manner LDAP uses to filter datasets is nice. Very combinable, readable and easily created programatically. On being nice, forcing everything into a strict hierarchy is not.

MQL :
Not so much a language as a search API for finding structures and substructures in molecular simulations.

MDX : MultiDimensional eXpressions
This appears to have syntax borrowed from SQL with the addition of prepartitioned data and some niceties in reaching it. There are some interesting ideas at work here, I will have to look into them in the future.

OQL :
The binding of SQL tables and their associated columns to devices and their exposed attributes. The interesting difference is that instead of just returning scalars it can also return objects that can be traversed using dot-member syntax. The aggregation method differs from standard SQL in dividing items into groups that can then be referenced by values and subqueries using the keyword partition.

OPath :
A combination of API and string segment arguments that allow for locating items in the WinFS store. Apparently there will be a natural language to OPath translator. It looks like you acquire a query-object into a potion of the WinFS store programatically and then pass it a string specifying the boolean operations to indicate which items from the current partition of the store are desired.

QUEL :
SQL with a more complex from statement involving manipulating ranges and dropping the select keyword.

SPARQL :
This is a language designed for querying information found in RDF files. I applaud them for not merely repurposing the syntax of SQL to the task. It seems that many think this is what creating a query language means. The interesting part of the query consists of the where clause. Apparently the items returned are generated by specifying a series of interconnected assertions that whenever met return a clump of information consisting of all the specified variables that where intermixed with the assertions being bound to the values that allowed the assertions to match. It looks neat, but it bothered me that just glancing at it the query did not produce a mental model of how it operated. Maybe it was just me.

SQL :
The current cream of the crop in relational data manipulation, SQL has been allowing us to keep an external dataset and query and alter it in standard ways for more than thirty years. Of course the primary part that seems to be standard is mostly in the fashion that queries are marked up and some general agreement on what terms to use to indicate the various ways of stitching together tables of data. In general anything so clever as to be particularly interesting or useful is custom to only a given database and the few that might choose to copy them. For a standard language, the implementation specific libraries do much to unstandardize it. The primary limitation of the language in my experience is that anything you want to do must be expressed through the metaphor of stitching together tables of information and of course returning a single table of information resulting from this. Many clever hacks are used to bypass these limitations in practice of coarse, but trying to do anything other than simple column binds can become very confusing to look at very quickly.

WQL :
Microsofts query only SQL for WMI. A hidden SQL server on every server and workstation of a domain that only understands select queries. Useful.

20071111

Restructuring query languages.

SQL is the de facto language for the manipulation of relational data. Designed in the early 70s and standardized in the mid-80s, SQL has dozens of commercial and open-source servers written for it and enjoys being a language with almost no major competitors in its field.

It seems that all major relational databases are basically interpreters for various flavors of SQL. I've looked around from time to time for a decent alternative to SQL.

I haven't found anything interesting that was particularly useful out there.

I'm going to present for comments some of the ideas I have come up with to fix this. As you read below, know there is not yet an actual interpreter for my language. There will be, but I am not yet finished with it. I am writing an interpreter based on sqlite3 using lemon to create the parser and targeting the sqlite3 virtual machine which I will extend very minimally to handle some functionality I want that doesn't quite fit into their model.

My language is very much based on the entity-relationship model. SQL is often used to model this approach. It can be clumsy to fit into the tables of data paradigm, but it can.

The easiest way to explain my language of course will be to show a brief interpreter session. So here it goes.


Interpreter for AsYetUnnamedProgrammingLanguage by MichaelSpeer v0.-1

>

Let's begin by creating an entity to manipulate. We'll use products, something nicely generic that are probably manipulated by everyone reading this at some point.

> declare
. Products
. :name PHRASE
. :price MONEY'USD
. ;

There you are. We have a database containing products, each of which has a member specifying the name ( PHRASE : a string with no newlines ) and a price ( MONEY : a numeric datatype insured to two decimal places without fear of truncation errors, rounding errors, poor approximations or anything else silly like that. The tick after the MONEY datatype is something for which I do not yet have a permanent terminology, though I like it. I'll call it a shadow type. It is a small comment that must be present on anything that interacts with the particular datatype. So 3'USD + :price is good, whereas 3 + :price is an error. This is just an enforced helper to prevent 3'YEN + 3'USD from not being an error and giving a six.

The syntax for creating some instances of the Products entity is as follows :

> create Products [ :name , :price ] =
. [ "DVD Player" , 20'USD ] ,
. [ "VCR" , 15.5'USD ] ,
. [ "DVD" , 7'USD ] ,
. [ "VCR Tape" , 4'USD ]
. ;

So now we have Products in the system.

> Products ;
[ 1 , 2 , 3 , 4 ]

> Products:id ;
[ 1 , 2 , 3 , 4 ]

These are selectors. Selectors select what information from where will be available for call-back to the application.

As you can see, if you do not give an expressions_list or end the selector with a valuating member, then it will default to the id of whatever the current object is.

> Products[ :id , :name ] ;
[ 1 , "DVD Player" ]
[ 2 , "VCR" ]
[ 3 , "DVD" ]
[ 4 , "VCR Tape" ]

By giving the expressions_list, the values in the selected attributes are returned to the caller.

So this doesn't allow anything interesting yet. The path to usefulness starts with filters.

> Products( :price < 5'USD )[ :name ];
[ "VCR Tape" ]

I mentioned earlier the difference between putting an expressions_list at the end of the selector vs placing an object_member there.

> Products( :price > 10'USD ):name ;
[ "DVD Player" , "VCR" ]

> Products( :price > 10'USD )[ :name ] ;
[ "DVD Player" ]
[ "VCR" ]

Of course all of this would be a huge pain if there was no way to sort them. Sorting expressions can be interpolated directly into the filters.

> Products( asc :name ):name ;
[ "DVD" , "DVD Player" , "VCR" , "VCR Tape" ]

So far we've looked at how to manipulate a single object type. Now let's add a new one to put these products into some categories.

> declare
. Categories
. :name PHRASE
. :sub-products -> Products:category single
. ;

The declaration of the Categories object-type and the :name member looks the same as last time. :sub-products is the new item. This line is declaring a relationship. The `Products:category single` section indicates that the relationship will interconnect Categories ( the current item ) and Products. The relationship can be referred to from a set of Categories using the :sub-products member and from a set of Products using the :category member. Furthermore, the caveat `single` means that any instance of a Products object may only relate to a single Categories set object. The caveat can appear on either or both sides.

Setting relationships is accomplished by placing a selector in the respective slot of the expressions_list being used to set the members. id est

> create Categories [ :name , :sub-products ] =
. [ "Video Players" , Products( :name in [ "DVD Player" , "VCR" ] ) ]
. ;

Here is part of what I like about this syntax. The `:name` term above is a relative selector. When a filter has a selector that isn't grounded it is assumed to be relative to the current object in the filters. So the `:name` above is relative to Products. A filter of that nature can be read as objects from Products where Products:name is "DVD Player" or Products:name is "VCR". There isn't a limit to how many filters are applied either.

If you specify a relationship member as part of the filters then at that point a unique set of items available at that point are selected.

> Products:category:name ;
[ "Video Players" ]

> Products[ :category:name ] ;
[]
[ "Video Players" ]
[]
[ "Video Players" ]

Using the expression_list causes a separate return for each item whereas selecting the attribute directly takes the unique set of values from that attribute.

The reason that this happens is because in the first item the :category filter is selecting the set of all unique categories referred to by the set of current Products. The second selector is selecting each of the current set of Products ( all of them since no filter is applied ) and then for each of them returning a list of the :name attributes on the Categories in the set returned by the :category entry.

One of the things that will be immediately noticed is that I intend for my language to be able to represent and send out complex structures instead of just flat tables. The instruction to process a sub-list is the alteration to the virtual machine that will need to be made. I hope to link the language to python first planning to take advantage of the bindings that already exist in the builtin sqlite3 module.

Here are some more useful examples.

> Products( :category ):name ; -- returns products that have a category
[ "DVD Player" , "VCR" ]

> Products( not :category ):name ; -- returns products lacking a category
[ "DVD" , "VCR Tape" ]

> Products( :id != 3 && :id != 4 )[ :name ] ;
[ "DVD Player" ]
[ "VCR" ]

I will add the other two products to a category to demonstrate some more complex queries.

> create Categories [ :name , :sub-products ] =
. [ "Blank Media" , Products( not :category ) ]
. ;

> Products( :category( :sub-products( :name == "DVD" ) ) ):name ;
[ "DVD" , "VCR Tape" ]

That second query is returning the names of Products that have Categories that have products in them named "DVD". The same query is entirely possible in SQL. Just far more complex.

This language doesn't do a whole lot that cannot be accomplished with SQL or other information engines. It does it easier. This, to me, is better.

Before I conclude we'll raise the price of media by a dollar per item and add another entry to Categories.

> alter Products( :category( :name == "Media" ) )[ :price ] = [ :price += 1'USD ] ;

> declare Categories:recommended -> :sub-products:recommended-by ;

This declaration creates an attribute that will be allowed to point at any number of items that are found in the same items :sub-products list.

> alter Categories( :name == "Media" )[ :recommended ] = [ :sub-products( :name == "DVD" ) ] ;

There will be a caveat called `automatic` that will allow adding items not already in :sub-products to :recommended wherein the added item would also be added to :sub-products and removing the recommended item from :sub-products would remove it from :recommended. Otherwise trying these things will result in an error.

The + sign will be a union operator against sets and the - sign will give all of the first set except those in the second set. As well union( ... ) and except( ... ) functions will likely exist.

This should be enough information to gather thoughts from others on. So, please leave any thoughts you have so I can roll them into the work I am doing.

The current Categories entity type could have been declared initially as :

> declare
. Categories
. :name PHRASE
. :sub-products -> Products:category single
. :recommended -> :sub-products:recommend-by
. ;


That should be it for now.

20071101

Multi-Argument map For perl

I've been playing with perl recently. I avoiding the language for a long time because I hit python first and foolishly believed the hoards when they decried everything about perl. It can be cryptic, but perl is far from being without merit.

One of the things that kept bugging me when programming in perl was that there was the lack of a multi-argument grabbing map. I'd want to chain against the values of a hash and found annoyingly that the syntax was unhelpful.

A hash in list context dumps its keys and values into a zipped list.

{ 'k1' : 'v1' , 'k2' : 'v2 , 'k3' : 'v3' } -> [ 'k1' , 'v1' , 'k2' , 'v2' , 'k3' , 'v3' ]

This pretty much screws any sane sort of action that can be taken against the data. So instead you have to use `keys' to get a list of just keys in a `for' statement and then use the keys to access members of the hash.

I didn't really appreciate this restriction, as it meant I couldn't easily massage data with an anonymous function by mapping a sub block across it.

Thus was born mapper.

mapper takes three arguments. well, this being perl it takes one huge list of arguments. But it treats them like (defun mapper ( n f &rest L ) ... ) where


  • n : the number of elements to peel off of the list and hand to the sub function

  • f : a sub function

  • L : the list of items



It isn't long.


sub mapper
{
# n - number of items to pass at a time to the subroutine
# s - the subroutine
# r - the list of things to chunk and pass to the subroutine
# x - number of items to chunk
# c - index
# l - mapped values

local @_ = @_ ;

my ( $n , $s , @r ) = @_ ;
my $x = @r ;
my $c = 0 ;
my @l ;

while( $x > $c )
{
push @l , &$s( @r[ $c .. ( ( $c + $n ) - 1 >= $x ? $x - 1 : ( $c + $n ) - 1 ) ] ) ;
$c += $n ;
}

return @l ;

}


From this I derived a quick transposition function just to play with it ( demonstrating its use )


sub transpose
{
local @_ = @_ ;
mapper 2 , sub { ( $_[1] , $_[0] ) } , @_ ;
}


The function is rather obfuscatory when nested ...


print join ' ' , mapper 3 , sub { '<' . ( join '' , mapper 1 , sub { '[' . $_[0] . ']' } , @_ ) . '>' } , qw( the quick brown fox jumped over the lazy brown dog ) ;
print "\n" ;


You'll see from the last that mapper when presented with fewer than the expected number of items at the end of a list just passes along what it has, which is likely unhelpful for debugging, but helpful for applying n-arary functions across lists.

20070726

The impossible is possible : Irrefutable patterns.

Earlier this year I wrote about a regular expression code engine I was writing in Haskell that didn't work because of a recursive call problem. I was fairly sure the code could work if I had a way to inform the compiler of the needed laziness, a way to allow a number of different functions that each depended on the output of the each other as their inputs to execute, but believed that the way Haskell worked would permanently prevent this. Earlier today, linuxer on Reddit linked to an article concerning Irrefutable patterns, and I had what felt like an epiphany.

Remembering the code, I went back and corrected it only by adding the needed tilde to specify that the patterns should be irrefutable in the oexn function definitions. A few bugs fixes later, and the code works.


-- rx : a regular expression engine for haskell
-- : (c) 2007 by michael speer , released GPL v2.0 license or later

import Maybe

-- get-function : gather a nodes worth of pattern from a pattern-string
-- returns the segment and the remaining pattern in a tuple
gf :: [Char] -> ( [Char] , [Char] )
gf ('(':ps) = let ( segment , remaining ) = gpf [] ps
in ( '(':segment , remaining )
where
-- get-parenthesis'd fragment
gpf :: [Char] -> [Char] -> ( [Char] , [Char] )
gpf segment [] = ( segment , [] )
gpf segment (')':ps) = ( segment++")" , ps )
gpf segment ('(':ps) = let ( subsegment , remaining ) = gpf [] ps
in gpf ( segment ++ ('(':subsegment) ) remaining
gpf segment (p:ps) = gpf (segment ++ p:[]) ps
gf ('\\':p:ps) = ( '\\':p:[] , ps )
gf ('?':'?':ps) = ( "??" , ps )
gf ('*':'?':ps) = ( "*?" , ps )
gf (p:ps) = ( p:[] , ps )

-- list-functions : creates a list of function-pattern-segments from a regular expression
lfn :: [Char] -> [[Char]]
lfn [] = []
lfn pps = let ( fn , remaining ) = gf pps
in fn : lfn remaining

-- postfix-to-prefix : moves all repitition characters to before their targets
-- in addition running the pattern through this function
-- ensures that the pattern is legal.
p2p :: [[Char]] -> [[Char]]
p2p [] = []
p2p (('?':_):_) = error "Misplaced repitition operator `?' "
p2p (('*':_):_) = error "Misplaced repitition operator `*' "
p2p ("\\":_) = error "Escape sequence targetting nothing"
p2p ((')':_):[]) = ")" : [] -- allows for a close-parum at end of pattern
-- later processing must test for this and throw error if it is unexpected ( eg top-level
p2p ((')':_):_) = error "Unmatched terminating group marker , `)' " -- catches mid pattern erroneous close-para
p2p ("|":('?':_):_) = error "Repitition operator `?' applied to procedural alternation operator `|' "
p2p (s1:s2@("??"):ps) = ('?':'?': s1) : p2p ps
p2p (s1:s2@("?"):ps) = ('?': s1) : p2p ps
p2p (s1:s2@("*?"):ps) = ('*':'?': s1) : p2p ps
p2p (s1:s2@("*"):ps) = ('*': s1) : p2p ps
p2p (s1:ps) = s1 : p2p ps

-- group-split : break the contents of a parenthesized group into peices
-- build a current list of segments until | is encountered
-- then append the result of calling this again on the
-- remainder
data PatternDepth = Top | Sub -- Pattern depth
gs :: PatternDepth -> [ [ Char ] ] -> [ [ Char ] ] -> [ [ [ Char ] ] ]
gs Top segments (")":[]) = error "Unmatched terminating group marker `)' at end of pattern"
gs Sub segments (")":[]) = gs Sub segments []
gs _ segments [] = if segments /= [] then
segments : []
else
[""] : []
gs pd segments ("|":rest) = segments : gs pd [] rest
gs pd segments (s:rest) = gs pd (segments++[s]) rest

-- many to single nodes
m2s :: [ [ ( Maybe Char , Integer ) ] ] -> Integer -> ( ( Maybe Char , Integer ) , [ [ ( Maybe Char , Integer ) ] ] , Integer )
m2s (g@(ni:[]):gs) n = ( ni , gs , n ) -- if the first list is one long, pop it off
m2s ggs n = ( ( Nothing , (n+1) ) , ggs , n+1 ) -- otherwise add a new member at the beginning to act as a pointer to it

-- or-extracted-nodes
oexn :: [ [ [ Char ] ] ] -> Integer -> Integer -> ( [ ( Maybe Char , Integer ) ] , [ [ ( Maybe Char , Integer ) ] ] , Integer )
oexn (g:[]) n l = let ( ~( ns , x ) ,
~( ni' , ns' , x' ) ) = ( aexn g x' l ,
m2s ns n )
in ( [ ni' ] , ns' , x )

oexn (g:g':[]) n l = let ( ~( ns , x ) ,
~( ni' , ns' , x' ) ,
~( ns'' , x'' ) ,
~( ni''' , ns''' , x''' ) ) = ( aexn g x' l ,
m2s ns n ,
aexn g' x''' l ,
m2s ns'' x )
in
( ni' : ni''' : [] , ns' ++ ns''' , (x''-1) )

oexn (g:gs) n l = let ( ~( ns , x ) ,
~( ni' , ns' , x' ) ,
~( ni'' , ns'' , x'' ) ) = ( aexn g x' l ,
m2s ns n ,
oexn gs x l )
in
( ni' : ni'' , ns' ++ ns'' , (x''-1) )

-- and-extracted-nodes
aexn :: [ [ Char ] ] -> Integer -> Integer -> ( [ [ ( Maybe Char , Integer ) ] ] , Integer )
aexn (b:[]) n l = exn b n l
aexn (b:bs) n l = let ( ~( ns , x ) ,
~( ns' , x' ) ) = ( exn b n x ,
aexn bs x l )
in
( ns ++ ns' , x' )

-- subgroup
exn :: [ Char ] -> Integer -> Integer -> ( [ [ ( Maybe Char , Integer ) ] ] , Integer )
exn ('(':cs) n l = let ~( ni , ns , x ) = oexn (gs Sub [] $ p2p $ lfn cs ) n l
in
( [ ni ] ++ ns , x )

-- non-greedy
exn ('?':'?':cs) n l = let ~( ni , ns , x ) = oexn (gs Top [] $ p2p $ lfn cs ) (n+1) x
in
( [ [ ( Nothing , l ) , ( Nothing , (n+1) ) ] ] ++ [ ni ] ++ ns , x )

-- greedy
exn ('?':cs) n l = let ~( ni , ns , x ) = oexn (gs Top [] $ p2p $ lfn cs ) (n+1) x
in
( [ [ ( Nothing , (n+1) ) , ( Nothing , l ) ] ] ++ [ ni ] ++ ns , x )

-- non-greedy
exn ('*':'?':cs) n l = let ~( ni , ns , x ) = oexn (gs Top [] $ p2p $ lfn cs ) (n+1) n
in
( [ [ ( Nothing , l ) , ( Nothing , (n+1) ) ] ] ++ [ ni ] ++ ns , x )

-- greedy
exn ('*':cs) n l = let ~( ni , ns , x ) = oexn (gs Top [] $ p2p $ lfn cs ) (n+1) n
in
( [ [ ( Nothing , (n+1) ) , ( Nothing , l ) ] ] ++ [ ni ] ++ ns , x )

exn (c:_) n l = ( [ [ ( Just c , l ) ] ] , (n+1) )
exn [] n l = ( [ [ ( Nothing , l ) ] ] , (n+1) )

-- compile a regular expression into a DFA array
rx_compile pps = let ~( ni , ns , x ) = oexn (gs Top [] $ p2p $ lfn pps ) 1 x
in
if ni == []
then
( ns , x )
else
( ni : ns , x )

main = do
-- print $ test
-- print $ lfn test
-- print $ p2p $ lfn test
print $ rx_compile "ab?c" -- arrayed !
print $ rx_compile "()*" -- if you do something stupid, the system will not protect you.


What this will currently do is to convert a string like "ab?c" into a list containing lists each item of which specify where to go next in matching. For example, the given pattern would end up as


( [ [ (Just 'a',2) ] ,
[ (Nothing ,3) , (Nothing,4) ] ,
[ (Just 'b',4) ] ,
[ (Just 'c',5) ]
] ,
5
)


This list is returned in a tuple with a number representing one over the length of the list. This is were the next item would go if the list were still being built.

A parser that accepts a string and uses this structure as a ruleset in matching it should be trivial to make.

Thanks for the link linuxer.

20070611

The impossible is only possible sometimes.

I've been working at creating a regular expression engine in Haskell again and have come up with what feels like a fairly elegant solution. Through a series of functions the following conversions happen :


print $ test -- "a|b?c|d"
print $ lfn test -- ["a","|","b","?","c","|","d"]
print $ p2p $ lfn test -- ["a","|","?b","c","|","d"]
print $ gs Top [] $ p2p $ lfn test -- [["a"],["?b","c"],["d"]]


Now I will convert this last output into a series of rules for what to do at certain points of evaluation, using the information at http://swtch.com/~rsc/regexp/ as a guide to how good regular expressions work.

I want to perform the following conversion :


[["a"],["?b","c"],["d"]] -> ( [ [ (Just 'a', 4 ) , (Nothing , 2) , (Just 'd' , 4 ) ] ,
[ (Just 'b', 3 ) , ( Nothing , 3 ) ] ,
[ (Just 'c', 4 ) ]
]
,
4 )


A parser for such a rule set should be simple enough, and once created I will use the Array module to make it go at a decent speed.

I've coded up through the transformation into a ruleset, and have hit an interesting boundary. When designing my solution I remembered an interesting article I had seen when initially starting to look at Haskell wherein the author was coding an assembler and passed the output in as one of the initial arguments. The idea intrigued me.

The algorithm I derived to make my transformation used this concept judiciously. Unfortunately it does not work, though perhaps not for the reasons one might guess.

Here is the code :



-- rx : a regular expression engine for haskell
-- : (c) 2007 by michael speer , released GPL v2.0 license or later

import Maybe

-- get-function : gather a nodes worth of pattern from a pattern-string
-- returns the segment and the remaining pattern in a tuple
gf :: [Char] -> ( [Char] , [Char] )
gf ('(':ps) = let ( segment , remaining ) = gpf [] ps
in ( '(':segment , remaining )
where
-- get-parenthesis'd fragment
gpf :: [Char] -> [Char] -> ( [Char] , [Char] )
gpf segment [] = ( segment , [] )
gpf segment (')':ps) = ( segment++")" , ps )
gpf segment ('(':ps) = let ( subsegment , remaining ) = gpf [] ps
in gpf ( segment ++ ('(':subsegment) ) remaining
gpf segment (p:ps) = gpf (segment ++ p:[]) ps
gf ('\\':p:ps) = ( '\\':p:[] , ps )
gf ('?':'?':ps) = ( "??" , ps )
gf ('*':'?':ps) = ( "*?" , ps )
gf (p:ps) = ( p:[] , ps )

-- list-functions : creates a list of function-pattern-segments from a regular expression
lfn :: [Char] -> [[Char]]
lfn [] = []
lfn pps = let ( fn , remaining ) = gf pps
in fn : lfn remaining

-- postfix-to-prefix : moves all repitition characters to before their targets
-- in addition running the pattern through this function
-- ensures that the pattern is legal.
p2p :: [[Char]] -> [[Char]]
p2p [] = []
p2p (('?':_):_) = error "Misplaced repitition operator `?' "
p2p (('*':_):_) = error "Misplaced repitition operator `*' "
p2p ("\\":_) = error "Escape sequence targetting nothing"
p2p ((')':_):[]) = ")" : [] -- allows for a close-parum at end of pattern
-- later processing must test for this and throw error if it is unexpected ( eg top-level
p2p ((')':_):_) = error "Unmatched terminating group marker , `)' " -- catches mid pattern erroneous close-para
p2p ("|":('?':_):_) = error "Repitition operator `?' applied to procedural alternation operator `|' "
p2p (s1:s2@("??"):ps) = ('?':'?': s1) : p2p ps
p2p (s1:s2@("?"):ps) = ('?': s1) : p2p ps
p2p (s1:s2@("*?"):ps) = ('*':'?': s1) : p2p ps
p2p (s1:s2@("*"):ps) = ('*': s1) : p2p ps
p2p (s1:ps) = s1 : p2p ps

-- group-split : break the contents of a parenthesized group into peices
-- build a current list of segments until | is encountered
-- then append the result of calling this again on the
-- remainder
data PatternDepth = Top | Sub -- Pattern depth
gs :: PatternDepth -> [ [ Char ] ] -> [ [ Char ] ] -> [ [ [ Char ] ] ]
gs Top segments (")":[]) = error "Unmatched terminating group marker `)' at end of pattern"
gs Sub segments (")":[]) = gs Sub segments []
gs _ segments [] = if segments /= [] then
segments : []
else
[""] : []
gs pd segments ("|":rest) = segments : gs pd [] rest
gs pd segments (s:rest) = gs pd (segments++[s]) rest

-- many to single nodes
m2s :: [ [ ( Maybe Char , Integer ) ] ] -> Integer -> ( ( Maybe Char , Integer ) , [ [ ( Maybe Char , Integer ) ] ] , Integer )
m2s (g@(ni:[]):gs) n = ( ni , gs , n ) -- if first group is one long : decrease the count by one so that popping the top item off mess up the counting
m2s ggs n = ( ( Nothing , (n+1) ) , ggs , (n+1) ) -- otherwise add a new member at the beginning to act as a pointer to it

-- or-extracted-nodes
oexn :: [ [ [ Char ] ] ] -> Integer -> Integer -> ( [ ( Maybe Char , Integer ) ] , [ [ ( Maybe Char , Integer ) ] ] , Integer )
oexn (g:[]) n l = let ( ns , x ) = ( aexn g x l )
in ( [] , ns , x )

oexn (g:g':[]) n l = let ( ( ns , x ) ,
( ni' , ns' , x' ) ,
( ns'' , x'' ) ,
( ni''' , ns''' , x''' ) ) = ( aexn g x' l ,
m2s ns n ,
aexn g' x''' l ,
m2s ns'' x )
in
( ni' : ni''' : [] , ns' ++ ns''' , x'' )

oexn (g:gs) n l = let ( ( ns , x ) ,
( ni' , ns' , x' ) ,
( ni'' , ns'' , x'' ) ) = ( aexn g x' l ,
m2s ns n ,
oexn gs x l )
in
( ni' : ni'' , ns' ++ ns'' , x'' )

-- and-extracted-nodes
aexn :: [ [ Char ] ] -> Integer -> Integer -> ( [ [ ( Maybe Char , Integer ) ] ] , Integer )
aexn (b:[]) n l = exn b n l
aexn (b:bs) n l = let ( ( ns , x ) ,
( ns' , x' ) ) = ( exn b n x ,
aexn bs x l )
in
( ns ++ ns' , x' )

-- extract-node
exn :: [ Char ] -> Integer -> Integer -> ( [ [ ( Maybe Char , Integer ) ] ] , Integer )
exn ('(':cs) n l = let ( ni , ns , x ) = oexn (gs Sub [] $ p2p $ lfn cs ) n x
in
if ni == [] then
( ns , x )
else
( [ ni ] ++ ns , x )
exn (c:_) n l = ( [ [ ( Just c , (n+1) ) ] ] , (n+1) )


test = "a|b?c|d"

main = do
print $ test
print $ lfn test
print $ p2p $ lfn test
print $ gs Top [] $ p2p $ lfn test
--print $ let ( ni , ns , x ) = oexn (gs Top [] $ p2p $ lfn test ) 1 (x+1)
-- in ( ni : ns , x )
print $ let ( ns , x ) = aexn [ "a" , "b" , "c" , "d" ] 1 x
in ( ns , x )
--
--print $ let ( ( ns , x ) ,
-- ( ni' , ns' , x' ) ) = ( aexn [ "abc", "d" ] x' x ,
-- let ( (ggs@((g@(ni:ns)::[(Maybe Char,Integer)]):gs)) , a ) = ( [ns] , 1 )
-- in if ns == [] then
-- ( ni , gs , a )
-- else
-- ( ( Nothing , a+1 ) , ggs , a+1 ) )
-- in ( ns , x )



Following the `print $ aexn ...' function above, you can see how the concept works. The oexn portion fails however.

I believe the reason is the way that the compiler handles the mutual dependence of the arguments and outputs of the two functions. It appears that instead of passing a marker that a value is "in progress" when dealing with two separate functions, each one simply calls the other anew to try to generate it. Since the functions rely on each others output to work, they loop trying to instantiate one another.

A single function is allowed to accept as input its own output ( as is demonstrated in the above working code ) but trying to do this with mutually dependent functions causes them to enter an endless loop that smashes the stack.

For any curious I compile the above with `The Glorious Glasgow Haskell Compilation System, version 6.6' on `Linux version 2.6.20-16-generic (root@terranova) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #2 SMP Thu Jun 7 20:19:32 UTC 2007'.

I looked to see if there were options that might allow me to make the system realize that the calls need to be lazy to a single call of the function, but I do not see any. Instead the system continues being confused into stack overflow by the dependence.

The last function call ( commented out ) had me trying to inline the m2s function, but it caused the same error ( presumably because the compiler just converted the sample into a lambda that caused the same dissonance in gathering the data.

Ah well. Maybe in 6.7 ...

20070509

Initial code at regular expressions in Haskell

Haskell is an interesting language. It is, perhaps, the most intelligent system that I have programmed in. The type system is beautiful, the pattern matching makes function definition in other languages seem laborious, the high precedence for functions applying to their arguments just feels right, and the high barrier of correctness needed for compilation along with the errors given is great.

I am posting in order to share the work I have done thus far in learning Haskell. I decided I wanted to do something at least a little interesting, so I decided to begin implementing some rudimentary regular expression functions.

Currently I only have the framework for what I imagined. As written it operates as follows :


  • the characters from left to right are converted into a list of functions and functions that combine functions

  • the combinating functions are used to foldr together the matching functions in from right to left

  • a string can be handed into this function chain for matching.



-- takes a pattern string, returns two items
-- 1) a function that takes a target string and returns if it matches and how many characters
-- 2) how much of the pattern was consumed to create this match function
literal :: [Char] -> ( [Char] -> ( Bool , Int ) , Int )
literal xxs = ( \yys -> case ( xxs , yys ) of
( [] , _ ) -> ( True , 0 )
( _ , [] ) -> ( False, 0 )
((x:_),(y:_)) -> if x == y then
( True , 1 )
else
( False , 0 )
, 1 )

-- execute the match function to the left first, then the match function to the right second
leftCombine :: ( (([Char]->(Bool,Int)),Int) -> (([Char]->(Bool,Int)),Int) -> (([Char]->(Bool,Int)),Int) )
leftCombine xf yf = let ( xmf , pc ) = xf -- xfunction yfunction xmatchfunction patterncount yougettheidea
in let ( ymf , pc' ) = yf
in ( ( \ccs -> let ( b , i ) = xmf ccs -- ccs is the eventual target string, note that \ :: ([Char]->(Bool,Int))
in if b == False then
( False , 0 )
else
let ( b' , i' ) = ymf ( drop i ccs )
in if b' == False then
( False , 0 )
else
( True , i + i' ) )
, pc + pc' )

-- for each letter gather a matching function, and a function detailing how to attach the preceding match to it.
-- right now it only has a single letter match with a straight left combine function
-- future support for repitions will be handled by combining a 0matching true stub and a combinative
-- function that sets up the preceding match function to match as many elements to a limit as it can,
-- backtracking and trying fewer matches on failures
rxizer :: [Char] -> [ ( (([Char]->(Bool,Int)),Int)->(([Char]->(Bool,Int)),Int)->(([Char]->(Bool,Int)),Int) , ([Char]->(Bool,Int),Int) ) ]
rxizer [] = []
rxizer pps@(p:_) = let ( cfn , ( fn , pc ) ) = case p of
_ -> ( leftCombine , literal pps )
in ( cfn , ( fn , pc ) ) : rxizer ( drop pc pps )

-- meshes together the series of generated match-functions by foldr'ing them together by their combinative-functions
rx :: [Char] -> [Char] -> ( Bool , Int )
rx [] = \ccs -> (True,0)
rx mms@(m:_) = fst $ snd $ foldr fnmerge truestub ( rxizer mms )
where
truestub = ( leftCombine , ( \c -> ( True , 0 ) , 0 ) )
fnmerge x y = let ( cfn , fn ) = x
( cfn' , fn' ) = y
in ( cfn , cfn' fn fn' )

main = do
-- generate some matches
-- pattern target -- output
print $ "HelloWorld" `rx` "HelloWorld" -- (True,10)
print $ "Hello" `rx` "HelloWorld" -- (True,5)
print $ "HelloWorld" `rx` "Hello" -- (False,0)
print $ "" `rx` "" -- (True,0)

About Me

(2) the subculture of the compulsive programmer, whose ethics prescribe that one silly idea and a month of frantic coding should suffice to make him a life-long millionaire. --ewd1036