The programmable programmer: regular expression

I had a problem last year with a legacy database filled with a mix of utf-8, windows cp-1252, extended and regular ascii. I needed a way to clean up the information without losing any of the information contained in it.

Being familiar with regular expressions, I looked up how UTF-8 was formatted, made a couple of assumptions about the malformations I would find therein, figured out which code points were the ones I should replace and the following function was born.

Or something like that. Anyway here is the code :

function garbage_to_utf8_character_replacement_function( $matches ) { // converts binary 10000000 -> 11111111 that do not appear // as part of a unicode character into a unicode character // under the assumption that a portion of them are windows // cp-1252 characters, and the rest are exteneded ascii // characters $o = ord( $matches[ 0 ] ) ; switch( $o ) { // check for windows code page 1252 characters case 130 : return "\xe2\x80\x94" ; // Single Low-9 Quotation Mark case 131 : return "\xc6\x92" ; // Latin Small Letter F With Hook case 132 : return "\xe2\x80\x9e" ; // Double Low-9 Quotation Mark case 133 : return "\xe2\x80\xa6" ; // Horizontal Ellipsis case 134 : return "\xe2\x80\xa0" ; // Dagger case 135 : return "\xe2\x80\xa1" ; // Double Dagger case 136 : return "\xcb\x86" ; // Modifier Letter Circumflex Accent case 137 : return "\xe2\x80\xb0" ; // Per Mille Sign case 138 : return "\xc5\xa0" ; // Latin Capital Letter S With Caron case 139 : return "\xe2\x80\xb9" ; // Single Left-Pointing Angle Quotation Mark case 140 : return "\xc5\x92" ; // Latin Capital Ligature OE //gap case 145 : return "\xe2\x80\x98" ; // Left Single Quotation Mark case 146 : return "\xe2\x80\x99" ; // Right Single Quotation Mark case 147 : return "\xe2\x80\x9c" ; // Left Double Quotation Mark case 148 : return "\xe2\x80\x9d" ; // Right Double Quotation Mark case 149 : return "\xe2\x80\xa2" ; // Bullet case 150 : return "\xe2\x80\x93" ; // En Dash case 151 : return "\xe2\x80\x94" ; // Em Dash case 152 : return "\xcb\x9c" ; // Small Tilde case 153 : return "\xe2\x84\xa2" ; // Trade Mark Sign case 154 : return "\xc5\xa1" ; // Latin Small Letter S With Caron case 155 : return "\xe2\x80\xba" ; // Single Right-Pointing Angle Quotation Mark case 156 : return "\xc5\x93" ; // Latin Small Ligature OE //gap case 159 : return "\xc5\xb8" ; // Latin Capital Letter Y With Diaeresis default : return chr( 192 | ( 3 & ( $o >> 6 ) ) ) . chr( $o & 191 ) ; } } function garbage_to_utf8( $text ) { // locate all bytes with 0x80 set that are not a proper // component of a unicode character. pass them to // garbage_to_utf8_character_replacement_function // to convert them to unicode under the assumptions they // are either windows characters or extended ascii $bad_replace = '' . '/(' . '(' // find 1111110x not followed by 5 10xxxxxx . '[\\xFC-\\xFD](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 111110xx not followed by 4 10xxxxxx . '[\\xF8-\\xFB](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 11110xxx not followed by 3 10xxxxxx . '[\\xF0-\\xF7](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 1110xxxx not followed by 2 10xxxxxx . '[\\xE0-\\xEF](?![\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 110xxxxx not followed by 1 10xxxxxx . '[\\xC0-\\xDF](?![\\x80-\\xBF])' . '|' // find 10xxxxxx not part of code point . '(?<!' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD]' . '|' . '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xF8-\\xFB][\\x80-\\xBF]' . '|' . '[\\xF8-\\xFB]' . '|' . '[\\xF0-\\xF7][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xF0-\\xF7][\\x80-\\xBF]' . '|' . '[\\xF0-\\xF7]' . '|' . '[\\xE0-\\xEF][\\x80-\\xBF]' . '|' . '[\\xE0-\\xEF]' . '|' . '[\\xC0-\\xDF]' . ')' . '[\\x80-\\xBF]' . ')' . ')/' ; return preg_replace_callback( $bad_replace , 'garbage_to_utf8_character_replacement_function' , $text ) ; }

A quick search shows I'm not the only one to have solved this using regular expressions.

FixLatin

Too bad he didn't post sooner, it would've saved me having to figure out the encoding transformation on my own. Ah well, at least I'm not the only one.

Haskell is an interesting language. It is, perhaps, the most intelligent system that I have programmed in. The type system is beautiful, the pattern matching makes function definition in other languages seem laborious, the high precedence for functions applying to their arguments just feels right, and the high barrier of correctness needed for compilation along with the errors given is great.

I am posting in order to share the work I have done thus far in learning Haskell. I decided I wanted to do something at least a little interesting, so I decided to begin implementing some rudimentary regular expression functions.

Currently I only have the framework for what I imagined. As written it operates as follows :

the characters from left to right are converted into a list of functions and functions that combine functions

the combinating functions are used to foldr together the matching functions in from right to left

a string can be handed into this function chain for matching.


-- takes a pattern string, returns two items                                                                                                                           
--  1) a function that takes a target string and returns if it matches and how many characters                                                                         
--  2) how much of the pattern was consumed to create this match function                                                                                              
literal :: [Char] -> ( [Char] -> ( Bool , Int ) , Int )
literal xxs = ( \yys -> case ( xxs , yys ) of
                          ( [] , _ ) -> ( True , 0 )
                          ( _ , [] ) -> ( False, 0 )
                          ((x:_),(y:_)) -> if x == y then
                                                ( True , 1 )
                                            else
                                                ( False , 0 )
              , 1 )

-- execute the match function to the left first, then the match function to the right second                                                                           
leftCombine :: ( (([Char]->(Bool,Int)),Int) -> (([Char]->(Bool,Int)),Int) -> (([Char]->(Bool,Int)),Int) )
leftCombine xf yf = let ( xmf , pc ) = xf  -- xfunction yfunction xmatchfunction patterncount yougettheidea                                                            
                    in let ( ymf , pc' ) = yf
                       in ( ( \ccs -> let ( b , i ) = xmf ccs -- ccs is the eventual target string, note that \ :: ([Char]->(Bool,Int))                                
                                      in if b == False then
                                             ( False , 0 )
                                         else
                                             let ( b' , i' ) = ymf ( drop i ccs )
                                             in if b' == False then
                                                    ( False , 0 )
                                                else
                                                    ( True , i + i' ) )
                          , pc + pc' )

-- for each letter gather a matching function, and a function detailing how to attach the preceding match to it.                                                       
-- right now it only has a single letter match with a straight left combine function                                                                                   
-- future support for repitions will be handled by combining a 0matching true stub and a combinative                                                                   
--   function that sets up the preceding match function to match as many elements to a limit as it can,                                                                
--   backtracking and trying fewer matches on failures                                                                                                                 
rxizer :: [Char] -> [ ( (([Char]->(Bool,Int)),Int)->(([Char]->(Bool,Int)),Int)->(([Char]->(Bool,Int)),Int) , ([Char]->(Bool,Int),Int) ) ]
rxizer [] = []
rxizer pps@(p:_) = let ( cfn , ( fn , pc ) ) = case p of
                                                 _ -> ( leftCombine , literal pps )
                   in ( cfn , ( fn , pc ) ) : rxizer ( drop pc pps )

-- meshes together the series of generated match-functions by foldr'ing them together by their combinative-functions                                                   
rx :: [Char] -> [Char] -> ( Bool , Int )
rx [] = \ccs -> (True,0)
rx mms@(m:_) = fst $ snd $ foldr fnmerge truestub ( rxizer mms )
    where
      truestub = ( leftCombine , ( \c -> ( True , 0 ) , 0 ) )
      fnmerge x y = let ( cfn , fn ) = x
                        ( cfn' , fn' ) = y
                    in ( cfn , cfn' fn fn' )

main = do
  -- generate some matches                                                                                                                                             
  --      pattern           target       -- output  
  print $ "HelloWorld" `rx` "HelloWorld" -- (True,10)                                                                                                                  
  print $ "Hello"      `rx` "HelloWorld" -- (True,5)                                                                                                                   
  print $ "HelloWorld" `rx` "Hello"      -- (False,0)                                                                                                                  
  print $ ""           `rx` ""           -- (True,0)

The programmable programmer

20090713

Garbage to UTF-8

20070509

Initial code at regular expressions in Haskell

Blog Archive

About Me