Showing posts with label regular expression. Show all posts
Showing posts with label regular expression. Show all posts

20090713

Garbage to UTF-8

I had a problem last year with a legacy database filled with a mix of utf-8, windows cp-1252, extended and regular ascii. I needed a way to clean up the information without losing any of the information contained in it.

Being familiar with regular expressions, I looked up how UTF-8 was formatted, made a couple of assumptions about the malformations I would find therein, figured out which code points were the ones I should replace and the following function was born.

Or something like that. Anyway here is the code :


function garbage_to_utf8_character_replacement_function( $matches )
{

// converts binary 10000000 -> 11111111 that do not appear
// as part of a unicode character into a unicode character
// under the assumption that a portion of them are windows
// cp-1252 characters, and the rest are exteneded ascii
// characters

$o = ord( $matches[ 0 ] ) ;

switch( $o )
{
// check for windows code page 1252 characters
case 130 : return "\xe2\x80\x94" ; // Single Low-9 Quotation Mark
case 131 : return "\xc6\x92" ; // Latin Small Letter F With Hook
case 132 : return "\xe2\x80\x9e" ; // Double Low-9 Quotation Mark
case 133 : return "\xe2\x80\xa6" ; // Horizontal Ellipsis
case 134 : return "\xe2\x80\xa0" ; // Dagger
case 135 : return "\xe2\x80\xa1" ; // Double Dagger
case 136 : return "\xcb\x86" ; // Modifier Letter Circumflex Accent
case 137 : return "\xe2\x80\xb0" ; // Per Mille Sign
case 138 : return "\xc5\xa0" ; // Latin Capital Letter S With Caron
case 139 : return "\xe2\x80\xb9" ; // Single Left-Pointing Angle Quotation Mark
case 140 : return "\xc5\x92" ; // Latin Capital Ligature OE
//gap
case 145 : return "\xe2\x80\x98" ; // Left Single Quotation Mark
case 146 : return "\xe2\x80\x99" ; // Right Single Quotation Mark
case 147 : return "\xe2\x80\x9c" ; // Left Double Quotation Mark
case 148 : return "\xe2\x80\x9d" ; // Right Double Quotation Mark
case 149 : return "\xe2\x80\xa2" ; // Bullet
case 150 : return "\xe2\x80\x93" ; // En Dash
case 151 : return "\xe2\x80\x94" ; // Em Dash
case 152 : return "\xcb\x9c" ; // Small Tilde
case 153 : return "\xe2\x84\xa2" ; // Trade Mark Sign
case 154 : return "\xc5\xa1" ; // Latin Small Letter S With Caron
case 155 : return "\xe2\x80\xba" ; // Single Right-Pointing Angle Quotation Mark
case 156 : return "\xc5\x93" ; // Latin Small Ligature OE
//gap
case 159 : return "\xc5\xb8" ; // Latin Capital Letter Y With Diaeresis

default :
return chr( 192 | ( 3 & ( $o >> 6 ) ) ) . chr( $o & 191 ) ;

}

}

function garbage_to_utf8( $text )
{

// locate all bytes with 0x80 set that are not a proper
// component of a unicode character. pass them to
// garbage_to_utf8_character_replacement_function
// to convert them to unicode under the assumptions they
// are either windows characters or extended ascii

$bad_replace = ''
. '/('
. '('
// find 1111110x not followed by 5 10xxxxxx
. '[\\xFC-\\xFD](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 111110xx not followed by 4 10xxxxxx
. '[\\xF8-\\xFB](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 11110xxx not followed by 3 10xxxxxx
. '[\\xF0-\\xF7](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 1110xxxx not followed by 2 10xxxxxx
. '[\\xE0-\\xEF](?![\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 110xxxxx not followed by 1 10xxxxxx
. '[\\xC0-\\xDF](?![\\x80-\\xBF])'
. '|'
// find 10xxxxxx not part of code point
. '(?<!'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD]'
. '|'
. '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xF8-\\xFB][\\x80-\\xBF]'
. '|'
. '[\\xF8-\\xFB]'
. '|'
. '[\\xF0-\\xF7][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xF0-\\xF7][\\x80-\\xBF]'
. '|'
. '[\\xF0-\\xF7]'
. '|'
. '[\\xE0-\\xEF][\\x80-\\xBF]'
. '|'
. '[\\xE0-\\xEF]'
. '|'
. '[\\xC0-\\xDF]'
. ')'
. '[\\x80-\\xBF]'
. ')'
. ')/'
;

return preg_replace_callback( $bad_replace ,
'garbage_to_utf8_character_replacement_function' ,
$text ) ;

}


A quick search shows I'm not the only one to have solved this using regular expressions.

FixLatin

Too bad he didn't post sooner, it would've saved me having to figure out the encoding transformation on my own. Ah well, at least I'm not the only one.

20070509

Initial code at regular expressions in Haskell

Haskell is an interesting language. It is, perhaps, the most intelligent system that I have programmed in. The type system is beautiful, the pattern matching makes function definition in other languages seem laborious, the high precedence for functions applying to their arguments just feels right, and the high barrier of correctness needed for compilation along with the errors given is great.

I am posting in order to share the work I have done thus far in learning Haskell. I decided I wanted to do something at least a little interesting, so I decided to begin implementing some rudimentary regular expression functions.

Currently I only have the framework for what I imagined. As written it operates as follows :


  • the characters from left to right are converted into a list of functions and functions that combine functions

  • the combinating functions are used to foldr together the matching functions in from right to left

  • a string can be handed into this function chain for matching.



-- takes a pattern string, returns two items
-- 1) a function that takes a target string and returns if it matches and how many characters
-- 2) how much of the pattern was consumed to create this match function
literal :: [Char] -> ( [Char] -> ( Bool , Int ) , Int )
literal xxs = ( \yys -> case ( xxs , yys ) of
( [] , _ ) -> ( True , 0 )
( _ , [] ) -> ( False, 0 )
((x:_),(y:_)) -> if x == y then
( True , 1 )
else
( False , 0 )
, 1 )

-- execute the match function to the left first, then the match function to the right second
leftCombine :: ( (([Char]->(Bool,Int)),Int) -> (([Char]->(Bool,Int)),Int) -> (([Char]->(Bool,Int)),Int) )
leftCombine xf yf = let ( xmf , pc ) = xf -- xfunction yfunction xmatchfunction patterncount yougettheidea
in let ( ymf , pc' ) = yf
in ( ( \ccs -> let ( b , i ) = xmf ccs -- ccs is the eventual target string, note that \ :: ([Char]->(Bool,Int))
in if b == False then
( False , 0 )
else
let ( b' , i' ) = ymf ( drop i ccs )
in if b' == False then
( False , 0 )
else
( True , i + i' ) )
, pc + pc' )

-- for each letter gather a matching function, and a function detailing how to attach the preceding match to it.
-- right now it only has a single letter match with a straight left combine function
-- future support for repitions will be handled by combining a 0matching true stub and a combinative
-- function that sets up the preceding match function to match as many elements to a limit as it can,
-- backtracking and trying fewer matches on failures
rxizer :: [Char] -> [ ( (([Char]->(Bool,Int)),Int)->(([Char]->(Bool,Int)),Int)->(([Char]->(Bool,Int)),Int) , ([Char]->(Bool,Int),Int) ) ]
rxizer [] = []
rxizer pps@(p:_) = let ( cfn , ( fn , pc ) ) = case p of
_ -> ( leftCombine , literal pps )
in ( cfn , ( fn , pc ) ) : rxizer ( drop pc pps )

-- meshes together the series of generated match-functions by foldr'ing them together by their combinative-functions
rx :: [Char] -> [Char] -> ( Bool , Int )
rx [] = \ccs -> (True,0)
rx mms@(m:_) = fst $ snd $ foldr fnmerge truestub ( rxizer mms )
where
truestub = ( leftCombine , ( \c -> ( True , 0 ) , 0 ) )
fnmerge x y = let ( cfn , fn ) = x
( cfn' , fn' ) = y
in ( cfn , cfn' fn fn' )

main = do
-- generate some matches
-- pattern target -- output
print $ "HelloWorld" `rx` "HelloWorld" -- (True,10)
print $ "Hello" `rx` "HelloWorld" -- (True,5)
print $ "HelloWorld" `rx` "Hello" -- (False,0)
print $ "" `rx` "" -- (True,0)

About Me

(2) the subculture of the compulsive programmer, whose ethics prescribe that one silly idea and a month of frantic coding should suffice to make him a life-long millionaire. --ewd1036