20090713

Garbage to UTF-8

I had a problem last year with a legacy database filled with a mix of utf-8, windows cp-1252, extended and regular ascii. I needed a way to clean up the information without losing any of the information contained in it.

Being familiar with regular expressions, I looked up how UTF-8 was formatted, made a couple of assumptions about the malformations I would find therein, figured out which code points were the ones I should replace and the following function was born.

Or something like that. Anyway here is the code :


function garbage_to_utf8_character_replacement_function( $matches )
{

// converts binary 10000000 -> 11111111 that do not appear
// as part of a unicode character into a unicode character
// under the assumption that a portion of them are windows
// cp-1252 characters, and the rest are exteneded ascii
// characters

$o = ord( $matches[ 0 ] ) ;

switch( $o )
{
// check for windows code page 1252 characters
case 130 : return "\xe2\x80\x94" ; // Single Low-9 Quotation Mark
case 131 : return "\xc6\x92" ; // Latin Small Letter F With Hook
case 132 : return "\xe2\x80\x9e" ; // Double Low-9 Quotation Mark
case 133 : return "\xe2\x80\xa6" ; // Horizontal Ellipsis
case 134 : return "\xe2\x80\xa0" ; // Dagger
case 135 : return "\xe2\x80\xa1" ; // Double Dagger
case 136 : return "\xcb\x86" ; // Modifier Letter Circumflex Accent
case 137 : return "\xe2\x80\xb0" ; // Per Mille Sign
case 138 : return "\xc5\xa0" ; // Latin Capital Letter S With Caron
case 139 : return "\xe2\x80\xb9" ; // Single Left-Pointing Angle Quotation Mark
case 140 : return "\xc5\x92" ; // Latin Capital Ligature OE
//gap
case 145 : return "\xe2\x80\x98" ; // Left Single Quotation Mark
case 146 : return "\xe2\x80\x99" ; // Right Single Quotation Mark
case 147 : return "\xe2\x80\x9c" ; // Left Double Quotation Mark
case 148 : return "\xe2\x80\x9d" ; // Right Double Quotation Mark
case 149 : return "\xe2\x80\xa2" ; // Bullet
case 150 : return "\xe2\x80\x93" ; // En Dash
case 151 : return "\xe2\x80\x94" ; // Em Dash
case 152 : return "\xcb\x9c" ; // Small Tilde
case 153 : return "\xe2\x84\xa2" ; // Trade Mark Sign
case 154 : return "\xc5\xa1" ; // Latin Small Letter S With Caron
case 155 : return "\xe2\x80\xba" ; // Single Right-Pointing Angle Quotation Mark
case 156 : return "\xc5\x93" ; // Latin Small Ligature OE
//gap
case 159 : return "\xc5\xb8" ; // Latin Capital Letter Y With Diaeresis

default :
return chr( 192 | ( 3 & ( $o >> 6 ) ) ) . chr( $o & 191 ) ;

}

}

function garbage_to_utf8( $text )
{

// locate all bytes with 0x80 set that are not a proper
// component of a unicode character. pass them to
// garbage_to_utf8_character_replacement_function
// to convert them to unicode under the assumptions they
// are either windows characters or extended ascii

$bad_replace = ''
. '/('
. '('
// find 1111110x not followed by 5 10xxxxxx
. '[\\xFC-\\xFD](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 111110xx not followed by 4 10xxxxxx
. '[\\xF8-\\xFB](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 11110xxx not followed by 3 10xxxxxx
. '[\\xF0-\\xF7](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 1110xxxx not followed by 2 10xxxxxx
. '[\\xE0-\\xEF](?![\\x80-\\xBF][\\x80-\\xBF])'
. '|'
// find 110xxxxx not followed by 1 10xxxxxx
. '[\\xC0-\\xDF](?![\\x80-\\xBF])'
. '|'
// find 10xxxxxx not part of code point
. '(?<!'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD][\\x80-\\xBF]'
. '|'
. '[\\xFC-\\xFD]'
. '|'
. '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xF8-\\xFB][\\x80-\\xBF]'
. '|'
. '[\\xF8-\\xFB]'
. '|'
. '[\\xF0-\\xF7][\\x80-\\xBF][\\x80-\\xBF]'
. '|'
. '[\\xF0-\\xF7][\\x80-\\xBF]'
. '|'
. '[\\xF0-\\xF7]'
. '|'
. '[\\xE0-\\xEF][\\x80-\\xBF]'
. '|'
. '[\\xE0-\\xEF]'
. '|'
. '[\\xC0-\\xDF]'
. ')'
. '[\\x80-\\xBF]'
. ')'
. ')/'
;

return preg_replace_callback( $bad_replace ,
'garbage_to_utf8_character_replacement_function' ,
$text ) ;

}


A quick search shows I'm not the only one to have solved this using regular expressions.

FixLatin

Too bad he didn't post sooner, it would've saved me having to figure out the encoding transformation on my own. Ah well, at least I'm not the only one.

No comments:

About Me

(2) the subculture of the compulsive programmer, whose ethics prescribe that one silly idea and a month of frantic coding should suffice to make him a life-long millionaire. --ewd1036