The programmable programmer: Garbage to UTF-8

I had a problem last year with a legacy database filled with a mix of utf-8, windows cp-1252, extended and regular ascii. I needed a way to clean up the information without losing any of the information contained in it.

Being familiar with regular expressions, I looked up how UTF-8 was formatted, made a couple of assumptions about the malformations I would find therein, figured out which code points were the ones I should replace and the following function was born.

Or something like that. Anyway here is the code :

function garbage_to_utf8_character_replacement_function( $matches ) { // converts binary 10000000 -> 11111111 that do not appear // as part of a unicode character into a unicode character // under the assumption that a portion of them are windows // cp-1252 characters, and the rest are exteneded ascii // characters $o = ord( $matches[ 0 ] ) ; switch( $o ) { // check for windows code page 1252 characters case 130 : return "\xe2\x80\x94" ; // Single Low-9 Quotation Mark case 131 : return "\xc6\x92" ; // Latin Small Letter F With Hook case 132 : return "\xe2\x80\x9e" ; // Double Low-9 Quotation Mark case 133 : return "\xe2\x80\xa6" ; // Horizontal Ellipsis case 134 : return "\xe2\x80\xa0" ; // Dagger case 135 : return "\xe2\x80\xa1" ; // Double Dagger case 136 : return "\xcb\x86" ; // Modifier Letter Circumflex Accent case 137 : return "\xe2\x80\xb0" ; // Per Mille Sign case 138 : return "\xc5\xa0" ; // Latin Capital Letter S With Caron case 139 : return "\xe2\x80\xb9" ; // Single Left-Pointing Angle Quotation Mark case 140 : return "\xc5\x92" ; // Latin Capital Ligature OE //gap case 145 : return "\xe2\x80\x98" ; // Left Single Quotation Mark case 146 : return "\xe2\x80\x99" ; // Right Single Quotation Mark case 147 : return "\xe2\x80\x9c" ; // Left Double Quotation Mark case 148 : return "\xe2\x80\x9d" ; // Right Double Quotation Mark case 149 : return "\xe2\x80\xa2" ; // Bullet case 150 : return "\xe2\x80\x93" ; // En Dash case 151 : return "\xe2\x80\x94" ; // Em Dash case 152 : return "\xcb\x9c" ; // Small Tilde case 153 : return "\xe2\x84\xa2" ; // Trade Mark Sign case 154 : return "\xc5\xa1" ; // Latin Small Letter S With Caron case 155 : return "\xe2\x80\xba" ; // Single Right-Pointing Angle Quotation Mark case 156 : return "\xc5\x93" ; // Latin Small Ligature OE //gap case 159 : return "\xc5\xb8" ; // Latin Capital Letter Y With Diaeresis default : return chr( 192 | ( 3 & ( $o >> 6 ) ) ) . chr( $o & 191 ) ; } } function garbage_to_utf8( $text ) { // locate all bytes with 0x80 set that are not a proper // component of a unicode character. pass them to // garbage_to_utf8_character_replacement_function // to convert them to unicode under the assumptions they // are either windows characters or extended ascii $bad_replace = '' . '/(' . '(' // find 1111110x not followed by 5 10xxxxxx . '[\\xFC-\\xFD](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 111110xx not followed by 4 10xxxxxx . '[\\xF8-\\xFB](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 11110xxx not followed by 3 10xxxxxx . '[\\xF0-\\xF7](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 1110xxxx not followed by 2 10xxxxxx . '[\\xE0-\\xEF](?![\\x80-\\xBF][\\x80-\\xBF])' . '|' // find 110xxxxx not followed by 1 10xxxxxx . '[\\xC0-\\xDF](?![\\x80-\\xBF])' . '|' // find 10xxxxxx not part of code point . '(?<!' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD][\\x80-\\xBF]' . '|' . '[\\xFC-\\xFD]' . '|' . '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xF8-\\xFB][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xF8-\\xFB][\\x80-\\xBF]' . '|' . '[\\xF8-\\xFB]' . '|' . '[\\xF0-\\xF7][\\x80-\\xBF][\\x80-\\xBF]' . '|' . '[\\xF0-\\xF7][\\x80-\\xBF]' . '|' . '[\\xF0-\\xF7]' . '|' . '[\\xE0-\\xEF][\\x80-\\xBF]' . '|' . '[\\xE0-\\xEF]' . '|' . '[\\xC0-\\xDF]' . ')' . '[\\x80-\\xBF]' . ')' . ')/' ; return preg_replace_callback( $bad_replace , 'garbage_to_utf8_character_replacement_function' , $text ) ; }

A quick search shows I'm not the only one to have solved this using regular expressions.

FixLatin

Too bad he didn't post sooner, it would've saved me having to figure out the encoding transformation on my own. Ah well, at least I'm not the only one.

The programmable programmer

20090713

Garbage to UTF-8

No comments:

Blog Archive

About Me