You are here: Re: checking to see if a character is UTF8 « PHP Programming Language « IT news, forums, messages
Re: checking to see if a character is UTF8

Posted by Malcolm Dew-Jones on 11/18/05 04:13

lkrubner@geocities.com wrote:

: this is a function that someone has up on www.php.net:


: function seemsUTF8($Str) {
: // bmorel at ssi dot fr
: //17-Feb-2004 01:22
: //Here is an improved version of that function, compatible with 31-bit
: encoding scheme of //Unicode //3.x :
: for ($i=0; $i < strlen($Str); $i++) {
: if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
: elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
: elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
: elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
: elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
: elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
: else return false; # Does not match any model
: for ($j=0; $j < $n; $j++) {
: # n bytes matching 10bbbbbb follow ?
: if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
: return false;
: }
: }
: return true;
: }



: What is achieved by the variable $n? I don't know enough about
: character codes to understand what that final inner for loop is trying
: to do.

A utf-8 character can take more than one byte. Characters that are larger
(in numeric value) than 127 require more than one byte. The first byte of
a multibyte character indicates how many bytes are in the character.

There can be from two to six bytes in total (the first byte followed by 1
to 5 more bytes).

The outer loop is looking for the first byte of a multibyte character.
When it finds one then it examines the bit pattern to see how many more
bytes there are.

The inner loop is examining those bytes (the "more" in the above
sentence). It is checking that there are the correct number of
continuation bytes following the first byte.

The outer loop skips over bytes that represent single byte characters.

--

This programmer available for rent.

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация