|
Posted by Samuel on 10/07/05 00:11
Hello,
I am looking for a way to check whether a string contains only word
characters and a single space (!= any whitespace char), *regardless of
the current locale*. In other words, any character that is a word
character in any locale should be allowed. This check:
preg_match("/^[\w ]*$/", $_GET[whatever]);
in which the $_GET variable contains an UTF-8 encoded string, only
seems to work with whatever locale is currently defined. Of course, I
could change the locale using setlocale(), but that would still limit
the check to a subset of all possible input values.
I also created this function from information that I found on the web:
--------------------------------
function is_utf8($_string) {
return preg_match('/^([\x00-\x7f]|'
. '[\xc2-\xdf][\x80-\xbf]|'
. '\xe0[\xa0-\xbf][\x80-\xbf]|'
. '[\xe1-\xec][\x80-\xbf]{2}|'
. '\xed[\x80-\x9f][\x80-\xbf]|'
. '[\xee-\xef][\x80-\xbf]{2}|'
. 'f0[\x90-\xbf][\x80-\xbf]{2}|'
. '[\xf1-\xf3][\x80-\xbf]{3}|'
. '\xf4[\x80-\x8f][\x80-\xbf]{2})*$/',
$_string) > 0;
}
--------------------------------
However, this does not seem to be completely accurate, as it still
allows characters such as this:
http://debain.org/software/tefinch/demo/?read=1&msg_id=214&forum_id=1
(sorry for the external link, I just don't know how to create such
characters here.)
According to the W3C Validator, those characters are still invalid.
http://validator.w3.org/check?uri=http%3A%2F%2Fdebain.org%2Fsoftware%2Ftefinch%2Fdemo%2F%3Fread%3D1%26msg_id%3D214%26forum_id%3D1&charset=%28detect+automatically%29&doctype=%28detect+automatically%29
I know there must be an answer somewhere on the web already, but I have
not found any reference in Google nor in the archives of this
newsgroup.
Any help appreciated.
-Samuel
[Back to original message]
|