|
Posted by Dylan Sung on 07/17/06 16:58
"Nikita the Spider" <NikitaTheSpider@gmail.com> wrote in message
news:NikitaTheSpider-8A9133.10534117072006@news-rdr-02-ge0-1.southeast.rr.com...
> In article <1153147164.996213.29840@p79g2000cwp.googlegroups.com>,
> "Pat" <GTQWVOPHMEAQ@spammotel.com> wrote:
>
>> Would google and other search engines support the indexing of
>> non-English UTF-8 encoded websites?
>
> Yes.
>
>
>> Most chinese website indexed on google appears to be
>> - for Traditional Chinese, charset=big5" encoding=ANSI
>> - For Simplified Chinese, charset=gb2312 encoding=ANSI
>
> I don't have any experience with Asian encodings but my guess is that
> big5 is preferable to UTF8 because it is more efficient (i.e. takes up
> less space) when most of the characters are Asian. If you don't mind
> fatter pages, UTF8 should be fine.
Encodings like GB and Big5 are double byte encodings. However, unicode (utf8
at least) uses three or more bytes for far east asian characters (amongst
others). So yes, in terms of economy, GB and Big5 yield text files that have
fewer bytes.
You can view the repetoire of characters in unicode as having subsets of GB
and Big5 within them, and thus you can do direct converseions from GB to
unicode, and Big5 to unicode. However there are characters in GB which do
not occur in Big5 and vice versa, so conversion between the two is lossy. My
guess is that google employs searching algorithms which convert characters
to utf-8 and then searches for webpages which contain both simplified gb and
traditional characters in Big5 all at the same time, at least this is what I
get when I'm entering one or the other character set characters into their
search field.
Dyl.
[Back to original message]
|