|
Posted by Dylan Sung on 07/17/06 17:00
"Dylan Sung" <dylanwhs.tsktsktsk@pacific.net.hk> wrote in message
news:e9gfj6$rli$1@nntp.aioe.org...
>
> "Nikita the Spider" <NikitaTheSpider@gmail.com> wrote in message
> news:NikitaTheSpider-8A9133.10534117072006@news-rdr-02-ge0-1.southeast.rr.com...
>> In article <1153147164.996213.29840@p79g2000cwp.googlegroups.com>,
>> "Pat" <GTQWVOPHMEAQ@spammotel.com> wrote:
>>
>>> Would google and other search engines support the indexing of
>>> non-English UTF-8 encoded websites?
>>
>> Yes.
>>
>>
>>> Most chinese website indexed on google appears to be
>>> - for Traditional Chinese, charset=big5" encoding=ANSI
>>> - For Simplified Chinese, charset=gb2312 encoding=ANSI
>>
>> I don't have any experience with Asian encodings but my guess is that
>> big5 is preferable to UTF8 because it is more efficient (i.e. takes up
>> less space) when most of the characters are Asian. If you don't mind
>> fatter pages, UTF8 should be fine.
>
> Encodings like GB and Big5 are double byte encodings. However, unicode
> (utf8 at least) uses three or more bytes for far east asian characters
> (amongst others). So yes, in terms of economy, GB and Big5 yield text
> files that have fewer bytes.
>
> You can view the repetoire of characters in unicode as having subsets of
> GB and Big5 within them, and thus you can do direct converseions from GB
> to unicode, and Big5 to unicode. However there are characters in GB which
> do not occur in Big5 and vice versa, so conversion between the two is
> lossy. My guess is that google employs searching algorithms which convert
> characters to utf-8 and then searches for webpages which contain both
> simplified gb and traditional characters in Big5 all at the same time, at
> least this is what I get when I'm entering one or the other character set
> characters into their search field.
Sorry, didn't answer the original question. I think that web pages should
list their encodings as appropriate. That is gb, when gb is used and so
forth. Search engines can do the rest.
Dyl.
[Back to original message]
|