|
Posted by Gordon Burditt on 10/26/05 21:15
>I am working on a spider script but I only want to parse english pages.
>Is there a way I can check to see what language the content is in? I
>suppose I could restrict my spider to just .com , .org, etc so foreign
>countries would not get parsed.
Lots of websites in any domain are multi-lingual. Lots of websites
in non-English-speaking countries are in English (at least partly).
Your spider might manage to use content negotiation to try to select
the English content over other versions of it, but I suspect most
websites aren't really set up to use content negotiaton.
There are probably some word frequency tests you can use to guess
what language a web page is in. sci.crypt often uses such info to
try to crack ciphers if they think they know what language the
message is in. This might fall flat on its face if the web site
is discussing another language (e.g. computer programming languages,
or something laced heavily with technical jargon).
Gordon L. Burditt
Navigation:
[Reply to this message]
|