You are here: Re: Spidering only english webpages « PHP Programming Language « IT news, forums, messages
Re: Spidering only english webpages

Posted by Gordon Burditt on 10/26/05 21:15

>I am working on a spider script but I only want to parse english pages.
>Is there a way I can check to see what language the content is in? I
>suppose I could restrict my spider to just .com , .org, etc so foreign
>countries would not get parsed.

Lots of websites in any domain are multi-lingual. Lots of websites
in non-English-speaking countries are in English (at least partly).

Your spider might manage to use content negotiation to try to select
the English content over other versions of it, but I suspect most
websites aren't really set up to use content negotiaton.

There are probably some word frequency tests you can use to guess
what language a web page is in. sci.crypt often uses such info to
try to crack ciphers if they think they know what language the
message is in. This might fall flat on its face if the web site
is discussing another language (e.g. computer programming languages,
or something laced heavily with technical jargon).

Gordon L. Burditt

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация