You are here: Re: Spidering only english webpages « PHP Programming Language « IT news, forums, messages
Re: Spidering only english webpages

Posted by Philip Ronan on 10/08/76 11:30

"Paul Herber" wrote:

> On 26 Oct 2005 10:06:02 -0700, el_roachmeister@yahoo.com wrote:
>
>> I am working on a spider script but I only want to parse english pages.
>> Is there a way I can check to see what language the content is in? I
>> suppose I could restrict my spider to just .com , .org, etc so foreign
>> countries would not get parsed.
>
> http://www.deutsch-online.com/
>

Here are some more for you:

http://www.clemi.org/
http://www.tottori.co.uk/

Not even Google can work out a web page's language with 100% reliability
(see <http://www.google.com/help/faq_translation.html#link>)

As Gordon suggests, you might achieve *some* success by checking things like
word frequency, but this is computationally expensive, and you still have to
consider things like speling mistaiks and typign errors.

Some pages have a lang attribute in the HTML tag (e.g., <HTML lang="en">),
but most don't.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация