|
Posted by Philip Ronan on 10/08/76 11:30
"Paul Herber" wrote:
> On 26 Oct 2005 10:06:02 -0700, el_roachmeister@yahoo.com wrote:
>
>> I am working on a spider script but I only want to parse english pages.
>> Is there a way I can check to see what language the content is in? I
>> suppose I could restrict my spider to just .com , .org, etc so foreign
>> countries would not get parsed.
>
> http://www.deutsch-online.com/
>
Here are some more for you:
http://www.clemi.org/
http://www.tottori.co.uk/
Not even Google can work out a web page's language with 100% reliability
(see <http://www.google.com/help/faq_translation.html#link>)
As Gordon suggests, you might achieve *some* success by checking things like
word frequency, but this is computationally expensive, and you still have to
consider things like speling mistaiks and typign errors.
Some pages have a lang attribute in the HTML tag (e.g., <HTML lang="en">),
but most don't.
--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/
Navigation:
[Reply to this message]
|