Reply to Re: Find urls in plain text files — PHP Programming Language

Posted by Christoph Burschka on 11/02/07 15:08

Salve Håkedal wrote:
> What is the best regular expression for finding urls in plain text
> files?
> (By urls I mean http://www.something.com, but also www.something.com,
> or salve@somewhere.com)
>
> Salve

I've used this before, but you're probably better off making your own
expression. Note that it's really loose and will get a lot of false positives -
especially file names - and it will cause havoc if you use it on HTML source. I
deliberately did not enter any Top-Level-Domain filtering, because there are so
many of them. You can replace the [a-z]{2,5} with something like (com|net|org)
if you don't need to worry about country codes.

The following expression should find strings that satisfy these conditions:

- optionally a http protocol identifier
- optionally a username(:password)@ string, which allows pretty much any
characters except for spaces and colons. This isn't RFC-standard, by the way.
- a hostname consisting of at least two and at most 34 labels, the last of which
has 2 to 5 alphabet letters (for weird new ones like aero and museum; you can
shorten it to 3 and still get the most common ones).
- optionally a path containing any characters apart from spaces, and /ending in
a non-punctuation character/. This last bit is vital because it avoids messing
up URLs at the end of a sentence.

(http:\/\/)?([^ :]+(:[^
]+)?@)?[a-z0-9]([a-z0-9i\-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9\-]{0,61}[a-z0-9])?){0,32}\.[a-z]{2,5}(\/[^
]*[^" \.,;\)])?

(linebreaks are added by email client)

This is a case insensitive pattern, you'll need the i modifier.

--
Christoph Burschka

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация