|
Posted by Christoph Burschka on 11/02/07 15:08
Salve Håkedal wrote:
> What is the best regular expression for finding urls in plain text
> files?
> (By urls I mean http://www.something.com, but also www.something.com,
> or salve@somewhere.com)
>
> Salve
I've used this before, but you're probably better off making your own
expression. Note that it's really loose and will get a lot of false positives -
especially file names - and it will cause havoc if you use it on HTML source. I
deliberately did not enter any Top-Level-Domain filtering, because there are so
many of them. You can replace the [a-z]{2,5} with something like (com|net|org)
if you don't need to worry about country codes.
The following expression should find strings that satisfy these conditions:
- optionally a http protocol identifier
- optionally a username(:password)@ string, which allows pretty much any
characters except for spaces and colons. This isn't RFC-standard, by the way.
- a hostname consisting of at least two and at most 34 labels, the last of which
has 2 to 5 alphabet letters (for weird new ones like aero and museum; you can
shorten it to 3 and still get the most common ones).
- optionally a path containing any characters apart from spaces, and /ending in
a non-punctuation character/. This last bit is vital because it avoids messing
up URLs at the end of a sentence.
(http:\/\/)?([^ :]+(:[^
]+)?@)?[a-z0-9]([a-z0-9i\-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9\-]{0,61}[a-z0-9])?){0,32}\.[a-z]{2,5}(\/[^
]*[^" \.,;\)])?
(linebreaks are added by email client)
This is a case insensitive pattern, you'll need the i modifier.
--
Christoph Burschka
[Back to original message]
|