|
Posted by Salve Hεkedal on 11/02/07 16:22
On 2007-11-02, Christoph Burschka <christoph.burschka@rwth-aachen.de> wrote:
> Salve HΓ₯kedal wrote:
>> What is the best regular expression for finding urls in plain text
>> files?
>> (By urls I mean http://www.something.com, but also www.something.com,
>> or salve@somewhere.com)
>>
>> Salve
>
> I've used this before, but you're probably better off making your own
> expression. Note that it's really loose and will get a lot of false positives -
> especially file names - and it will cause havoc if you use it on HTML source. I
> deliberately did not enter any Top-Level-Domain filtering, because there are so
> many of them. You can replace the [a-z]{2,5} with something like (com|net|org)
> if you don't need to worry about country codes.
>
> The following expression should find strings that satisfy these conditions:
>
> - optionally a http protocol identifier
> - optionally a username(:password)@ string, which allows pretty much any
> characters except for spaces and colons. This isn't RFC-standard, by the way.
> - a hostname consisting of at least two and at most 34 labels, the last of which
> has 2 to 5 alphabet letters (for weird new ones like aero and museum; you can
> shorten it to 3 and still get the most common ones).
> - optionally a path containing any characters apart from spaces, and /ending in
> a non-punctuation character/. This last bit is vital because it avoids messing
> up URLs at the end of a sentence.
>
> (http:\/\/)?([^ :]+(:[^
> ]+)?@)?[a-z0-9]([a-z0-9i\-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9\-]{0,61}[a-z0-9])?){0,32}\.[a-z]{2,5}(\/[^
> ]*[^" \.,;\)])?
>
> (linebreaks are added by email client)
>
> This is a case insensitive pattern, you'll need the i modifier.
>
> --
> Christoph Burschka
Thanks alot! I'll study this closely
--
Salve
Navigation:
[Reply to this message]
|