Posted by data64 on 06/17/05 16:24
"Travis Newbury" <TravisNewbury@hotmail.com> wrote in
news:1118946530.850531.150290@g14g2000cwa.googlegroups.com:
> Does anyone know of a program that can crawl a website and tell what
> files are not used any more?
>
> The servers are running on IIS
>
> Thanks
>
We did something similar using perl, essentially comparing the files indexed
by our search engine with the files in the webserver directory. Being static
files, this was fairly simply.
If you are looking for a spider to crawl things, and don't mind using perl
there's Merlyn's article on a simple spider
http://www.stonehenge.com/merlyn/WebTechniques/col07.html
The swish-e open source search engine ships with a spider that you could use
to return a list of files for your site and another for you filesystem.
You would have to modify it to only return the name rather than entire
document in your case.
http://swish-e.org/docs/spider.html
data64
[Back to original message]
|