Reply to Re: Efficient way to rip html — HTML

Posted by Nikita the Spider on 10/03/06 19:15

In article <slrnei5alo.kp2.spamspam@bowser.marioworld>,
Ben C <spamspam@spam.eggs> wrote:

> On 2006-10-03, Arthur Rhodes <rhodesr@no.spam.com> wrote:
> > I'm building a web store and I have to create a large number of
> > product descriptions. The distributors do not provide spec sheets
> > or marketing materials to me in html format. Instead, they advise
> > me to simply copy the descriptions from their web sites.
> >
> > The problem is that the descriptions I need to copy are embedded
> > in complex pages, with nested tables, etc. Simply copying the
> > page source doesn't seem to be that useful. I end up having to
> > cut out lots of table code, etc., and usually make mistakes that
> > are time consuming to figure out and fix.
> >
> > The other alternative is to copy the text and then recreating the html
> > formatting from scratch.
> >
> > Is there an easier way?
>
> Python, and Beautiful Soup.
>
> http://www.crummy.com/software/BeautifulSoup/

Seconded. If you're willing to go the Python programming route, Connelly
Barnes' htmldata might also prove helpful:
http://oregonstate.edu/~barnesc/htmldata/

Last but not least you could use command-line Spyce (HTML templates with
the dynamic bits written in Python) to build your Web pages:
http://spyce.sourceforge.net/

Good luck

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация