|
Posted by Steve on 11/09/07 02:36
"Animesh K" <animesh1978@gmail.com> wrote in message
news:fh0c8g$1u1s$1@agate.berkeley.edu...
> Jerry Stuckle wrote:
>> Animesh K wrote:
>>> Jerry Stuckle wrote:
>>>
>>>>>
>>>>
>>>> That's not easy. Are you keeping the articles in a database or text
>>>> files? If the former, you can search the database.
>>>>
>>>
>>> To keep the problem simpler, let's assume that each article has
>>> tags/authors/topic and it is stored in database.
>>>
>>> One can view this as a graph-theoretic problem (with the graph being
>>> computed by php and stored in a database). But doing it in an efficient
>>> way would be interesting.
>>>
>>> Scanning the article for probable keywords is the next (and much harder)
>>> step :)
why not approach it on a curve rather than a linear graph, without defining
(manually) specific words that should identify the page. that may be what
you're talking about anyway. in that case, it should be less of a big deal
than you think.
if you parse the text of the pages, exclude common words (like adjectives,
articles, and verbs), and reduce the page content to nouns essentially, you
can then give rank to each one based on occurance. you could also assist
yourself in this process by creating a mapping table. in that table, you
could define certain jargon that will be found in, or unique to, your site.
that would better correlate the ranking that i just described. you could
also define the rank in other ways like the 'common-ness' of the words left
in the reduced content. 'theory' is not a very common term in most settings,
so, it may need to be seen as a more predominate descriptor of what the page
is about. make sense?
that's a content based way to rank similarities between pages. as for tags,
authors, and topics? well, that's pretty specific and less guessing has to
be done.
anyway, that's just an initial theoric approach to retaining abstractness
without having to know what any one page is about - requiring you to read
the page and manually creating the relationships, i mean.
what would also be helpful for you to do is to look at case studies done by
web crawlers and search engines. there have been a terrible amount written
about what google is doing that makes them so successful compared to others.
i mean specific tactics and algorythms they use...not just conceptual stuff.
ironically, you can find these by googling google. :)
hth,
me
Navigation:
[Reply to this message]
|