|
Posted by CAH on 04/12/06 12:41
Thansk a lot for all the comments and code, it is great that you take
an interrest. I belive session ID is also a problem with regards to
validation of site, so there are good reasons to find a solid and
simple solution.
> > > But, if your site depends on session (for non-members and hence
> > > crawler)
> >
> > it does denpend on sessions for non-members
>
> I doubt that. Anyway, it's better you check if it depends on
> session again.
Well, there is no login on my site, there is no members. The session id
is used to transfer informations from one page to the next in af
electronic guide, that is made with forms.
> Actually you're turning off trans sid (see my link above) and there
> by you're turning off the session for crawler.
I still do not understand the difference between trans sid og turning
sid off.
What is the difference between these two?
php_value session.use_only_cookies
php_value session.use_trans_sid
>But, you said your site
> needs session for crawler.
I am sorry that I have given that impression, I do not need sessions
for the crawler. But I do not like to turn off sessions in the url,
because there are still users who does not like to get cookies on there
computers.
And here goes my untested--to be improved--a
> quick dirty hack:
>
> <?php
> /* Crawler SID removal hack: begin--------*/
> /* Hack code should be placed on the top of every accessible script.
> * or place it in a global common file say header.php or so.
> * Important Assumption: Crawler indexes the final redirected URI */
>
> /**
> * Test if the request is from the crawler
> *
> * @return boolean
> * @todo implement it or google for hundreds of codes
> **/
> function IsCrawler()
> {
> return true;
> }
Does the above that code test if it is a crawler?
> if (IsCrawler())
> {
> define('CRAWLER_SID_FILE', 'crawler_sid.txt');
>
> if (isset($_GET[session_name()])) // Is session id found in query
> string?
> {
> $tmp_get = $_GET;
> unset($tmp_get[session_name()]); //remove session id
> // now rebuild query string...
> $new_get = http_build_query($tmp_get);
> $default_ports = array('https' => 443, 'http' => 80);
> $prefix = (!empty($_SERVER['HTTPS']) ? 'https' : 'http');
> $current_url = $prefix .
> (($_SERVER['SERVER_PORT'] != $default_ports[$prefix]) ?
> ':' . $_SERVER['SERVER_PORT'] : '') . '://'
> . $_SERVER['HTTP_HOST']
> . $_SERVER['PHP_SELF'];
> // redirect to self, but with SID removed
> header('Location: '.$current_url . '?' . $new_get);
> exit;
> }
> else // SID is not found (page got redirected); so we need to
> set/load the crawler's session id
> {
> // the session id is been actually found in
> CRAWLER_SID_FILE for the crawler
> if (($crawler_sid=@file_get_contents(CRAWLER_SID_FILE))!==false)
> session_id($crawler_sid);
> }
> }
> /*----break: Crawler SID removal hack*/
> // normal code...
> session_start();
> /* Crawler SID removal hack: continue ----*/
> if (IsCrawler())
> {
> file_put_contents(CRAWLER_SID_FILE, session_id()); //safely store
> the crawler's session id in CRAWLER_SID_FILE
> }
> /*----end: Crawler SID removal hack*/
>
> // now, again normal code...
> // ...
> // testing a link...
> echo '<p><a href="' . $_SERVER['PHP_SELF'] . '?a=100&b=50&c=5">Test
> link</a></p>';
> $_SESSION['test'] = !isset($_SESSION['test']) ? 0 :
> ($_SESSION['test']+1);
> echo '<p>'.$_SESSION['test'].'</p>';
> ?>
>
>
> --
> <?php echo 'Just another PHP saint'; ?>
> Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/
I am going to have to read this a few times to get it, i am in the
beginner class. But I will look more closely into it.
But how about the robots.txt solution, it seems simple and I think it
works.
User-agent: Googlebot
Disallow: /*PHPSESSID
But I can not say for sure, I have to wait and see if google removes
the old listings.
Once again thanks for help and code.
Best regards
Mads Larsen
[Back to original message]
|