|
Posted by R. Rajesh Jeba Anbiah on 04/11/06 22:44
CAH wrote:
> R. Rajesh Jeba Anbiah skrev:
<snip>
> > It doesn't seem to strip session id as I thought. If your site
> > contents doesn't rely on session (for non-members), you may safely turn
> > off trans sid> <news:1111603962.594721.154710@l41g2000cwc.googlegroups.com> (
> > http://groups.google.com/group/comp.lang.php/msg/ce24f27f2b7ac610 )
> > --even you can selectively turn off only for the crawler by sniffing
> > user agent string and or IP.
> >
> > But, if your site depends on session (for non-members and hence
> > crawler)
>
> it does denpend on sessions for non-members
I doubt that. Anyway, it's better you check if it depends on
session again.
> and you'd like to enable session for crawler, but doesn't want
> > the trans sid, you need to go for some other hack. If that is your
> > situation, I may help you with the hack.
<snip>
> I found this at another site
>
> if(strpos($_SERVER['HTTP_USER_AGENT'],"google")!==false or
> strpos($_SERVER['HTTP_USER_AGENT'],"MSIECrawler")!==false)
> {
> ini_set("url_rewriter.tags","");
> }
>
> http://www.mtdev.com/2002/06/why-you-should-disable-phps-session-use_trans_sid/
Actually you're turning off trans sid (see my link above) and there
by you're turning off the session for crawler. But, you said your site
needs session for crawler. And here goes my untested--to be improved--a
quick dirty hack:
<?php
/* Crawler SID removal hack: begin--------*/
/* Hack code should be placed on the top of every accessible script.
* or place it in a global common file say header.php or so.
* Important Assumption: Crawler indexes the final redirected URI */
/**
* Test if the request is from the crawler
*
* @return boolean
* @todo implement it or google for hundreds of codes
**/
function IsCrawler()
{
return true;
}
if (IsCrawler())
{
define('CRAWLER_SID_FILE', 'crawler_sid.txt');
if (isset($_GET[session_name()])) // Is session id found in query
string?
{
$tmp_get = $_GET;
unset($tmp_get[session_name()]); //remove session id
// now rebuild query string...
$new_get = http_build_query($tmp_get);
$default_ports = array('https' => 443, 'http' => 80);
$prefix = (!empty($_SERVER['HTTPS']) ? 'https' : 'http');
$current_url = $prefix .
(($_SERVER['SERVER_PORT'] != $default_ports[$prefix]) ?
':' . $_SERVER['SERVER_PORT'] : '') . '://'
. $_SERVER['HTTP_HOST']
. $_SERVER['PHP_SELF'];
// redirect to self, but with SID removed
header('Location: '.$current_url . '?' . $new_get);
exit;
}
else // SID is not found (page got redirected); so we need to
set/load the crawler's session id
{
// the session id is been actually found in
CRAWLER_SID_FILE for the crawler
if (($crawler_sid=@file_get_contents(CRAWLER_SID_FILE))!==false)
session_id($crawler_sid);
}
}
/*----break: Crawler SID removal hack*/
// normal code...
session_start();
/* Crawler SID removal hack: continue ----*/
if (IsCrawler())
{
file_put_contents(CRAWLER_SID_FILE, session_id()); //safely store
the crawler's session id in CRAWLER_SID_FILE
}
/*----end: Crawler SID removal hack*/
// now, again normal code...
// ...
// testing a link...
echo '<p><a href="' . $_SERVER['PHP_SELF'] . '?a=100&b=50&c=5">Test
link</a></p>';
$_SESSION['test'] = !isset($_SESSION['test']) ? 0 :
($_SESSION['test']+1);
echo '<p>'.$_SESSION['test'].'</p>';
?>
--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/
[Back to original message]
|