|
Posted by Chuck Renner on 10/28/06 15:25
Thanks Rik for pointing out that the HTTP headers on that redirected
page were setting and using cookies and for pointing me in the right
direction with cURL.
I was able to yield a correctly working result for my HTML downloading
problem in less than an hour, using cURL with PHP.
With the function I have below, I just call tempnam() to give me a
temporary filename, call my function with the uri and the results from
tempnam(), and then read the file with file_get_contents(). I then can
delete the file with unlink().
Here is the function I wrote to download a uri into a file (following
all redirects, ignoring old cookies, and passing set cookies to redirects):
<?php
function uri_download($uri, $fileName) {
// use cURL to download uri
// make a curl resource, setting the uri as it's target to open
$curl = curl_init($uri);
// make a file resource and create/empty the file for writing
$hFile = fopen($fileName, "w+");
// set curl options
// set the file resource that curl will write to
curl_setopt($curl, CURLOPT_FILE, $hFile);
// do not let curl output the HTTP headers
curl_setopt($curl, CURLOPT_HEADER, false);
// let curl follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// set a location for curl to handle cookies
curl_setopt($curl, CURLOPT_COOKIEJAR, "/tmp");
// tell curl to mark this as a new cookie session
curl_setopt($curl, CURLOPT_COOKIESESSION, true);
// execute curl (download the uri to the temp file)
curl_exec($curl);
// close the curl resource
curl_close($curl);
// unset the curl resource
unset($curl);
// close the temp file and file resource
fclose($hFile);
// unset the file resource
unset($hFile);
}
?>
Chuck Renner wrote:
> Please help!
>
> This MIGHT even be a bug in PHP!
>
> I'll provide version numbers and site specific information (browser, OS,
> and kernel versions) if others cannot reproduce this problem.
>
> I'm running into some PHP behavior that I do not understand in PHP 5.1.2.
>
> I need to parse the HTML from the following carefully constructed URI:
> http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121
>
> The problem is that when PHP downloads the HTML using file_get_contents,
> or any other method of opening a remote file in PHP that I have tried,
> it gives me the wrong page!
>
> This URI is supposed to yield the HTML from the page at
> http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
> version of the page, selectable from the dropdown box at the top of the
> page.
>
> The correct page is downloaded in IE, SeaMonkey, and in wget!
>
> But when downloading in PHP, I get the HTML from the page at
> http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
> small" version of the page, selectable from the dropdown box at the top
> of the page.
>
> Please note that the templatechange.mg page is merely a server-side
> script that takes the arguments passed to it (TemplateID and origin),
> and redirects the browser to the correct version of the page at
> "origin", based on the "TemplateID".
>
> Here is how to reproduce the problem:
> * Download the page with wget so that you have a copy of the correct
> results:
>
> --commandline start here--
> wget
> "http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"
> -O correct.html
> --commandline end here--
>
> * Download the same page with php 5.1.2:
>
> --file incorrect.php start here--
> <?php
> print(file_get_contents("http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"));
> ?>
> --file incorrect.php end here--
>
> --commandline start here--
> php incorrect.php > incorrect.html
> --commandline end here--
>
> * You should now have two very different HTML files (correct.html and
> incorrect.html), even though both were downloaded using the same URI!
>
> * Open correct.html in a web browser. You will see a thumbnails
> ("allthumbs") only version of a smugmug.com picture gallery.
>
> * Open incorrect.html in a web browser. You will see a paginated
> version of the same smugmug.com picture gallery ("smugmug small"), with
> a larger image on the right.
>
> I know that I could make a workaround by having my PHP scripts call wget
> instead of using intrinsic functions to download the HTML. This is not
> practical for me for a number of reasons, including code portability and
> streamlining.
>
> Can anyone help me with this? I know that the templatechange.mg uses a
> 302 to redirect the browser, based on the output I get from wget. I
> also know that the redirect is happening in PHP (even if it is happening
> incorrectly), because I'm not getting the contents of the
> templatechange.mg file, but a different version of the gallery itself.
>
> This is driving me crazy. I can find no logical reason why PHP would
> yield different results for the same URI than I get in 3 other browsers
> (SeaMonkey, IE, and wget).
>
> I have also attached the results pages and the php script (correct.html,
> incorrect.html, and incorrect.php) in php_download_strangeness.tar.bz2
> (a bzip2 compressed tar archive)
>
> - Chuck Renner
>
>
Navigation:
[Reply to this message]
|