Posted by John Nichel on 08/15/73 11:36
Jay Paulson (CE CEN) wrote:
> Jay Paulson (CE CEN) wrote:
>>Hello everyone! I've been given the responsiblity of coding an apache access_log parser. What my tasks are to do is to return the number of hits for certain file extensions that happen on certain dates with specific IP address.
>>As of now I'm only going back 7 days in the log looking for this information and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv). I'm using the fgets() function so I can read the file line by line and do the matches that I need to do and increment the counters as needed. Right now I have 3 loops looking for everything, which seems to me not to be the best way of doing this. I've also encountered that a line may have the file extension I want but it's actually the soucre of another file. (see below for example)
>>Log file example:
>>I want the first line but not the second line. The second line has a .css file which was used by the .html file therefore I don't want this line. I do want the first line that all it has is .html and no other files.
>>10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] "GET /home.html HTTP/1.1" 200 8220 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
>>10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] "GET /styles/redesign.css HTTP/1.1" 200 2381 "http://wfmu.wfm.pvt/home.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
>>At any rate, here's some of my psudo code/code for what I'm trying to accomplish. I know there has to be a better way for this and I'm looking for suggestions!
> Save yourself a ton of work. Dump the raw logs into a db, and you can
> do all the queries on the db. Something like this...
> CREATE TABLE `rawLogs` (
> `ipAddress` int(15) NOT NULL default '0',
> `rfcIdentity` varchar(32) NOT NULL default '',
> `apacheUser` varchar(32) NOT NULL default '',
> `date` int(15) NOT NULL default '0',
> `request` longtext NOT NULL,
> `statusCode` varchar(32) NOT NULL default '',
> `sizeBytes` int(11) NOT NULL default '0',
> `referer` longtext NOT NULL,
> `userAgent` longtext NOT NULL,
> KEY `ipAddress` (`ipAddress`),
> FULLTEXT KEY `search` (`request`,`referer`,`userAgent`)
> ) TYPE=MyISAM;
> A few questions with this train of thought. I can see the advantages of putting the raw log file into a database but I would still need to parse the file and get the information out of it for each column.
Correct, but putting it into a db, you only have to parse the file once
instead of every time you want to sort your data.
> I'm also not quite sure what some of your feilds are for 'rfcIdentity'?? What is that? Why would I need an 'apacheUser' also?
In the output example of your logs, it looks as if your using the format
of Apache logs which contain this data (the two dashes after the IP).
Most of the time, that's what they will be; dashes, no data. Look here:
> Anyway, not too sure how I would get this information in an easy way for the massive amounts of inserts I would have to do on a 10 meg log file.
Script it. Just like you're parsing each line right now, but split the
line on the tab (I assume that's your separator), and you'll have an
array of the values in that line. Use that array to insert your values.
I do this with daily logs on our sites (some of the files are over
100mb) I also convert the IP and date into integers for easier
searching before inserting them into the db. YMMV.
Once you have them in the db, it's easy to run your queries on that
table (or break the data up into other tables for different search
criteria). On our system, I dump the raw log table every month (because
it's already been broken down to other tables and better normalized), as
trying to put two months of data into it would put it beyond the 4gb
limit on our system.
If this is just a one time thing you're looking to do, all of this may
be over the top. However, if the bosses are going to want to review
this data month in and month out, I think the time spent doing something
like this will be worth it.
John C. Nichel IV
Programmer/System Admin (ÜberGeek)
Dot Com Holdings of Buffalo
[Back to original message]