Overview Schedule Announcements Resources Assignments Home

CSC 121: Computer Science I, Spring 2005

Project 5

Mining Data from Web Logs
Deadline: 5:00pm, Thursday, May 12

Description

After many years of working with computers, Professor X, a Computer Scientist at a small midwestern college, recently went off of the deep end, deciding to give it all up and go on the road playing banjo as a member of a bluegrass band. Expecting to gain little income from his new venture, he has an unusual and innovative idea for a business on the Web, a new dot-com, that will essentially run itself, allowing him ample time for his new pursuit.

He is keeping his idea a secret, but figures that if he is able to easily monitor web activity he will be able to make minor changes to the web site to keep the business going. There are several places that web activity data are stored, one of them being the Web Log. He has a list of reports that he would like to generate and, because of your experience analyzing web logs, thinks you are the person to write the software application that will allow him to do so.

Since it has been a while since you actually worked with a web log, here is a description of the log entries you will be examining:

207.46.98.33 - - [01/Nov/2004:04:28:32 -0500] "GET /~runner/csc121/resources.htm HTTP/1.0" 200 3973

Notice that each item above is separated from the other by a space. Note also that some of the items themselves contain spaces. If you were to read a line from the log file, one way to split the line into pieces would be to use the split method, as in line.split(" "). Then you would get an array of Strings with 10 elements, element 0 being "207.46.98.33", elements 1 and 2 both being "-", element 3 being "[01/Nov/2004:04:28:32" and so on.

Items to do and point distribution:

Professor X would like the following information, for the time period spanned by the log, included in a report written to a text file:

The format of the report is up to you, but it should have a title and each of the requested statistics should be appropriately labeled and easily readable.

Copy the project I:\CSC121\public\webanalyzer to your own folder. The project has several classes including one called Analyzer. This is an empty class which you will need to write from scratch for this assignment. Analyzer will have three main methods, input(Infile inf), process(), and output(Outfile outf). Each of these are described below. You are likely to decide to define "helper" methods in addition to these three. A driver class will be calling your methods, so they need to have the prescribed interface and functionality.

The project folder also contains a web log, access_logcps.txt. This log has nine days worth of data, November 1 - November 9, 2004.

public boolean input(Infile inf) - This method will use the read method of the Infile class (see that class interface) to add web log records to an ArrayList. The ArrayList will serve as a basis for analysis to be done by the process method. Each line of the input is a String and contains a full log record as described above. input() should return true or false according to whether an ArrayList of length greater than zero has been successfully constructed. After reading the last line from the input, you should close() the input file.

public void process() - This method will construct, populate and prepare the structures required for analysis. For example in order to compute the peak day and hour statistic, you will likely construct a 2-D array of frequency counts of visits, day x hour (9 x 24). You might want to use the method parseInt from the Integer class to convert a String to an int. You should extract day and hour from the time field of the web log entry. The syntax for using the function is:

int myInt = Integer.parseInt( <The String to convert> );
       

public void output(Outfile outf) - This method will use the write method of the Outfile class (see that class interface) to create a text file containing the results of the analysis. You should use the structures created by the process method to produce your report.You might want to produce the report in a terminal window before writing it to a text file. You should close the output file after writing the last line of output.

  1. (15 points) Correct implementation of the input() method.
  2. (15 points) Correct implementation of the output() method.
  3. (15 points) Correct implementation of the process method and the associated structures.
  4. (10 points) The correct number of distinct visitors (client ip addresses) to the web site.
  5. (10 points) The correct visitor who made the largest number of requests and the visit count for this visitor.
  6. (10 points) The correct most requested page and its request count.
  7. (10 points) The correct peak hour report and visit count for each of the nine days in the log.
  8. (10 points) The correct peak day and visit count.
  9. (5 points) Programming style, formatting (readability), and comments, especially describing methods.
Turning in your project:
When you are finished, copy your project to the appropriate folder under I:\CSC121\Project5 and send an email message to your instructor naming the members (1 or 2) of your team.
Overview Schedule Announcements Resources Assignments Home

Valid HTML 4.01!Valid CSS!DePauw University , Computer Science Department , Spring 2005
Maintained by Brian Howard ( bhoward@depauw.edu ). Last updated