TWiki> CS450fall09 Web>Homeworks>Homework9 (revision 1)EditAttach

Homework 9 - crawling the web

In our last homework, we return to the application layer. You will write a simple web-crawling application, which may be used to inventory a website, and to check the validity of the various links in such site.

Given the following command on the command-line:

./hw9 http://www.thesite.com/thepath

your program will do the following:

  • for each "a href" link that leads to a document with the same path prefix, follow the link and repeat this process for that document.
  • for each "a href" and "img src" URL, visit the URL. If the URL returns an error code (400 or 500 range), report this error
  • also report the following statistics:
    • documents visited
    • links checked (including visited documents)
    • broken links found

take care to make your program work on relative as well as absolute links, and to properly ignore any non-http links. The following is an example output. You may report more statistics, as long as the below statistics are easy to identify in the output.

./hw9 http://www.thesite.com/thepath
Visited pages: 14
Links checked: 124
Broken links (1):
http://www.doesnt.exist.com/thepath

The rules

For this assignment, you may use either C/C++ or shell scripting (bash, awk/gawk, sed, tr, cat, etc). In addition, you may use curl for fetching documents from the shell, or libcurl for fetching documents from C. In C, you may use regular expressions for parsing documents, see regex.h, or 'man regex'.

You may not use wget instead of curl, and if you find a version of curl that does recursive document retrieval, you may not use that either. Languages other than the ones mentioned above are by request only. In general, python, perl and ruby are not allowed, nor will most other general-purpose languages.

Edit | Attach | Print version | History: r6 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2009-11-15 - 03:48:29 - Main.jakob
 
Copyright 2016 The Board of Trustees
of the University of Illinois.webmaster@cs.uic.edu
WISEST
Helping Women Faculty Advance
Funded by NSF