Homework 9 - crawling the web

In our last homework, we return to the application layer. You will write a simple web-crawling application, which may be used to inventory a website, and to check the validity of the various links in such site.

Given the following command on the command-line:


your program will do the following:

  • for each "a href" link that leads to a document with the same path prefix, follow the link and repeat this process for that document.
  • for each "a href" and "img src" URL, visit the URL. If the URL returns an error code (400 or 500 range), report this error
  • also report the following statistics:
    • documents visited
    • links checked (including visited documents)
    • broken links found

take care to make your program work on relative as well as absolute links, and to properly ignore any non-http links. The following is an example output. You may report more statistics, as long as the below statistics are easy to identify in the output.

Visited pages:        1
Links checked:       11
Broken links (       1):

Visited pages: 9
Links checked: 22
Broken links (1):

Visited pages:       13
Links checked:       85
Broken links (       4):

Other good sites to try your stuff on:

The rules

For this assignment, you may use either C/C++ or shell scripting (bash, awk/gawk, sed, tr, cat, etc). In addition, you may use curl for fetching documents from the shell, or libcurl for fetching documents from C. In C, you may use regular expressions for parsing documents, see regex.h, or 'man regex'.

You may not use wget instead of curl, and if you find a version of curl that does recursive document retrieval, you may not use that either. Languages other than the ones mentioned above are by request only. In general, python, perl and ruby are not allowed, nor will most other general-purpose languages.

