Homework 9 - crawling the web (12 points!) due Mon Nov 30 at 2 pm

In our last homework, we return to the application layer. You will write a simple web-crawling application, which may be used to inventory a website, and to check the validity of the various links in said site.

Given the following command on the command-line:

./hw9 http://www.thesite.com/thepath

your program will do the following:

  • for each "a href" link that leads to a document with the same path prefix, follow the link and repeat this process for that document.
  • for each "a href" and "img src" URL, visit the URL. If the URL returns an error code (400 or 500 range), report this error
  • also report the following statistics:
    • documents visited
    • links checked (including visited documents)
    • broken links found

take care to make your program work on relative as well as absolute links, and to properly ignore any non-http links. The following is an example output. You may report more statistics, as long as the below statistics are easy to identify in the output.

./hw9 http://amita.cs.uic.edu/
Visited pages:        1                                                                     
URLs checked:       11
Broken URLs (       1):
http://amita.cs.uic.edu/www.openwrt.org linked from:
  http://amita.cs.uic.edu


./hw9 http://www.cs.uic.edu/~llyons
Visited pages:       10                                                                     
URLs checked:       34
Broken URLs (       2):
http://www-personal.umich.edu/~ltoth/ linked from:
  http://www.cs.uic.edu/~llyons/resume.htm
http://www.cs.uic.edu/~llyons/www.mcmaster.com linked from:
  http://www.cs.uic.edu/~llyons/skills.html

./hw9  http://rites.uic.edu
Visited pages:       15                                                    
URLs checked:       90
Broken URLs (       5):
http://www.ssn.uillinois.edu/html/ssn_forms_docs.html linked from:
  http://rites.uic.edu/iaPolicies.html
http://www.uic.edu/depts/las linked from:
  http://rites.uic.edu
  http://rites.uic.edu/courses.html
  http://rites.uic.edu/faculty.html
  http://rites.uic.edu/graduates.html
  http://rites.uic.edu/index.html
  http://rites.uic.edu/industry.html
  http://rites.uic.edu/labs.html
  http://rites.uic.edu/misc.html
  http://rites.uic.edu/projects.html
  http://rites.uic.edu/students.html
http://www.uillinois.edu/about/policies.html linked from:
  http://rites.uic.edu/iaPolicies.html
http://www.vpaa.uillinois.edu/policies/internet.asp?bhcp=1 linked from:
  http://rites.uic.edu/iaPolicies.html
https://login.techweb.com/cas/login?service=http%3A//www.darkreading.com/showArticle.jhtml%3Bjsessionid%3DQW4ZLPYPDUB11QE1GHRSKH4ATMY32JVN&gateway=true linked from:
  http://www.darkreading.com/showArticle.jhtml?doc_id=102209&WT.svl=news1_1


Other good sites to try your stuff on: http://logos.cs.uic.edu/reed/

http://www.uic.edu

We'll be using this one as one of the evaluation examples: http://www1.cs.uic.edu

The rules

For this assignment, you may use either C/C++ or shell scripting (bash, awk/gawk, sed, tr, cat, etc). In addition, you may use curl for fetching documents from the shell, or libcurl for fetching documents from C. In C, you may use regular expressions for parsing documents, see regex.h, or 'man regex'. Using regular expressions is not mandatory, but recommended (it's an awesome tool to know).

You may not use wget instead of curl, and if you find a version of curl that does recursive document retrieval, you may not use that either. Languages other than the ones mentioned above are by request only. In general, python, perl and ruby are not allowed, nor will most other general-purpose languages.

You do not need to handle sites that have an infinite or impractically large (say a google search for "email") number of valid URLs.

curl gets unhappy with certain https links. You will not be penalized for reporting these as broken links, even if they work in your browser.

Why/how is this homework 12 points?

For grading purposes, this homework will be considered as the normal 6-point homework, plus a second 6-point bonus homework. If you get 6 points total, this will result in one homework with a 100% score, and one with a 0% score. If you get 11 points total, you'll have one homework with 6 points, and one with 5 points.

Why? Just because I'm such a nice guy. wink

Hints

In regards to parsing HTML with regular expressions, I found this answer to be particularly enlightening. Be sure to read the entire answer.

The following command line snippet demonstrates extracting "img src" and "a href" URLs with regular expressions on linux. On OS X, replace "sed -r" with "sed -E".

curl http://www.cs.uic.edu/ | sed 's/>/>\
/g' | sed -r -n 's/.*<(a|img)[^>]*(href|src)=["'']?([^" >]*)["'']?.*$/\3/p'

Here the 'sed' command with a newline actually adds a newline after each '>' character, to make sure there is at most one URL per line. With "curl", if a page download fails it'll give you a non-zero return code. You can check it like this:

curl http://some.strange.url
if (( $? != 0 )); then
 echo "curl returned error";
fi

In other cases, the download succeeds, but you get an HTTP error code (404 not found, for example). To see the error codes, use

curl -v http://theurl 2> /tmp/stderr.log > /tmp/thefile

this will output the HTTP header on stderr, which is then redirected to /tmp/stderr.log. In this example, stdout is also redirected, but to a different file.

Watch out for relative URLs, both in HTTP redirects and in HTML pages. The can be either host-relative (starting with /), or path relative (starting without /), and may contain both '.' and '..'. They may also start with "http:" but without the "//", as illustrated on http://rites.uic.edu

Big dump for http://www1.cs.uic.edu


./hw9 http://www1.cs.uic.edu
Visited pages: 92                                                
URLs checked: 1125
Broken URLs (51):
http://www1.cs.uic.edu/~webmaster/dls/distlect05.html linked from:
  http://www1.cs.uic.edu/www/home.php?audience=public
  http://www1.cs.uic.edu/www/home.php?audience=public&label=
  http://www1.cs.uic.edu/www/newsArchive.php?audience=public&label=News
http:// linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://liu.ece.uic.edu linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://vienna.che.uic.edu/personalpage/ linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://acm.cs.uic.edu/~cpw/stump/rules.html linked from:
  http://www1.cs.uic.edu/www/calendar.php?audience=public&label=Calendar
http://acm.eecs.uic.edu/~cpw/stump/ linked from:
  http://www1.cs.uic.edu/www/calendar.php?audience=public&label=Calendar
http://linux.pharm.uic.edu/ linked from:
  http://www1.cs.uic.edu/www/calendar.php?audience=public&label=Calendar
http://multimedia.ece.uic.edu/~ashfaq/ linked from:
  http://www1.cs.uic.edu/www/faculty.php?audience=public
  http://www1.cs.uic.edu/www/faculty.php?audience=public&label=Faculty
http://www1.cs.uic.edu/CSweb/documents/gradmanual2002.pdf linked from:
  http://www1.cs.uic.edu/www/gradadmit.php?audience=public&label=Graduate
  http://www1.cs.uic.edu/www/gradadmit.php?audience=public&label=Graduate%20Admissions
http://acm.cs.uic.edu/ linked from:
  http://www1.cs.uic.edu/www/links.php?audience=public&label=Links
http://www1.cs.uic.edu/CSweb/speakers/andrewYao.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/aravind.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/derHorngLee.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/FransKaashoek.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/janetKoledner.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/lanceFortnow.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/langford.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/leslieLamport.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/linCai.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/moshe.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/rajJain.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/rayDeCarlo.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/riccardoPucella.php linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/www/<li><a linked from:
  http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars
http://www1.cs.uic.edu/CSweb/speakers/cristea.htm linked from:
  http://www1.cs.uic.edu/www/news.php?audience=public&label=&ind=93
http://www1.cs.uic.edu/CSweb/speakers/chrisDing.htm linked from:
  http://www1.cs.uic.edu/www/news.php?audience=public&label=&ind=96
https://bannerweb.apps.uillinois.edu/uic/prospect linked from:
  http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&amp;label=Undergraduate%20Admissions
  http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&label=Undergraduate
http://www2new.cs.uic.edu/www/home.php?audience=public linked from:
  http://www1.cs.uic.edu/www/
http://www.cs.uic.edu/~abicknel/ linked from:
  http://cs.uic.edu/~abicknel/
http://www.cs.uic.edu/~aganti/ linked from:
  http://cs.uic.edu/~aganti/
http://www.cs.uic.edu/~ashoukry/ linked from:
  http://cs.uic.edu/~ashoukry/
http://www.cs.uic.edu/~awalters/ linked from:
  http://cs.uic.edu/~awalters/
http://www.cs.uic.edu/~ekhokhlo/ linked from:
  http://cs.uic.edu/~ekhokhlo/
http://www.cs.uic.edu/~kapichon/ linked from:
  http://cs.uic.edu/~kapichon/
http://www.cs.uic.edu/~pgoripar/ linked from:
  http://cs.uic.edu/~pgoripar/
http://www.cs.uic.edu/~rlamoren/ linked from:
  http://cs.uic.edu/~rlamoren/
http://www.cs.uic.edu/~sfaci/ linked from:
  http://cs.uic.edu/~sfaci/
http://www.cs.uic.edu/~smorris/ linked from:
  http://cs.uic.edu/~smorris/
http://www.cs.uic.edu/~vpritik1/ linked from:
  http://cs.uic.edu/~vpritik1/
http://www.ego.net/us/il/chicago/ttd/default.asp linked from:
  http://www1.cs.uic.edu/www/contact.php?audience=public&label=Contact
http://www.evl.uic.edu/EVL/EVLERS/dana.html linked from:
  http://www1.cs.uic.edu/www/staff.php?audience=public&label=Staff
http://www.me.uic.edu/faculty/cetinkunt.htm linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://www.me.uic.edu/faculty/darabi.htm linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://www.ohare.com/midway/home.asp linked from:
  http://www1.cs.uic.edu/www/contact.php?audience=public&label=Contact
http://www.uic.edu/cba/cba-depts/ids/facultyprofiles/aris.htm linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://www.uic.edu/cba/cba-depts/ids/facultyprofiles/wxding.htm linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://www.uic.edu/depts/bioe/faculty/u_diwekar/index.htm linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://www.uic.edu/depts/bioe/faculty/y_dai/index.htm linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct
http://www.uic.edu/depts/enga/currstud/studentactivities.htm linked from:
  http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&amp;label=Undergraduate%20Admissions
  http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&label=Undergraduate
http://www.uic.edu/depts/oae/campus_accessibility_map.html linked from:
  http://www1.cs.uic.edu/www/contact.php?audience=public&label=Contact
http://www.uic.edu/depts/psch/ohlson-1.html linked from:
  http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct

Topic revision: r6 - 2009-11-30 - 20:58:55 - Main.jakob
 
Copyright 2016 The Board of Trustees
of the University of Illinois.webmaster@cs.uic.edu
WISEST
Helping Women Faculty Advance
Funded by NSF