TWiki
>
CS450fall09 Web
>
Homeworks
>
Homework9
(revision 5) (raw view)
Edit
Attach
---+ Homework 9 - crawling the web (12 points!) due Mon Nov 30 at 2 pm In our last homework, we return to the application layer. You will write a simple web-crawling application, which may be used to inventory a website, and to check the validity of the various links in said site. Given the following command on the command-line: <verbatim> ./hw9 http://www.thesite.com/thepath </verbatim> your program will do the following: * for each "a href" link that leads to a document with the same path prefix, follow the link and repeat this process for that document. * for each "a href" and "img src" URL, visit the URL. If the URL returns an error code (400 or 500 range), report this error * also report the following statistics: * documents visited * links checked (including visited documents) * broken links found take care to make your program work on relative as well as absolute links, and to properly ignore any non-http links. The following is an example output. You may report more statistics, as long as the below statistics are easy to identify in the output. <verbatim> ./hw9 http://amita.cs.uic.edu Visited pages: 1 URLs checked: 11 Broken URLs ( 1): http://amita.cs.uic.edu/www.openwrt.org linked from: http://amita.cs.uic.edu ./hw9 http://www.cs.uic.edu/~llyons/ Visited pages: 9 URLs checked: 22 Broken URLs (1): http://www.cs.uic.edu/~llyons/www.mcmaster.com linked from: http://www.cs.uic.edu/~llyons/ ./hw9 http://rites.uic.edu Visited pages: 15 URLs checked: 94 Broken URLs (4): http://www.ssn.uillinois.edu/html/ssn_forms_docs.html linked from: http://rites.uic.edu/iaPolicies.html http://www.uic.edu/depts/las linked from: http://rites.uic.edu http://rites.uic.edu/courses.html http://rites.uic.edu/faculty.html http://rites.uic.edu/graduates.html http://rites.uic.edu/index.html http://rites.uic.edu/industry.html http://rites.uic.edu/labs.html http://rites.uic.edu/misc.html http://rites.uic.edu/projects.html http://rites.uic.edu/students.html http://www.uillinois.edu/about/policies.html linked from: http://rites.uic.edu/iaPolicies.html http://www.vpaa.uillinois.edu/policies/internet.asp?bhcp=1 linked from: http://rites.uic.edu/iaPolicies.html </verbatim> Other good sites to try your stuff on: http://logos.cs.uic.edu/reed/ http://www.uic.edu We'll be using this one as one of the evaluation examples: http://www1.cs.uic.edu ---++ The rules For this assignment, you may use either C/C++ or shell scripting (bash, awk/gawk, sed, tr, cat, etc). In addition, you may use curl for fetching documents from the shell, or libcurl for fetching documents from C. In C, you may use regular expressions for parsing documents, see regex.h, or 'man regex'. Using regular expressions is not mandatory, but recommended (it's an awesome tool to know). You *may not* use wget instead of curl, and if you find a version of curl that does recursive document retrieval, you may not use that either. Languages other than the ones mentioned above are by request only. In general, python, perl and ruby are not allowed, nor will most other general-purpose languages. You do not need to handle sites that have an infinite or impractically large (say a google search for "email") number of valid URLs. curl gets unhappy with certain https links. You will not be penalized for reporting these as broken links, even if they work in your browser. ---++ Why/how is this homework 12 points? For grading purposes, this homework will be considered as the normal 6-point homework, plus a second 6-point bonus homework. If you get 6 points total, this will result in one homework with a 100% score, and one with a 0% score. If you get 11 points total, you'll have one homework with 6 points, and one with 5 points. Why? Just because I'm such a nice guy. ;-) ---++ Hints In regards to parsing HTML with regular expressions, I found [[http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454][this answer]] to be particularly enlightening. Be sure to read the entire answer. The following command line snippet demonstrates extracting "img src" and "a href" URLs with regular expressions on linux. On OS X, replace "sed -r" with "sed -E". <verbatim> curl http://www.cs.uic.edu/ | sed 's/>/>\ /g' | sed -r -n 's/.*<(a|img)[^>]*(href|src)=["'']?([^" >]*)["'']?.*$/\3/p' </verbatim> Here the 'sed' command with a newline actually adds a newline after each '>' character, to make sure there is at most one URL per line. With "curl", if a page download fails it'll give you a non-zero return code. You can check it like this: <verbatim> curl http://some.strange.url if (( $? != 0 )); then echo "curl returned error"; fi </verbatim> In other cases, the download succeeds, but you get an HTTP error code (404 not found, for example). To see the error codes, use <verbatim> curl -v http://theurl 2> /tmp/stderr.log > /tmp/thefile </verbatim> this will output the HTTP header on stderr, which is then redirected to /tmp/stderr.log. In this example, stdout is also redirected, but to a different file. Watch out for relative URLs, both in HTTP redirects and in HTML pages. The can be either host-relative (starting with /), or path relative (starting without /), and may contain both '.' and '..'. They may also start with "http:" but without the "//", as illustrated on http://rites.uic.edu ---++ Big dump for http://www1.cs.uic.edu <verbatim> ./hw9 http://www1.cs.uic.edu Visited pages: 92 URLs checked: 1125 Broken URLs (51): http://www1.cs.uic.edu/~webmaster/dls/distlect05.html linked from: http://www1.cs.uic.edu/www/home.php?audience=public http://www1.cs.uic.edu/www/home.php?audience=public&label= http://www1.cs.uic.edu/www/newsArchive.php?audience=public&label=News http:// linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://liu.ece.uic.edu linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://vienna.che.uic.edu/personalpage/ linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://acm.cs.uic.edu/~cpw/stump/rules.html linked from: http://www1.cs.uic.edu/www/calendar.php?audience=public&label=Calendar http://acm.eecs.uic.edu/~cpw/stump/ linked from: http://www1.cs.uic.edu/www/calendar.php?audience=public&label=Calendar http://linux.pharm.uic.edu/ linked from: http://www1.cs.uic.edu/www/calendar.php?audience=public&label=Calendar http://multimedia.ece.uic.edu/~ashfaq/ linked from: http://www1.cs.uic.edu/www/faculty.php?audience=public http://www1.cs.uic.edu/www/faculty.php?audience=public&label=Faculty http://www1.cs.uic.edu/CSweb/documents/gradmanual2002.pdf linked from: http://www1.cs.uic.edu/www/gradadmit.php?audience=public&label=Graduate http://www1.cs.uic.edu/www/gradadmit.php?audience=public&label=Graduate%20Admissions http://acm.cs.uic.edu/ linked from: http://www1.cs.uic.edu/www/links.php?audience=public&label=Links http://www1.cs.uic.edu/CSweb/speakers/andrewYao.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/aravind.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/derHorngLee.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/FransKaashoek.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/janetKoledner.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/lanceFortnow.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/langford.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/leslieLamport.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/linCai.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/moshe.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/rajJain.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/rayDeCarlo.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/riccardoPucella.php linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/www/<li><a linked from: http://www1.cs.uic.edu/www/seminars.php?audience=public&label=Seminars http://www1.cs.uic.edu/CSweb/speakers/cristea.htm linked from: http://www1.cs.uic.edu/www/news.php?audience=public&label=&ind=93 http://www1.cs.uic.edu/CSweb/speakers/chrisDing.htm linked from: http://www1.cs.uic.edu/www/news.php?audience=public&label=&ind=96 https://bannerweb.apps.uillinois.edu/uic/prospect linked from: http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&label=Undergraduate%20Admissions http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&label=Undergraduate http://www2new.cs.uic.edu/www/home.php?audience=public linked from: http://www1.cs.uic.edu/www/ http://www.cs.uic.edu/~abicknel/ linked from: http://cs.uic.edu/~abicknel/ http://www.cs.uic.edu/~aganti/ linked from: http://cs.uic.edu/~aganti/ http://www.cs.uic.edu/~ashoukry/ linked from: http://cs.uic.edu/~ashoukry/ http://www.cs.uic.edu/~awalters/ linked from: http://cs.uic.edu/~awalters/ http://www.cs.uic.edu/~ekhokhlo/ linked from: http://cs.uic.edu/~ekhokhlo/ http://www.cs.uic.edu/~kapichon/ linked from: http://cs.uic.edu/~kapichon/ http://www.cs.uic.edu/~pgoripar/ linked from: http://cs.uic.edu/~pgoripar/ http://www.cs.uic.edu/~rlamoren/ linked from: http://cs.uic.edu/~rlamoren/ http://www.cs.uic.edu/~sfaci/ linked from: http://cs.uic.edu/~sfaci/ http://www.cs.uic.edu/~smorris/ linked from: http://cs.uic.edu/~smorris/ http://www.cs.uic.edu/~vpritik1/ linked from: http://cs.uic.edu/~vpritik1/ http://www.ego.net/us/il/chicago/ttd/default.asp linked from: http://www1.cs.uic.edu/www/contact.php?audience=public&label=Contact http://www.evl.uic.edu/EVL/EVLERS/dana.html linked from: http://www1.cs.uic.edu/www/staff.php?audience=public&label=Staff http://www.me.uic.edu/faculty/cetinkunt.htm linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://www.me.uic.edu/faculty/darabi.htm linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://www.ohare.com/midway/home.asp linked from: http://www1.cs.uic.edu/www/contact.php?audience=public&label=Contact http://www.uic.edu/cba/cba-depts/ids/facultyprofiles/aris.htm linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://www.uic.edu/cba/cba-depts/ids/facultyprofiles/wxding.htm linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://www.uic.edu/depts/bioe/faculty/u_diwekar/index.htm linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://www.uic.edu/depts/bioe/faculty/y_dai/index.htm linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct http://www.uic.edu/depts/enga/currstud/studentactivities.htm linked from: http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&label=Undergraduate%20Admissions http://www1.cs.uic.edu/www/ugradadmit.php?audience=public&label=Undergraduate http://www.uic.edu/depts/oae/campus_accessibility_map.html linked from: http://www1.cs.uic.edu/www/contact.php?audience=public&label=Contact http://www.uic.edu/depts/psch/ohlson-1.html linked from: http://www1.cs.uic.edu/www/adjunct.php?audience=public&label=Adjunct </verbatim>
Edit
|
Attach
|
P
rint version
|
H
istory
:
r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r5 - 2009-11-27 - 16:34:54 - Main.jakob
CS450fall09
Syllabus
Lecture Notes
-
References
Homeworks
-
Subversion
-
VMWare
-
schedule
-
hints
list archives
FAQ
ERF2054 map
[edit this
]
Log In
CS450fall09 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
ABOUT US
Our Department
Recent News
Contact Us
ACADEMICS
Prospective Students
Undergraduate
CS Minor
Graduate
Courses
RESEARCH
Overview
By Faculty
Labs
PEOPLE
Faculty
Adjuncts
Staff
Students
Alumni
Copyright 2016 The Board of Trustees
of the University of Illinois.
webmaster@cs.uic.edu
WISEST
Helping Women Faculty Advance
Funded by NSF