TWiki
>
CS450fall09 Web
>
Homeworks
>
Homework9
(revision 2) (raw view)
Edit
Attach
---+ Homework 9 - crawling the web In our last homework, we return to the application layer. You will write a simple web-crawling application, which may be used to inventory a website, and to check the validity of the various links in such site. Given the following command on the command-line: <verbatim> ./hw9 http://www.thesite.com/thepath </verbatim> your program will do the following: * for each "a href" link that leads to a document with the same path prefix, follow the link and repeat this process for that document. * for each "a href" and "img src" URL, visit the URL. If the URL returns an error code (400 or 500 range), report this error * also report the following statistics: * documents visited * links checked (including visited documents) * broken links found take care to make your program work on relative as well as absolute links, and to properly ignore any non-http links. The following is an example output. You may report more statistics, as long as the below statistics are easy to identify in the output. <verbatim> ./hw9 http://amita.cs.uic.edu Visited pages: 1 Links checked: 11 Broken links ( 1): http://amita.cs.uic.edu/www.openwrt.org ./hw9 http://www.cs.uic.edu/~llyons/ Visited pages: 9 Links checked: 22 Broken links (1): http://www.cs.uic.edu/~llyons/www.mcmaster.com ./hw9 http://rites.uic.edu Visited pages: 13 Links checked: 85 Broken links ( 4): http://www.ssn.uillinois.edu/html/ssn_forms_docs.html http://www.uic.edu/depts/las http://www.uillinois.edu/about/policies.html http://www.vpaa.uillinois.edu/policies/internet.asp?bhcp=1 </verbatim> Other good sites to try your stuff on: http://logos.cs.uic.edu/reed/ http://www.uic.edu ---++ The rules For this assignment, you may use either C/C++ or shell scripting (bash, awk/gawk, sed, tr, cat, etc). In addition, you may use curl for fetching documents from the shell, or libcurl for fetching documents from C. In C, you may use regular expressions for parsing documents, see regex.h, or 'man regex'. You *may not* use wget instead of curl, and if you find a version of curl that does recursive document retrieval, you may not use that either. Languages other than the ones mentioned above are by request only. In general, python, perl and ruby are not allowed, nor will most other general-purpose languages.
Edit
|
Attach
|
P
rint version
|
H
istory
:
r6
|
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r2 - 2009-11-15 - 07:13:05 - Main.jakob
CS450fall09
Syllabus
Lecture Notes
-
References
Homeworks
-
Subversion
-
VMWare
-
schedule
-
hints
list archives
FAQ
ERF2054 map
[edit this
]
Log In
CS450fall09 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
ABOUT US
Our Department
Recent News
Contact Us
ACADEMICS
Prospective Students
Undergraduate
CS Minor
Graduate
Courses
RESEARCH
Overview
By Faculty
Labs
PEOPLE
Faculty
Adjuncts
Staff
Students
Alumni
Copyright 2016 The Board of Trustees
of the University of Illinois.
webmaster@cs.uic.edu
WISEST
Helping Women Faculty Advance
Funded by NSF