|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.fetcher.FetchHTTP
public class FetchHTTP
HTTP fetcher that uses Apache Jakarta Commons HttpClient library.
| Nested Class Summary | |
|---|---|
(package private) class |
FetchHTTP.PostRestore
|
| Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
|---|
ComplexType.MBeanAttributeInfoIterator |
| Field Summary | |
|---|---|
static java.lang.String |
ATTR_ACCEPT_HEADERS
|
static java.lang.String |
ATTR_BDB_COOKIES
|
static java.lang.String |
ATTR_DEFAULT_ENCODING
|
static java.lang.String |
ATTR_DIGEST_ALGORITHM
|
static java.lang.String |
ATTR_DIGEST_CONTENT
|
static java.lang.String |
ATTR_FETCH_BANDWIDTH_MAX
|
static java.lang.String |
ATTR_HTTP_BIND_ADDRESS
|
static java.lang.String |
ATTR_HTTP_PROXY_HOST
|
static java.lang.String |
ATTR_HTTP_PROXY_PORT
|
static java.lang.String |
ATTR_IGNORE_COOKIES
|
static java.lang.String |
ATTR_LOAD_COOKIES
|
static java.lang.String |
ATTR_MAX_LENGTH_BYTES
|
static java.lang.String |
ATTR_MIDFETCH_DECIDE_RULES
Rules to apply mid-fetch, just after receipt of the response headers before we start to download body. |
static java.lang.String |
ATTR_SAVE_COOKIES
|
static java.lang.String |
ATTR_SEND_CONNECTION_CLOSE
|
static java.lang.String |
ATTR_SEND_IF_MODIFIED_SINCE
|
static java.lang.String |
ATTR_SEND_IF_NONE_MATCH
|
static java.lang.String |
ATTR_SEND_RANGE
|
static java.lang.String |
ATTR_SEND_REFERER
|
static java.lang.String |
ATTR_SOTIMEOUT_MS
|
static java.lang.String |
ATTR_TIMEOUT_SECONDS
|
static java.lang.String |
ATTR_TRUST
SSL trust level setting attribute name. |
protected com.sleepycat.je.Database |
cookieDb
Database backing cookie map, if using BDB |
static java.lang.String |
COOKIEDB_NAME
Name of cookie BDB Database |
static java.lang.String |
DEFAULT_DIGEST_ALGORITHM
Default algorithm to use for message disgesting. |
(package private) static java.lang.Boolean |
DEFAULT_DIGEST_CONTENT
Default whether to perform on-the-fly digest hashing of content-bodies. |
static java.lang.String[] |
DIGEST_ALGORITHMS
|
static java.lang.String |
HTTP_SCHEME
|
static java.lang.String |
HTTPS_SCHEME
|
static java.lang.String |
MD5
|
static java.lang.String |
RANGE
|
static java.lang.String |
RANGE_PREFIX
|
static java.lang.String |
REFERER
|
(package private) static java.lang.String |
SERVER_CACHE_KEY
|
static java.lang.String |
SHA1
The different digest algorithms to choose between, SHA-1 or MD-5 at the moment. |
(package private) static java.lang.String |
SSL_FACTORY_KEY
|
| Fields inherited from class org.archive.crawler.framework.Processor |
|---|
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules |
| Fields inherited from class org.archive.crawler.settings.ComplexType |
|---|
definition, definitionMap |
| Constructor Summary | |
|---|---|
FetchHTTP(java.lang.String name)
Constructor. |
|
| Method Summary | |
|---|---|
protected void |
addResponseContent(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
This method populates curi with response status and
content type. |
protected boolean |
checkMidfetchAbort(CrawlURI curi,
HttpRecorderMethod method,
org.apache.commons.httpclient.HttpConnection conn)
|
protected void |
cleanupHttp()
Perform any final cleanup related to the HttpClient instance. |
protected void |
configureHttp()
|
protected org.apache.commons.httpclient.HostConfiguration |
configureMethod(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method)
Configure the HttpMethod setting options and headers. |
void |
crawlCheckpoint(java.io.File checkpointDir)
Called by CrawlController when checkpointing. |
void |
crawlEnded(java.lang.String sExitMessage)
Called when a CrawlController has ended a crawl and is about to exit. |
void |
crawlEnding(java.lang.String sExitMessage)
Called when a CrawlController is ending a crawl (for any reason) |
void |
crawlPaused(java.lang.String statusMessage)
Called when a CrawlController is actually paused (all threads are idle). |
void |
crawlPausing(java.lang.String statusMessage)
Called when a CrawlController is going to be paused. |
void |
crawlResuming(java.lang.String statusMessage)
Called when a CrawlController is resuming a crawl that had been paused. |
void |
crawlStarted(java.lang.String message)
Called on crawl start. |
protected void |
doAbort(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
java.lang.String annotation)
|
void |
finalTasks()
Classes subclassing this one should override this method to perform processor specific actions. |
protected java.lang.Object |
getAttributeEither(CrawlURI curi,
java.lang.String key)
Get a value either from inside the CrawlURI instance, or from settings (module attributes). |
protected org.apache.commons.httpclient.auth.AuthScheme |
getAuthScheme(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
|
protected org.apache.commons.httpclient.HttpClient |
getHttp()
|
protected DecideRule |
getMidfetchRule(java.lang.Object o)
|
protected void |
handle401(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
Server is looking for basic/digest auth credentials (RFC2617). |
void |
initialTasks()
Classes subclassing this one should override this method to perform processor specific actions. |
protected void |
innerProcess(CrawlURI curi)
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI. |
protected void |
listUsedFiles(java.util.List<java.lang.String> list)
Those Modules that use files on disk should list them all when this method is called. |
void |
loadCookies()
Load cookies from the file specified in the order file. |
void |
loadCookies(java.lang.String cookiesFile)
Load cookies from a file before the first fetch. |
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status of the processor. |
void |
saveCookies()
Saves cookies to the file specified in the order file. |
void |
saveCookies(java.lang.String saveCookiesFile)
Saves cookies to a file. |
protected void |
setConditionalGetHeader(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
java.lang.String setting,
java.lang.String sourceHeader,
java.lang.String targetHeader)
Set the given conditional-GET header, if the setting is enabled and a suitable value is available in the URI history. |
protected void |
setSizes(CrawlURI curi,
HttpRecorder rec)
Update CrawlURI internal sizes based on current transaction (and in the case of 304s, history) |
| Methods inherited from class org.archive.crawler.framework.Processor |
|---|
checkForInterrupt, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
| Methods inherited from class org.archive.crawler.settings.ModuleType |
|---|
addElement |
| Methods inherited from class org.archive.crawler.settings.Type |
|---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
| Methods inherited from class javax.management.Attribute |
|---|
getName |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final java.lang.String ATTR_HTTP_PROXY_HOST
public static final java.lang.String ATTR_HTTP_PROXY_PORT
public static final java.lang.String ATTR_TIMEOUT_SECONDS
public static final java.lang.String ATTR_SOTIMEOUT_MS
public static final java.lang.String ATTR_MAX_LENGTH_BYTES
public static final java.lang.String ATTR_LOAD_COOKIES
public static final java.lang.String ATTR_SAVE_COOKIES
public static final java.lang.String ATTR_ACCEPT_HEADERS
public static final java.lang.String ATTR_DEFAULT_ENCODING
public static final java.lang.String ATTR_DIGEST_CONTENT
public static final java.lang.String ATTR_DIGEST_ALGORITHM
public static final java.lang.String ATTR_FETCH_BANDWIDTH_MAX
public static final java.lang.String ATTR_TRUST
static java.lang.Boolean DEFAULT_DIGEST_CONTENT
public static final java.lang.String SHA1
public static final java.lang.String MD5
public static java.lang.String[] DIGEST_ALGORITHMS
public static final java.lang.String DEFAULT_DIGEST_ALGORITHM
public static final java.lang.String ATTR_MIDFETCH_DECIDE_RULES
public static final java.lang.String ATTR_SEND_CONNECTION_CLOSE
public static final java.lang.String ATTR_SEND_REFERER
public static final java.lang.String ATTR_SEND_RANGE
public static final java.lang.String ATTR_SEND_IF_MODIFIED_SINCE
public static final java.lang.String ATTR_SEND_IF_NONE_MATCH
public static final java.lang.String REFERER
public static final java.lang.String RANGE
public static final java.lang.String RANGE_PREFIX
public static final java.lang.String HTTP_SCHEME
public static final java.lang.String HTTPS_SCHEME
public static final java.lang.String ATTR_IGNORE_COOKIES
public static final java.lang.String ATTR_BDB_COOKIES
public static final java.lang.String ATTR_HTTP_BIND_ADDRESS
protected com.sleepycat.je.Database cookieDb
public static final java.lang.String COOKIEDB_NAME
static final java.lang.String SERVER_CACHE_KEY
static final java.lang.String SSL_FACTORY_KEY
| Constructor Detail |
|---|
public FetchHTTP(java.lang.String name)
name - Name of this processor.| Method Detail |
|---|
protected void innerProcess(CrawlURI curi)
throws java.lang.InterruptedException
Processor
innerProcess in class Processorcuri - The CrawlURI being processed.
java.lang.InterruptedException
protected void setSizes(CrawlURI curi,
HttpRecorder rec)
curi - CrawlURIrec - HttpRecorder
protected void doAbort(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
java.lang.String annotation)
protected boolean checkMidfetchAbort(CrawlURI curi,
HttpRecorderMethod method,
org.apache.commons.httpclient.HttpConnection conn)
protected DecideRule getMidfetchRule(java.lang.Object o)
protected void addResponseContent(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
curi with response status and
content type.
curi - CrawlURI to populate.method - Method to get response status and headers from.
protected org.apache.commons.httpclient.HostConfiguration configureMethod(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method)
curi - CrawlURI from which we pull configuration.method - The Method to configure.
protected void setConditionalGetHeader(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
java.lang.String setting,
java.lang.String sourceHeader,
java.lang.String targetHeader)
curi - source CrawlURImethod - HTTP operation pendingsetting - true/false enablement setting name to consultsourceHeader - header to consult in URI historytargetHeader - header to set if possible
protected java.lang.Object getAttributeEither(CrawlURI curi,
java.lang.String key)
curi - CrawlURI to consultkey - key to lookup
protected void handle401(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
method - Method that got a 401.curi - CrawlURI that got a 401.
protected org.apache.commons.httpclient.auth.AuthScheme getAuthScheme(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
method - Method that got a 401.curi - CrawlURI that got a 401.
public void initialTasks()
ProcessorThis method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.
initialTasks in class Processorpublic void finalTasks()
Processor
finalTasks in class Processorprotected void cleanupHttp()
protected void configureHttp()
throws java.lang.RuntimeException
java.lang.RuntimeExceptionpublic void loadCookies(java.lang.String cookiesFile)
The file is a text file in the Netscape's 'cookies.txt' file format.
Example entry of cookies.txt file:
www.archive.org FALSE / FALSE 1074567117 details-visit texts-cralond
Each line has 7 tab-separated fields:
cookiesFile - file in the Netscape's 'cookies.txt' format.public java.lang.String report()
Processor
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
report in class Processorpublic void loadCookies()
The file is a text file in the Netscape's 'cookies.txt' file format.
Example entry of cookies.txt file:
www.archive.org FALSE / FALSE 1074567117 details-visit texts-cralond
Each line has 7 tab-separated fields:
public void saveCookies()
public void saveCookies(java.lang.String saveCookiesFile)
saveCookiesFile - output file.protected void listUsedFiles(java.util.List<java.lang.String> list)
ModuleTypeEach file (as a string name with full path) should be added to the provided list.
Modules that do not use any files can safely ignore this method.
listUsedFiles in class ModuleTypelist - The list to add files to.protected org.apache.commons.httpclient.HttpClient getHttp()
public void crawlStarted(java.lang.String message)
CrawlStatusListener
crawlStarted in interface CrawlStatusListenermessage - Start message.public void crawlCheckpoint(java.io.File checkpointDir)
CrawlStatusListenerCrawlController when checkpointing.
crawlCheckpoint in interface CrawlStatusListenercheckpointDir - Checkpoint dir. Write checkpoint state here.public void crawlEnding(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnding in interface CrawlStatusListenersExitMessage - Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJobpublic void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded in interface CrawlStatusListenersExitMessage - Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJobpublic void crawlPausing(java.lang.String statusMessage)
CrawlStatusListener
crawlPausing in interface CrawlStatusListenerstatusMessage - Should be
STATUS_WAITING_FOR_PAUSE. Passed for conveniencepublic void crawlPaused(java.lang.String statusMessage)
CrawlStatusListener
crawlPaused in interface CrawlStatusListenerstatusMessage - Should be
CrawlJob.STATUS_PAUSED. Passed for
conveniencepublic void crawlResuming(java.lang.String statusMessage)
CrawlStatusListener
crawlResuming in interface CrawlStatusListenerstatusMessage - Should be
CrawlJob.STATUS_RUNNING. Passed for
convenience
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||