|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.processor.CrawlMapper
public abstract class CrawlMapper
A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). May operate on a CrawlURI (typically early in the processing chain) or its CandidateURI outlinks (late in the processing chain, after LinksScoper), or both (if inserted and configured in both places).
Applies a map() method, supplied by a concrete subclass, to classKeys to map URIs to crawlers by name.
One crawler name is distinguished as the 'local name'; URIs mapped to this name are not diverted, but continue to be processed normally.
If using the JMX importUris operation importing URLs dropped by
a CrawlMapper instance, use recoveryLog style.
| Nested Class Summary |
|---|
| Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
|---|
ComplexType.MBeanAttributeInfoIterator |
| Field Summary | |
|---|---|
static java.lang.String |
ATTR_CHECK_OUTLINKS
whether to map CrawlURI's outlinks (if CandidateURIs) |
static java.lang.String |
ATTR_CHECK_URI
whether to map CrawlURI itself (if status nonpositive) |
static java.lang.String |
ATTR_DIVERSION_DIR
where to log diversions |
static java.lang.String |
ATTR_LOCAL_NAME
name of local crawler (URIs mapped to here are not diverted) |
static java.lang.String |
ATTR_MAP_OUTLINK_DECIDE_RULES
decide rules to determine if an outlink is subject to mapping |
static java.lang.String |
ATTR_ROTATION_DIGITS
rotate logs when change occurs within this # of digits of timestamp |
protected ArrayLongFPCache |
cache
|
static java.lang.Boolean |
DEFAULT_CHECK_OUTLINKS
|
static java.lang.Boolean |
DEFAULT_CHECK_URI
|
static java.lang.String |
DEFAULT_DIVERSION_DIR
|
static java.lang.String |
DEFAULT_LOCAL_NAME
|
static java.lang.Integer |
DEFAULT_ROTATION_DIGITS
|
(package private) java.util.HashMap<java.lang.String,java.io.PrintWriter> |
diversionLogs
Mapping of target crawlers to logs (PrintWriters) |
protected java.lang.String |
localName
name of the enclosing crawler (URIs mapped here stay put) |
(package private) java.lang.String |
logGeneration
Truncated timestamp prefix for diversion logs; when current time doesn't match, it's time to close all current logs. |
| Fields inherited from class org.archive.crawler.framework.Processor |
|---|
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules |
| Fields inherited from class org.archive.crawler.settings.ComplexType |
|---|
definition, definitionMap |
| Constructor Summary | |
|---|---|
CrawlMapper(java.lang.String name,
java.lang.String description)
Constructor. |
|
| Method Summary | |
|---|---|
protected boolean |
decideToMapOutlink(CandidateURI cauri)
|
protected void |
divertLog(CandidateURI cauri,
java.lang.String target)
Note the given CandidateURI in the appropriate diversion log. |
protected java.io.PrintWriter |
getDiversionLog(java.lang.String target)
Get the diversion log for a given target crawler node node. |
protected DecideRule |
getMapOutlinkDecideRule(java.lang.Object o)
|
protected void |
initialTasks()
Classes subclassing this one should override this method to perform processor specific actions. |
protected void |
innerProcess(CrawlURI curi)
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI. |
protected abstract java.lang.String |
map(CandidateURI cauri)
Look up the crawler node name to which the given CandidateURI should be mapped. |
protected void |
updateGeneration(java.lang.String nowGeneration)
Close and mark as finished all existing diversion logs, and arrange for new logs to use the new generation prefix. |
| Methods inherited from class org.archive.crawler.framework.Processor |
|---|
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
| Methods inherited from class org.archive.crawler.settings.ModuleType |
|---|
addElement, listUsedFiles |
| Methods inherited from class org.archive.crawler.settings.Type |
|---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
| Methods inherited from class javax.management.Attribute |
|---|
getName |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final java.lang.String ATTR_CHECK_URI
public static final java.lang.Boolean DEFAULT_CHECK_URI
public static final java.lang.String ATTR_CHECK_OUTLINKS
public static final java.lang.Boolean DEFAULT_CHECK_OUTLINKS
public static final java.lang.String ATTR_MAP_OUTLINK_DECIDE_RULES
public static final java.lang.String ATTR_LOCAL_NAME
public static final java.lang.String DEFAULT_LOCAL_NAME
public static final java.lang.String ATTR_DIVERSION_DIR
public static final java.lang.String DEFAULT_DIVERSION_DIR
public static final java.lang.String ATTR_ROTATION_DIGITS
public static final java.lang.Integer DEFAULT_ROTATION_DIGITS
java.util.HashMap<java.lang.String,java.io.PrintWriter> diversionLogs
java.lang.String logGeneration
protected java.lang.String localName
protected ArrayLongFPCache cache
| Constructor Detail |
|---|
public CrawlMapper(java.lang.String name,
java.lang.String description)
name - Name of this processor.| Method Detail |
|---|
protected void innerProcess(CrawlURI curi)
Processor
innerProcess in class Processorcuri - The CrawlURI being processed.protected boolean decideToMapOutlink(CandidateURI cauri)
protected DecideRule getMapOutlinkDecideRule(java.lang.Object o)
protected void updateGeneration(java.lang.String nowGeneration)
nowGeneration - new generation (timestamp prefix) to useprotected abstract java.lang.String map(CandidateURI cauri)
cauri - CandidateURI to consider
protected void divertLog(CandidateURI cauri,
java.lang.String target)
cauri - CandidateURI to append to a diversion logtarget - String node name (log name) to receive URIprotected java.io.PrintWriter getDiversionLog(java.lang.String target)
target - crawler node name of requested log
protected void initialTasks()
ProcessorThis method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.
initialTasks in class Processor
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||