|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Filter
org.archive.crawler.framework.CrawlScope
public class CrawlScope
A CrawlScope instance defines which URIs are "in" a particular crawl. It is essentially a Filter which determines, looking at the totality of information available about a CandidateURI/CrawlURI instamce, if that URI should be scheduled for crawling. Dynamic information inherent in the discovery of the URI -- such as the path by which it was discovered -- may be considered. Dynamic information which requires the consultation of external and potentially volatile information -- such as current robots.txt requests and the history of attempts to crawl the same URI -- should NOT be considered. Those potentially high-latency decisions should be made at another step.
| Nested Class Summary |
|---|
| Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
|---|
ComplexType.MBeanAttributeInfoIterator |
| Field Summary | |
|---|---|
static java.lang.String |
ATTR_NAME
|
static java.lang.String |
ATTR_REREAD_SEEDS_ON_CONFIG
Whether every configu change should trigger a rereading of the original seeds spec/file. |
static java.lang.String |
ATTR_SEEDS
|
static java.lang.Boolean |
DEFAULT_REREAD_SEEDS_ON_CONFIG
|
protected java.util.Set<SeedListener> |
seedListeners
|
| Fields inherited from class org.archive.crawler.framework.Filter |
|---|
ATTR_ENABLED |
| Fields inherited from class org.archive.crawler.settings.ComplexType |
|---|
definition, definitionMap |
| Constructor Summary | |
|---|---|
CrawlScope()
Default constructor. |
|
CrawlScope(java.lang.String name)
Constructs a new CrawlScope. |
|
| Method Summary | |
|---|---|
boolean |
addSeed(CandidateURI curi)
Add a new seed to scope. |
void |
addSeedListener(SeedListener sl)
|
protected void |
checkClose(java.util.Iterator iter)
Convenience method to close SeedFileIterator, if appropriate. |
java.io.File |
getSeedfile()
|
void |
initialize(CrawlController controller)
Initialize is called just before the crawler starts to run. |
protected boolean |
isSameHost(UURI a,
UURI b)
|
protected boolean |
isSeed(java.lang.Object o)
Check if a URI is in the seeds. |
void |
kickUpdate()
Take note of a situation (such as settings edit) where involved reconfiguration (such as reading from external files) may be necessary. |
void |
listUsedFiles(java.util.List<java.lang.String> list)
Those Modules that use files on disk should list them all when this method is called. |
void |
refreshSeeds()
Refresh seeds. |
java.util.Iterator<UURI> |
seedsIterator()
Gets an iterator over all configured seeds. |
java.util.Iterator<UURI> |
seedsIterator(java.io.Writer ignoredItemWriter)
Gets an iterator over all configured seeds. |
java.lang.String |
toString()
|
| Methods inherited from class org.archive.crawler.framework.Filter |
|---|
accepts, getFilterOffPosition, innerAccepts, returnTrueIfMatches |
| Methods inherited from class org.archive.crawler.settings.ModuleType |
|---|
addElement |
| Methods inherited from class org.archive.crawler.settings.Type |
|---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
| Methods inherited from class javax.management.Attribute |
|---|
getName |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final java.lang.String ATTR_NAME
public static final java.lang.String ATTR_SEEDS
public static final java.lang.String ATTR_REREAD_SEEDS_ON_CONFIG
public static final java.lang.Boolean DEFAULT_REREAD_SEEDS_ON_CONFIG
protected java.util.Set<SeedListener> seedListeners
| Constructor Detail |
|---|
public CrawlScope(java.lang.String name)
name - the name is ignored since it always have to be the value of
the constant ATT_NAME.public CrawlScope()
| Method Detail |
|---|
public void initialize(CrawlController controller)
ComplexType.earlyInitialize(CrawlerSettings).
controller - Controller object.public java.lang.String toString()
toString in class Filterpublic void refreshSeeds()
public java.io.File getSeedfile()
protected boolean isSeed(java.lang.Object o)
o - the URI to check.
protected boolean isSameHost(UURI a,
UURI b)
a - First UURI of compare.b - Second UURI of compare.
public void listUsedFiles(java.util.List<java.lang.String> list)
ModuleTypeEach file (as a string name with full path) should be added to the provided list.
Modules that do not use any files can safely ignore this method.
listUsedFiles in class ModuleTypelist - The list to add files to.public void kickUpdate()
kickUpdate in class Filterpublic java.util.Iterator<UURI> seedsIterator()
public java.util.Iterator<UURI> seedsIterator(java.io.Writer ignoredItemWriter)
ignoredItemWriter - optional writer to get ignored seed items report
protected void checkClose(java.util.Iterator iter)
iter - Iterator to check if SeedFileIterator needing closingpublic boolean addSeed(CandidateURI curi)
This method is *not* sufficient to get the new seed scheduled in the Frontier for crawling -- it only affects the Scope's seed record (and decisions which flow from seeds).
curi - CandidateUri to add
public void addSeedListener(SeedListener sl)
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||