|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.frontier.AbstractFrontier
org.archive.crawler.frontier.WorkQueueFrontier
org.archive.crawler.frontier.BdbFrontier
public class BdbFrontier
A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.
| Nested Class Summary |
|---|
| Nested classes/interfaces inherited from class org.archive.crawler.frontier.WorkQueueFrontier |
|---|
WorkQueueFrontier.WakeTask |
| Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
|---|
ComplexType.MBeanAttributeInfoIterator |
| Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier |
|---|
Frontier.FrontierGroup |
| Field Summary | |
|---|---|
static java.lang.String |
ATTR_DUMP_PENDING_AT_CLOSE
URI-already-included to use (by class name) |
static java.lang.String |
ATTR_INCLUDED
URI-already-included to use (by class name) |
protected BdbMultipleWorkQueues |
pendingUris
all URIs scheduled to be crawled |
| Fields inherited from class org.archive.crawler.settings.ComplexType |
|---|
definition, definitionMap |
| Fields inherited from interface org.archive.crawler.framework.Frontier |
|---|
ATTR_NAME |
| Constructor Summary | |
|---|---|
BdbFrontier(java.lang.String name)
Constructor. |
|
BdbFrontier(java.lang.String name,
java.lang.String description)
Create the BdbFrontier |
|
| Method Summary | |
|---|---|
protected void |
closeQueue()
|
void |
crawlCheckpoint(java.io.File checkpointDir)
Called by CrawlController when checkpointing. |
void |
crawlEnded(java.lang.String sExitMessage)
Called when a CrawlController has ended a crawl and is about to exit. |
protected UriUniqFilter |
createAlreadyIncluded()
Create a UriUniqFilter that will serve as record of already seen URIs. |
protected UriUniqFilter |
deserializeAlreadySeen(java.lang.Class<? extends UriUniqFilter> cls,
java.io.File dir)
|
void |
dumpAllPendingToLog()
Dump all still-enqueued URIs to the crawl.log -- without actually dequeuing. |
FrontierMarker |
getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
Get a URIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier. |
protected WorkQueue |
getQueueFor(CrawlURI curi)
Return the work queue for the given CrawlURI's classKey. |
protected WorkQueue |
getQueueFor(java.lang.String classKey)
Return the work queue for the given classKey, or null if no such queue exists. |
java.util.ArrayList<java.lang.String> |
getURIsList(FrontierMarker marker,
int numberOfMatches,
boolean verbose)
Return list of urls. |
protected BdbMultipleWorkQueues |
getWorkQueues()
|
void |
initialize(CrawlController c)
Initializes the Frontier, given the supplied CrawlController. |
protected void |
initQueue()
|
protected void |
initQueuesOfQueues()
Set up the various queues-of-queues used by the frontier. |
protected java.util.Queue<java.lang.String> |
reinit(java.util.Queue<java.lang.String> q,
java.lang.String name)
|
protected boolean |
workQueueDataOnDisk()
Returns true if the WorkQueue implementation of this
Frontier stores its workload on disk instead of relying
on serialization mechanisms. |
| Methods inherited from class org.archive.crawler.frontier.WorkQueueFrontier |
|---|
appendQueueReports, asCrawlUri, averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, deleteURIs, discoveredUriCount, finished, forget, getGroup, getReports, isEmpty, kickUpdate, next, receive, reportTo, schedule, sendToQueue, singleLineLegend, singleLineReportTo, wakeQueues |
| Methods inherited from class org.archive.crawler.settings.ModuleType |
|---|
addElement, listUsedFiles |
| Methods inherited from class org.archive.crawler.settings.Type |
|---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
| Methods inherited from class javax.management.Attribute |
|---|
getName |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
protected transient BdbMultipleWorkQueues pendingUris
public static final java.lang.String ATTR_INCLUDED
public static final java.lang.String ATTR_DUMP_PENDING_AT_CLOSE
| Constructor Detail |
|---|
public BdbFrontier(java.lang.String name)
name - Name for of this Frontier.
public BdbFrontier(java.lang.String name,
java.lang.String description)
name - description - | Method Detail |
|---|
protected void initQueuesOfQueues()
WorkQueueFrontier
initQueuesOfQueues in class WorkQueueFrontier
protected java.util.Queue<java.lang.String> reinit(java.util.Queue<java.lang.String> q,
java.lang.String name)
protected UriUniqFilter createAlreadyIncluded()
throws java.io.IOException
createAlreadyIncluded in class WorkQueueFrontierjava.io.IOException
protected UriUniqFilter deserializeAlreadySeen(java.lang.Class<? extends UriUniqFilter> cls,
java.io.File dir)
throws java.io.FileNotFoundException,
java.io.IOException
java.io.FileNotFoundException
java.io.IOExceptionprotected WorkQueue getQueueFor(CrawlURI curi)
getQueueFor in class WorkQueueFrontiercuri - CrawlURI to base queue on
protected WorkQueue getQueueFor(java.lang.String classKey)
getQueueFor in class WorkQueueFrontierclassKey - key to look for
public FrontierMarker getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
FrontierURIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier.
getInitialMarker in interface Frontierregexpr - The regular expression that URIs within the frontier must
match to be considered within the scope of this markerinCacheOnly - If set to true, only those URIs within the frontier
that are stored in cache (usually this means in memory
rather then on disk, but that is an implementation
detail) will be considered. Others will be entierly
ignored, as if they dont exist. This is usefull for quick
peeks at the top of the URI list.
public java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker,
int numberOfMatches,
boolean verbose)
getURIsList in interface Frontiermarker - numberOfMatches - verbose -
FrontierMarker,
Frontier.getInitialMarker(String, boolean)
protected void initQueue()
throws java.io.IOException
initQueue in class WorkQueueFrontierjava.io.IOExceptionprotected void closeQueue()
closeQueue in class WorkQueueFrontierprotected BdbMultipleWorkQueues getWorkQueues()
protected boolean workQueueDataOnDisk()
WorkQueueFrontiertrue if the WorkQueue implementation of this
Frontier stores its workload on disk instead of relying
on serialization mechanisms.
TODO: rename! (this is a very misleading name) or kill (don't
see any implementations that return false)
workQueueDataOnDisk in class WorkQueueFrontier
public void initialize(CrawlController c)
throws FatalConfigurationException,
java.io.IOException
WorkQueueFrontier
initialize in interface Frontierinitialize in class WorkQueueFrontierc - The CrawlController that created the Frontier.
FatalConfigurationException - If provided settings are illegal or
otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file
from disk.Frontier.initialize(org.archive.crawler.framework.CrawlController)public void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded in interface CrawlStatusListenercrawlEnded in class WorkQueueFrontiersExitMessage - Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlCheckpoint(java.io.File checkpointDir)
throws java.lang.Exception
CrawlStatusListenerCrawlController when checkpointing.
crawlCheckpoint in interface CrawlStatusListenercrawlCheckpoint in class AbstractFrontiercheckpointDir - Checkpoint dir. Write checkpoint state here.
java.lang.Exception - A fatal exception. Any exceptions
that are let out of this checkpoint are assumed fatal
and terminate further checkpoint processing.
public void dumpAllPendingToLog()
throws com.sleepycat.je.DatabaseException
com.sleepycat.je.DatabaseException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||