|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.AbstractTracker
org.archive.crawler.admin.StatisticsTracker
public class StatisticsTracker
This is an implementation of the AbstractTracker. It is designed to function with the WUI as well as performing various logging activity.
At the end of each snapshot a line is written to the 'progress-statistics.log' file.
The header of that file is as follows:
[timestamp] [discovered] [queued] [downloaded] [doc/s(avg)] [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]First there is a timestamp, accurate down to 1 second.
discovered, queued, downloaded and dl-failures are (respectively) the discovered URI count, pending URI count, successfully fetched count and failed fetch count from the frontier at the time of the snapshot.
KB/s(avg) is the bandwidth usage. We use the total bytes downloaded to calculate average bandwidth usage (KB/sec). Since we also note the value each time a snapshot is made we can calculate the average bandwidth usage during the last snapshot period to gain a "current" rate. The first number is the current and the average is in parenthesis.
doc/s(avg) works the same way as doc/s except it show the number of documents (URIs) rather then KB downloaded.
busy-threads is the total number of ToeThreads that are not available (and thus presumably busy processing a URI). This information is extracted from the crawl controller.
Finally mem-use-KB is extracted from the run time environment
(Runtime.getRuntime().totalMemory()).
In addition to the data collected for the above logs, various other data is gathered and stored by this tracker.
StatisticsTracking,
AbstractTracker,
Serialized Form| Nested Class Summary |
|---|
| Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
|---|
ComplexType.MBeanAttributeInfoIterator |
| Field Summary | |
|---|---|
protected long |
averageDepth
|
protected int |
busyThreads
|
protected float |
congestionRatio
|
protected CrawledBytesHistotable |
crawledBytes
tally sizes novel, verified (same hash), vouched (not-modified) |
protected double |
currentDocsPerSecond
|
protected int |
currentKBPerSec
|
protected long |
deepestUri
|
protected long |
discoveredUriCount
|
protected double |
docsPerSecond
|
protected long |
downloadDisregards
|
protected long |
downloadedUriCount
|
protected long |
downloadFailures
|
protected long |
finishedUriCount
|
protected java.util.Map<java.lang.String,LongWrapper> |
hostsBytes
|
protected java.util.Map<java.lang.String,LongWrapper> |
hostsDistribution
Keep track of hosts. |
protected java.util.Map<java.lang.String,java.lang.Long> |
hostsLastFinished
|
protected long |
lastPagesFetchedCount
|
protected long |
lastProcessedBytesCount
|
protected java.util.Hashtable<java.lang.String,LongWrapper> |
mimeTypeBytes
|
protected java.util.Hashtable<java.lang.String,LongWrapper> |
mimeTypeDistribution
Keep track of the file types we see (mime type -> count) |
protected java.util.Map<java.lang.String,SeedRecord> |
processedSeedsRecords
Record of seeds' latest actions. |
protected long |
queuedUriCount
|
protected java.util.Map<java.lang.String,java.util.HashMap<java.lang.String,LongWrapper>> |
sourceHostDistribution
Keep track of URL counts per host per seed |
protected java.util.Hashtable<java.lang.String,LongWrapper> |
statusCodeDistribution
Keep track of fetch status codes |
protected long |
totalKBPerSec
|
protected long |
totalProcessedBytes
|
| Fields inherited from class org.archive.crawler.framework.AbstractTracker |
|---|
ATTR_STATS_INTERVAL, controller, crawlerEndTime, crawlerPauseStarted, crawlerStartTime, crawlerTotalPausedTime, DEFAULT_STATISTICS_REPORT_INTERVAL, lastLogPointTime, shouldrun |
| Fields inherited from class org.archive.crawler.settings.ComplexType |
|---|
definition, definitionMap |
| Fields inherited from interface org.archive.crawler.framework.StatisticsTracking |
|---|
SEED_DISPOSITION_DISREGARD, SEED_DISPOSITION_FAILURE, SEED_DISPOSITION_NOT_PROCESSED, SEED_DISPOSITION_RETRY, SEED_DISPOSITION_SUCCESS |
| Constructor Summary | |
|---|---|
StatisticsTracker(java.lang.String name)
|
|
| Method Summary | |
|---|---|
int |
activeThreadCount()
Get the number of active (non-paused) threads. |
long |
averageDepth()
Average depth of the last URI in all eligible queues. |
float |
congestionRatio()
Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads. |
void |
crawlCheckpoint(java.io.File cpDir)
Called by CrawlController when checkpointing. |
java.lang.String |
crawledBytesSummary()
|
void |
crawledURIDisregard(CrawlURI curi)
Notification of a crawled URI that is to be disregarded. |
void |
crawledURIFailure(CrawlURI curi)
Notification of a failed crawling of a URI. |
void |
crawledURINeedRetry(CrawlURI curi)
Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems). |
void |
crawledURISuccessful(CrawlURI curi)
Notification of a successfully crawled URI |
void |
crawlEnded(java.lang.String message)
Called when a CrawlController has ended a crawl and is about to exit. |
double |
currentProcessedDocsPerSec()
Returns an estimate of recent document download rates based on a queue of recently seen CrawlURIs (as of last snapshot). |
int |
currentProcessedKBPerSec()
Calculates an estimate of the rate, in kb, at which documents are currently being processed by the crawler. |
long |
deepestUri()
Ordinal position of the 'deepest' URI eligible for crawling. |
long |
discoveredUriCount()
Number of discovered URIs. |
long |
disregardedFetchAttempts()
Get the total number of failed fetch attempts (connection failures -> give up, etc) |
void |
dumpReports()
Run the reports. |
long |
failedFetchAttempts()
Get the total number of failed fetch attempts (connection failures -> give up, etc) |
protected void |
finalCleanup()
Cleanup resources used, at crawl end. |
long |
finishedUriCount()
Number of URIs that have finished processing. |
long |
getBytesPerFileType(java.lang.String filetype)
Returns the accumulated number of bytes from files of a given file type. |
long |
getBytesPerHost(java.lang.String host)
Returns the accumulated number of bytes downloaded from a given host. |
java.util.Hashtable<java.lang.String,LongWrapper> |
getFileDistribution()
Returns a HashMap that contains information about distributions of encountered mime types. |
long |
getHostLastFinished(java.lang.String host)
Returns the time (in millisec) when a URI belonging to a given host was last finished processing. |
java.util.Map<java.lang.String,java.lang.Number> |
getProgressStatistics()
|
java.lang.String |
getProgressStatisticsLine()
Return one line of current progress-statistics |
java.lang.String |
getProgressStatisticsLine(java.util.Date now)
Return one line of current progress-statistics |
java.util.TreeMap<java.lang.String,LongWrapper> |
getReverseSortedCopy(java.util.Map<java.lang.String,LongWrapper> mapOfLongWrapperValues)
Sort the entries of the given HashMap in descending order by their values, which must be longs wrapped with LongWrapper. |
java.util.SortedMap |
getReverseSortedHostCounts(java.util.Map<java.lang.String,LongWrapper> hostCounts)
Return a copy of the hosts distribution in reverse-sorted (largest first) order. |
java.util.SortedMap |
getReverseSortedHostsDistribution()
Return a copy of the hosts distribution in reverse-sorted (largest first) order. |
java.util.Iterator |
getSeedRecordsSortedByStatusCode()
Get a SeedRecord iterator for the job being monitored. |
protected java.util.Iterator<SeedRecord> |
getSeedRecordsSortedByStatusCode(java.util.Iterator<java.lang.String> i)
|
java.util.Iterator<java.lang.String> |
getSeeds()
Get a seed iterator for the job being monitored. |
java.util.Hashtable<java.lang.String,LongWrapper> |
getStatusCodeDistribution()
Return a HashMap representing the distribution of status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. |
protected static void |
incrementMapCount(java.util.Map<java.lang.String,LongWrapper> map,
java.lang.String key)
Increment a counter for a key in a given HashMap. |
protected static void |
incrementMapCount(java.util.Map<java.lang.String,LongWrapper> map,
java.lang.String key,
long increment)
Increment a counter for a key in a given HashMap by an arbitrary amount. |
void |
initialize(CrawlController c)
Sets up the Logger (including logInterval) and registers with the CrawlController for CrawlStatus and CrawlURIDisposition events. |
int |
percentOfDiscoveredUrisCompleted()
This returns the number of completed URIs as a percentage of the total number of URIs encountered (should be inverse to the discovery curve) |
double |
processedDocsPerSec()
Returns the number of documents that have been processed per second over the life of the crawl (as of last snapshot) |
long |
processedKBPerSec()
Calculates the rate that data, in kb, has been processed over the life of the crawl (as of last snapshot.) |
protected void |
progressStatisticsEvent(java.util.EventObject e)
A method for logging current crawler state. |
long |
queuedUriCount()
Number of URIs queued up and waiting for processing. |
protected void |
saveHostStats(java.lang.String hostname,
long size)
|
protected void |
saveSourceStats(java.lang.String source,
java.lang.String hostname)
|
long |
successfullyFetchedCount()
Number of successfully processed URIs. |
int |
threadCount()
Get the total number of ToeThreads (sleeping and active) |
long |
totalBytesCrawled()
Returns the total number of uncompressed bytes crawled. |
long |
totalBytesWritten()
Deprecated. use totalBytesCrawled |
long |
totalCount()
|
protected void |
writeCrawlReportTo(java.io.PrintWriter writer)
|
protected void |
writeFrontierReportTo(java.io.PrintWriter writer)
Write the Frontier's 'nonempty' report (if available) |
protected void |
writeHostsReportTo(java.io.PrintWriter writer)
|
protected void |
writeManifestReportTo(java.io.PrintWriter writer)
|
protected void |
writeMimetypesReportTo(java.io.PrintWriter writer)
|
protected void |
writeProcessorsReportTo(java.io.PrintWriter writer)
|
protected void |
writeReportFile(java.lang.String reportName,
java.lang.String filename)
|
protected void |
writeReportLine(java.io.PrintWriter writer,
java.lang.Object... fields)
|
protected void |
writeResponseCodeReportTo(java.io.PrintWriter writer)
|
protected void |
writeSeedsReportTo(java.io.PrintWriter writer)
|
protected void |
writeSourceReportTo(java.io.PrintWriter writer)
|
| Methods inherited from class org.archive.crawler.framework.AbstractTracker |
|---|
crawlDuration, crawlEnding, crawlPaused, crawlPausing, crawlResuming, crawlStarted, getCrawlEndTime, getCrawlerTotalElapsedTime, getCrawlPauseStartedTime, getCrawlStartTime, getCrawlTotalPauseTime, getLogWriteInterval, logNote, noteStart, progressStatisticsLegend, run, tallyCurrentPause |
| Methods inherited from class org.archive.crawler.settings.ModuleType |
|---|
addElement, listUsedFiles |
| Methods inherited from class org.archive.crawler.settings.Type |
|---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
| Methods inherited from class javax.management.Attribute |
|---|
getName |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
protected long lastPagesFetchedCount
protected long lastProcessedBytesCount
protected long discoveredUriCount
protected long queuedUriCount
protected long finishedUriCount
protected long downloadedUriCount
protected long downloadFailures
protected long downloadDisregards
protected double docsPerSecond
protected double currentDocsPerSecond
protected int currentKBPerSec
protected long totalKBPerSec
protected int busyThreads
protected long totalProcessedBytes
protected float congestionRatio
protected long deepestUri
protected long averageDepth
protected CrawledBytesHistotable crawledBytes
protected java.util.Hashtable<java.lang.String,LongWrapper> mimeTypeDistribution
protected java.util.Hashtable<java.lang.String,LongWrapper> mimeTypeBytes
protected java.util.Hashtable<java.lang.String,LongWrapper> statusCodeDistribution
protected transient java.util.Map<java.lang.String,LongWrapper> hostsDistribution
They're transient because usually bigmaps that get reconstituted on recover from checkpoint.
protected transient java.util.Map<java.lang.String,LongWrapper> hostsBytes
protected transient java.util.Map<java.lang.String,java.lang.Long> hostsLastFinished
protected transient java.util.Map<java.lang.String,java.util.HashMap<java.lang.String,LongWrapper>> sourceHostDistribution
protected transient java.util.Map<java.lang.String,SeedRecord> processedSeedsRecords
| Constructor Detail |
|---|
public StatisticsTracker(java.lang.String name)
| Method Detail |
|---|
public void initialize(CrawlController c)
throws FatalConfigurationException
AbstractTracker
initialize in interface StatisticsTrackinginitialize in class AbstractTrackerc - A crawl controller instance.
FatalConfigurationException - Not thrown here. For overrides that
go to settings system for configuration.CrawlStatusListener,
CrawlURIDispositionListenerprotected void finalCleanup()
AbstractTracker
finalCleanup in class AbstractTrackerprotected void progressStatisticsEvent(java.util.EventObject e)
AbstractTrackerCrawlController.logProgressStatistics(java.lang.String) so CrawlController
can act on progress statistics event.
It is recommended that for implementations of this method it be carefully considered if it should be synchronized in whole or in part
progressStatisticsEvent in class AbstractTrackere - Progress statistics event.public java.lang.String getProgressStatisticsLine(java.util.Date now)
now -
public java.util.Map<java.lang.String,java.lang.Number> getProgressStatistics()
getProgressStatistics in interface StatisticsTrackingpublic java.lang.String getProgressStatisticsLine()
getProgressStatisticsLine in interface StatisticsTrackingpublic double processedDocsPerSec()
StatisticsTracking
processedDocsPerSec in interface StatisticsTrackingpublic double currentProcessedDocsPerSec()
StatisticsTracking
currentProcessedDocsPerSec in interface StatisticsTrackingpublic long processedKBPerSec()
StatisticsTracking
processedKBPerSec in interface StatisticsTrackingpublic int currentProcessedKBPerSec()
StatisticsTracking
currentProcessedKBPerSec in interface StatisticsTrackingpublic java.util.Hashtable<java.lang.String,LongWrapper> getFileDistribution()
Note: All the values are wrapped with a LongWrapper
protected static void incrementMapCount(java.util.Map<java.lang.String,LongWrapper> map,
java.lang.String key)
map - The HashMapkey - The key for the counter to be incremented, if it does not
exist it will be added (set to 1). If null it will
increment the counter "unknown".
protected static void incrementMapCount(java.util.Map<java.lang.String,LongWrapper> map,
java.lang.String key,
long increment)
map - The HashMapkey - The key for the counter to be incremented, if it does not exist
it will be added (set to equal to increment).
If null it will increment the counter "unknown".increment - The amount to increment counter related to the key.public java.util.TreeMap<java.lang.String,LongWrapper> getReverseSortedCopy(java.util.Map<java.lang.String,LongWrapper> mapOfLongWrapperValues)
LongWrapper.
Elements are sorted by value from largest to smallest. Equal values are sorted in an arbitrary, but consistent manner by their keys. Only items with identical value and key are considered equal. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.
mapOfLongWrapperValues - Assumes values are wrapped with LongWrapper.
public java.util.Hashtable<java.lang.String,LongWrapper> getStatusCodeDistribution()
LongWrapper
public long getHostLastFinished(java.lang.String host)
host - The host to look up time of last completed URI.
public long getBytesPerHost(java.lang.String host)
host - name of the host
public long getBytesPerFileType(java.lang.String filetype)
filetype - Filetype to check.
public int threadCount()
public int activeThreadCount()
StatisticsTracking
activeThreadCount in interface StatisticsTrackingpublic int percentOfDiscoveredUrisCompleted()
public long discoveredUriCount()
If crawl not running (paused or stopped) this will return the value of the last snapshot.
Frontier.discoveredUriCount()public long finishedUriCount()
Frontier.finishedUriCount()public long failedFetchAttempts()
public long disregardedFetchAttempts()
public long successfullyFetchedCount()
StatisticsTrackingIf crawl not running (paused or stopped) this will return the value of the last snapshot.
successfullyFetchedCount in interface StatisticsTrackingFrontier.succeededFetchCount()public long totalCount()
totalCount in interface StatisticsTrackingpublic float congestionRatio()
congestionRatio in interface StatisticsTrackingpublic long deepestUri()
deepestUri in interface StatisticsTrackingpublic long averageDepth()
averageDepth in interface StatisticsTrackingpublic long queuedUriCount()
If crawl not running (paused or stopped) this will return the value of the last snapshot.
Frontier.queuedUriCount()public long totalBytesWritten()
StatisticsTracking
totalBytesWritten in interface StatisticsTrackingpublic long totalBytesCrawled()
StatisticsTracking
totalBytesCrawled in interface StatisticsTrackingpublic java.lang.String crawledBytesSummary()
public void crawledURISuccessful(CrawlURI curi)
CrawlURIDispositionListener
crawledURISuccessful in interface CrawlURIDispositionListenercuri - The relevant CrawlURI
protected void saveSourceStats(java.lang.String source,
java.lang.String hostname)
protected void saveHostStats(java.lang.String hostname,
long size)
public void crawledURINeedRetry(CrawlURI curi)
CrawlURIDispositionListener
crawledURINeedRetry in interface CrawlURIDispositionListenercuri - The relevant CrawlURIpublic void crawledURIDisregard(CrawlURI curi)
CrawlURIDispositionListener
crawledURIDisregard in interface CrawlURIDispositionListenercuri - The relevant CrawlURIpublic void crawledURIFailure(CrawlURI curi)
CrawlURIDispositionListener
crawledURIFailure in interface CrawlURIDispositionListenercuri - The relevant CrawlURIpublic java.util.Iterator<java.lang.String> getSeeds()
public java.util.Iterator getSeedRecordsSortedByStatusCode()
StatisticsTracking
Sort order is:
No status code (not processed)
Status codes smaller then 0 (largest to smallest)
Status codes larger then 0 (largest to smallest)
Note: This iterator will iterate over a list of SeedRecords.
getSeedRecordsSortedByStatusCode in interface StatisticsTrackingprotected java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode(java.util.Iterator<java.lang.String> i)
public void crawlEnded(java.lang.String message)
CrawlStatusListener
crawlEnded in interface CrawlStatusListenercrawlEnded in class AbstractTrackermessage - Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlStatusListener.crawlEnded(java.lang.String)protected void writeSeedsReportTo(java.io.PrintWriter writer)
writer - Where to write.protected void writeSourceReportTo(java.io.PrintWriter writer)
public java.util.SortedMap getReverseSortedHostCounts(java.util.Map<java.lang.String,LongWrapper> hostCounts)
protected void writeHostsReportTo(java.io.PrintWriter writer)
protected void writeReportLine(java.io.PrintWriter writer,
java.lang.Object... fields)
public java.util.SortedMap getReverseSortedHostsDistribution()
protected void writeMimetypesReportTo(java.io.PrintWriter writer)
protected void writeResponseCodeReportTo(java.io.PrintWriter writer)
protected void writeCrawlReportTo(java.io.PrintWriter writer)
protected void writeProcessorsReportTo(java.io.PrintWriter writer)
protected void writeReportFile(java.lang.String reportName,
java.lang.String filename)
protected void writeManifestReportTo(java.io.PrintWriter writer)
writer - Where to write.protected void writeFrontierReportTo(java.io.PrintWriter writer)
writer - to report topublic void dumpReports()
dumpReports in class AbstractTracker
public void crawlCheckpoint(java.io.File cpDir)
throws java.lang.Exception
CrawlStatusListenerCrawlController when checkpointing.
crawlCheckpoint in interface CrawlStatusListenercpDir - Checkpoint dir. Write checkpoint state here.
java.lang.Exception - A fatal exception. Any exceptions
that are let out of this checkpoint are assumed fatal
and terminate further checkpoint processing.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||