Skip navigation links
A B C D E F G H I L M N P R S T U V W 

A

AbstractSiteMap - Class in crawlercommons.sitemaps
SiteMap or SiteMapIndex
AbstractSiteMap() - Constructor for class crawlercommons.sitemaps.AbstractSiteMap
 
AbstractSiteMap.SitemapType - Enum in crawlercommons.sitemaps
Various Sitemap types
addRule(String, boolean) - Method in class crawlercommons.robots.SimpleRobotRules
 
addSitemap(String) - Method in class crawlercommons.robots.BaseRobotRules
 
addSitemap(AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapIndex
Add this Sitemap to the list of Sitemaps,
addSiteMapUrl(SiteMapURL) - Method in class crawlercommons.sitemaps.SiteMap
 
addUrlIntoSitemap(String, SiteMap, String, String, String, int) - Method in class crawlercommons.sitemaps.SiteMapParser
Adds the given URL to the given sitemap while showing the relevant logs
addUrlIntoSitemap(String, SiteMap, String, String, String, int) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Adds the given URL to the given sitemap while showing the relevant logs

B

BaseRobotRules - Class in crawlercommons.robots
Result from parsing a single robots.txt file - which means we get a set of rules, and a crawl-delay.
BaseRobotRules() - Constructor for class crawlercommons.robots.BaseRobotRules
 
BaseRobotsParser - Class in crawlercommons.robots
 
BaseRobotsParser() - Constructor for class crawlercommons.robots.BaseRobotsParser
 
BasicURLNormalizer - Class in crawlercommons.filters.basic
Code borrowed from Apache Nutch.
BasicURLNormalizer() - Constructor for class crawlercommons.filters.basic.BasicURLNormalizer
 

C

characters(char[], int, int) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
clearRules() - Method in class crawlercommons.robots.SimpleRobotRules
 
COMMENT - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
compareTo(SimpleRobotRules.RobotRule) - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
convertToDate(String) - Static method in class crawlercommons.sitemaps.AbstractSiteMap
Convert the given date (given in an acceptable DateFormat), null if the date is not in the correct format.
crawlercommons - package crawlercommons
 
CrawlerCommons - Class in crawlercommons
 
CrawlerCommons() - Constructor for class crawlercommons.CrawlerCommons
 
crawlercommons.domains - package crawlercommons.domains
Classes contained within the domains package relate to the definition of Top Level Domain's, various domain registrars and the effective handling of such domains.
crawlercommons.filters - package crawlercommons.filters
The filters package contains code and resources for URL filtering.
crawlercommons.filters.basic - package crawlercommons.filters.basic
 
crawlercommons.robots - package crawlercommons.robots
The robots package contains all of the robots.txt rule inference, parsing and utilities contained within Crawler Commons.
crawlercommons.sitemaps - package crawlercommons.sitemaps
Sitemaps package provides all classes relevant to focused sitemap parsing, url definition and processing.
crawlercommons.sitemaps.sax - package crawlercommons.sitemaps.sax
 
currentElement() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
currentElementParent() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 

D

DEFAULT_PRIORITY - Static variable in class crawlercommons.sitemaps.SiteMapURL
 
DelegatorHandler - Class in crawlercommons.sitemaps.sax
Provides a base SAX handler for parsing of XML documents representing sub-classes of AbstractSiteMap.
DelegatorHandler(LinkedList<String>, boolean) - Constructor for class crawlercommons.sitemaps.sax.DelegatorHandler
 
DelegatorHandler(URL, boolean) - Constructor for class crawlercommons.sitemaps.sax.DelegatorHandler
 
DOT - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
DOT_REGEX - Static variable in class crawlercommons.domains.EffectiveTldFinder
 

E

EffectiveTLD(String) - Constructor for class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
EffectiveTldFinder - Class in crawlercommons.domains
Given a URL's hostname, there are determining the actual domain requires knowledge of the various domain registrars and their assignment policies.
EffectiveTldFinder.EffectiveTLD - Class in crawlercommons.domains
 
endElement(String, String, String) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
equals(Object) - Method in class crawlercommons.robots.BaseRobotRules
 
equals(Object) - Method in class crawlercommons.robots.SimpleRobotRules
 
equals(Object) - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
equals(Object) - Method in class crawlercommons.sitemaps.SiteMapURL
 
error(SAXParseException) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
ETLD_DATA - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
EXCEPTION - Static variable in class crawlercommons.domains.EffectiveTldFinder
 

F

failedFetch(int) - Method in class crawlercommons.robots.BaseRobotsParser
The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.
failedFetch(int) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
fatalError(SAXParseException) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
filter(String) - Method in class crawlercommons.filters.basic.BasicURLNormalizer
 
filter(String) - Method in class crawlercommons.filters.URLFilter
Returns a modified version of the input URL or null if the URL should be removed

G

getAssignedDomain(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
This method uses the effective TLD to determine which component of a FQDN is the NIC-assigned domain name.
getBaseUrl() - Method in class crawlercommons.sitemaps.SiteMap
 
getChangeFrequency() - Method in class crawlercommons.sitemaps.SiteMapURL
Return the URL's change frequency
getCrawlDelay() - Method in class crawlercommons.robots.BaseRobotRules
 
getDomain() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
getEffectiveTLD(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
 
getEffectiveTLDs() - Static method in class crawlercommons.domains.EffectiveTldFinder
 
getElementAttributeValue(Element, String, String) - Method in class crawlercommons.sitemaps.SiteMapParser
Get the element's attribute value.
getElementValue(Element, String) - Method in class crawlercommons.sitemaps.SiteMapParser
Get the element's textual content.
getError() - Method in exception crawlercommons.sitemaps.UnknownFormatException
public method, callable by exception catcher.
getException() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
getFullDateFormat() - Static method in class crawlercommons.sitemaps.AbstractSiteMap
 
getInstance() - Static method in class crawlercommons.domains.EffectiveTldFinder
 
getLastModified() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getLastModified() - Method in class crawlercommons.sitemaps.SiteMapURL
Return when this URL was last modified.
getNumWarnings() - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
getPLD(String) - Static method in class crawlercommons.domains.PaidLevelDomain
Extract the PLD (paid-level domain) from the hostname.
getPLD(URL) - Static method in class crawlercommons.domains.PaidLevelDomain
Extract the PLD (paid-level domain) from the URL.
getPriority() - Method in class crawlercommons.sitemaps.SiteMapURL
Return this URL's priority (a value between [0.0 - 1.0]).
getSiteMap() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
getSitemap(URL) - Method in class crawlercommons.sitemaps.SiteMapIndex
Returns the Sitemap that has the given URL.
getSitemaps() - Method in class crawlercommons.robots.BaseRobotRules
 
getSitemaps() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
getSiteMapUrls() - Method in class crawlercommons.sitemaps.SiteMap
 
getType() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getUrl() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getUrl() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
getUrl() - Method in class crawlercommons.sitemaps.SiteMapURL
Return the URL.
getVersion() - Static method in class crawlercommons.CrawlerCommons
 

H

hashCode() - Method in class crawlercommons.robots.BaseRobotRules
 
hashCode() - Method in class crawlercommons.robots.SimpleRobotRules
 
hashCode() - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
hashCode() - Method in class crawlercommons.sitemaps.SiteMapURL
 
hasUnprocessedSitemap() - Method in class crawlercommons.sitemaps.SiteMapIndex
 

I

initialize(InputStream) - Method in class crawlercommons.domains.EffectiveTldFinder
 
isAllowAll() - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowAll() - Method in class crawlercommons.robots.SimpleRobotRules
Is our ruleset set up to allow all access?
isAllowed(String) - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowed(String) - Method in class crawlercommons.robots.SimpleRobotRules
 
isAllowNone() - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowNone() - Method in class crawlercommons.robots.SimpleRobotRules
Is our ruleset set up to disallow all access?
isConfigured() - Method in class crawlercommons.domains.EffectiveTldFinder
 
isDeferVisits() - Method in class crawlercommons.robots.BaseRobotRules
 
isException() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
isIndex() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
isIndex() - Method in class crawlercommons.sitemaps.SiteMap
 
isIndex() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
isProcessed() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
isStrict() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
isStrict() - Method in class crawlercommons.sitemaps.SiteMapParser
 
isStrict() - Method in class crawlercommons.sitemaps.SiteMapParserSAX
 
isValid() - Method in class crawlercommons.sitemaps.SiteMapURL
Is the siteMapURL under the base url ?
isWild() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 

L

LOG - Static variable in class crawlercommons.filters.basic.BasicURLNormalizer
 
LOG - Static variable in class crawlercommons.sitemaps.SiteMapParser
 
LOG - Static variable in class crawlercommons.sitemaps.SiteMapParserSAX
 

M

main(String[]) - Static method in class crawlercommons.filters.basic.BasicURLNormalizer
 
main(String[]) - Static method in class crawlercommons.sitemaps.SiteMapTester
 
MAX_BYTES_ALLOWED - Static variable in class crawlercommons.sitemaps.SiteMapParser
Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov.
MAX_BYTES_ALLOWED - Static variable in class crawlercommons.sitemaps.SiteMapParserSAX
Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov.

N

nextUnprocessedSitemap() - Method in class crawlercommons.sitemaps.SiteMapIndex
 

P

PaidLevelDomain - Class in crawlercommons.domains
Routines to extract the PLD (paid-level domain, as per the IRLbot paper) from a hostname or URL.
PaidLevelDomain() - Constructor for class crawlercommons.domains.PaidLevelDomain
 
parseAtom(SiteMap, Element, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the XML document which is assumed to be in Atom format.
parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.BaseRobotsParser
Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.
parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
parseRSS(SiteMap, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse XML document which is assumed to be in RSS format.
parseSiteMap(URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.
parseSiteMap(String, byte[], AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapParser
Returns a processed copy of an unprocessed sitemap object, i.e.
parseSiteMap(byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse a sitemap, given the content bytes and the URL.
parseSiteMap(String, byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse a sitemap, given the MIME type, the content bytes, and the URL.
parseSiteMap(URL) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.
parseSiteMap(String, byte[], AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Returns a processed copy of an unprocessed sitemap object, i.e.
parseSiteMap(byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Parse a sitemap, given the content bytes and the URL.
parseSiteMap(String, byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Parse a sitemap, given the MIME type, the content bytes, and the URL.
parseSitemapIndex(URL, NodeList) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse XML that contains a Sitemap Index.
parseSyndicationFormat(URL, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the XML document, looking for a feed element to determine if it's an Atom doc rss to determine if it's an RSS doc.
parseXmlSitemap(URL, Document) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse XML that contains a valid Sitemap.
processGzippedXML(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Decompress the gzipped content and process the resulting XML Sitemap.
processGzippedXML(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Decompress the gzipped content and process the resulting XML Sitemap.
processText(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Process a text-based Sitemap.
processText(URL, InputStream) - Method in class crawlercommons.sitemaps.SiteMapParser
Process a text-based Sitemap.
processText(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Process a text-based Sitemap.
processText(URL, InputStream) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Process a text-based Sitemap.
processXml(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the given XML content.
processXml(URL, InputSource) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the given XML content.
processXml(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Parse the given XML content.
processXml(URL, InputSource) - Method in class crawlercommons.sitemaps.SiteMapParserSAX
Parse the given XML content.

R

RobotRule(String, boolean) - Constructor for class crawlercommons.robots.SimpleRobotRules.RobotRule
 

S

setChangeFrequency(SiteMapURL.ChangeFrequency) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's change frequency
setChangeFrequency(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's change frequency In case of a bad ChangeFrequency, the current frequency in this instance will be set to NULL
setCrawlDelay(long) - Method in class crawlercommons.robots.BaseRobotRules
 
setDeferVisits(boolean) - Method in class crawlercommons.robots.BaseRobotRules
 
setException(UnknownFormatException) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
setLastModified(Date) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setLastModified(String) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setLastModified(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set when this URL was last modified.
setLastModified(Date) - Method in class crawlercommons.sitemaps.SiteMapURL
Set when this URL was last modified.
setPriority(double) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority is used if the given priority is out of range).
setPriority(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority is used if the given priority missing or is out of range).
setProcessed(boolean) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setType(AbstractSiteMap.SitemapType) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setUrl(URL) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL.
setUrl(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL.
setValid(boolean) - Method in class crawlercommons.sitemaps.SiteMapURL
Valid means that it follows the official guidelines that the siteMapURL must be under the base url
SimpleRobotRules - Class in crawlercommons.robots
Result from parsing a single robots.txt file - which means we get a set of rules, and a crawl-delay.
SimpleRobotRules() - Constructor for class crawlercommons.robots.SimpleRobotRules
 
SimpleRobotRules(SimpleRobotRules.RobotRulesMode) - Constructor for class crawlercommons.robots.SimpleRobotRules
 
SimpleRobotRules.RobotRule - Class in crawlercommons.robots
Single rule that maps from a path prefix to an allow flag.
SimpleRobotRules.RobotRulesMode - Enum in crawlercommons.robots
 
SimpleRobotRulesParser - Class in crawlercommons.robots
This implementation of BaseRobotsParser retrieves a set of rules for an agent with the given name from the robots.txt file of a given domain.
SimpleRobotRulesParser() - Constructor for class crawlercommons.robots.SimpleRobotRulesParser
 
SiteMap - Class in crawlercommons.sitemaps
 
SiteMap() - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(URL) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(String) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(URL, Date) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(String, String) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMapIndex - Class in crawlercommons.sitemaps
 
SiteMapIndex() - Constructor for class crawlercommons.sitemaps.SiteMapIndex
 
SiteMapIndex(URL) - Constructor for class crawlercommons.sitemaps.SiteMapIndex
 
SiteMapParser - Class in crawlercommons.sitemaps
 
SiteMapParser() - Constructor for class crawlercommons.sitemaps.SiteMapParser
 
SiteMapParser(boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParser
 
SiteMapParserSAX - Class in crawlercommons.sitemaps
 
SiteMapParserSAX() - Constructor for class crawlercommons.sitemaps.SiteMapParserSAX
 
SiteMapParserSAX(boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParserSAX
 
SiteMapParserSAX(boolean, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParserSAX
 
SiteMapTester - Class in crawlercommons.sitemaps
Sitemap Tool for recursively fetching all URL's from a sitemap (and all of it's children)
SiteMapTester() - Constructor for class crawlercommons.sitemaps.SiteMapTester
 
SiteMapURL - Class in crawlercommons.sitemaps
The SitemapUrl class represents a URL found in a Sitemap.
SiteMapURL(String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(URL, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(String, String, String, String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(URL, Date, SiteMapURL.ChangeFrequency, double, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL.ChangeFrequency - Enum in crawlercommons.sitemaps
Allowed change frequencies
sortRules() - Method in class crawlercommons.robots.SimpleRobotRules
In order to match up with Google's convention, we want to match rules from longest to shortest.
startElement(String, String, String, Attributes) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
strict - Variable in class crawlercommons.sitemaps.SiteMapParser
True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: http://www.sitemaps.org/protocol.html#location
strict - Variable in class crawlercommons.sitemaps.SiteMapParserSAX
True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: http://www.sitemaps.org/protocol.html#location

T

toString() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
toString() - Method in class crawlercommons.sitemaps.SiteMap
 
toString() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
toString() - Method in class crawlercommons.sitemaps.SiteMapURL
 

U

UnknownFormatException - Exception in crawlercommons.sitemaps
 
UnknownFormatException() - Constructor for exception crawlercommons.sitemaps.UnknownFormatException
Default constructor - initializes instance variable to unknown
UnknownFormatException(String) - Constructor for exception crawlercommons.sitemaps.UnknownFormatException
Constructor receives some kind of message that is saved in an instance variable.
UNSET_CRAWL_DELAY - Static variable in class crawlercommons.robots.BaseRobotRules
 
url - Variable in class crawlercommons.sitemaps.AbstractSiteMap
 
URLFilter - Class in crawlercommons.filters
 
URLFilter() - Constructor for class crawlercommons.filters.URLFilter
 
urlIsValid(String, String) - Static method in class crawlercommons.sitemaps.SiteMapParser
See if testUrl is under sitemapBaseUrl.
urlIsValid(String, String) - Static method in class crawlercommons.sitemaps.SiteMapParserSAX
See if testUrl is under sitemapBaseUrl.

V

valueOf(String) - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
Returns the enum constant of this type with the specified name.
values() - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
Returns an array containing the constants of this enum type, in the order they are declared.

W

WILD_CARD - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
A B C D E F G H I L M N P R S T U V W 
Skip navigation links

Copyright © 2009–2017 Crawler-Commons. All rights reserved.