Skip navigation links
A B C D E F G H I L M N P R S T U V W _ 

A

AbstractSiteMap - Class in crawlercommons.sitemaps
SiteMap or SiteMapIndex
AbstractSiteMap() - Constructor for class crawlercommons.sitemaps.AbstractSiteMap
 
AbstractSiteMap.SitemapType - Enum in crawlercommons.sitemaps
Various Sitemap types
acceptedNamespaces - Variable in class crawlercommons.sitemaps.SiteMapParser
Set of namespaces (if SiteMapParser.strictNamespace) accepted by the parser.
addAcceptedNamespace(String) - Method in class crawlercommons.sitemaps.SiteMapParser
Add namespace URI to set of accepted namespaces.
addAcceptedNamespace(String[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Add namespace URIs to set of accepted namespaces.
addChild(char, V) - Method in class crawlercommons.domains.SuffixTrie.Node
 
addRule(String, boolean) - Method in class crawlercommons.robots.SimpleRobotRules
 
addSitemap(String) - Method in class crawlercommons.robots.BaseRobotRules
 
addSitemap(AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapIndex
Add this Sitemap to the list of Sitemaps,
addSiteMapUrl(SiteMapURL) - Method in class crawlercommons.sitemaps.SiteMap
 
ATOM_0_3 - Static variable in class crawlercommons.sitemaps.Namespace
 
ATOM_1_0 - Static variable in class crawlercommons.sitemaps.Namespace
 

B

BaseRobotRules - Class in crawlercommons.robots
Result from parsing a single robots.txt file - which means we get a set of rules, and a crawl-delay.
BaseRobotRules() - Constructor for class crawlercommons.robots.BaseRobotRules
 
BaseRobotsParser - Class in crawlercommons.robots
 
BaseRobotsParser() - Constructor for class crawlercommons.robots.BaseRobotsParser
 
BasicURLNormalizer - Class in crawlercommons.filters.basic
Code borrowed from Apache Nutch.
BasicURLNormalizer() - Constructor for class crawlercommons.filters.basic.BasicURLNormalizer
 

C

characters(char[], int, int) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
clearRules() - Method in class crawlercommons.robots.SimpleRobotRules
 
COMMENT - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
compareTo(SimpleRobotRules.RobotRule) - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
contains(String) - Method in class crawlercommons.domains.SuffixTrie
Checks whether trie contains a suffix string.
convertToDate(String) - Static method in class crawlercommons.sitemaps.AbstractSiteMap
Convert the given date (given in an acceptable DateFormat), null if the date is not in the correct format.
crawlercommons - package crawlercommons
 
CrawlerCommons - Class in crawlercommons
 
CrawlerCommons() - Constructor for class crawlercommons.CrawlerCommons
 
crawlercommons.domains - package crawlercommons.domains
Classes contained within the domains package relate to the definition of Top Level Domain's, various domain registrars and the effective handling of such domains.
crawlercommons.filters - package crawlercommons.filters
The filters package contains code and resources for URL filtering.
crawlercommons.filters.basic - package crawlercommons.filters.basic
 
crawlercommons.mimetypes - package crawlercommons.mimetypes
 
crawlercommons.robots - package crawlercommons.robots
The robots package contains all of the robots.txt rule inference, parsing and utilities contained within Crawler Commons.
crawlercommons.sitemaps - package crawlercommons.sitemaps
Sitemaps package provides all classes relevant to focused sitemap parsing, url definition and processing.
crawlercommons.sitemaps.sax - package crawlercommons.sitemaps.sax
 
currentElement() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
currentElementParent() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 

D

DEFAULT_PRIORITY - Static variable in class crawlercommons.sitemaps.SiteMapURL
 
DelegatorHandler - Class in crawlercommons.sitemaps.sax
Provides a base SAX handler for parsing of XML documents representing sub-classes of AbstractSiteMap.
DelegatorHandler(LinkedList<String>, boolean) - Constructor for class crawlercommons.sitemaps.sax.DelegatorHandler
 
DelegatorHandler(URL, boolean) - Constructor for class crawlercommons.sitemaps.sax.DelegatorHandler
 
detect(byte[]) - Method in class crawlercommons.mimetypes.MimeTypeDetector
 
detect(byte[], int) - Method in class crawlercommons.mimetypes.MimeTypeDetector
 
detect(InputStream) - Method in class crawlercommons.mimetypes.MimeTypeDetector
 
DOT - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
DOT_REGEX - Static variable in class crawlercommons.domains.EffectiveTldFinder
 

E

EffectiveTLD(String, boolean) - Constructor for class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
EffectiveTldFinder - Class in crawlercommons.domains
To determine the actual domain name of a host name or URL requires knowledge of the various domain registrars and their assignment policies.
EffectiveTldFinder.EffectiveTLD - Class in crawlercommons.domains
 
EMPTY - Static variable in class crawlercommons.sitemaps.Namespace
In contradiction to the protocol specification ("The Sitemap must ...
endElement(String, String, String) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
equals(Object) - Method in class crawlercommons.robots.BaseRobotRules
 
equals(Object) - Method in class crawlercommons.robots.SimpleRobotRules
 
equals(Object) - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
equals(Object) - Method in class crawlercommons.sitemaps.SiteMapURL
 
error(SAXParseException) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
ETLD_DATA - Static variable in class crawlercommons.domains.EffectiveTldFinder
 
EXCEPTION - Static variable in class crawlercommons.domains.EffectiveTldFinder
 

F

failedFetch(int) - Method in class crawlercommons.robots.BaseRobotsParser
The fetch of robots.txt failed, so return rules appropriate give the HTTP status code.
failedFetch(int) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
fatalError(SAXParseException) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
filter(String) - Method in class crawlercommons.filters.basic.BasicURLNormalizer
 
filter(String) - Method in class crawlercommons.filters.URLFilter
Returns a modified version of the input URL or null if the URL should be removed

G

get(String) - Method in class crawlercommons.domains.SuffixTrie
Get value associated with suffix string in trie.
getAssignedDomain(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
This method uses the effective TLD to determine which component of a FQDN is the NIC-assigned domain name (aka "Paid Level Domain").
getAssignedDomain(String, boolean) - Static method in class crawlercommons.domains.EffectiveTldFinder
This method uses the effective TLD to determine which component of a FQDN is the NIC-assigned domain name (aka "Paid Level Domain").
getAssignedDomain(String, boolean, boolean) - Static method in class crawlercommons.domains.EffectiveTldFinder
This method uses the effective TLD to determine which component of a FQDN is the NIC-assigned domain name.
getBaseUrl() - Method in class crawlercommons.sitemaps.SiteMap
 
getChangeFrequency() - Method in class crawlercommons.sitemaps.SiteMapURL
Return the URL's change frequency
getChild(char) - Method in class crawlercommons.domains.SuffixTrie.Node
 
getCrawlDelay() - Method in class crawlercommons.robots.BaseRobotRules
 
getDomain() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
getEffectiveTLD(String) - Static method in class crawlercommons.domains.EffectiveTldFinder
Get EffectiveTLD for host name using the singleton instance of EffectiveTldFinder.
getEffectiveTLD(String, boolean) - Static method in class crawlercommons.domains.EffectiveTldFinder
Get EffectiveTLD for host name using the singleton instance of EffectiveTldFinder.
getEffectiveTLDs() - Static method in class crawlercommons.domains.EffectiveTldFinder
 
getException() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
getFullDateFormat() - Static method in class crawlercommons.sitemaps.AbstractSiteMap
 
getInstance() - Static method in class crawlercommons.domains.EffectiveTldFinder
Get singleton instance of EffectiveTldFinder with default configuration.
getLastModified() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getLastModified() - Method in class crawlercommons.sitemaps.SiteMapURL
Return when this URL was last modified.
getLongestSuffix(String) - Method in class crawlercommons.domains.SuffixTrie
Match the longest suffix of a string contained in trie.
getMaxCrawlDelay() - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
getMaxWarnings() - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
getNameVariants() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
Generate name variants caused by Internationalized Domain Names: every IDN part of a eTLD can be replaced by its punycoded ASCII variant.
getNumWarnings() - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
getPLD(String) - Static method in class crawlercommons.domains.PaidLevelDomain
Extract the PLD (paid-level domain) from the hostname.
getPLD(URL) - Static method in class crawlercommons.domains.PaidLevelDomain
Extract the PLD (paid-level domain) from the URL.
getPrefix() - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
getPriority() - Method in class crawlercommons.sitemaps.SiteMapURL
Return this URL's priority (a value between [0.0 - 1.0]).
getRobotRules() - Method in class crawlercommons.robots.SimpleRobotRules
 
getSiteMap() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
getSitemap(URL) - Method in class crawlercommons.sitemaps.SiteMapIndex
Returns the Sitemap that has the given URL.
getSitemaps() - Method in class crawlercommons.robots.BaseRobotRules
 
getSitemaps() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
getSiteMapUrls() - Method in class crawlercommons.sitemaps.SiteMap
 
getSuffixes(String) - Method in class crawlercommons.domains.SuffixTrie
Match all suffixes of a string contained in trie.
getType() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getUrl() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
getUrl() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
getUrl() - Method in class crawlercommons.sitemaps.SiteMapURL
Return the URL.
getVersion() - Static method in class crawlercommons.CrawlerCommons
 

H

hashCode() - Method in class crawlercommons.robots.BaseRobotRules
 
hashCode() - Method in class crawlercommons.robots.SimpleRobotRules
 
hashCode() - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
hashCode() - Method in class crawlercommons.sitemaps.SiteMapURL
 
hasUnprocessedSitemap() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
help() - Static method in class crawlercommons.domains.EffectiveTldFinder
 

I

IMAGE - Static variable in class crawlercommons.sitemaps.Namespace
 
initialize(InputStream) - Method in class crawlercommons.domains.EffectiveTldFinder
(Re)initialize EffectiveTldFinder with custom public suffix list.
isAcceptedNamespace(String) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
isAllow() - Method in class crawlercommons.robots.SimpleRobotRules.RobotRule
 
isAllowAll() - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowAll() - Method in class crawlercommons.robots.SimpleRobotRules
Is our ruleset set up to allow all access?
isAllowed(String) - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowed(String) - Method in class crawlercommons.robots.SimpleRobotRules
 
isAllowNone() - Method in class crawlercommons.robots.BaseRobotRules
 
isAllowNone() - Method in class crawlercommons.robots.SimpleRobotRules
Is our ruleset set up to disallow all access?
isConfigured() - Method in class crawlercommons.domains.EffectiveTldFinder
 
isDeferVisits() - Method in class crawlercommons.robots.BaseRobotRules
 
isException() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
isGzip(String) - Method in class crawlercommons.mimetypes.MimeTypeDetector
 
isIndex() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
isIndex() - Method in class crawlercommons.sitemaps.SiteMap
 
isIndex() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
isProcessed() - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
isStrict() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
isStrict() - Method in class crawlercommons.sitemaps.SiteMapParser
 
isStrictNamespace() - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
isStrictNamespace() - Method in class crawlercommons.sitemaps.SiteMapParser
 
isSupported(String) - Static method in class crawlercommons.sitemaps.Namespace
 
isText(String) - Method in class crawlercommons.mimetypes.MimeTypeDetector
 
isValid() - Method in class crawlercommons.sitemaps.SiteMapURL
Is the siteMapURL under the base url ?
isWild() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
isXml(String) - Method in class crawlercommons.mimetypes.MimeTypeDetector
 

L

LINKS - Static variable in class crawlercommons.sitemaps.Namespace
 
LOG - Static variable in class crawlercommons.filters.basic.BasicURLNormalizer
 
LOG - Static variable in class crawlercommons.sitemaps.SiteMapParser
 
LookupResult(int, V) - Constructor for class crawlercommons.domains.SuffixTrie.LookupResult
 

M

main(String[]) - Static method in class crawlercommons.domains.EffectiveTldFinder
 
main(String[]) - Static method in class crawlercommons.filters.basic.BasicURLNormalizer
 
main(String[]) - Static method in class crawlercommons.robots.SimpleRobotRulesParser
 
main(String[]) - Static method in class crawlercommons.sitemaps.SiteMapTester
 
MAX_BYTES_ALLOWED - Static variable in class crawlercommons.sitemaps.SiteMapParser
Sitemaps (including sitemap index files) "must be no larger than 50MB (52,428,800 bytes)" as specified in the Sitemaps XML format (before Nov.
MimeTypeDetector - Class in crawlercommons.mimetypes
 
MimeTypeDetector() - Constructor for class crawlercommons.mimetypes.MimeTypeDetector
 

N

Namespace - Class in crawlercommons.sitemaps
supported sitemap formats: https://www.sitemaps.org/protocol.html#otherformats
Namespace() - Constructor for class crawlercommons.sitemaps.Namespace
 
NEWS - Static variable in class crawlercommons.sitemaps.Namespace
 
nextUnprocessedSitemap() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
Node() - Constructor for class crawlercommons.domains.SuffixTrie.Node
 
normalize(String, byte[]) - Method in class crawlercommons.mimetypes.MimeTypeDetector
 
normalizeRSSTimestamp(String) - Static method in class crawlercommons.sitemaps.AbstractSiteMap
Converts pubDate of RSS to the string representation which could be parsed in AbstractSiteMap.convertToDate(String) method.

P

PaidLevelDomain - Class in crawlercommons.domains
Routines to extract the PLD (paid-level domain, as per the IRLbot paper) from a hostname or URL.
PaidLevelDomain() - Constructor for class crawlercommons.domains.PaidLevelDomain
 
parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.BaseRobotsParser
Parse the robots.txt file in content, and return rules appropriate for processing paths by userAgent.
parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
parseSiteMap(URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Returns a SiteMap or SiteMapIndex given an online sitemap URL Please note that this method is a static method which goes online and fetches the sitemap then parses it This method is a convenience method for a user who has a sitemap URL and wants a "Keep it simple" way to parse it.
parseSiteMap(String, byte[], AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapParser
Returns a processed copy of an unprocessed sitemap object, i.e.
parseSiteMap(byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse a sitemap, given the content bytes and the URL.
parseSiteMap(String, byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse a sitemap, given the MIME type, the content bytes, and the URL.
processGzippedXML(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Decompress the gzipped content and process the resulting XML Sitemap.
processText(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Process a text-based Sitemap.
processText(URL, InputStream) - Method in class crawlercommons.sitemaps.SiteMapParser
Process a text-based Sitemap.
processXml(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the given XML content.
processXml(URL, InputSource) - Method in class crawlercommons.sitemaps.SiteMapParser
Parse the given XML content.
put(String, V) - Method in class crawlercommons.domains.SuffixTrie
Insert a string and an associated value into the trie.

R

RobotRule(String, boolean) - Constructor for class crawlercommons.robots.SimpleRobotRules.RobotRule
 
root - Variable in class crawlercommons.domains.SuffixTrie
 
RSS_2_0 - Static variable in class crawlercommons.sitemaps.Namespace
RSS and Atom sitemap formats do not have strict definition.

S

setAcceptedNamespaces(Set<String>) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
setChangeFrequency(SiteMapURL.ChangeFrequency) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's change frequency
setChangeFrequency(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's change frequency In case of a bad ChangeFrequency, the current frequency in this instance will be set to NULL
setCrawlDelay(long) - Method in class crawlercommons.robots.BaseRobotRules
 
setDeferVisits(boolean) - Method in class crawlercommons.robots.BaseRobotRules
 
setException(UnknownFormatException) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
setLastModified(Date) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setLastModified(String) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setLastModified(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set when this URL was last modified.
setLastModified(Date) - Method in class crawlercommons.sitemaps.SiteMapURL
Set when this URL was last modified.
setMaxCrawlDelay(long) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
setMaxWarnings(int) - Method in class crawlercommons.robots.SimpleRobotRulesParser
 
setPriority(double) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority is used if the given priority is out of range).
setPriority(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority is used if the given priority missing or is out of range).
setProcessed(boolean) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setStrictNamespace(boolean) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
setStrictNamespace(boolean) - Method in class crawlercommons.sitemaps.SiteMapParser
Sets the parser to allow any namespace or just the one from the specification, or any accepted namespace (see SiteMapParser.addAcceptedNamespace(String)).
setType(AbstractSiteMap.SitemapType) - Method in class crawlercommons.sitemaps.AbstractSiteMap
 
setUrl(URL) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL.
setUrl(String) - Method in class crawlercommons.sitemaps.SiteMapURL
Set the URL.
setValid(boolean) - Method in class crawlercommons.sitemaps.SiteMapURL
Valid means that it follows the official guidelines that the siteMapURL must be under the base url
SimpleRobotRules - Class in crawlercommons.robots
Result from parsing a single robots.txt file - which means we get a set of rules, and a crawl-delay.
SimpleRobotRules() - Constructor for class crawlercommons.robots.SimpleRobotRules
 
SimpleRobotRules(SimpleRobotRules.RobotRulesMode) - Constructor for class crawlercommons.robots.SimpleRobotRules
 
SimpleRobotRules.RobotRule - Class in crawlercommons.robots
Single rule that maps from a path prefix to an allow flag.
SimpleRobotRules.RobotRulesMode - Enum in crawlercommons.robots
 
SimpleRobotRulesParser - Class in crawlercommons.robots
This implementation of BaseRobotsParser retrieves a set of rules for an agent with the given name from the robots.txt file of a given domain.
SimpleRobotRulesParser() - Constructor for class crawlercommons.robots.SimpleRobotRulesParser
 
SimpleRobotRulesParser(long, int) - Constructor for class crawlercommons.robots.SimpleRobotRulesParser
 
SITEMAP - Static variable in class crawlercommons.sitemaps.Namespace
 
SiteMap - Class in crawlercommons.sitemaps
 
SiteMap() - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(URL) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(String) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(URL, Date) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SiteMap(String, String) - Constructor for class crawlercommons.sitemaps.SiteMap
 
SITEMAP_LEGACY - Static variable in class crawlercommons.sitemaps.Namespace
Legacy schema URIs from prior sitemap protocol versions and frequent variants.
SITEMAP_SUPPORTED_NAMESPACES - Static variable in class crawlercommons.sitemaps.Namespace
 
SiteMapIndex - Class in crawlercommons.sitemaps
 
SiteMapIndex() - Constructor for class crawlercommons.sitemaps.SiteMapIndex
 
SiteMapIndex(URL) - Constructor for class crawlercommons.sitemaps.SiteMapIndex
 
SiteMapParser - Class in crawlercommons.sitemaps
 
SiteMapParser() - Constructor for class crawlercommons.sitemaps.SiteMapParser
 
SiteMapParser(boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParser
 
SiteMapParser(boolean, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParser
 
SiteMapTester - Class in crawlercommons.sitemaps
Sitemap Tool for recursively fetching all URL's from a sitemap (and all of it's children)
SiteMapTester() - Constructor for class crawlercommons.sitemaps.SiteMapTester
 
SiteMapURL - Class in crawlercommons.sitemaps
The SitemapUrl class represents a URL found in a Sitemap.
SiteMapURL(String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(URL, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(String, String, String, String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL(URL, Date, SiteMapURL.ChangeFrequency, double, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
 
SiteMapURL.ChangeFrequency - Enum in crawlercommons.sitemaps
Allowed change frequencies
sortRules() - Method in class crawlercommons.robots.SimpleRobotRules
In order to match up with Google's convention, we want to match rules from longest to shortest.
startElement(String, String, String, Attributes) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
 
strict - Variable in class crawlercommons.sitemaps.SiteMapParser
True (by default) meaning that invalid URLs should be rejected, as the official docs allow the siteMapURLs to be only under the base url: http://www.sitemaps.org/protocol.html#location
strictNamespace - Variable in class crawlercommons.sitemaps.SiteMapParser
Indicates whether the parser should work with the namespace from the specifications or any namespace.
SuffixTrie<V> - Class in crawlercommons.domains
 
SuffixTrie() - Constructor for class crawlercommons.domains.SuffixTrie
 
SuffixTrie.LookupResult<V> - Class in crawlercommons.domains
Wrapper for results when a string is checked for suffixes contained in the suffix trie.
SuffixTrie.Node<V> - Class in crawlercommons.domains
 

T

toString() - Method in class crawlercommons.domains.EffectiveTldFinder.EffectiveTLD
 
toString() - Method in class crawlercommons.robots.BaseRobotRules
 
toString() - Method in class crawlercommons.robots.SimpleRobotRules
 
toString() - Method in class crawlercommons.sitemaps.SiteMap
 
toString() - Method in class crawlercommons.sitemaps.SiteMapIndex
 
toString() - Method in class crawlercommons.sitemaps.SiteMapURL
 

U

UnknownFormatException - Exception in crawlercommons.sitemaps
 
UnknownFormatException() - Constructor for exception crawlercommons.sitemaps.UnknownFormatException
 
UnknownFormatException(String) - Constructor for exception crawlercommons.sitemaps.UnknownFormatException
 
UnknownFormatException(String, Throwable) - Constructor for exception crawlercommons.sitemaps.UnknownFormatException
 
UNSET_CRAWL_DELAY - Static variable in class crawlercommons.robots.BaseRobotRules
 
url - Variable in class crawlercommons.sitemaps.AbstractSiteMap
 
URLFilter - Class in crawlercommons.filters
 
URLFilter() - Constructor for class crawlercommons.filters.URLFilter
 
urlIsValid(String, String) - Static method in class crawlercommons.sitemaps.SiteMapParser
See if testUrl is under sitemapBaseUrl.

V

valueOf(String) - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
Returns the enum constant of this type with the specified name.
values() - Static method in enum crawlercommons.robots.SimpleRobotRules.RobotRulesMode
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.sitemaps.AbstractSiteMap.SitemapType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum crawlercommons.sitemaps.SiteMapURL.ChangeFrequency
Returns an array containing the constants of this enum type, in the order they are declared.
VIDEO - Static variable in class crawlercommons.sitemaps.Namespace
 

W

walkSiteMap(URL, Consumer<SiteMapURL>) - Method in class crawlercommons.sitemaps.SiteMapParser
Fetch a sitemap from the specified URL, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.
walkSiteMap(AbstractSiteMap, Consumer<SiteMapURL>) - Method in class crawlercommons.sitemaps.SiteMapParser
Traverse a sitemap, recursively fetching and traversing the content of any enclosed sitemap index, and performing the specified action for each sitemap URL until all URLs have been processed or the action throws an exception.
WILD_CARD - Static variable in class crawlercommons.domains.EffectiveTldFinder
 

_

_mode - Variable in class crawlercommons.robots.SimpleRobotRules
 
_rules - Variable in class crawlercommons.robots.SimpleRobotRules
 
A B C D E F G H I L M N P R S T U V W _ 
Skip navigation links

Copyright © 2009–2018 Crawler-Commons. All rights reserved.