- PaidLevelDomain - Class in crawlercommons.domains
-
Routines to extract the PLD (paid-level domain, as per the IRLbot paper) from
a hostname or URL.
- PaidLevelDomain() - Constructor for class crawlercommons.domains.PaidLevelDomain
-
- parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.BaseRobotsParser
-
Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent.
- parseContent(String, byte[], String, String) - Method in class crawlercommons.robots.SimpleRobotRulesParser
-
- parseSiteMap(URL) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Returns a SiteMap or SiteMapIndex given an online sitemap URL
Please note that this method is a static method which goes online and
fetches the sitemap then parses it
This method is a convenience method for a user who has a sitemap URL and
wants a "Keep it simple" way to parse it.
- parseSiteMap(String, byte[], AbstractSiteMap) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Returns a processed copy of an unprocessed sitemap object, i.e.
- parseSiteMap(byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse a sitemap, given the content bytes and the URL.
- parseSiteMap(String, byte[], URL) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse a sitemap, given the MIME type, the content bytes, and the URL.
- processGzippedXML(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Decompress the gzipped content and process the resulting XML Sitemap.
- processText(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Process a text-based Sitemap.
- processText(URL, InputStream) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Process a text-based Sitemap.
- processXml(URL, byte[]) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse the given XML content.
- processXml(URL, InputSource) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Parse the given XML content.
- setChangeFrequency(SiteMapURL.ChangeFrequency) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's change frequency
- setChangeFrequency(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's change frequency In case of a bad ChangeFrequency, the
current frequency in this instance will be set to NULL
- setCrawlDelay(long) - Method in class crawlercommons.robots.BaseRobotRules
-
- setDeferVisits(boolean) - Method in class crawlercommons.robots.BaseRobotRules
-
- setException(UnknownFormatException) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
-
- setLastModified(Date) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setLastModified(String) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setLastModified(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set when this URL was last modified.
- setLastModified(Date) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set when this URL was last modified.
- setPriority(double) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority
is used if the given priority is out of range).
- setPriority(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL's priority to a value between [0.0 - 1.0] (Default Priority
is used if the given priority missing or is out of range).
- setProcessed(boolean) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setStrictNamespace(boolean) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
-
Sets the parser to allow any namespace or just the one from the
specification
- setStrictNamespace(boolean) - Method in class crawlercommons.sitemaps.SiteMapParser
-
Sets the parser to allow any namespace or just the one from the
specification
- setType(AbstractSiteMap.SitemapType) - Method in class crawlercommons.sitemaps.AbstractSiteMap
-
- setUrl(URL) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL.
- setUrl(String) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Set the URL.
- setValid(boolean) - Method in class crawlercommons.sitemaps.SiteMapURL
-
Valid means that it follows the official guidelines that the siteMapURL
must be under the base url
- SimpleRobotRules - Class in crawlercommons.robots
-
Result from parsing a single robots.txt file - which means we get a set of
rules, and a crawl-delay.
- SimpleRobotRules() - Constructor for class crawlercommons.robots.SimpleRobotRules
-
- SimpleRobotRules(SimpleRobotRules.RobotRulesMode) - Constructor for class crawlercommons.robots.SimpleRobotRules
-
- SimpleRobotRules.RobotRule - Class in crawlercommons.robots
-
Single rule that maps from a path prefix to an allow flag.
- SimpleRobotRules.RobotRulesMode - Enum in crawlercommons.robots
-
- SimpleRobotRulesParser - Class in crawlercommons.robots
-
This implementation of
BaseRobotsParser
retrieves a set of
rules
for an agent with the given name from the
robots.txt
file of a given domain.
- SimpleRobotRulesParser() - Constructor for class crawlercommons.robots.SimpleRobotRulesParser
-
- SITEMAP - Static variable in class crawlercommons.sitemaps.Namespace
-
- SiteMap - Class in crawlercommons.sitemaps
-
- SiteMap() - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(URL) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(String) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(URL, Date) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMap(String, String) - Constructor for class crawlercommons.sitemaps.SiteMap
-
- SiteMapIndex - Class in crawlercommons.sitemaps
-
- SiteMapIndex() - Constructor for class crawlercommons.sitemaps.SiteMapIndex
-
- SiteMapIndex(URL) - Constructor for class crawlercommons.sitemaps.SiteMapIndex
-
- SiteMapParser - Class in crawlercommons.sitemaps
-
- SiteMapParser() - Constructor for class crawlercommons.sitemaps.SiteMapParser
-
- SiteMapParser(boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParser
-
- SiteMapParser(boolean, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapParser
-
- SiteMapTester - Class in crawlercommons.sitemaps
-
Sitemap Tool for recursively fetching all URL's from a sitemap (and all of
it's children)
- SiteMapTester() - Constructor for class crawlercommons.sitemaps.SiteMapTester
-
- SiteMapURL - Class in crawlercommons.sitemaps
-
The SitemapUrl class represents a URL found in a Sitemap.
- SiteMapURL(String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL(URL, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL(String, String, String, String, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL(URL, Date, SiteMapURL.ChangeFrequency, double, boolean) - Constructor for class crawlercommons.sitemaps.SiteMapURL
-
- SiteMapURL.ChangeFrequency - Enum in crawlercommons.sitemaps
-
Allowed change frequencies
- sortRules() - Method in class crawlercommons.robots.SimpleRobotRules
-
In order to match up with Google's convention, we want to match rules
from longest to shortest.
- startElement(String, String, String, Attributes) - Method in class crawlercommons.sitemaps.sax.DelegatorHandler
-
- strict - Variable in class crawlercommons.sitemaps.SiteMapParser
-
True (by default) meaning that invalid URLs should be rejected, as the
official docs allow the siteMapURLs to be only under the base url:
http://www.sitemaps.org/protocol.html#location
- strictNamespace - Variable in class crawlercommons.sitemaps.SiteMapParser
-
Indicates whether the parser should work with the namespace from the
specifications or any namespace.