public class SiteMapParser extends Object
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
static int |
MAX_BYTES_ALLOWED
Sitemaps (including sitemap index files) "must be no larger than
50MB (52,428,800 bytes)" as specified in the
Sitemaps XML
format (before Nov.
|
protected boolean |
strict
True (by default) meaning that invalid URLs should be rejected, as the
official docs allow the siteMapURLs to be only under the base url:
http://www.sitemaps.org/protocol.html#location
|
protected boolean |
strictNamespace
Indicates whether the parser should work with the namespace from the
specifications or any namespace.
|
Constructor and Description |
---|
SiteMapParser() |
SiteMapParser(boolean strict) |
SiteMapParser(boolean strict,
boolean allowPartial) |
Modifier and Type | Method and Description |
---|---|
protected void |
addUrlIntoSitemap(String urlStr,
SiteMap siteMap,
String lastMod,
String changeFreq,
String priority,
int urlIndex)
Adds the given URL to the given sitemap while showing the relevant logs
|
boolean |
isStrict() |
boolean |
isStrictNamespace() |
AbstractSiteMap |
parseSiteMap(byte[] content,
URL url)
Parse a sitemap, given the content bytes and the URL.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
AbstractSiteMap sitemap)
Returns a processed copy of an unprocessed sitemap object, i.e.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
URL url)
Parse a sitemap, given the MIME type, the content bytes, and the URL.
|
AbstractSiteMap |
parseSiteMap(URL onlineSitemapUrl)
Returns a SiteMap or SiteMapIndex given an online sitemap URL
Please note that this method is a static method which goes online and
fetches the sitemap then parses it
This method is a convenience method for a user who has a sitemap URL and
wants a "Keep it simple" way to parse it.
|
protected AbstractSiteMap |
processGzippedXML(URL url,
byte[] response)
Decompress the gzipped content and process the resulting XML Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
byte[] content)
Process a text-based Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
InputStream stream)
Process a text-based Sitemap.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
byte[] xmlContent)
Parse the given XML content.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
InputSource is)
Parse the given XML content.
|
void |
setStrictNamespace(boolean s)
Sets the parser to allow any namespace or just the one from the
specification
|
static boolean |
urlIsValid(String sitemapBaseUrl,
String testUrl)
See if testUrl is under sitemapBaseUrl.
|
public static final org.slf4j.Logger LOG
public static final int MAX_BYTES_ALLOWED
protected boolean strict
protected boolean strictNamespace
public SiteMapParser()
public SiteMapParser(boolean strict)
public SiteMapParser(boolean strict, boolean allowPartial)
public boolean isStrict()
public boolean isStrictNamespace()
public void setStrictNamespace(boolean s)
public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException
onlineSitemapUrl
- URL of the online sitemapUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap filesitemap
- an AbstractSiteMap
implementationUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException
content
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
protected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException
sitemapUrl
- URL to sitemap filexmlContent
- the byte[] backing the sitemapUrlUnknownFormatException
- if there is an error parsing the sitemapprotected SiteMap processText(URL sitemapUrl, byte[] content) throws IOException
sitemapUrl
- URL to sitemap filecontent
- the byte[] backing the sitemapUrlIOException
- if there is an error reading in the site map contentprotected SiteMap processText(URL sitemapUrl, InputStream stream) throws IOException
sitemapUrl
- URL to sitemap filestream
- content streamIOException
- if there is an error reading in the site map contentprotected AbstractSiteMap processGzippedXML(URL url, byte[] response) throws IOException, UnknownFormatException
url
- - URL of the gzipped contentresponse
- - Gzipped contentUnknownFormatException
- if there is an error parsing the gzipIOException
- if there is an error reading in the gzip URL
protected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException
sitemapUrl
- a sitemap URL
is
- an InputSource
backing the sitemapUnknownFormatException
- if there is an error parsing the
InputSource
protected void addUrlIntoSitemap(String urlStr, SiteMap siteMap, String lastMod, String changeFreq, String priority, int urlIndex)
urlStr
- an URL string to add to the
SiteMap
siteMap
- the sitemap to add URL(s) tolastMod
- last time the SiteMapURL
was
modifiedchangeFreq
- the SiteMapURL
change frquencypriority
- priority of this SiteMapURL
urlIndex
- index position to which this entry has been addedCopyright © 2009–2017 Crawler-Commons. All rights reserved.