public class SiteMapParserSAX extends SiteMapParser
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
static int |
MAX_BYTES_ALLOWED
Sitemaps (including sitemap index files) "must be no larger than
50MB (52,428,800 bytes)" as specified in the
Sitemaps XML
format (before Nov.
|
protected boolean |
strict
True (by default) meaning that invalid URLs should be rejected, as the
official docs allow the siteMapURLs to be only under the base url:
http://www.sitemaps.org/protocol.html#location
|
Constructor and Description |
---|
SiteMapParserSAX() |
SiteMapParserSAX(boolean strict) |
SiteMapParserSAX(boolean strict,
boolean allowPartial) |
Modifier and Type | Method and Description |
---|---|
protected void |
addUrlIntoSitemap(String urlStr,
SiteMap siteMap,
String lastMod,
String changeFreq,
String priority,
int urlIndex)
Adds the given URL to the given sitemap while showing the relevant logs
|
boolean |
isStrict() |
AbstractSiteMap |
parseSiteMap(byte[] content,
URL url)
Parse a sitemap, given the content bytes and the URL.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
AbstractSiteMap sitemap)
Returns a processed copy of an unprocessed sitemap object, i.e.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
URL url)
Parse a sitemap, given the MIME type, the content bytes, and the URL.
|
AbstractSiteMap |
parseSiteMap(URL onlineSitemapUrl)
Returns a SiteMap or SiteMapIndex given an online sitemap URL
Please note that this method is a static method which goes online and
fetches the sitemap then parses it
This method is a convenience method for a user who has a sitemap URL and
wants a "Keep it simple" way to parse it.
|
protected AbstractSiteMap |
processGzippedXML(URL url,
byte[] response)
Decompress the gzipped content and process the resulting XML Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
byte[] content)
Process a text-based Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
InputStream stream)
Process a text-based Sitemap.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
byte[] xmlContent)
Parse the given XML content.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
InputSource is)
Parse the given XML content.
|
static boolean |
urlIsValid(String sitemapBaseUrl,
String testUrl)
See if testUrl is under sitemapBaseUrl.
|
getElementAttributeValue, getElementValue, parseAtom, parseRSS, parseSitemapIndex, parseSyndicationFormat, parseXmlSitemap
public static final org.slf4j.Logger LOG
public static final int MAX_BYTES_ALLOWED
protected boolean strict
public SiteMapParserSAX()
public SiteMapParserSAX(boolean strict)
public SiteMapParserSAX(boolean strict, boolean allowPartial)
public boolean isStrict()
isStrict
in class SiteMapParser
public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException
parseSiteMap
in class SiteMapParser
onlineSitemapUrl
- URL of the online sitemapUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException
parseSiteMap
in class SiteMapParser
contentType
- MIME type of contentcontent
- raw bytes of sitemap filesitemap
- an AbstractSiteMap
implementationUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException
parseSiteMap
in class SiteMapParser
content
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException
parseSiteMap
in class SiteMapParser
contentType
- MIME type of contentcontent
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
protected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException
processXml
in class SiteMapParser
sitemapUrl
- URL to sitemap filexmlContent
- the byte[] backing the sitemapUrlUnknownFormatException
- if there is an error parsing the sitemapprotected SiteMap processText(URL sitemapUrl, byte[] content) throws IOException
processText
in class SiteMapParser
sitemapUrl
- URL to sitemap filecontent
- the byte[] backing the sitemapUrlIOException
- if there is an error reading in the site map contentprotected SiteMap processText(URL sitemapUrl, InputStream stream) throws IOException
processText
in class SiteMapParser
sitemapUrl
- URL to sitemap filestream
- content streamIOException
- if there is an error reading in the site map contentprotected AbstractSiteMap processGzippedXML(URL url, byte[] response) throws IOException, UnknownFormatException
processGzippedXML
in class SiteMapParser
url
- - URL of the gzipped contentresponse
- - Gzipped contentUnknownFormatException
- if there is an error parsing the gzipIOException
- if there is an error reading in the gzip URL
protected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException
processXml
in class SiteMapParser
sitemapUrl
- a sitemap URL
is
- an InputSource
backing the sitemapUnknownFormatException
- if there is an error parsing the
InputSource
protected void addUrlIntoSitemap(String urlStr, SiteMap siteMap, String lastMod, String changeFreq, String priority, int urlIndex)
addUrlIntoSitemap
in class SiteMapParser
urlStr
- an URL string to add to the
SiteMap
siteMap
- the sitemap to add URL(s) tolastMod
- last time the SiteMapURL
was
modifiedchangeFreq
- the SiteMapURL
change frquencypriority
- priority of this SiteMapURL
urlIndex
- index position to which this entry has been addedCopyright © 2009–2017 Crawler-Commons. All rights reserved.