public class SiteMapParser extends Object
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
static int |
MAX_BYTES_ALLOWED
Sitemaps (including sitemap index files) "must be no larger than
50MB (52,428,800 bytes)" as specified in the
Sitemaps XML
format (before Nov.
|
protected boolean |
strict
True (by default) meaning that invalid URLs should be rejected, as the
official docs allow the siteMapURLs to be only under the base url:
http://www.sitemaps.org/protocol.html#location
|
Constructor and Description |
---|
SiteMapParser() |
SiteMapParser(boolean strict) |
Modifier and Type | Method and Description |
---|---|
protected void |
addUrlIntoSitemap(String urlStr,
SiteMap siteMap,
String lastMod,
String changeFreq,
String priority,
int urlIndex)
Adds the given URL to the given sitemap while showing the relevant logs
|
protected String |
getElementAttributeValue(Element elem,
String elementName,
String attributeName)
Get the element's attribute value.
|
protected String |
getElementValue(Element elem,
String elementName)
Get the element's textual content.
|
boolean |
isStrict() |
protected void |
parseAtom(SiteMap sitemap,
Element elem,
Document doc)
Parse the XML document which is assumed to be in Atom format.
|
protected void |
parseRSS(SiteMap sitemap,
Document doc)
Parse XML document which is assumed to be in RSS format.
|
AbstractSiteMap |
parseSiteMap(byte[] content,
URL url)
Parse a sitemap, given the content bytes and the URL.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
AbstractSiteMap sitemap)
Returns a processed copy of an unprocessed sitemap object, i.e.
|
AbstractSiteMap |
parseSiteMap(String contentType,
byte[] content,
URL url)
Parse a sitemap, given the MIME type, the content bytes, and the URL.
|
AbstractSiteMap |
parseSiteMap(URL onlineSitemapUrl)
Returns a SiteMap or SiteMapIndex given an online sitemap URL
Please note that this method is a static method which goes online and
fetches the sitemap then parses it
This method is a convenience method for a user who has a sitemap URL and
wants a "Keep it simple" way to parse it.
|
protected SiteMapIndex |
parseSitemapIndex(URL url,
NodeList nodeList)
Parse XML that contains a Sitemap Index.
|
protected SiteMap |
parseSyndicationFormat(URL sitemapUrl,
Document doc)
Parse the XML document, looking for a feed element to determine if
it's an Atom doc rss to determine if it's an RSS
doc.
|
protected SiteMap |
parseXmlSitemap(URL sitemapUrl,
Document doc)
Parse XML that contains a valid Sitemap.
|
protected AbstractSiteMap |
processGzippedXML(URL url,
byte[] response)
Decompress the gzipped content and process the resulting XML Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
byte[] content)
Process a text-based Sitemap.
|
protected SiteMap |
processText(URL sitemapUrl,
InputStream stream)
Process a text-based Sitemap.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
byte[] xmlContent)
Parse the given XML content.
|
protected AbstractSiteMap |
processXml(URL sitemapUrl,
InputSource is)
Parse the given XML content.
|
static boolean |
urlIsValid(String sitemapBaseUrl,
String testUrl)
See if testUrl is under sitemapBaseUrl.
|
public static final org.slf4j.Logger LOG
public static final int MAX_BYTES_ALLOWED
protected boolean strict
public SiteMapParser()
public SiteMapParser(boolean strict)
public boolean isStrict()
public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl) throws UnknownFormatException, IOException
onlineSitemapUrl
- URL of the online sitemapUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, AbstractSiteMap sitemap) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap filesitemap
- an AbstractSiteMap
implementationUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(byte[] content, URL url) throws UnknownFormatException, IOException
content
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
public AbstractSiteMap parseSiteMap(String contentType, byte[] content, URL url) throws UnknownFormatException, IOException
contentType
- MIME type of contentcontent
- raw bytes of sitemap fileurl
- URL to sitemap fileUnknownFormatException
- if there is an error parsing the sitemapIOException
- if there is an error reading in the site map
URL
protected AbstractSiteMap processXml(URL sitemapUrl, byte[] xmlContent) throws UnknownFormatException
sitemapUrl
- URL to sitemap filexmlContent
- the byte[] backing the sitemapUrlUnknownFormatException
- if there is an error parsing the sitemapprotected SiteMap processText(URL sitemapUrl, byte[] content) throws IOException
sitemapUrl
- URL to sitemap filecontent
- the byte[] backing the sitemapUrlIOException
- if there is an error reading in the site map contentprotected SiteMap processText(URL sitemapUrl, InputStream stream) throws IOException
sitemapUrl
- URL to sitemap filestream
- content streamIOException
- if there is an error reading in the site map contentprotected AbstractSiteMap processGzippedXML(URL url, byte[] response) throws IOException, UnknownFormatException
url
- - URL of the gzipped contentresponse
- - Gzipped contentUnknownFormatException
- if there is an error parsing the gzipIOException
- if there is an error reading in the gzip URL
protected AbstractSiteMap processXml(URL sitemapUrl, InputSource is) throws UnknownFormatException
sitemapUrl
- a sitemap URL
is
- an InputSource
backing the sitemapUnknownFormatException
- if there is an error parsing the
InputSource
protected SiteMap parseXmlSitemap(URL sitemapUrl, Document doc)
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
</urlset>
protected SiteMapIndex parseSitemapIndex(URL url, NodeList nodeList)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml.gz</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>
url
- - URL of Sitemap IndexnodeList
- a NodeList
backing the sitemapprotected SiteMap parseSyndicationFormat(URL sitemapUrl, Document doc) throws UnknownFormatException
sitemapUrl
- the URL location of the Sitemapdoc
- - XML document to parseUnknownFormatException
- if XML does not appear to be Atom or RSSprotected void parseAtom(SiteMap sitemap, Element elem, Document doc)
Parse the XML document which is assumed to be in Atom format. Atom 1.0 example:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Example Feed</title>
<subtitle>A subtitle.</subtitle>
<link href="http://example.org/feed/" rel="self"/>
<link href="http://example.org/"/>
<modified>2003-12-13T18:30:02Z</modified>
<author>
<name>John Doe</name>
<email>johndoe@example.com</email>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</id>
<entry>
<title>Atom-Powered Robots Run Amok</title>
<link href="http://example.org/2003/12/13/atom03"/>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2003-12-13T18:30:02Z</updated>
<summary>Some text.</summary>
</entry>
...
</feed>
protected void parseRSS(SiteMap sitemap, Document doc)
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Lift Off News</title>
<link>http://liftoff.msfc.nasa.gov/</link>
<description>Liftoff to Space Exploration.</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<lastBuildDate>Tue, 10 Jun 2003 09:41:01 GMT</lastBuildDate>
<docs>http://blogs.law.harvard.edu/tech/rss</docs>
<generator>Weblog Editor 2.0</generator>
<managingEditor>editor@example.com</managingEditor>
<webMaster>webmaster@example.com</webMaster>
<ttl>5</ttl>
<item>
<title>Star City</title>
<link>http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp</link>
<description>How do Americans get ready to work with Russians aboard the
International Space Station? They take a crash course in culture,
language and protocol at Russia's Star City.
</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://liftoff.msfc.nasa.gov/2003/06/03.html#item573</guid>
</item>
<item>
<title>Space Exploration</title>
<link>http://liftoff.msfc.nasa.gov/</link>
<description>Sky watchers in Europe, Asia, and parts of Alaska and Canada
will experience a partial eclipse of the Sun on Saturday, May 31.
</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://liftoff.msfc.nasa.gov/2003/05/30.html#item572</guid>
</item>
</channel>
</rss>
protected String getElementValue(Element elem, String elementName)
elem
- elementName
- protected String getElementAttributeValue(Element elem, String elementName, String attributeName)
elem
- elementName
- attributeName
- protected void addUrlIntoSitemap(String urlStr, SiteMap siteMap, String lastMod, String changeFreq, String priority, int urlIndex)
urlStr
- an URL string to add to the
SiteMap
siteMap
- the sitemap to add URL(s) tolastMod
- last time the SiteMapURL
was
modifiedchangeFreq
- the SiteMapURL
change frquencypriority
- priority of this SiteMapURL
urlIndex
- index position to which this entry has been addedCopyright © 2009–2017 Crawler-Commons. All rights reserved.