* Allow for legacy URIs when checking sitemap namespaces
- e.g., allow legacy namespace URI but ignore URLs
from image and video sitemap extensions
- resolve relative namespace URIs
- add namespace URIs of sitemap extensions (news, images, videos)
* Address kkrugler's review comments:
- document addition of sitemap namespace required by sitemap
protocol specification when calling setStrictNamespace(true)
- remove early return on <rss> root element
* Add main to SimpleRobotRulesParser for testing
- implement toString() for robot rules
- fix line breaks in comments
* Do not detect MIME type as Tika dependency has been removed
- Makes SimpleRobotRulesParser._rules property protected
and adds getters for SimpleRobotRulesParser._rules and
RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
to SimpleRobotRules to allow access to concrete class without
nasty type casts while still obeying super class contract
- ignore query part of URL to determine sitemap location prefix
for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
fixes #179
- allow eTLDs to appear as IDN variants
* one or more dot-separated parts as IDN,
* optionally others as punycoded IDNA
- speed-up lookup of public suffixes / eTLDs using a suffix trie
- update public suffix list to recent version
- add unit tests to test for wildcard eTLDs
for PaidLevelDomain detection
- adapt unit tests: 'de.com' is a private suffix,
consequently 'xxx.de.com' isn't a PaidLevelDomain
- rebase on master, apply code format
- update public suffix list to recent version of
https://publicsuffix.org/list/public_suffix_list.dat
- add method flag to force a check whether the domain has a valid
effective TLD listed in the public suffix list
- fix mixed case hostnames (wWW.eXample.com)
- improve parsing of public suffix list
- correctly recognize exceptions to wildcard rules as domains
- do not disallow TLDs with last element not being a TLD (e.g., .ac.za)
- partially fix IDNs: punycoded IDNs are now recognized
- add unit test for uppercase / mixed case host names
* Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom parsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".
* Lenient namespacing in non strict mode + applied formatting
* Introduced separate field strictNamespace to sitemapparsers + added test to saxparser
* Fixes Javadoc
* Fixes the fix for the Javadoc
* Allow to set strictNamespace in SiteMapTester
- Fix strict namespace handling in SitemapParserSAX:
- pass strictNamespace from DelegatorHandler to delegates
- ignore text if inside an element of invalid namespace
- use SAX parser in unit test
- set exception and pass it to calling DelegatorHandler if namespace
does not match
- call delegated method startElement(...) from DelegatorHandler
- add heuristics to select link URL
- add Atom feed unit test
- add static method to encapsulate reading of test resource files