- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
percent-encoded sequences are properly caught
fixes #276
- change visibility (private -> public) of DEFAULT_MAX_CRAWL_DELAY and DEFAULT_MAX_WARNINGS
- document methods to set/get max crawl delay and max warnings
- make all sitemap classes including extensions to implement the
Serializable interface
- extend sitemap parser unit tests to check object serialization
on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
etc.)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
* added abstract method to extension metadata
* implemented asmap in image/link/mobile/news attributes
* implemented asmap in videoattributes
* adding video attributes unit test
* added news attributes unit tests
* unit test for link attributes
* unit tests for image and mobile attributes
* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString
* using constants instead of strings in more attributes
* cleaned up the imports
* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage
* added related issue to the changelog
* reverting change to NewsAttributes.equal, that causes a unit test failure
- log the location of the public suffix list (effective_tld_names.dat)
during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
- move the counter _numWarnings into the ParseState object local to
parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
log messages beforehand
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
add note that UTC is assumed if no time zone is contained in
date string
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests
* Fix Javadoc error and warnings, update change log
* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
by updating to Java 8 date and time API
- Makes SimpleRobotRulesParser._rules property protected
and adds getters for SimpleRobotRulesParser._rules and
RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
to SimpleRobotRules to allow access to concrete class without
nasty type casts while still obeying super class contract