* This normalization basically adds the [Unit_Name] in front of the issue when it is obvious and when it is missing
Added the [Domains] unit name (as in the java package name)
Didn't touch the issues changelog prior to v0.7
This resolves #270
* Updated according to Sebastian's code review
- extend SiteMapParser by methods to register a URLFilter (function)
used to normalize or filter (if null is returned) URLs found in
sitemaps
- implement URL filtering in sitemap parsers / XML handlers
- add unit tests to verify URL filtering for text and XML sitemaps
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
percent-encoded sequences are properly caught
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
percent-encoded sequences are properly caught
fixes #276
- change visibility (private -> public) of DEFAULT_MAX_CRAWL_DELAY and DEFAULT_MAX_WARNINGS
- document methods to set/get max crawl delay and max warnings
- make all sitemap classes including extensions to implement the
Serializable interface
- extend sitemap parser unit tests to check object serialization
on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
etc.)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
* added abstract method to extension metadata
* implemented asmap in image/link/mobile/news attributes
* implemented asmap in videoattributes
* adding video attributes unit test
* added news attributes unit tests
* unit test for link attributes
* unit tests for image and mobile attributes
* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString
* using constants instead of strings in more attributes
* cleaned up the imports
* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage
* added related issue to the changelog
* reverting change to NewsAttributes.equal, that causes a unit test failure
- log the location of the public suffix list (effective_tld_names.dat)
during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
- move the counter _numWarnings into the ParseState object local to
parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
log messages beforehand