- make all sitemap classes including extensions to implement the
Serializable interface
- extend sitemap parser unit tests to check object serialization
on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
etc.)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
* added abstract method to extension metadata
* implemented asmap in image/link/mobile/news attributes
* implemented asmap in videoattributes
* adding video attributes unit test
* added news attributes unit tests
* unit test for link attributes
* unit tests for image and mobile attributes
* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString
* using constants instead of strings in more attributes
* cleaned up the imports
* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage
* added related issue to the changelog
* reverting change to NewsAttributes.equal, that causes a unit test failure
- log the location of the public suffix list (effective_tld_names.dat)
during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
- move the counter _numWarnings into the ParseState object local to
parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
log messages beforehand
IMHO it also makes the tests more readable but that is in the eyes of the beholder
This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly
* Removed unneeded before and after
* This is a technical debt
Fixing a styling issue I caused about 4 years ago
Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82
* Fixed according to @sebastian-nagel code review
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
add note that UTC is assumed if no time zone is contained in
date string
fixes #144
- implement InputStream skipping over white space at beginning of file
- use for XML sitemaps in combination with BOMInputStream,
so that white space or empty lines before <?xml ...> do not
cause the parser to fail
the current character chunk)
- use `localName` instead of calling `currentElement()` where applicable
- remove unnecessary null checks of character buffer
the current character chunk)
- fix errors when character chunks are interrupted by CDATA sections or character entities
- fixes #225 XMLIndexHandler needs to accumulate the lastmod date string before parsing
- fixes #226 XMLHandler needs to append text in characters() vs. immediately processing
- provide character buffer in DelegatorHandler, so that derived classes
can append characters to it and finally get the buffered content
- code cleanup in all handler classes:
- add @Override annotations
- remove stubb method implementations
NPE is generated because parseFloat returns a Float object that can be set null in case of NumberFormatException, but the VideoPrice accepts only float.
To bypass this issue and avoid reccuring errors, I've moved the VideoPrice price field to a Float object instead accepting null in case of.
It is far from ideal, and parseFloat would enjoy being able to parse different locale formatting. Anyway, in a first quick fix, this allows the rest of the file to be parsed,
whereas the previous error had all the file to fail while parsing.
- optionally parse elements in the namespace of sitemap extensions:
- Google video sitemaps (resolves #35)
- Google image sitemaps (resolves #36)
- Google news sitemaps
- alternate links in sitemaps (resolves #149)
- the code is taken from Tanguy Moal's (@tuxnco) PR #162
with the following modifications:
- port from DOM to SAX parser
- keep specific extensions separate from the "core" sitemap classes
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests
* Fix Javadoc error and warnings, update change log
* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
by updating to Java 8 date and time API
* Allow for legacy URIs when checking sitemap namespaces
- e.g., allow legacy namespace URI but ignore URLs
from image and video sitemap extensions
- resolve relative namespace URIs
- add namespace URIs of sitemap extensions (news, images, videos)
* Address kkrugler's review comments:
- document addition of sitemap namespace required by sitemap
protocol specification when calling setStrictNamespace(true)
- remove early return on <rss> root element
* Add main to SimpleRobotRulesParser for testing
- implement toString() for robot rules
- fix line breaks in comments
* Do not detect MIME type as Tika dependency has been removed