- move the counter _numWarnings into the ParseState object local to
parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
log messages beforehand
IMHO it also makes the tests more readable but that is in the eyes of the beholder
This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly
* Removed unneeded before and after
* This is a technical debt
Fixing a styling issue I caused about 4 years ago
Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82
* Fixed according to @sebastian-nagel code review
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
add note that UTC is assumed if no time zone is contained in
date string
fixes #144
- implement InputStream skipping over white space at beginning of file
- use for XML sitemaps in combination with BOMInputStream,
so that white space or empty lines before <?xml ...> do not
cause the parser to fail
the current character chunk)
- use `localName` instead of calling `currentElement()` where applicable
- remove unnecessary null checks of character buffer
the current character chunk)
- fix errors when character chunks are interrupted by CDATA sections or character entities
- fixes #225 XMLIndexHandler needs to accumulate the lastmod date string before parsing
- fixes #226 XMLHandler needs to append text in characters() vs. immediately processing
- provide character buffer in DelegatorHandler, so that derived classes
can append characters to it and finally get the buffered content
- code cleanup in all handler classes:
- add @Override annotations
- remove stubb method implementations
NPE is generated because parseFloat returns a Float object that can be set null in case of NumberFormatException, but the VideoPrice accepts only float.
To bypass this issue and avoid reccuring errors, I've moved the VideoPrice price field to a Float object instead accepting null in case of.
It is far from ideal, and parseFloat would enjoy being able to parse different locale formatting. Anyway, in a first quick fix, this allows the rest of the file to be parsed,
whereas the previous error had all the file to fail while parsing.
- optionally parse elements in the namespace of sitemap extensions:
- Google video sitemaps (resolves #35)
- Google image sitemaps (resolves #36)
- Google news sitemaps
- alternate links in sitemaps (resolves #149)
- the code is taken from Tanguy Moal's (@tuxnco) PR #162
with the following modifications:
- port from DOM to SAX parser
- keep specific extensions separate from the "core" sitemap classes
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests
* Fix Javadoc error and warnings, update change log
* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
by updating to Java 8 date and time API
* Allow for legacy URIs when checking sitemap namespaces
- e.g., allow legacy namespace URI but ignore URLs
from image and video sitemap extensions
- resolve relative namespace URIs
- add namespace URIs of sitemap extensions (news, images, videos)
* Address kkrugler's review comments:
- document addition of sitemap namespace required by sitemap
protocol specification when calling setStrictNamespace(true)
- remove early return on <rss> root element
* Add main to SimpleRobotRulesParser for testing
- implement toString() for robot rules
- fix line breaks in comments
* Do not detect MIME type as Tika dependency has been removed
- Makes SimpleRobotRulesParser._rules property protected
and adds getters for SimpleRobotRulesParser._rules and
RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
to SimpleRobotRules to allow access to concrete class without
nasty type casts while still obeying super class contract
- ignore query part of URL to determine sitemap location prefix
for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201