1
0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-09-20 08:42:15 +02:00
Commit Graph

189 Commits

Author SHA1 Message Date
Sebastian Nagel
6c271f5fab SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
- move the counter _numWarnings into the ParseState object local to
  parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
  of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
  log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel
f7cc8b694a Make mvn run parameterized tests, fixes #273
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel
7419e96d74 [BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
fixes #247
2019-11-12 13:32:00 +01:00
Sebastian Nagel
2979aaf4aa [BasicNormalizer] Trailing question mark in url query test should be added,
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel
7e5347eca6
Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check
EffectiveTldFinder to validate returned domain names for length restrictions
2019-10-15 16:24:47 +02:00
Avi Hayun
1c74fc6ca0 The robots.txt test now supports the junit5 parameterizedTest, gaining us the ability to have each robot rule be tested separately, so if one rule fails, the others can still show success in their own unit tests. (#256)
IMHO it also makes the tests more readable but that is in the eyes of the beholder

This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
2019-07-23 13:13:45 +01:00
Avi Hayun
3597dee5a7 Updated the tests to use parameterizedTest, thus making the test file more readable and each normalized pair is now a test in itself, thus making it much easier to identify exactly failure points 2019-07-16 21:17:25 +03:00
Avi Hayun
5bda3636d2
Upgraded to Junit v5.5 (#250)
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly

* Removed unneeded before and after

* This is a technical debt

Fixing a styling issue I caused about 4 years ago

Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82

* Fixed according to @sebastian-nagel code review
2019-07-15 21:29:03 +03:00
Sebastian Nagel
fba716c765 EffectiveTldFinder to validate returned domain names for length restrictions,
fixes #251
2019-07-15 12:39:04 +02:00
Sebastian Nagel
b449fdf024 Robots parser to always handle absolute sitemap URL even without valid base URL,
fixes #240
2019-03-26 15:05:12 +01:00
Sebastian Nagel
e8b598b2e8 [Sitemaps] Unit tests depend on system timezone, fixes #238 (#239)
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
  add note that UTC is assumed if no time zone is contained in
  date string
2019-03-19 15:00:04 +00:00
Sebastian Nagel
40531efe25 EffectiveTldFinder: upgrade public suffix list, implements #219 (#235)
- upgrade the public suffix list to the latest version retrieved right now from
     https://publicsuffix.org/list/public_suffix_list.dat
  resp.
     802c469416/public_suffix_list.dat
2019-03-14 11:22:50 +00:00
Sebastian Nagel
a69bf09cd4 Detection and parsing of XML sitemaps fails with whitespace before XML declaration,
fixes #144
- implement InputStream skipping over white space at beginning of file
- use for XML sitemaps in combination with BOMInputStream,
  so that white space or empty lines before <?xml ...> do not
  cause the parser to fail
2019-03-07 16:17:09 +01:00
Sebastian Nagel
2539b07948 Buffer characters when characters(...) is called (instead of immediately processing
the current character chunk)
- use `localName` instead of calling `currentElement()` where applicable
- remove unnecessary null checks of character buffer
2019-02-25 14:32:26 +01:00
Sebastian Nagel
82859c2b1a Buffer characters when characters(...) is called (instead of immediately processing
the current character chunk)
- fix errors when character chunks are interrupted by CDATA sections or character entities
  - fixes #225 XMLIndexHandler needs to accumulate the lastmod date string before parsing
  - fixes #226 XMLHandler needs to append text in characters() vs. immediately processing
  - provide character buffer in DelegatorHandler, so that derived classes
    can append characters to it and finally get the buffered content
- code cleanup in all handler classes:
  - add @Override annotations
  - remove stubb method implementations
2019-02-21 23:01:32 +01:00
Sebastian Nagel
916415d262
Merge branch 'master' into cc-231-etld-invalid-idns 2019-02-21 22:16:48 +01:00
Sebastian Nagel
67db8bf1be [Sitemaps] Trim Unicode whitespace around URLs, fixes #224 2019-02-20 16:27:16 +01:00
Sebastian Nagel
680df29f0f EffectiveTldFinder throws IllegalArgumentException on IDN domain names
containing prohibited charactersi, fixes #231
- catch IllegalArgumentException when converting IDNs to ASCII
- validate domain names returned by getAssignedDomain(...)
- add unit tests, complete Javadoc
2019-02-20 14:59:58 +01:00
Sebastian Nagel
2bdc1923b1 Merge branch 'master' into cc-213-auto-close-sitemap-in-sitemapindex 2019-02-03 13:47:31 +01:00
Sebastian Nagel
5706ee1e97 Sitemap index: stop URL at closing </loc>
- bug fix: <sitemap> element is closed, not <url>
- add unit test to cover resetting of attributes
2019-02-03 13:44:14 +01:00
Sebastian Nagel
e0ef1ee29d Remove final from method parameters 2019-01-18 17:27:23 +01:00
Sebastian Nagel
13fab2e033 Sitemap index: stop URL at closing </loc>, fixes #213
- at start of a <loc> element auto-close any unclosed
  <sitemap> element and add the sitemap if there is a
  valid URL from the previous <loc> element
2018-12-13 15:01:23 +01:00
Sebastian Nagel
439c350ab3 Allow empty price in video sitemaps, fixes #221 2018-12-07 16:44:47 +01:00
Sebastian Nagel
f361597447 Inline unit test for locale-specific number formats in video:price tags (#220)
- simply unit tests to contain only tested or required video attributes
- fix order of attributes/elements to follow XSD schema spec
2018-12-07 16:44:16 +01:00
Aurelien Violette
3c12c715c8 Fix object comparison on testing. 2018-12-03 12:55:30 +01:00
Aurelien Violette
e802cfa8eb In case of the use of a different locale, price tag can be formatted with ',' instead of '.' leading to a NPE.
NPE is generated because parseFloat returns a Float object that can be set null in case of NumberFormatException, but the VideoPrice accepts only float.
To bypass this issue and avoid reccuring errors, I've moved the VideoPrice price field to a Float object instead accepting null in case of.

It is far from ideal, and parseFloat would enjoy being able to parse different locale formatting. Anyway, in a first quick fix, this allows the rest of the file to be parsed,
whereas the previous error had all the file to fail while parsing.
2018-12-03 12:44:15 +01:00
Sebastian Nagel
862af9416f Sitemap extension support
- add extension support to SiteMapTester
- list extension attributes in SiteMapURL.toString()
- update change log
2018-09-28 12:14:02 +02:00
Sebastian Nagel
66745b29a1 Add support for the Mobile sitemap extension 2018-09-28 12:05:07 +02:00
Sebastian Nagel
b924bd0828 Sitemap extension support
- optionally parse elements in the namespace of sitemap extensions:
  - Google video sitemaps (resolves #35)
  - Google image sitemaps (resolves #36)
  - Google news sitemaps
  - alternate links in sitemaps (resolves #149)
- the code is taken from Tanguy Moal's (@tuxnco) PR #162
  with the following modifications:
  - port from DOM to SAX parser
  - keep specific extensions separate from the "core" sitemap classes
2018-09-28 12:04:39 +02:00
Sebastian Nagel
9318de951f Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (#217)
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
  but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests

* Fix Javadoc error and warnings, update change log

* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
  by updating to Java 8 date and time API
2018-09-24 10:09:58 +01:00
Ken Krugler
54d90deaa4 Fix up formatting 2018-08-01 10:15:42 -07:00
Ken Krugler
d5f614b67a Fix for URL with only a query string (no path) 2018-07-31 17:10:12 -07:00
Julien Nioche
0da1b8b8b5 Minor changes + applied formatting pre 0.10 release 2018-06-05 11:33:27 +01:00
Sebastian Nagel
d98a3f14cf Allow for legacy URIs when checking sitemap namespaces (#211)
* Allow for legacy URIs when checking sitemap namespaces
- e.g., allow legacy namespace URI but ignore URLs
  from image and video sitemap extensions
- resolve relative namespace URIs
- add namespace URIs of sitemap extensions (news, images, videos)

* Address kkrugler's review comments:
- document addition of sitemap namespace required by sitemap
  protocol specification when calling setStrictNamespace(true)
- remove early return on <rss> root element
2018-06-05 11:20:26 +01:00
Sebastian Nagel
0c75e75496 Add main to SimpleRobotRulesParser for testing (#193)
* Add main to SimpleRobotRulesParser for testing
- implement toString() for robot rules
- fix line breaks in comments

* Do not detect MIME type as Tika dependency has been removed
2018-06-04 21:26:45 +01:00
Ken Krugler
d99c034dd0
Merge branch 'master' into issue-134 2018-05-14 11:20:17 -07:00
Aecio Santos
47c2cad8b8 Add getters/setters and update CHANGES.txt 2018-05-14 12:00:02 -04:00
Aecio Santos
fd1e7fcffe SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos
7bef14d386 Make RobotRules accessible #134
- Makes SimpleRobotRulesParser._rules property protected
  and adds getters for SimpleRobotRulesParser._rules and
  RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
  to SimpleRobotRules to allow access to concrete class without
  nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Sebastian Nagel
7d3eccfa63 Add changelog entry and fix unit test 2018-04-25 14:06:33 +02:00
Luc Boruta
355684ff20 Add methods to parse and iterate sitemap
SiteMapParser#walkSiteMap(URL,Consumer)
2018-04-25 14:04:58 +02:00
Sebastian Nagel
0ef7cf87fa Improve sitemap parsing
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel
a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel
a6b3178fc7 Simplify MIME detection:
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel
6714ea57d6 Fix error message format string 2018-04-18 14:53:21 +02:00
Sebastian Nagel
72aa177738 RDF-based RSS feeds: map MIME type, detect from content 2018-04-18 14:44:10 +02:00
Sebastian Nagel
4780678af0 Improve logging of content type detection for gzip-compressed sitemaps 2018-04-18 14:43:52 +02:00
Sebastian Nagel
907be2343f Format fix: add braces, complete CHANGES.txt 2018-04-16 13:36:06 +02:00
Sebastian Nagel
af084f7ee4 Sitemaps limit on "bad url" log messages, fixes #145
- degrade log level to debug for lines which are not valid
- only log first 1024 characters of line
2018-04-13 11:16:47 +02:00
Sebastian Nagel
34c19d811a Fix MIME detection for sitemaps:
- avoid NPE if no MIME type has been detected
- allow optional leading white space before MIME patterns
  (after optional BOM)
2018-04-10 19:17:16 +02:00