1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-28 20:36:05 +02:00
Commit Graph

94 Commits

Author SHA1 Message Date
Julien Nioche 0da1b8b8b5 Minor changes + applied formatting pre 0.10 release 2018-06-05 11:33:27 +01:00
Julien Nioche 8195140e21
Update CHANGES.txt
added #211
2018-06-05 11:23:00 +01:00
Julien Nioche a8b474551a
Update CHANGES.txt
Add main to SimpleRobotRulesParser for testing (#193)
2018-06-04 21:28:09 +01:00
Ken Krugler d99c034dd0
Merge branch 'master' into issue-134 2018-05-14 11:20:17 -07:00
Aecio Santos 47c2cad8b8 Add getters/setters and update CHANGES.txt 2018-05-14 12:00:02 -04:00
Aecio Santos aaa3113e55 Update CHANGES.txt 2018-05-14 11:51:43 -04:00
Aecio Santos fd1e7fcffe SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos 7bef14d386 Make RobotRules accessible #134
- Makes SimpleRobotRulesParser._rules property protected
  and adds getters for SimpleRobotRulesParser._rules and
  RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
  to SimpleRobotRules to allow access to concrete class without
  nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Julien Nioche e25309d26c
Add JAX-B dependencies to POM (#207)
* Add JAX-B dependencies to POM, fixes #196

* mentioned in CHANGES.txt
2018-05-03 11:04:03 +01:00
Sebastian Nagel 7d3eccfa63 Add changelog entry and fix unit test 2018-04-25 14:06:33 +02:00
Sebastian Nagel 0ef7cf87fa Improve sitemap parsing
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel a6b3178fc7 Simplify MIME detection:
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel 907be2343f Format fix: add braces, complete CHANGES.txt 2018-04-16 13:36:06 +02:00
Ken Krugler 12155888bc
Add reference to issue #199 2018-04-02 12:59:17 -07:00
Sebastian Nagel 49bf37c6d9 Update CHANGES.txt 2017-12-08 09:42:52 +01:00
Ken Krugler 2b58c5050c merge with master 2017-11-05 14:55:30 -08:00
Ken Krugler aeb0cb91a2 Update CHANGES.txt 2017-11-05 14:53:58 -08:00
Julien Nioche af0a013776 Released 0.9 2017-10-31 09:42:57 +00:00
Julien Nioche f3e37f37da Updated change log prior to 0.9 release 2017-10-27 11:08:47 +01:00
Sebastian Nagel 2afdf5b04d Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169
- completely add sitemap URLs from sitemap index if URL contains
  XML entities or CDATA
2017-08-12 17:28:08 +02:00
Ken Krugler 7e08c1da49 Update CHANGES.txt 2017-06-20 15:18:47 -07:00
Julien Nioche 694e74207b release notes for 0.8 2017-06-09 10:15:44 +01:00
Julien Nioche 2c72ba8708 Update CHANGES.txt
added tika 1.15 to changes
2017-06-02 15:02:20 +01:00
Sebastian Nagel 02e62c12cb Disable XML resolvers: update changelog 2017-05-04 22:36:40 +02:00
Julien Nioche 4ba1295c17 Update forbiddenapis to v2.0. Fixes #99 2017-03-20 15:58:55 +00:00
Sebastian Nagel 772f02fcb0 Fix parsing of gzipped text sitemaps, fixes #143
- detect gzip embedded media type to decide
  whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Sebastian Nagel f7c7cab7a8 Merge branch 'matt-deboer-master'
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Sebastian Nagel 61a500ad21 Use constants from StandardCharsets where applicable, fixes #141 2017-02-02 14:59:14 +01:00
Sebastian Nagel 49b3097083 Increase size limit of sitemaps (10MB -> 50MB), fixes #132 2017-02-02 12:00:47 +01:00
Julien Nioche eefeda558c Update CHANGES.txt
#137
2017-01-13 17:31:10 +00:00
Sebastian Nagel cb38a5fc8f BasicURLNormalizer: NPE for URLs without authority
- check whether URL.getAuthority() returns null
- recompose URLs without authority with empty authority/host
2017-01-11 17:05:53 +01:00
Sebastian Nagel e39aa60373 BasicURLNormalizer to remove empty port 2016-12-09 14:54:00 +01:00
Ken Krugler 5783046f4e Issue #96
Remove fetcher support
2016-11-27 09:18:21 -08:00
Julien Nioche 1a6c1b0dce Released 0.7 2016-11-24 10:07:27 +00:00
Julien Nioche 9ea4f1b514 added ref to #126
and changed presentation of issue number
2016-09-30 12:33:06 +01:00
Julien Nioche 36a4bd420e Updated CHANGES with 124 2016-09-21 14:59:34 +01:00
Julien Nioche 4625a358f2 Update CHANGES.txt
added #117 and #113
2016-09-20 10:30:31 +01:00
Julien Nioche 0775bb216e Fix license headers + applied formatting. Fixes #108 2016-06-30 11:45:08 +01:00
Julien Nioche be52b770ff Rename package crawlercommons.url Fixes #107 2016-06-30 11:11:49 +01:00
Ken Krugler 31a6c80ea7 Fix sitemap extraction from robots.txt 2015-12-30 22:03:49 -08:00
Julien Nioche f3f34844d4 Deprecate fetcher classes #97 2015-12-02 10:30:54 +00:00
Julien Nioche c1b3f4b086 Added URLFilter interface + BasicURLNormalizer borrowed from Nutch #106 2015-11-13 10:58:48 +00:00
Ken Krugler 4c43c48ef7 Merged conflict with CHANGES.txt 2015-10-20 07:50:58 -07:00
Ken Krugler 940cbfd0e8 Merged with aecio 2015-10-20 07:48:51 -07:00
Aecio Santos f2bf9300e6 Upgrades httpclient to version 4.5.1 (fixes #84)
and do not ignore test failures during maven build
2015-10-09 14:08:39 -04:00
Julien Nioche 98316a51fc issue #100 in CHANGES.txt 2015-10-06 18:48:58 +01:00
Ken Krugler 9e9f5df884 Fixed up CHANGES.txt file 2015-09-15 07:57:13 -07:00
Julien Nioche f0d71b4729 mentioned issue 89 in CHANGES 2015-09-15 11:38:40 +01:00
Lewis John McGibbney f2e41af53c Trivial commit to update CHANGES.txt for recent commits. 2015-09-14 22:40:04 -07:00