Julien Nioche
0da1b8b8b5
Minor changes + applied formatting pre 0.10 release
2018-06-05 11:33:27 +01:00
Julien Nioche
8195140e21
Update CHANGES.txt
...
added #211
2018-06-05 11:23:00 +01:00
Julien Nioche
a8b474551a
Update CHANGES.txt
...
Add main to SimpleRobotRulesParser for testing (#193 )
2018-06-04 21:28:09 +01:00
Ken Krugler
d99c034dd0
Merge branch 'master' into issue-134
2018-05-14 11:20:17 -07:00
Aecio Santos
47c2cad8b8
Add getters/setters and update CHANGES.txt
2018-05-14 12:00:02 -04:00
Aecio Santos
aaa3113e55
Update CHANGES.txt
2018-05-14 11:51:43 -04:00
Aecio Santos
fd1e7fcffe
SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
...
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos
7bef14d386
Make RobotRules accessible #134
...
- Makes SimpleRobotRulesParser._rules property protected
and adds getters for SimpleRobotRulesParser._rules and
RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
to SimpleRobotRules to allow access to concrete class without
nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Julien Nioche
e25309d26c
Add JAX-B dependencies to POM ( #207 )
...
* Add JAX-B dependencies to POM, fixes #196
* mentioned in CHANGES.txt
2018-05-03 11:04:03 +01:00
Sebastian Nagel
7d3eccfa63
Add changelog entry and fix unit test
2018-04-25 14:06:33 +02:00
Sebastian Nagel
0ef7cf87fa
Improve sitemap parsing
...
- ignore query part of URL to determine sitemap location prefix
for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel
a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
...
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel
a6b3178fc7
Simplify MIME detection:
...
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel
907be2343f
Format fix: add braces, complete CHANGES.txt
2018-04-16 13:36:06 +02:00
Ken Krugler
12155888bc
Add reference to issue #199
2018-04-02 12:59:17 -07:00
Sebastian Nagel
49bf37c6d9
Update CHANGES.txt
2017-12-08 09:42:52 +01:00
Ken Krugler
2b58c5050c
merge with master
2017-11-05 14:55:30 -08:00
Ken Krugler
aeb0cb91a2
Update CHANGES.txt
2017-11-05 14:53:58 -08:00
Julien Nioche
af0a013776
Released 0.9
2017-10-31 09:42:57 +00:00
Julien Nioche
f3e37f37da
Updated change log prior to 0.9 release
2017-10-27 11:08:47 +01:00
Sebastian Nagel
2afdf5b04d
Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169
...
- completely add sitemap URLs from sitemap index if URL contains
XML entities or CDATA
2017-08-12 17:28:08 +02:00
Ken Krugler
7e08c1da49
Update CHANGES.txt
2017-06-20 15:18:47 -07:00
Julien Nioche
694e74207b
release notes for 0.8
2017-06-09 10:15:44 +01:00
Julien Nioche
2c72ba8708
Update CHANGES.txt
...
added tika 1.15 to changes
2017-06-02 15:02:20 +01:00
Sebastian Nagel
02e62c12cb
Disable XML resolvers: update changelog
2017-05-04 22:36:40 +02:00
Julien Nioche
4ba1295c17
Update forbiddenapis to v2.0. Fixes #99
2017-03-20 15:58:55 +00:00
Sebastian Nagel
772f02fcb0
Fix parsing of gzipped text sitemaps, fixes #143
...
- detect gzip embedded media type to decide
whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Sebastian Nagel
f7c7cab7a8
Merge branch 'matt-deboer-master'
...
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Sebastian Nagel
61a500ad21
Use constants from StandardCharsets where applicable, fixes #141
2017-02-02 14:59:14 +01:00
Sebastian Nagel
49b3097083
Increase size limit of sitemaps (10MB -> 50MB), fixes #132
2017-02-02 12:00:47 +01:00
Julien Nioche
eefeda558c
Update CHANGES.txt
...
#137
2017-01-13 17:31:10 +00:00
Sebastian Nagel
cb38a5fc8f
BasicURLNormalizer: NPE for URLs without authority
...
- check whether URL.getAuthority() returns null
- recompose URLs without authority with empty authority/host
2017-01-11 17:05:53 +01:00
Sebastian Nagel
e39aa60373
BasicURLNormalizer to remove empty port
2016-12-09 14:54:00 +01:00
Ken Krugler
5783046f4e
Issue #96
...
Remove fetcher support
2016-11-27 09:18:21 -08:00
Julien Nioche
1a6c1b0dce
Released 0.7
2016-11-24 10:07:27 +00:00
Julien Nioche
9ea4f1b514
added ref to #126
...
and changed presentation of issue number
2016-09-30 12:33:06 +01:00
Julien Nioche
36a4bd420e
Updated CHANGES with 124
2016-09-21 14:59:34 +01:00
Julien Nioche
4625a358f2
Update CHANGES.txt
...
added #117 and #113
2016-09-20 10:30:31 +01:00
Julien Nioche
0775bb216e
Fix license headers + applied formatting. Fixes #108
2016-06-30 11:45:08 +01:00
Julien Nioche
be52b770ff
Rename package crawlercommons.url Fixes #107
2016-06-30 11:11:49 +01:00
Ken Krugler
31a6c80ea7
Fix sitemap extraction from robots.txt
2015-12-30 22:03:49 -08:00
Julien Nioche
f3f34844d4
Deprecate fetcher classes #97
2015-12-02 10:30:54 +00:00
Julien Nioche
c1b3f4b086
Added URLFilter interface + BasicURLNormalizer borrowed from Nutch #106
2015-11-13 10:58:48 +00:00
Ken Krugler
4c43c48ef7
Merged conflict with CHANGES.txt
2015-10-20 07:50:58 -07:00
Ken Krugler
940cbfd0e8
Merged with aecio
2015-10-20 07:48:51 -07:00
Aecio Santos
f2bf9300e6
Upgrades httpclient to version 4.5.1 ( fixes #84 )
...
and do not ignore test failures during maven build
2015-10-09 14:08:39 -04:00
Julien Nioche
98316a51fc
issue #100 in CHANGES.txt
2015-10-06 18:48:58 +01:00
Ken Krugler
9e9f5df884
Fixed up CHANGES.txt file
2015-09-15 07:57:13 -07:00
Julien Nioche
f0d71b4729
mentioned issue 89 in CHANGES
2015-09-15 11:38:40 +01:00
Lewis John McGibbney
f2e41af53c
Trivial commit to update CHANGES.txt for recent commits.
2015-09-14 22:40:04 -07:00