Julien Nioche
58f8e22dc2
[maven-release-plugin] prepare release crawler-commons-0.8
2017-06-07 16:09:25 +01:00
Julien Nioche
79032dbaf5
Merge pull request #165 from crawler-commons/153
...
SAX parser to stop URL at closing </loc> fixes #153
2017-06-06 22:02:46 +01:00
Sebastian Nagel
4dd61ded84
SAX parser to stop URL at closing </loc> fixes #153
...
- on opening <loc> and <url> assume forgotten closing </url> tag
only if there is more than white space in the buffer
2017-06-06 21:04:28 +02:00
Ken Krugler
90c2800869
Merge pull request #164 from crawler-commons/153
...
SAX parser to stop URL at closing </loc> fixes #153
2017-06-05 12:35:22 -07:00
Julien Nioche
7bf499a98a
SAX parser to stop URL at closing </loc> fixes #153
2017-06-05 16:58:39 +01:00
Julien Nioche
2c72ba8708
Update CHANGES.txt
...
added tika 1.15 to changes
2017-06-02 15:02:20 +01:00
Julien Nioche
e2f38b3903
Merge pull request #163 from crawler-commons/tika1.15
...
Upgraded Tika 1.15
2017-06-02 14:59:05 +01:00
Julien Nioche
f8c20c8e4a
Upgraded Tika 1.15
2017-06-02 12:22:44 +01:00
Sebastian Nagel
02e62c12cb
Disable XML resolvers: update changelog
2017-05-04 22:36:40 +02:00
Sebastian Nagel
ce42b00150
Merge pull request #152 from sebastian-nagel/sitemaps-disable-xml-resolvers
...
Disable validation of XML sitemaps and install noop entity resolver, fixes #151
2017-05-04 22:33:35 +02:00
Julien Nioche
462a7f40b7
Merge pull request #160 from sebastian-nagel/cc-159-sax-atom-feed
...
SAX parser does not extract URLs from Atom feeds, fixes #159
2017-03-28 13:29:06 +01:00
Sebastian Nagel
4f1534a6f4
SAX parser does not extract URLs from Atom feeds, fixes #159
...
- call delegated method startElement(...) from DelegatorHandler
- add heuristics to select link URL
- add Atom feed unit test
- add static method to encapsulate reading of test resource files
2017-03-28 11:54:04 +02:00
Julien Nioche
4ba1295c17
Update forbiddenapis to v2.0. Fixes #99
2017-03-20 15:58:55 +00:00
Julien Nioche
9d67dabb1f
Merge pull request #157 from sebastian-nagel/cc-143-sitemap-text-gzipped
...
Fix parsing of gzipped text sitemaps, fixes #143
2017-03-20 15:48:57 +00:00
Sebastian Nagel
772f02fcb0
Fix parsing of gzipped text sitemaps, fixes #143
...
- detect gzip embedded media type to decide
whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Julien Nioche
266ac09fff
Merge pull request #155 from sebastian-nagel/cc-154-sax-parser-fail-on-unknown-xml
...
SAX parser to fail on valid but unknown XML formats, fixes #154
2017-03-08 13:58:46 +00:00
Sebastian Nagel
6759ee73e9
SAX parser to fail on valid but unknown XML formats, fixes #154
2017-03-08 14:23:55 +01:00
Sebastian Nagel
9b936ecca4
Disable validation of XML sitemaps and install noop entity resolver, fixes #151
2017-03-03 00:01:12 +01:00
Julien Nioche
cb8e760102
Merge pull request #150 from sebastian-nagel/cc-116-sitemap-sax-parser
...
Optional SAX Parser for Sitemaps
2017-02-28 09:20:09 +00:00
Sebastian Nagel
cc09c889fe
Merge branch 'master' into cc-116-sitemap-sax-parser
2017-02-27 23:12:55 +01:00
Sebastian Nagel
f7c7cab7a8
Merge branch 'matt-deboer-master'
...
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Ken Krugler
d5d4e1b751
Merge pull request #147 from sebastian-nagel/cc-146-log-xml-errors
...
Looks good, thanks @sebastian-nagel
2017-02-07 13:39:54 -08:00
Sebastian Nagel
be982c1a1c
SiteMapParser: install error handlers in DOM builder to log XML errors, fixes #146
2017-02-07 22:06:03 +01:00
Ken Krugler
2d751a04b8
Merge pull request #142 from sebastian-nagel/cc-141-use-standardcharsets
...
Use constants from StandardCharsets where applicable, fixes #141 .
Thanks @sebastian-nagel!
2017-02-02 09:44:00 -08:00
Sebastian Nagel
960b7a8fed
remove unused import
2017-02-02 15:18:05 +01:00
Sebastian Nagel
61a500ad21
Use constants from StandardCharsets where applicable, fixes #141
2017-02-02 14:59:14 +01:00
Ken Krugler
375aae3d99
Merge pull request #140 from sebastian-nagel/cc-132-sitemap-limit-50mb
...
Increase size limit of sitemaps (10MB -> 50MB), fixes #132
2017-02-02 03:57:45 -08:00
Sebastian Nagel
49b3097083
Increase size limit of sitemaps (10MB -> 50MB), fixes #132
2017-02-02 12:00:47 +01:00
Sebastian Nagel
9a49088ec8
Add new committer
2017-01-16 15:47:44 +01:00
Julien Nioche
eefeda558c
Update CHANGES.txt
...
#137
2017-01-13 17:31:10 +00:00
Julien Nioche
040c02dad8
Merge pull request #138 from sebastian-nagel/137-default-locale
...
Remove dependencies to system-specific locale, fixes #137
2017-01-13 17:29:16 +00:00
Sebastian Nagel
81a460de54
Remove dependencies to system-specific locale, fixes #137
...
- use Locale.ROOT instead of Locale.getDefault()
- set encoding for resources to UTF-8
(StandardCharsets.UTF_8 instead of Charset.defaultCharset())
2017-01-12 15:22:48 +01:00
Ken Krugler
477bfd678f
Merge pull request #136 from sebastian-nagel/basic-url-normalizer-npe
...
BasicURLNormalizer: NPE for URLs without authority
2017-01-11 19:20:27 -08:00
Sebastian Nagel
cb38a5fc8f
BasicURLNormalizer: NPE for URLs without authority
...
- check whether URL.getAuthority() returns null
- recompose URLs without authority with empty authority/host
2017-01-11 17:05:53 +01:00
Julien Nioche
a2dd24624e
Merge pull request #133 from sebastian-nagel/basic-url-normalizer-empty-port
...
BasicURLNormalizer to strip empty port
2017-01-03 10:28:30 +00:00
Sebastian Nagel
e39aa60373
BasicURLNormalizer to remove empty port
2016-12-09 14:54:00 +01:00
Julien Nioche
eba560af1c
Merge pull request #131 from crawler-commons/Remove_Unused_HttpClient_Resources_130
...
Issue #130
2016-11-28 16:52:28 +00:00
Ken Krugler
49d71ad295
Issue #130
...
Remove unused HttpClient code & resources
2016-11-28 08:45:09 -08:00
Julien Nioche
c39f917340
Merge pull request #129 from crawler-commons/Remove_HTTP_fetcher_support_96
...
Issue #96
2016-11-28 10:38:27 +00:00
Ken Krugler
5783046f4e
Issue #96
...
Remove fetcher support
2016-11-27 09:18:21 -08:00
Julien Nioche
1a6c1b0dce
Released 0.7
2016-11-24 10:07:27 +00:00
Julien Nioche
2df1374eb4
[maven-release-plugin] prepare for next development iteration
2016-11-21 14:49:47 +00:00
Julien Nioche
7ef3599487
[maven-release-plugin] prepare release crawler-commons-0.7
2016-11-21 14:49:43 +00:00
Lewis John McGibbney
99b85b4f42
Add license and badge to codebase
2016-10-25 18:01:17 -07:00
Julien Nioche
9ea4f1b514
added ref to #126
...
and changed presentation of issue number
2016-09-30 12:33:06 +01:00
Julien Nioche
d0c1221846
Merge pull request #126 from lewismc/ISSUE-125
...
Upgrade to JDK 1.8
2016-09-30 12:28:08 +01:00
Lewis John McGibbney
18bbae908c
Upgrade to JDK 1.8
2016-09-29 21:39:24 -07:00
Lewis John McGibbney
fc3378cb95
Merge branch 'master' into ISSUE-125
2016-09-29 21:36:05 -07:00
Julien Nioche
f4b76c70c6
Merge pull request #128 from echoboxapp/further-sitemapparser-method-visibility-changes
...
Further changes to SiteMapParser and AbstractSiteMap method visibilit…
2016-09-29 11:11:02 +01:00
Michael Lavelle
58608c485d
Further changes to SiteMapParser and AbstractSiteMap method visibility. Previous pull request to change certain methods to protected did not cover all required methods. The changes in this commit allow for the addition of GoogleNews site maps
2016-09-29 09:07:45 +01:00