1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-09 07:46:03 +02:00
Commit Graph

275 Commits

Author SHA1 Message Date
Julien Nioche 58f8e22dc2 [maven-release-plugin] prepare release crawler-commons-0.8 2017-06-07 16:09:25 +01:00
Julien Nioche 79032dbaf5 Merge pull request #165 from crawler-commons/153
SAX parser to stop URL at closing </loc> fixes #153
2017-06-06 22:02:46 +01:00
Sebastian Nagel 4dd61ded84 SAX parser to stop URL at closing </loc> fixes #153
- on opening <loc> and <url> assume forgotten closing </url> tag
  only if there is more than white space in the buffer
2017-06-06 21:04:28 +02:00
Ken Krugler 90c2800869 Merge pull request #164 from crawler-commons/153
SAX parser to stop URL at closing </loc> fixes #153
2017-06-05 12:35:22 -07:00
Julien Nioche 7bf499a98a SAX parser to stop URL at closing </loc> fixes #153 2017-06-05 16:58:39 +01:00
Julien Nioche 2c72ba8708 Update CHANGES.txt
added tika 1.15 to changes
2017-06-02 15:02:20 +01:00
Julien Nioche e2f38b3903 Merge pull request #163 from crawler-commons/tika1.15
Upgraded Tika 1.15
2017-06-02 14:59:05 +01:00
Julien Nioche f8c20c8e4a Upgraded Tika 1.15 2017-06-02 12:22:44 +01:00
Sebastian Nagel 02e62c12cb Disable XML resolvers: update changelog 2017-05-04 22:36:40 +02:00
Sebastian Nagel ce42b00150 Merge pull request #152 from sebastian-nagel/sitemaps-disable-xml-resolvers
Disable validation of XML sitemaps and install noop entity resolver, fixes #151
2017-05-04 22:33:35 +02:00
Julien Nioche 462a7f40b7 Merge pull request #160 from sebastian-nagel/cc-159-sax-atom-feed
SAX parser does not extract URLs from Atom feeds, fixes #159
2017-03-28 13:29:06 +01:00
Sebastian Nagel 4f1534a6f4 SAX parser does not extract URLs from Atom feeds, fixes #159
- call delegated method startElement(...) from DelegatorHandler
- add heuristics to select link URL
- add Atom feed unit test
- add static method to encapsulate reading of test resource files
2017-03-28 11:54:04 +02:00
Julien Nioche 4ba1295c17 Update forbiddenapis to v2.0. Fixes #99 2017-03-20 15:58:55 +00:00
Julien Nioche 9d67dabb1f Merge pull request #157 from sebastian-nagel/cc-143-sitemap-text-gzipped
Fix parsing of gzipped text sitemaps, fixes #143
2017-03-20 15:48:57 +00:00
Sebastian Nagel 772f02fcb0 Fix parsing of gzipped text sitemaps, fixes #143
- detect gzip embedded media type to decide
  whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Julien Nioche 266ac09fff Merge pull request #155 from sebastian-nagel/cc-154-sax-parser-fail-on-unknown-xml
SAX parser to fail on valid but unknown XML formats, fixes #154
2017-03-08 13:58:46 +00:00
Sebastian Nagel 6759ee73e9 SAX parser to fail on valid but unknown XML formats, fixes #154 2017-03-08 14:23:55 +01:00
Sebastian Nagel 9b936ecca4 Disable validation of XML sitemaps and install noop entity resolver, fixes #151 2017-03-03 00:01:12 +01:00
Julien Nioche cb8e760102 Merge pull request #150 from sebastian-nagel/cc-116-sitemap-sax-parser
Optional SAX Parser for Sitemaps
2017-02-28 09:20:09 +00:00
Sebastian Nagel cc09c889fe Merge branch 'master' into cc-116-sitemap-sax-parser 2017-02-27 23:12:55 +01:00
Sebastian Nagel f7c7cab7a8 Merge branch 'matt-deboer-master'
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Ken Krugler d5d4e1b751 Merge pull request #147 from sebastian-nagel/cc-146-log-xml-errors
Looks good, thanks @sebastian-nagel
2017-02-07 13:39:54 -08:00
Sebastian Nagel be982c1a1c SiteMapParser: install error handlers in DOM builder to log XML errors, fixes #146 2017-02-07 22:06:03 +01:00
Ken Krugler 2d751a04b8 Merge pull request #142 from sebastian-nagel/cc-141-use-standardcharsets
Use constants from StandardCharsets where applicable, fixes #141.

Thanks @sebastian-nagel!
2017-02-02 09:44:00 -08:00
Sebastian Nagel 960b7a8fed remove unused import 2017-02-02 15:18:05 +01:00
Sebastian Nagel 61a500ad21 Use constants from StandardCharsets where applicable, fixes #141 2017-02-02 14:59:14 +01:00
Ken Krugler 375aae3d99 Merge pull request #140 from sebastian-nagel/cc-132-sitemap-limit-50mb
Increase size limit of sitemaps (10MB -> 50MB), fixes #132
2017-02-02 03:57:45 -08:00
Sebastian Nagel 49b3097083 Increase size limit of sitemaps (10MB -> 50MB), fixes #132 2017-02-02 12:00:47 +01:00
Sebastian Nagel 9a49088ec8 Add new committer 2017-01-16 15:47:44 +01:00
Julien Nioche eefeda558c Update CHANGES.txt
#137
2017-01-13 17:31:10 +00:00
Julien Nioche 040c02dad8 Merge pull request #138 from sebastian-nagel/137-default-locale
Remove dependencies to system-specific locale, fixes #137
2017-01-13 17:29:16 +00:00
Sebastian Nagel 81a460de54 Remove dependencies to system-specific locale, fixes #137
- use Locale.ROOT instead of Locale.getDefault()
- set encoding for resources to UTF-8
  (StandardCharsets.UTF_8 instead of Charset.defaultCharset())
2017-01-12 15:22:48 +01:00
Ken Krugler 477bfd678f Merge pull request #136 from sebastian-nagel/basic-url-normalizer-npe
BasicURLNormalizer: NPE for URLs without authority
2017-01-11 19:20:27 -08:00
Sebastian Nagel cb38a5fc8f BasicURLNormalizer: NPE for URLs without authority
- check whether URL.getAuthority() returns null
- recompose URLs without authority with empty authority/host
2017-01-11 17:05:53 +01:00
Julien Nioche a2dd24624e Merge pull request #133 from sebastian-nagel/basic-url-normalizer-empty-port
BasicURLNormalizer to strip empty port
2017-01-03 10:28:30 +00:00
Sebastian Nagel e39aa60373 BasicURLNormalizer to remove empty port 2016-12-09 14:54:00 +01:00
Julien Nioche eba560af1c Merge pull request #131 from crawler-commons/Remove_Unused_HttpClient_Resources_130
Issue #130
2016-11-28 16:52:28 +00:00
Ken Krugler 49d71ad295 Issue #130
Remove unused HttpClient code & resources
2016-11-28 08:45:09 -08:00
Julien Nioche c39f917340 Merge pull request #129 from crawler-commons/Remove_HTTP_fetcher_support_96
Issue #96
2016-11-28 10:38:27 +00:00
Ken Krugler 5783046f4e Issue #96
Remove fetcher support
2016-11-27 09:18:21 -08:00
Julien Nioche 1a6c1b0dce Released 0.7 2016-11-24 10:07:27 +00:00
Julien Nioche 2df1374eb4 [maven-release-plugin] prepare for next development iteration 2016-11-21 14:49:47 +00:00
Julien Nioche 7ef3599487 [maven-release-plugin] prepare release crawler-commons-0.7 2016-11-21 14:49:43 +00:00
Lewis John McGibbney 99b85b4f42 Add license and badge to codebase 2016-10-25 18:01:17 -07:00
Julien Nioche 9ea4f1b514 added ref to #126
and changed presentation of issue number
2016-09-30 12:33:06 +01:00
Julien Nioche d0c1221846 Merge pull request #126 from lewismc/ISSUE-125
Upgrade to JDK 1.8
2016-09-30 12:28:08 +01:00
Lewis John McGibbney 18bbae908c Upgrade to JDK 1.8 2016-09-29 21:39:24 -07:00
Lewis John McGibbney fc3378cb95 Merge branch 'master' into ISSUE-125 2016-09-29 21:36:05 -07:00
Julien Nioche f4b76c70c6 Merge pull request #128 from echoboxapp/further-sitemapparser-method-visibility-changes
Further changes to SiteMapParser and AbstractSiteMap method visibilit…
2016-09-29 11:11:02 +01:00
Michael Lavelle 58608c485d Further changes to SiteMapParser and AbstractSiteMap method visibility. Previous pull request to change certain methods to protected did not cover all required methods. The changes in this commit allow for the addition of GoogleNews site maps 2016-09-29 09:07:45 +01:00