1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-27 12:16:04 +02:00
Commit Graph

158 Commits

Author SHA1 Message Date
lewis.mcgibbney@gmail.com 429db36c35 Update release plugin version 2014-10-10 04:23:06 +00:00
lewis.mcgibbney@gmail.com 7dfda7e46e Update CHANGES ready for 0.5 releaae 2014-10-10 04:15:23 +00:00
kkrugler_lists@transpac.com 6fe3770889 Fix for issue #53 - handle spaces in comma-separated list of agent names 2014-10-04 16:32:12 +00:00
digitalpebble@googlemail.com fef6d41ef8 Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6 2014-09-24 13:43:09 +00:00
digitalpebble@googlemail.com 64530bc52b Issue 48:Upgraded to Tika 1.6 2014-09-10 12:37:17 +00:00
avraham2@gmail.com 5823288428 Removed commented out code I wrongfully put there in the past 2014-08-19 19:14:00 +00:00
avraham2@gmail.com 983cce7c07 Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases
new Tika().detect(URL) -- Will solve the mentioned problem.

BUT it will cause out library to fetch the sitemap twice.


A better solution should be sought.
Maybe use new Tika().detect(bytes, filename);
2014-08-19 19:08:27 +00:00
avraham2@gmail.com 19e2918aca Change the Mime type parsing to use Tika's MediaType.
I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);

And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.
2014-08-06 19:06:45 +00:00
avraham2@gmail.com dc98cbd57c Added myself (Avi Hayun) as a developer 2014-07-14 18:44:31 +00:00
lewis.mcgibbney@gmail.com 4793307adb Issue 39: [Sitemaps] Add the Parser a conviniece method with only a URL argument 2014-07-07 14:27:49 +00:00
lewis.mcgibbney@gmail.com 01e4feef8b Issue 42 [Sitemaps] Add more JUnit tests 2014-07-01 05:11:27 +00:00
kkrugler_lists@transpac.com 59344e878a Issue 37: Upgrade slf4j to v1.7.7 2014-06-24 02:49:01 +00:00
lewis.mcgibbney@gmail.com 94c3ed4068 Upgrade to JUnit v4 conventions in SiteMapParser 2014-05-29 21:02:16 +00:00
lewis.mcgibbney@gmail.com cb71c5502a Upgrade the Slf4j logging in SiteMaps 2014-05-29 20:28:37 +00:00
lewis.mcgibbney@gmail.com 25a317e5e6 [maven-release-plugin] prepare for next development iteration 2014-03-20 22:02:17 +00:00
lewis.mcgibbney@gmail.com 1131b36a3b [maven-release-plugin] prepare release crawler-commons-0.4 2014-03-20 22:02:04 +00:00
lewis.mcgibbney@gmail.com e1c264f1df prepare CHANGES.txt for 0.4 release 2014-03-20 21:50:05 +00:00
lewis.mcgibbney@gmail.com efaf0aec6c update CHANGES.txt 2014-03-19 19:15:05 +00:00
kkrugler_lists@transpac.com cc67a3d2c8 Merge patch for issue #13 from Lewis 2014-03-17 00:37:26 +00:00
lewis.mcgibbney@gmail.com 51b0593b75 Port all code changes to CHANGES.txt 2014-03-16 21:53:28 +00:00
kkrugler_lists@transpac.com a6ac57e354 Issue 21: allow has higher precendence than disallow, if both rules are the same length 2014-03-14 00:02:38 +00:00
kkrugler_lists@transpac.com c1f050d33f Add missing file from previous commit 2014-03-14 00:01:44 +00:00
kkrugler_lists@transpac.com ea67b56e42 Add tests for wildcards (via alparslanavci), and sorting rules 2014-03-13 23:50:17 +00:00
kkrugler_lists@transpac.com af74ccf44d Add support for wildcards (via alparslanavci), and sorting rules 2014-03-13 23:49:49 +00:00
kkrugler_lists@transpac.com 300d6ebdb7 Roll in patch from Lewis for issue #23 (http://code.google.com/p/crawler-commons/issues/detail?id=23) 2014-01-24 21:16:38 +00:00
kkrugler_lists@transpac.com dc8f241782 Fix up tests to match latest data file 2014-01-24 21:05:46 +00:00
kkrugler_lists@transpac.com aa4d410223 Make setProcessed public, was implicitly package private 2014-01-24 20:51:33 +00:00
kkrugler_lists@transpac.com dbae7e20df Updated comments w/link to actual data Mozilla data file 2014-01-24 20:44:51 +00:00
kkrugler_lists@transpac.com 16e46b0d50 Added a few more suffixes 2014-01-24 20:44:31 +00:00
kkrugler_lists@transpac.com a98bb030af Updated to latest from http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 2014-01-24 20:44:12 +00:00
digitalpebble@googlemail.com 9b6bf65b1a cleanup of ANT build remnants [lib and lib-ext] 2013-10-21 15:31:14 +00:00
digitalpebble@googlemail.com 816832b10b [maven-release-plugin] prepare for next development iteration 2013-10-11 15:21:59 +00:00
digitalpebble@googlemail.com 1389cf0066 [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-11 15:21:52 +00:00
digitalpebble@googlemail.com 2e08419852 Fixed scm info in pom 2013-10-11 15:20:53 +00:00
digitalpebble@googlemail.com ee88e20e4a [maven-release-plugin] rollback the release of crawler-commons-0.3 2013-10-11 15:18:50 +00:00
digitalpebble@googlemail.com 45975212ad [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-11 15:13:19 +00:00
digitalpebble@googlemail.com 464d5c7956 [maven-release-plugin] rollback the release of crawler-commons-0.3 2013-10-11 12:48:50 +00:00
digitalpebble@googlemail.com 6ed2b2da50 [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-11 12:40:15 +00:00
digitalpebble@googlemail.com 315a208b95 re-trying the release 2013-10-11 12:38:24 +00:00
digitalpebble@googlemail.com 92fb22c2a3 [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-11 11:42:18 +00:00
digitalpebble@googlemail.com 097a927868 [maven-release-plugin] rollback the release of crawler-commons-0.3 2013-10-11 11:35:34 +00:00
digitalpebble@googlemail.com 704bf5ba8b [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-11 11:06:20 +00:00
digitalpebble@googlemail.com add77028cc [maven-release-plugin] rollback the release of crawler-commons-0.3 2013-10-11 10:59:37 +00:00
digitalpebble@googlemail.com c7554efdcb [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-11 10:58:00 +00:00
digitalpebble@googlemail.com 644254769e [maven-release-plugin] rollback the release of crawler-commons-0.3 2013-10-11 10:48:40 +00:00
digitalpebble@googlemail.com dea86d57ea [maven-release-plugin] prepare for next development iteration 2013-10-11 10:46:23 +00:00
digitalpebble@googlemail.com 68106fd316 [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-11 10:46:11 +00:00
digitalpebble@googlemail.com baed790af1 upgraded version of Tika + reverted to 0.3-SNAPSHOT 2013-10-11 10:40:00 +00:00
digitalpebble@googlemail.com ecdf47221e [maven-release-plugin] prepare for next development iteration 2013-10-03 09:31:50 +00:00
digitalpebble@googlemail.com 4e2b0bac6f [maven-release-plugin] prepare release crawler-commons-0.3 2013-10-03 09:31:44 +00:00