1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-20 18:36:03 +02:00
Commit Graph

59 Commits

Author SHA1 Message Date
Julien Nioche 9ea4f1b514 added ref to #126
and changed presentation of issue number
2016-09-30 12:33:06 +01:00
Julien Nioche 36a4bd420e Updated CHANGES with 124 2016-09-21 14:59:34 +01:00
Julien Nioche 4625a358f2 Update CHANGES.txt
added #117 and #113
2016-09-20 10:30:31 +01:00
Julien Nioche 0775bb216e Fix license headers + applied formatting. Fixes #108 2016-06-30 11:45:08 +01:00
Julien Nioche be52b770ff Rename package crawlercommons.url Fixes #107 2016-06-30 11:11:49 +01:00
Ken Krugler 31a6c80ea7 Fix sitemap extraction from robots.txt 2015-12-30 22:03:49 -08:00
Julien Nioche f3f34844d4 Deprecate fetcher classes #97 2015-12-02 10:30:54 +00:00
Julien Nioche c1b3f4b086 Added URLFilter interface + BasicURLNormalizer borrowed from Nutch #106 2015-11-13 10:58:48 +00:00
Ken Krugler 4c43c48ef7 Merged conflict with CHANGES.txt 2015-10-20 07:50:58 -07:00
Ken Krugler 940cbfd0e8 Merged with aecio 2015-10-20 07:48:51 -07:00
Aecio Santos f2bf9300e6 Upgrades httpclient to version 4.5.1 (fixes #84)
and do not ignore test failures during maven build
2015-10-09 14:08:39 -04:00
Julien Nioche 98316a51fc issue #100 in CHANGES.txt 2015-10-06 18:48:58 +01:00
Ken Krugler 9e9f5df884 Fixed up CHANGES.txt file 2015-09-15 07:57:13 -07:00
Julien Nioche f0d71b4729 mentioned issue 89 in CHANGES 2015-09-15 11:38:40 +01:00
Lewis John McGibbney f2e41af53c Trivial commit to update CHANGES.txt for recent commits. 2015-09-14 22:40:04 -07:00
Julien Nioche 20861baf47 Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler) 2015-05-22 11:08:21 +01:00
Julien Nioche 63a837d5d7 updated CHANGES.txt for #76 2015-05-06 10:10:25 +01:00
Ken Krugler ee23e1fb0d Update CHANGES.txt 2015-04-22 07:31:20 -07:00
Ken Krugler 798dc59839 Update CHANGES.txt 2015-04-22 07:30:57 -07:00
Ken Krugler 53375168ab Update CHANGES.txt 2015-04-22 07:30:25 -07:00
Julien Nioche 1647b90f7d Upgraded to Tika 1.8 fixes #71 2015-04-22 13:19:02 +01:00
kkrugler_lists@transpac.com 10c4dfbd99 Update with info about issues 67 & 68 2015-03-27 22:12:14 +00:00
kkrugler_lists@transpac.com b2a92ce442 Add note about issue 59 being fixed 2015-01-26 13:21:29 +00:00
avraham2@gmail.com c8ef5e1083 Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag
Adding the CHANGES file
2015-01-25 09:19:06 +00:00
digitalpebble@googlemail.com 92408e37d4 Issue 64: Upgraded to Tika 1.7 (jnioche) 2015-01-22 14:43:34 +00:00
digitalpebble@googlemail.com 4f8614c85e Issue 32:[Robots] Resolve relative URL for sitemaps 2015-01-22 10:54:14 +00:00
digitalpebble@googlemail.com 8a0034c1f1 Issue 62:[Sitemaps] Add new parseSiteMap method 2015-01-21 08:59:01 +00:00
avraham2@gmail.com 34195de153 Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them 2015-01-12 10:53:20 +00:00
avraham2@gmail.com 546b9ff60e Issue51: Upgrade httpclient to the latest version 2015-01-12 10:30:30 +00:00
avraham2@gmail.com 823ea3221c Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily 2014-11-25 12:06:53 +00:00
avraham2@gmail.com bcc2c7fe26 Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen 2014-11-21 12:01:55 +00:00
avraham2@gmail.com c8261cbbc4 Issue55: fix setPriority 2014-10-26 11:48:57 +00:00
lewis.mcgibbney@gmail.com fa18129bcf Issue 50: Add Fetch Report to FetchedResult 2014-10-19 18:59:15 +00:00
avraham2@gmail.com 87331c6bd6 2014-10-10 10:49:28 +00:00
lewis.mcgibbney@gmail.com 7dfda7e46e Update CHANGES ready for 0.5 releaae 2014-10-10 04:15:23 +00:00
digitalpebble@googlemail.com fef6d41ef8 Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6 2014-09-24 13:43:09 +00:00
digitalpebble@googlemail.com 64530bc52b Issue 48:Upgraded to Tika 1.6 2014-09-10 12:37:17 +00:00
avraham2@gmail.com 983cce7c07 Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases
new Tika().detect(URL) -- Will solve the mentioned problem.

BUT it will cause out library to fetch the sitemap twice.


A better solution should be sought.
Maybe use new Tika().detect(bytes, filename);
2014-08-19 19:08:27 +00:00
avraham2@gmail.com 19e2918aca Change the Mime type parsing to use Tika's MediaType.
I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);

And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.
2014-08-06 19:06:45 +00:00
lewis.mcgibbney@gmail.com 4793307adb Issue 39: [Sitemaps] Add the Parser a conviniece method with only a URL argument 2014-07-07 14:27:49 +00:00
lewis.mcgibbney@gmail.com 01e4feef8b Issue 42 [Sitemaps] Add more JUnit tests 2014-07-01 05:11:27 +00:00
kkrugler_lists@transpac.com 59344e878a Issue 37: Upgrade slf4j to v1.7.7 2014-06-24 02:49:01 +00:00
lewis.mcgibbney@gmail.com e1c264f1df prepare CHANGES.txt for 0.4 release 2014-03-20 21:50:05 +00:00
lewis.mcgibbney@gmail.com efaf0aec6c update CHANGES.txt 2014-03-19 19:15:05 +00:00
lewis.mcgibbney@gmail.com 51b0593b75 Port all code changes to CHANGES.txt 2014-03-16 21:53:28 +00:00
digitalpebble@googlemail.com 9b6bf65b1a cleanup of ANT build remnants [lib and lib-ext] 2013-10-21 15:31:14 +00:00
digitalpebble@googlemail.com baed790af1 upgraded version of Tika + reverted to 0.3-SNAPSHOT 2013-10-11 10:40:00 +00:00
digitalpebble@googlemail.com 14919f77f0 marking version 0.3 in CHANGES 2013-10-03 09:12:38 +00:00
digitalpebble@googlemail.com 4ce4b358b6 issue 29 : more robust parsing when loc element is missing 2013-10-02 13:40:50 +00:00
digitalpebble@googlemail.com d9e3cb4cbb Issue 25:Robots.txt parser should not lowercase sitemap URLs 2013-09-06 12:33:02 +00:00