1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-23 11:16:04 +02:00
Commit Graph

345 Commits

Author SHA1 Message Date
Julien Nioche 63a837d5d7 updated CHANGES.txt for #76 2015-05-06 10:10:25 +01:00
Julien Nioche d22a9d0617 removed properties file + 1.6 compliant formatting 2015-05-06 09:33:31 +01:00
Julien Nioche 0a5d9d338a maven-java-formatter-plugin 2015-04-30 13:52:44 +01:00
Ken Krugler e42c268e03 Add news about project moving to GitHub 2015-04-22 07:59:43 -07:00
Ken Krugler ee23e1fb0d Update CHANGES.txt 2015-04-22 07:31:20 -07:00
Ken Krugler 798dc59839 Update CHANGES.txt 2015-04-22 07:30:57 -07:00
Ken Krugler 53375168ab Update CHANGES.txt 2015-04-22 07:30:25 -07:00
Julien Nioche 1647b90f7d Upgraded to Tika 1.8 fixes #71 2015-04-22 13:19:02 +01:00
Julien Nioche a28a78c942 changed groupId to crawler-commons 2015-04-22 11:02:49 +01:00
Julien Nioche 2195fb7f7e replaced references to code.google with github equivalents 2015-04-17 10:25:52 +01:00
Julien Nioche 76bc563b21 Merge pull request #72 from crawler-commons/mimetypeRegistry
Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry
2015-04-13 20:42:52 +01:00
Julien Nioche dafbbdd2bf SiteMapParser use UPPERCASE for static finals 2015-04-13 20:41:47 +01:00
Julien Nioche 152a2446c2 Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry; instanciate Tika from the start 2015-04-13 16:36:39 +01:00
Avi Hayun e1c7955389 Updated links to point to GitHub
I have changed several links to point to GitHub instead of GoogleCode.

For now, I am pointing all releases of CHANGES.TXT to the latest at Master, as we don't have the code of the last releases, and actually the CHANGES.TXT file does contain all changes of all releases, so it seems like a good compromise for now...
2015-04-13 12:54:06 +03:00
Avi Hayun 84cc3bf2d0 First stab at recreating front page like the original 2015-04-13 12:46:57 +03:00
Ken Krugler a7ee1bfa84 Delete README 2015-04-09 08:29:18 -07:00
kkrugler_lists@transpac.com 10c4dfbd99 Update with info about issues 67 & 68 2015-03-27 22:12:14 +00:00
kkrugler_lists@transpac.com 7f7f915b0b Issue 68: Case-sensitive path matching 2015-03-27 22:09:32 +00:00
kkrugler_lists@transpac.com 6eb1459345 Issue 67: time in lastMod string not extracted during parse 2015-03-27 21:15:43 +00:00
kkrugler_lists@transpac.com b2a92ce442 Add note about issue 59 being fixed 2015-01-26 13:21:29 +00:00
kkrugler_lists@transpac.com aeafa263e5 Make SimpleRobotRules serializable (issue #59) 2015-01-26 13:18:54 +00:00
avraham2@gmail.com c8ef5e1083 Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag
Adding the CHANGES file
2015-01-25 09:19:06 +00:00
avraham2@gmail.com 19bc879d91 Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag
Fixed the NPE issue
Removed the recursive flag
Upgraded javadocs
2015-01-25 09:18:01 +00:00
digitalpebble@googlemail.com 92408e37d4 Issue 64: Upgraded to Tika 1.7 (jnioche) 2015-01-22 14:43:34 +00:00
digitalpebble@googlemail.com 4f8614c85e Issue 32:[Robots] Resolve relative URL for sitemaps 2015-01-22 10:54:14 +00:00
digitalpebble@googlemail.com 8a0034c1f1 Issue 62:[Sitemaps] Add new parseSiteMap method 2015-01-21 08:59:01 +00:00
avraham2@gmail.com 34195de153 Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them 2015-01-12 10:53:20 +00:00
avraham2@gmail.com 546b9ff60e Issue51: Upgrade httpclient to the latest version 2015-01-12 10:30:30 +00:00
avraham2@gmail.com 823ea3221c Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily 2014-11-25 12:06:53 +00:00
avraham2@gmail.com bcc2c7fe26 Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen 2014-11-21 12:01:55 +00:00
avraham2@gmail.com c8261cbbc4 Issue55: fix setPriority 2014-10-26 11:48:57 +00:00
lewis.mcgibbney@gmail.com 8dda18a77c Issue 50: Add Fetch Report to FetchedResult 2014-10-20 14:07:10 +00:00
lewis.mcgibbney@gmail.com fa18129bcf Issue 50: Add Fetch Report to FetchedResult 2014-10-19 18:59:15 +00:00
avraham2@gmail.com 87331c6bd6 2014-10-10 10:49:28 +00:00
lewis.mcgibbney@gmail.com 21de0241ef Update pom.xml for dist management 2014-10-10 04:59:16 +00:00
lewis.mcgibbney@gmail.com 0e0146faf1 [maven-release-plugin] prepare for next development iteration 2014-10-10 04:40:39 +00:00
lewis.mcgibbney@gmail.com aea0015d12 [maven-release-plugin] prepare release crawler-commons-0.5 2014-10-10 04:40:34 +00:00
lewis.mcgibbney@gmail.com 429db36c35 Update release plugin version 2014-10-10 04:23:06 +00:00
lewis.mcgibbney@gmail.com 7dfda7e46e Update CHANGES ready for 0.5 releaae 2014-10-10 04:15:23 +00:00
kkrugler_lists@transpac.com 6fe3770889 Fix for issue #53 - handle spaces in comma-separated list of agent names 2014-10-04 16:32:12 +00:00
digitalpebble@googlemail.com fef6d41ef8 Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6 2014-09-24 13:43:09 +00:00
digitalpebble@googlemail.com 64530bc52b Issue 48:Upgraded to Tika 1.6 2014-09-10 12:37:17 +00:00
avraham2@gmail.com 5823288428 Removed commented out code I wrongfully put there in the past 2014-08-19 19:14:00 +00:00
avraham2@gmail.com 983cce7c07 Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases
new Tika().detect(URL) -- Will solve the mentioned problem.

BUT it will cause out library to fetch the sitemap twice.


A better solution should be sought.
Maybe use new Tika().detect(bytes, filename);
2014-08-19 19:08:27 +00:00
avraham2@gmail.com 19e2918aca Change the Mime type parsing to use Tika's MediaType.
I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);

And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.
2014-08-06 19:06:45 +00:00
avraham2@gmail.com dc98cbd57c Added myself (Avi Hayun) as a developer 2014-07-14 18:44:31 +00:00
lewis.mcgibbney@gmail.com 4793307adb Issue 39: [Sitemaps] Add the Parser a conviniece method with only a URL argument 2014-07-07 14:27:49 +00:00
lewis.mcgibbney@gmail.com 01e4feef8b Issue 42 [Sitemaps] Add more JUnit tests 2014-07-01 05:11:27 +00:00
kkrugler_lists@transpac.com 59344e878a Issue 37: Upgrade slf4j to v1.7.7 2014-06-24 02:49:01 +00:00
lewis.mcgibbney@gmail.com 94c3ed4068 Upgrade to JUnit v4 conventions in SiteMapParser 2014-05-29 21:02:16 +00:00