1
0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-09-28 10:41:04 +02:00
Commit Graph

32 Commits

Author SHA1 Message Date
avraham2@gmail.com
34195de153 Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them 2015-01-12 10:53:20 +00:00
avraham2@gmail.com
546b9ff60e Issue51: Upgrade httpclient to the latest version 2015-01-12 10:30:30 +00:00
avraham2@gmail.com
823ea3221c Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily 2014-11-25 12:06:53 +00:00
avraham2@gmail.com
bcc2c7fe26 Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen 2014-11-21 12:01:55 +00:00
avraham2@gmail.com
c8261cbbc4 Issue55: fix setPriority 2014-10-26 11:48:57 +00:00
lewis.mcgibbney@gmail.com
fa18129bcf Issue 50: Add Fetch Report to FetchedResult 2014-10-19 18:59:15 +00:00
avraham2@gmail.com
87331c6bd6 2014-10-10 10:49:28 +00:00
lewis.mcgibbney@gmail.com
7dfda7e46e Update CHANGES ready for 0.5 releaae 2014-10-10 04:15:23 +00:00
digitalpebble@googlemail.com
fef6d41ef8 Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6 2014-09-24 13:43:09 +00:00
digitalpebble@googlemail.com
64530bc52b Issue 48:Upgraded to Tika 1.6 2014-09-10 12:37:17 +00:00
avraham2@gmail.com
983cce7c07 Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases
new Tika().detect(URL) -- Will solve the mentioned problem.

BUT it will cause out library to fetch the sitemap twice.


A better solution should be sought.
Maybe use new Tika().detect(bytes, filename);
2014-08-19 19:08:27 +00:00
avraham2@gmail.com
19e2918aca Change the Mime type parsing to use Tika's MediaType.
I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);

And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.
2014-08-06 19:06:45 +00:00
lewis.mcgibbney@gmail.com
4793307adb Issue 39: [Sitemaps] Add the Parser a conviniece method with only a URL argument 2014-07-07 14:27:49 +00:00
lewis.mcgibbney@gmail.com
01e4feef8b Issue 42 [Sitemaps] Add more JUnit tests 2014-07-01 05:11:27 +00:00
kkrugler_lists@transpac.com
59344e878a Issue 37: Upgrade slf4j to v1.7.7 2014-06-24 02:49:01 +00:00
lewis.mcgibbney@gmail.com
e1c264f1df prepare CHANGES.txt for 0.4 release 2014-03-20 21:50:05 +00:00
lewis.mcgibbney@gmail.com
efaf0aec6c update CHANGES.txt 2014-03-19 19:15:05 +00:00
lewis.mcgibbney@gmail.com
51b0593b75 Port all code changes to CHANGES.txt 2014-03-16 21:53:28 +00:00
digitalpebble@googlemail.com
9b6bf65b1a cleanup of ANT build remnants [lib and lib-ext] 2013-10-21 15:31:14 +00:00
digitalpebble@googlemail.com
baed790af1 upgraded version of Tika + reverted to 0.3-SNAPSHOT 2013-10-11 10:40:00 +00:00
digitalpebble@googlemail.com
14919f77f0 marking version 0.3 in CHANGES 2013-10-03 09:12:38 +00:00
digitalpebble@googlemail.com
4ce4b358b6 issue 29 : more robust parsing when loc element is missing 2013-10-02 13:40:50 +00:00
digitalpebble@googlemail.com
d9e3cb4cbb Issue 25:Robots.txt parser should not lowercase sitemap URLs 2013-09-06 12:33:02 +00:00
digitalpebble@googlemail.com
15aa39d41c Added utility class for testing sitemaps 2013-07-18 14:01:37 +00:00
lewis.mcgibbney@gmail.com
a0328358c0 Issue 16: Remove Ant scripts and configurations 2013-07-01 19:18:25 +00:00
digitalpebble
7596599e02 issue 26 : default priority correctly implemented in SiteMaps 2013-05-24 14:15:51 +00:00
digitalpebble
40ef1f5a10 issue 27 : [SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString() 2013-05-24 14:09:26 +00:00
lewis.mcgibbney@gmail.com
1fca6c714f clean up for 0.2 release 2013-01-30 04:12:34 +00:00
lewis.mcgibbney@gmail.com
2f34db8056 CC 12 Substantiate Javadoc 2013-01-28 02:45:41 +00:00
lewis.mcgibbney@gmail.com
7ec5b41e3e CC 11 Review default.properties 2013-01-28 02:43:34 +00:00
kkrugler_lists@transpac.com
23c2d1e7b9 Updated CHANGES.txt with recent work (post 0.1)
Added warning to pom.xml about not trying to use Maven to build artifacts.

Minor cleanup of comments.
2013-01-23 23:08:51 +00:00
digitalpebble
55da554846 added CHANGES.txt + refactoring of SiteMap objects (thanks to Hannes Schwarz) 2011-07-25 10:23:21 +00:00