1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-26 20:06:07 +02:00
Commit Graph

65 Commits

Author SHA1 Message Date
Julien Nioche 20861baf47 Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler) 2015-05-22 11:08:21 +01:00
Julien Nioche 40731c3304 applied formatting with mvn java-formatter:format 2015-05-15 09:03:24 +01:00
Julien Nioche dafbbdd2bf SiteMapParser use UPPERCASE for static finals 2015-04-13 20:41:47 +01:00
Julien Nioche 152a2446c2 Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry; instanciate Tika from the start 2015-04-13 16:36:39 +01:00
kkrugler_lists@transpac.com 7f7f915b0b Issue 68: Case-sensitive path matching 2015-03-27 22:09:32 +00:00
kkrugler_lists@transpac.com 6eb1459345 Issue 67: time in lastMod string not extracted during parse 2015-03-27 21:15:43 +00:00
kkrugler_lists@transpac.com aeafa263e5 Make SimpleRobotRules serializable (issue #59) 2015-01-26 13:18:54 +00:00
avraham2@gmail.com 19bc879d91 Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag
Fixed the NPE issue
Removed the recursive flag
Upgraded javadocs
2015-01-25 09:18:01 +00:00
digitalpebble@googlemail.com 4f8614c85e Issue 32:[Robots] Resolve relative URL for sitemaps 2015-01-22 10:54:14 +00:00
digitalpebble@googlemail.com 8a0034c1f1 Issue 62:[Sitemaps] Add new parseSiteMap method 2015-01-21 08:59:01 +00:00
avraham2@gmail.com 34195de153 Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them 2015-01-12 10:53:20 +00:00
avraham2@gmail.com 546b9ff60e Issue51: Upgrade httpclient to the latest version 2015-01-12 10:30:30 +00:00
avraham2@gmail.com 823ea3221c Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily 2014-11-25 12:06:53 +00:00
avraham2@gmail.com bcc2c7fe26 Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen 2014-11-21 12:01:55 +00:00
avraham2@gmail.com c8261cbbc4 Issue55: fix setPriority 2014-10-26 11:48:57 +00:00
lewis.mcgibbney@gmail.com 8dda18a77c Issue 50: Add Fetch Report to FetchedResult 2014-10-20 14:07:10 +00:00
lewis.mcgibbney@gmail.com fa18129bcf Issue 50: Add Fetch Report to FetchedResult 2014-10-19 18:59:15 +00:00
kkrugler_lists@transpac.com 6fe3770889 Fix for issue #53 - handle spaces in comma-separated list of agent names 2014-10-04 16:32:12 +00:00
digitalpebble@googlemail.com fef6d41ef8 Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6 2014-09-24 13:43:09 +00:00
avraham2@gmail.com 5823288428 Removed commented out code I wrongfully put there in the past 2014-08-19 19:14:00 +00:00
avraham2@gmail.com 983cce7c07 Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases
new Tika().detect(URL) -- Will solve the mentioned problem.

BUT it will cause out library to fetch the sitemap twice.


A better solution should be sought.
Maybe use new Tika().detect(bytes, filename);
2014-08-19 19:08:27 +00:00
avraham2@gmail.com 19e2918aca Change the Mime type parsing to use Tika's MediaType.
I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);

And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.
2014-08-06 19:06:45 +00:00
lewis.mcgibbney@gmail.com 4793307adb Issue 39: [Sitemaps] Add the Parser a conviniece method with only a URL argument 2014-07-07 14:27:49 +00:00
lewis.mcgibbney@gmail.com 01e4feef8b Issue 42 [Sitemaps] Add more JUnit tests 2014-07-01 05:11:27 +00:00
lewis.mcgibbney@gmail.com 94c3ed4068 Upgrade to JUnit v4 conventions in SiteMapParser 2014-05-29 21:02:16 +00:00
lewis.mcgibbney@gmail.com cb71c5502a Upgrade the Slf4j logging in SiteMaps 2014-05-29 20:28:37 +00:00
kkrugler_lists@transpac.com cc67a3d2c8 Merge patch for issue #13 from Lewis 2014-03-17 00:37:26 +00:00
kkrugler_lists@transpac.com a6ac57e354 Issue 21: allow has higher precendence than disallow, if both rules are the same length 2014-03-14 00:02:38 +00:00
kkrugler_lists@transpac.com c1f050d33f Add missing file from previous commit 2014-03-14 00:01:44 +00:00
kkrugler_lists@transpac.com ea67b56e42 Add tests for wildcards (via alparslanavci), and sorting rules 2014-03-13 23:50:17 +00:00
kkrugler_lists@transpac.com af74ccf44d Add support for wildcards (via alparslanavci), and sorting rules 2014-03-13 23:49:49 +00:00
kkrugler_lists@transpac.com 300d6ebdb7 Roll in patch from Lewis for issue #23 (http://code.google.com/p/crawler-commons/issues/detail?id=23) 2014-01-24 21:16:38 +00:00
kkrugler_lists@transpac.com dc8f241782 Fix up tests to match latest data file 2014-01-24 21:05:46 +00:00
kkrugler_lists@transpac.com aa4d410223 Make setProcessed public, was implicitly package private 2014-01-24 20:51:33 +00:00
kkrugler_lists@transpac.com dbae7e20df Updated comments w/link to actual data Mozilla data file 2014-01-24 20:44:51 +00:00
kkrugler_lists@transpac.com 16e46b0d50 Added a few more suffixes 2014-01-24 20:44:31 +00:00
kkrugler_lists@transpac.com a98bb030af Updated to latest from http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 2014-01-24 20:44:12 +00:00
digitalpebble@googlemail.com 5f3ab105ad SiteMap tester can take mime type as argument 2013-10-03 09:04:23 +00:00
digitalpebble@googlemail.com 4ce4b358b6 issue 29 : more robust parsing when loc element is missing 2013-10-02 13:40:50 +00:00
digitalpebble@googlemail.com d9e3cb4cbb Issue 25:Robots.txt parser should not lowercase sitemap URLs 2013-09-06 12:33:02 +00:00
digitalpebble@googlemail.com 15aa39d41c Added utility class for testing sitemaps 2013-07-18 14:01:37 +00:00
digitalpebble 7596599e02 issue 26 : default priority correctly implemented in SiteMaps 2013-05-24 14:15:51 +00:00
digitalpebble 40ef1f5a10 issue 27 : [SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString() 2013-05-24 14:09:26 +00:00
lewis.mcgibbney@gmail.com 51bb23f01c CC 12 Substantiate Javadoc 2013-01-28 02:47:01 +00:00
kkrugler_lists@transpac.com 2884151a73 Don't clear interrupted flag when we throw an
AbortedFetchException.
2013-01-24 00:10:55 +00:00
kkrugler_lists@transpac.com 77f2fea1e1 Support "lenient" sitemap parsing, where invalid URLs (because
they aren't under the path where the sitemap is located) will
not be silently rejected, but rather will be added and flagged
as invalid (in SiteMapURL)

https://code.google.com/p/crawler-commons/issues/detail?id=3
2013-01-23 23:41:43 +00:00
kkrugler_lists@transpac.com 23c2d1e7b9 Updated CHANGES.txt with recent work (post 0.1)
Added warning to pom.xml about not trying to use Maven to build artifacts.

Minor cleanup of comments.
2013-01-23 23:08:51 +00:00
kkrugler_lists@transpac.com 7300003e13 Add HTTP status code and reason to FetchedResult.
https://code.google.com/p/crawler-commons/issues/detail?id=6
2013-01-23 23:05:21 +00:00
kkrugler_lists@transpac.com 4653ced854 Minor formatting cleanup 2013-01-23 17:45:29 +00:00
kkrugler_lists@transpac.com 8cc20a48fd Roll in NUTCH-1031 patch from Tejas, to support multiple
user agent names (comma-separated).

See https://issues.apache.org/jira/browse/NUTCH-1031 for details.
2013-01-23 17:44:46 +00:00