crawler-commons

mirror of https://github.com/crawler-commons/crawler-commons synced 2024-09-25 01:50:39 +02:00

Author	SHA1	Message	Date
Aecio Santos	22ad611aef	Fixes bug introduced in pull request #98 and adds ability to configure a new timeout introduced in httpclient 4.5.1	2016-05-04 19:50:33 -04:00
Ken Krugler	b5704684ff	Clarify which method is preferred Generally better to call parseSiteMap w/o passing an explicit contentType, as web servers lie all the time - so let Tika figure it out.	2015-12-30 22:14:21 -08:00
Ken Krugler	31a6c80ea7	Fix sitemap extraction from robots.txt	2015-12-30 22:03:49 -08:00
Julien Nioche	f3f34844d4	Deprecate fetcher classes #97	2015-12-02 10:30:54 +00:00
Julien Nioche	c1b3f4b086	Added URLFilter interface + BasicURLNormalizer borrowed from Nutch #106	2015-11-13 10:58:48 +00:00
Ken Krugler	940cbfd0e8	Merged with aecio	2015-10-20 07:48:51 -07:00
Aecio Santos	f2bf9300e6	Upgrades httpclient to version 4.5.1 (fixes #84 ) and do not ignore test failures during maven build	2015-10-09 14:08:39 -04:00
Julien Nioche	b7ccc8d1f1	Fixed test for domains #103	2015-10-07 10:21:28 +01:00
Julien Nioche	9e93037e79	updated tld names from publicsuffix.org	2015-10-05 13:38:10 +01:00
Ken Krugler	2c687d1bba	Roll in fix for issue #87 w/RSS 1.0 site maps	2015-09-11 15:16:12 -07:00
Ken Krugler	d08f396576	Tweaked Javadoc update from Michael Roeder	2015-09-11 11:49:20 -07:00
MichaelRoeder	e8f38fda03	Added a javadoc comment to the SimpleRobotRulesParser class explaining its behavior.	2015-09-10 13:03:10 +02:00
Ken Krugler	16e13bedc4	Improve Javadoc on robot name matching And add a .gitignore	2015-09-08 16:12:31 -07:00
Lewis John McGibbney	d7ed6a742c	Upgrade to Jdk1.7 - remove license header at pom.xml and improve logging implementations.	2015-09-07 15:20:00 -04:00
Lewis John McGibbney	ba5906ec40	Upgrade to JDK 1.7 compiler version and introduce Maven forbidden API's plugin	2015-09-06 13:55:26 -04:00
matt-deboer	d203f0d4ac	Reworked sitemap parser to use SAX for optional parsing of partial docs. Traded Stack for LinkedList for performance improvement. Fix to getParentElement(); Added test for case referenced by issues #79 and #75.	2015-06-27 22:33:15 -07:00
Julien Nioche	feb40af519	Applied formatting with mvn java-formatter:format	2015-06-11 10:45:06 +01:00
Chaiavi	5cf62ab7d5	Fix for Issue 60 SitemapUrls can be not valid when they are referenced in a sitemap which it's directory is on a completely different path than the referenced SitemapUrl. All as indicated here: http://www.sitemaps.org/protocol.html#location In order to clarify the validity aspect we need to upgrade the following 1. Add a little more explanations as javadocs and as logs 2. Rename "Legal" (I think only one occurrence) to "valid" (in the parser) 3. Add to the Sitemap class a new method to get all valid SitemapUrls 4. When dropping a URL due to invalidity a log should be shown, a URL shouldn't be dropped quietly.	2015-06-08 23:41:56 +03:00
Julien Nioche	20861baf47	Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler)	2015-05-22 11:08:21 +01:00
Julien Nioche	40731c3304	applied formatting with mvn java-formatter:format	2015-05-15 09:03:24 +01:00
Julien Nioche	dafbbdd2bf	SiteMapParser use UPPERCASE for static finals	2015-04-13 20:41:47 +01:00
Julien Nioche	152a2446c2	Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry; instanciate Tika from the start	2015-04-13 16:36:39 +01:00
kkrugler_lists@transpac.com	7f7f915b0b	Issue 68: Case-sensitive path matching	2015-03-27 22:09:32 +00:00
kkrugler_lists@transpac.com	6eb1459345	Issue 67: time in lastMod string not extracted during parse	2015-03-27 21:15:43 +00:00
kkrugler_lists@transpac.com	aeafa263e5	Make SimpleRobotRules serializable (issue #59 )	2015-01-26 13:18:54 +00:00
avraham2@gmail.com	19bc879d91	Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag Fixed the NPE issue Removed the recursive flag Upgraded javadocs	2015-01-25 09:18:01 +00:00
digitalpebble@googlemail.com	4f8614c85e	Issue 32:[Robots] Resolve relative URL for sitemaps	2015-01-22 10:54:14 +00:00
digitalpebble@googlemail.com	8a0034c1f1	Issue 62:[Sitemaps] Add new parseSiteMap method	2015-01-21 08:59:01 +00:00
avraham2@gmail.com	34195de153	Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them	2015-01-12 10:53:20 +00:00
avraham2@gmail.com	546b9ff60e	Issue51: Upgrade httpclient to the latest version	2015-01-12 10:30:30 +00:00
avraham2@gmail.com	823ea3221c	Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily	2014-11-25 12:06:53 +00:00
avraham2@gmail.com	bcc2c7fe26	Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen	2014-11-21 12:01:55 +00:00
avraham2@gmail.com	c8261cbbc4	Issue55: fix setPriority	2014-10-26 11:48:57 +00:00
lewis.mcgibbney@gmail.com	8dda18a77c	Issue 50: Add Fetch Report to FetchedResult	2014-10-20 14:07:10 +00:00
lewis.mcgibbney@gmail.com	fa18129bcf	Issue 50: Add Fetch Report to FetchedResult	2014-10-19 18:59:15 +00:00
kkrugler_lists@transpac.com	6fe3770889	Fix for issue #53 - handle spaces in comma-separated list of agent names	2014-10-04 16:32:12 +00:00
digitalpebble@googlemail.com	fef6d41ef8	Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6	2014-09-24 13:43:09 +00:00
avraham2@gmail.com	5823288428	Removed commented out code I wrongfully put there in the past	2014-08-19 19:14:00 +00:00
avraham2@gmail.com	983cce7c07	Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases new Tika().detect(URL) -- Will solve the mentioned problem. BUT it will cause out library to fetch the sitemap twice. A better solution should be sought. Maybe use new Tika().detect(bytes, filename);	2014-08-19 19:08:27 +00:00
avraham2@gmail.com	19e2918aca	Change the Mime type parsing to use Tika's MediaType. I want to Identify the mediaType: MediaType mediaType = MediaType.parse(contentType); And then to process as follows: 1. By recursing through the mediatype supertypes till we get to the root and compare to the XML media type (or others) 2. If not found we should check the Aliases (for example text/xml is an alias of application/xml which is the more accurate form) 3. If not found then it is a bad MediaType and the exception should be thrown.	2014-08-06 19:06:45 +00:00
lewis.mcgibbney@gmail.com	4793307adb	Issue 39: [Sitemaps] Add the Parser a conviniece method with only a URL argument	2014-07-07 14:27:49 +00:00
lewis.mcgibbney@gmail.com	01e4feef8b	Issue 42 [Sitemaps] Add more JUnit tests	2014-07-01 05:11:27 +00:00
lewis.mcgibbney@gmail.com	94c3ed4068	Upgrade to JUnit v4 conventions in SiteMapParser	2014-05-29 21:02:16 +00:00
lewis.mcgibbney@gmail.com	cb71c5502a	Upgrade the Slf4j logging in SiteMaps	2014-05-29 20:28:37 +00:00
kkrugler_lists@transpac.com	cc67a3d2c8	Merge patch for issue #13 from Lewis	2014-03-17 00:37:26 +00:00
kkrugler_lists@transpac.com	a6ac57e354	Issue 21: allow has higher precendence than disallow, if both rules are the same length	2014-03-14 00:02:38 +00:00
kkrugler_lists@transpac.com	c1f050d33f	Add missing file from previous commit	2014-03-14 00:01:44 +00:00
kkrugler_lists@transpac.com	ea67b56e42	Add tests for wildcards (via alparslanavci), and sorting rules	2014-03-13 23:50:17 +00:00
kkrugler_lists@transpac.com	af74ccf44d	Add support for wildcards (via alparslanavci), and sorting rules	2014-03-13 23:49:49 +00:00
kkrugler_lists@transpac.com	300d6ebdb7	Roll in patch from Lewis for issue #23 (http://code.google.com/p/crawler-commons/issues/detail?id=23 )	2014-01-24 21:16:38 +00:00
kkrugler_lists@transpac.com	dc8f241782	Fix up tests to match latest data file	2014-01-24 21:05:46 +00:00
kkrugler_lists@transpac.com	aa4d410223	Make setProcessed public, was implicitly package private	2014-01-24 20:51:33 +00:00
kkrugler_lists@transpac.com	dbae7e20df	Updated comments w/link to actual data Mozilla data file	2014-01-24 20:44:51 +00:00
kkrugler_lists@transpac.com	16e46b0d50	Added a few more suffixes	2014-01-24 20:44:31 +00:00
kkrugler_lists@transpac.com	a98bb030af	Updated to latest from http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1	2014-01-24 20:44:12 +00:00
digitalpebble@googlemail.com	5f3ab105ad	SiteMap tester can take mime type as argument	2013-10-03 09:04:23 +00:00
digitalpebble@googlemail.com	4ce4b358b6	issue 29 : more robust parsing when loc element is missing	2013-10-02 13:40:50 +00:00
digitalpebble@googlemail.com	d9e3cb4cbb	Issue 25:Robots.txt parser should not lowercase sitemap URLs	2013-09-06 12:33:02 +00:00
digitalpebble@googlemail.com	15aa39d41c	Added utility class for testing sitemaps	2013-07-18 14:01:37 +00:00
digitalpebble	7596599e02	issue 26 : default priority correctly implemented in SiteMaps	2013-05-24 14:15:51 +00:00
digitalpebble	40ef1f5a10	issue 27 : [SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString()	2013-05-24 14:09:26 +00:00
lewis.mcgibbney@gmail.com	51bb23f01c	CC 12 Substantiate Javadoc	2013-01-28 02:47:01 +00:00
kkrugler_lists@transpac.com	2884151a73	Don't clear interrupted flag when we throw an AbortedFetchException.	2013-01-24 00:10:55 +00:00
kkrugler_lists@transpac.com	77f2fea1e1	Support "lenient" sitemap parsing, where invalid URLs (because they aren't under the path where the sitemap is located) will not be silently rejected, but rather will be added and flagged as invalid (in SiteMapURL) https://code.google.com/p/crawler-commons/issues/detail?id=3	2013-01-23 23:41:43 +00:00
kkrugler_lists@transpac.com	23c2d1e7b9	Updated CHANGES.txt with recent work (post 0.1) Added warning to pom.xml about not trying to use Maven to build artifacts. Minor cleanup of comments.	2013-01-23 23:08:51 +00:00
kkrugler_lists@transpac.com	7300003e13	Add HTTP status code and reason to FetchedResult. https://code.google.com/p/crawler-commons/issues/detail?id=6	2013-01-23 23:05:21 +00:00
kkrugler_lists@transpac.com	4653ced854	Minor formatting cleanup	2013-01-23 17:45:29 +00:00
kkrugler_lists@transpac.com	8cc20a48fd	Roll in NUTCH-1031 patch from Tejas, to support multiple user agent names (comma-separated). See https://issues.apache.org/jira/browse/NUTCH-1031 for details.	2013-01-23 17:44:46 +00:00
kkrugler_lists@transpac.com	0758bd56c8	Support robots.txt files that have UTF-16LE and UTF-16BE BOMs.	2012-08-17 03:04:40 +00:00
kkrugler_lists@transpac.com	78e4ae5e9e	Added test to validate proper handling of user agent crawler names that consist of multiple words.	2012-08-15 14:00:24 +00:00
kkrugler_lists@transpac.com	bf3666d2b2	Fixed handling of BOM in sitemaps (from Vivek Magotra)	2012-08-14 16:22:32 +00:00
kkrugler_lists@transpac.com	1b2e98afc6	Fixed up missing test dependency on slf4j-log4j Added more tests for "interesting" robots.txt files.	2012-07-04 23:56:22 +00:00
digitalpebble	55da554846	added CHANGES.txt + refactoring of SiteMap objects (thanks to Hannes Schwarz)	2011-07-25 10:23:21 +00:00
kkrugler_lists@transpac.com	5324eb415d	Added simple support for the file: protocol. Cleaned up packaging. Added "install" target.	2011-07-21 17:28:53 +00:00
digitalpebble@googlemail.com	6b977fd672	Added missing license headers	2011-06-04 09:28:57 +00:00
kkrugler_lists@transpac.com	acfaa17090	Test code for robots.txt processing code, HTTP fetcher	2011-06-03 22:14:01 +00:00
kkrugler_lists@transpac.com	1bcd2062a8	Test code for robots.txt processing code, HTTP fetcher	2011-06-03 22:12:47 +00:00
kkrugler_lists@transpac.com	cd55b6b92e	Preliminary versions of robots.txt processing code, HTTP fetcher	2011-06-03 21:30:09 +00:00
digitalpebble	ced3685969	unified logging with slf4j	2010-06-04 11:16:20 +00:00
digitalpebble	9c1bbdc544	SiteMap Parser : see discussion on [http://groups.google.com/group/crawler-commons/browse_thread/thread/9813bba4bcde8ace/]	2010-04-26 10:54:16 +00:00
digitalpebble	89d9dc7b4e	improved list of compound tlds - see NUTCH-786	2010-02-09 09:36:28 +00:00
kkrugler_lists@transpac.com	bf8ba66115	Rolled in Ian's patches to pom.xml and build.xml Rolled in Ian's EffectiveTldFinder code & test cases. Fixed "dist" target for build.	2009-12-12 00:22:44 +00:00
kkrugler_lists@transpac.com	288dca1504	Initial commit of build system, plus some paid-level domain extraction code from Bixo.	2009-12-04 04:13:38 +00:00

1 2 3 4

183 Commits