1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-06-24 03:37:44 +02:00
Commit Graph

65 Commits

Author SHA1 Message Date
kkrugler_lists@transpac.com 0758bd56c8 Support robots.txt files that have UTF-16LE and UTF-16BE BOMs. 2012-08-17 03:04:40 +00:00
kkrugler_lists@transpac.com 78e4ae5e9e Added test to validate proper handling of user agent crawler
names that consist of multiple words.
2012-08-15 14:00:24 +00:00
kkrugler_lists@transpac.com bf3666d2b2 Fixed handling of BOM in sitemaps (from Vivek Magotra) 2012-08-14 16:22:32 +00:00
kkrugler_lists@transpac.com 1b2e98afc6 Fixed up missing test dependency on slf4j-log4j
Added more tests for "interesting" robots.txt files.
2012-07-04 23:56:22 +00:00
digitalpebble 55da554846 added CHANGES.txt + refactoring of SiteMap objects (thanks to Hannes Schwarz) 2011-07-25 10:23:21 +00:00
kkrugler_lists@transpac.com 5324eb415d Added simple support for the file: protocol.
Cleaned up packaging.

Added "install" target.
2011-07-21 17:28:53 +00:00
digitalpebble@googlemail.com 6b977fd672 Added missing license headers 2011-06-04 09:28:57 +00:00
kkrugler_lists@transpac.com acfaa17090 Test code for robots.txt processing code, HTTP fetcher 2011-06-03 22:14:01 +00:00
kkrugler_lists@transpac.com 1bcd2062a8 Test code for robots.txt processing code, HTTP fetcher 2011-06-03 22:12:47 +00:00
kkrugler_lists@transpac.com cd55b6b92e Preliminary versions of robots.txt processing code, HTTP fetcher 2011-06-03 21:30:09 +00:00
digitalpebble ced3685969 unified logging with slf4j 2010-06-04 11:16:20 +00:00
digitalpebble 9c1bbdc544 SiteMap Parser : see discussion on [http://groups.google.com/group/crawler-commons/browse_thread/thread/9813bba4bcde8ace/] 2010-04-26 10:54:16 +00:00
digitalpebble 89d9dc7b4e improved list of compound tlds - see NUTCH-786 2010-02-09 09:36:28 +00:00
kkrugler_lists@transpac.com bf8ba66115 Rolled in Ian's patches to pom.xml and build.xml
Rolled in Ian's EffectiveTldFinder code & test cases.

Fixed "dist" target for build.
2009-12-12 00:22:44 +00:00
kkrugler_lists@transpac.com 288dca1504 Initial commit of build system, plus some paid-level domain extraction code from Bixo. 2009-12-04 04:13:38 +00:00