1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-23 11:16:04 +02:00
Commit Graph

345 Commits

Author SHA1 Message Date
lewis.mcgibbney@gmail.com a184dae67f trivial update to project pom 2013-01-30 06:03:15 +00:00
lewis.mcgibbney@gmail.com b5583b87f7 trivial update to project pon 2013-01-30 06:02:05 +00:00
lewis.mcgibbney@gmail.com 7f066c745e [maven-release-plugin] prepare release crawler-commons-0.2 2013-01-30 05:58:05 +00:00
lewis.mcgibbney@gmail.com 1fca6c714f clean up for 0.2 release 2013-01-30 04:12:34 +00:00
lewis.mcgibbney@gmail.com 6e7ee690d2 fix maven clean plugin configuration 2013-01-30 04:05:32 +00:00
lewis.mcgibbney@gmail.com 1494fe23eb substantiate project pom and move external jars out of lib 2013-01-30 03:57:47 +00:00
lewis.mcgibbney@gmail.com df993771c1 Update for 0.2 release 2013-01-28 04:14:09 +00:00
lewis.mcgibbney@gmail.com 51bb23f01c CC 12 Substantiate Javadoc 2013-01-28 02:47:01 +00:00
lewis.mcgibbney@gmail.com 2f34db8056 CC 12 Substantiate Javadoc 2013-01-28 02:45:41 +00:00
lewis.mcgibbney@gmail.com 7ec5b41e3e CC 11 Review default.properties 2013-01-28 02:43:34 +00:00
kkrugler_lists@transpac.com 2884151a73 Don't clear interrupted flag when we throw an
AbortedFetchException.
2013-01-24 00:10:55 +00:00
kkrugler_lists@transpac.com 77f2fea1e1 Support "lenient" sitemap parsing, where invalid URLs (because
they aren't under the path where the sitemap is located) will
not be silently rejected, but rather will be added and flagged
as invalid (in SiteMapURL)

https://code.google.com/p/crawler-commons/issues/detail?id=3
2013-01-23 23:41:43 +00:00
kkrugler_lists@transpac.com 23c2d1e7b9 Updated CHANGES.txt with recent work (post 0.1)
Added warning to pom.xml about not trying to use Maven to build artifacts.

Minor cleanup of comments.
2013-01-23 23:08:51 +00:00
kkrugler_lists@transpac.com 7300003e13 Add HTTP status code and reason to FetchedResult.
https://code.google.com/p/crawler-commons/issues/detail?id=6
2013-01-23 23:05:21 +00:00
kkrugler_lists@transpac.com 4653ced854 Minor formatting cleanup 2013-01-23 17:45:29 +00:00
kkrugler_lists@transpac.com 8cc20a48fd Roll in NUTCH-1031 patch from Tejas, to support multiple
user agent names (comma-separated).

See https://issues.apache.org/jira/browse/NUTCH-1031 for details.
2013-01-23 17:44:46 +00:00
kkrugler_lists@transpac.com f81c4c7bec Fix bug with wrong name for Eclipse formatter properties file. 2013-01-23 17:43:14 +00:00
kkrugler_lists@transpac.com 94a0cc4339 Switch default charset for Java files to UTF-8 2013-01-23 17:42:40 +00:00
kkrugler_lists@transpac.com b9e0b64d01 First attempt ta getting Javadoc into project 2012-11-06 04:23:01 +00:00
kkrugler_lists@transpac.com 761c9fa648 Switch to using eclipse-formatter.properties, which can be appended (what we need)
versus the XML we had.
2012-10-06 23:38:07 +00:00
kkrugler_lists@transpac.com 0758bd56c8 Support robots.txt files that have UTF-16LE and UTF-16BE BOMs. 2012-08-17 03:04:40 +00:00
kkrugler_lists@transpac.com 78e4ae5e9e Added test to validate proper handling of user agent crawler
names that consist of multiple words.
2012-08-15 14:00:24 +00:00
kkrugler_lists@transpac.com bf3666d2b2 Fixed handling of BOM in sitemaps (from Vivek Magotra) 2012-08-14 16:22:32 +00:00
kkrugler_lists@transpac.com 1b2e98afc6 Fixed up missing test dependency on slf4j-log4j
Added more tests for "interesting" robots.txt files.
2012-07-04 23:56:22 +00:00
digitalpebble 55da554846 added CHANGES.txt + refactoring of SiteMap objects (thanks to Hannes Schwarz) 2011-07-25 10:23:21 +00:00
kkrugler_lists@transpac.com 5324eb415d Added simple support for the file: protocol.
Cleaned up packaging.

Added "install" target.
2011-07-21 17:28:53 +00:00
digitalpebble 02dff51b08 package : copy build files to dist dir 2011-07-12 12:23:26 +00:00
digitalpebble@googlemail.com 790cbf9aae changing version to 0.2-SNAPSHOT 2011-07-06 18:49:12 +00:00
digitalpebble f92364a0ce Changed year to 2011 + distribute jar containing resources + copy license to root of distributed package 2011-07-06 14:56:15 +00:00
digitalpebble 6082b42c04 Added Apache License 2.0 2011-07-06 14:52:36 +00:00
digitalpebble d4c4461a73 reformat pom.xml + added stage task to build.xml 2011-07-06 14:18:35 +00:00
digitalpebble@googlemail.com d429b67182 pre-initial release : added dev info to pom.xml + ANT tasks for deployment to Maven public repository 2011-07-06 11:02:04 +00:00
kkrugler_lists@transpac.com 5d533e7e0d Add jar that's only in (currently unavailable) 101tec Nexus repo, so at least users can manually install it 2011-07-01 17:42:11 +00:00
kkrugler_lists@transpac.com 5d6825f34e Remove unneeded dependency on 101tec and Apache snapshot repositories 2011-07-01 17:02:23 +00:00
digitalpebble@googlemail.com 6b977fd672 Added missing license headers 2011-06-04 09:28:57 +00:00
kkrugler_lists@transpac.com acfaa17090 Test code for robots.txt processing code, HTTP fetcher 2011-06-03 22:14:01 +00:00
kkrugler_lists@transpac.com 1bcd2062a8 Test code for robots.txt processing code, HTTP fetcher 2011-06-03 22:12:47 +00:00
kkrugler_lists@transpac.com cd55b6b92e Preliminary versions of robots.txt processing code, HTTP fetcher 2011-06-03 21:30:09 +00:00
kkrugler_lists@transpac.com b37cdd0baa Preliminary versions of robots.txt processing code, HTTP fetcher 2011-06-03 21:29:34 +00:00
digitalpebble ced3685969 unified logging with slf4j 2010-06-04 11:16:20 +00:00
digitalpebble 9c1bbdc544 SiteMap Parser : see discussion on [http://groups.google.com/group/crawler-commons/browse_thread/thread/9813bba4bcde8ace/] 2010-04-26 10:54:16 +00:00
digitalpebble 89d9dc7b4e improved list of compound tlds - see NUTCH-786 2010-02-09 09:36:28 +00:00
kkrugler_lists@transpac.com bf8ba66115 Rolled in Ian's patches to pom.xml and build.xml
Rolled in Ian's EffectiveTldFinder code & test cases.

Fixed "dist" target for build.
2009-12-12 00:22:44 +00:00
kkrugler_lists@transpac.com ddc38ce331 Change name of format from "Bixo" to "Crawler-commons" 2009-12-04 04:19:21 +00:00
kkrugler_lists@transpac.com 288dca1504 Initial commit of build system, plus some paid-level domain extraction code from Bixo. 2009-12-04 04:13:38 +00:00