lewis.mcgibbney@gmail.com
|
df993771c1
|
Update for 0.2 release
|
2013-01-28 04:14:09 +00:00 |
|
lewis.mcgibbney@gmail.com
|
51bb23f01c
|
CC 12 Substantiate Javadoc
|
2013-01-28 02:47:01 +00:00 |
|
lewis.mcgibbney@gmail.com
|
2f34db8056
|
CC 12 Substantiate Javadoc
|
2013-01-28 02:45:41 +00:00 |
|
lewis.mcgibbney@gmail.com
|
7ec5b41e3e
|
CC 11 Review default.properties
|
2013-01-28 02:43:34 +00:00 |
|
kkrugler_lists@transpac.com
|
2884151a73
|
Don't clear interrupted flag when we throw an
AbortedFetchException.
|
2013-01-24 00:10:55 +00:00 |
|
kkrugler_lists@transpac.com
|
77f2fea1e1
|
Support "lenient" sitemap parsing, where invalid URLs (because
they aren't under the path where the sitemap is located) will
not be silently rejected, but rather will be added and flagged
as invalid (in SiteMapURL)
https://code.google.com/p/crawler-commons/issues/detail?id=3
|
2013-01-23 23:41:43 +00:00 |
|
kkrugler_lists@transpac.com
|
23c2d1e7b9
|
Updated CHANGES.txt with recent work (post 0.1)
Added warning to pom.xml about not trying to use Maven to build artifacts.
Minor cleanup of comments.
|
2013-01-23 23:08:51 +00:00 |
|
kkrugler_lists@transpac.com
|
7300003e13
|
Add HTTP status code and reason to FetchedResult.
https://code.google.com/p/crawler-commons/issues/detail?id=6
|
2013-01-23 23:05:21 +00:00 |
|
kkrugler_lists@transpac.com
|
4653ced854
|
Minor formatting cleanup
|
2013-01-23 17:45:29 +00:00 |
|
kkrugler_lists@transpac.com
|
8cc20a48fd
|
Roll in NUTCH-1031 patch from Tejas, to support multiple
user agent names (comma-separated).
See https://issues.apache.org/jira/browse/NUTCH-1031 for details.
|
2013-01-23 17:44:46 +00:00 |
|
kkrugler_lists@transpac.com
|
f81c4c7bec
|
Fix bug with wrong name for Eclipse formatter properties file.
|
2013-01-23 17:43:14 +00:00 |
|
kkrugler_lists@transpac.com
|
94a0cc4339
|
Switch default charset for Java files to UTF-8
|
2013-01-23 17:42:40 +00:00 |
|
kkrugler_lists@transpac.com
|
b9e0b64d01
|
First attempt ta getting Javadoc into project
|
2012-11-06 04:23:01 +00:00 |
|
kkrugler_lists@transpac.com
|
761c9fa648
|
Switch to using eclipse-formatter.properties, which can be appended (what we need)
versus the XML we had.
|
2012-10-06 23:38:07 +00:00 |
|
kkrugler_lists@transpac.com
|
0758bd56c8
|
Support robots.txt files that have UTF-16LE and UTF-16BE BOMs.
|
2012-08-17 03:04:40 +00:00 |
|
kkrugler_lists@transpac.com
|
78e4ae5e9e
|
Added test to validate proper handling of user agent crawler
names that consist of multiple words.
|
2012-08-15 14:00:24 +00:00 |
|
kkrugler_lists@transpac.com
|
bf3666d2b2
|
Fixed handling of BOM in sitemaps (from Vivek Magotra)
|
2012-08-14 16:22:32 +00:00 |
|
kkrugler_lists@transpac.com
|
1b2e98afc6
|
Fixed up missing test dependency on slf4j-log4j
Added more tests for "interesting" robots.txt files.
|
2012-07-04 23:56:22 +00:00 |
|
digitalpebble
|
55da554846
|
added CHANGES.txt + refactoring of SiteMap objects (thanks to Hannes Schwarz)
|
2011-07-25 10:23:21 +00:00 |
|
kkrugler_lists@transpac.com
|
5324eb415d
|
Added simple support for the file: protocol.
Cleaned up packaging.
Added "install" target.
|
2011-07-21 17:28:53 +00:00 |
|
digitalpebble
|
02dff51b08
|
package : copy build files to dist dir
|
2011-07-12 12:23:26 +00:00 |
|
digitalpebble@googlemail.com
|
790cbf9aae
|
changing version to 0.2-SNAPSHOT
|
2011-07-06 18:49:12 +00:00 |
|
digitalpebble
|
f92364a0ce
|
Changed year to 2011 + distribute jar containing resources + copy license to root of distributed package
|
2011-07-06 14:56:15 +00:00 |
|
digitalpebble
|
6082b42c04
|
Added Apache License 2.0
|
2011-07-06 14:52:36 +00:00 |
|
digitalpebble
|
d4c4461a73
|
reformat pom.xml + added stage task to build.xml
|
2011-07-06 14:18:35 +00:00 |
|
digitalpebble@googlemail.com
|
d429b67182
|
pre-initial release : added dev info to pom.xml + ANT tasks for deployment to Maven public repository
|
2011-07-06 11:02:04 +00:00 |
|
kkrugler_lists@transpac.com
|
5d533e7e0d
|
Add jar that's only in (currently unavailable) 101tec Nexus repo, so at least users can manually install it
|
2011-07-01 17:42:11 +00:00 |
|
kkrugler_lists@transpac.com
|
5d6825f34e
|
Remove unneeded dependency on 101tec and Apache snapshot repositories
|
2011-07-01 17:02:23 +00:00 |
|
digitalpebble@googlemail.com
|
6b977fd672
|
Added missing license headers
|
2011-06-04 09:28:57 +00:00 |
|
kkrugler_lists@transpac.com
|
acfaa17090
|
Test code for robots.txt processing code, HTTP fetcher
|
2011-06-03 22:14:01 +00:00 |
|
kkrugler_lists@transpac.com
|
1bcd2062a8
|
Test code for robots.txt processing code, HTTP fetcher
|
2011-06-03 22:12:47 +00:00 |
|
kkrugler_lists@transpac.com
|
cd55b6b92e
|
Preliminary versions of robots.txt processing code, HTTP fetcher
|
2011-06-03 21:30:09 +00:00 |
|
kkrugler_lists@transpac.com
|
b37cdd0baa
|
Preliminary versions of robots.txt processing code, HTTP fetcher
|
2011-06-03 21:29:34 +00:00 |
|
digitalpebble
|
ced3685969
|
unified logging with slf4j
|
2010-06-04 11:16:20 +00:00 |
|
digitalpebble
|
9c1bbdc544
|
SiteMap Parser : see discussion on [http://groups.google.com/group/crawler-commons/browse_thread/thread/9813bba4bcde8ace/]
|
2010-04-26 10:54:16 +00:00 |
|
digitalpebble
|
89d9dc7b4e
|
improved list of compound tlds - see NUTCH-786
|
2010-02-09 09:36:28 +00:00 |
|
kkrugler_lists@transpac.com
|
bf8ba66115
|
Rolled in Ian's patches to pom.xml and build.xml
Rolled in Ian's EffectiveTldFinder code & test cases.
Fixed "dist" target for build.
|
2009-12-12 00:22:44 +00:00 |
|
kkrugler_lists@transpac.com
|
ddc38ce331
|
Change name of format from "Bixo" to "Crawler-commons"
|
2009-12-04 04:19:21 +00:00 |
|
kkrugler_lists@transpac.com
|
288dca1504
|
Initial commit of build system, plus some paid-level domain extraction code from Bixo.
|
2009-12-04 04:13:38 +00:00 |
|