Aecio Santos
22ad611aef
Fixes bug introduced in pull request #98
...
and adds ability to configure a new timeout introduced in httpclient
4.5.1
2016-05-04 19:50:33 -04:00
Ken Krugler
b5704684ff
Clarify which method is preferred
...
Generally better to call parseSiteMap w/o passing an explicit
contentType, as web servers lie all the time - so let Tika figure it
out.
2015-12-30 22:14:21 -08:00
Ken Krugler
31a6c80ea7
Fix sitemap extraction from robots.txt
2015-12-30 22:03:49 -08:00
Julien Nioche
f3f34844d4
Deprecate fetcher classes #97
2015-12-02 10:30:54 +00:00
Julien Nioche
c1b3f4b086
Added URLFilter interface + BasicURLNormalizer borrowed from Nutch #106
2015-11-13 10:58:48 +00:00
Ken Krugler
940cbfd0e8
Merged with aecio
2015-10-20 07:48:51 -07:00
Aecio Santos
f2bf9300e6
Upgrades httpclient to version 4.5.1 ( fixes #84 )
...
and do not ignore test failures during maven build
2015-10-09 14:08:39 -04:00
Julien Nioche
b7ccc8d1f1
Fixed test for domains #103
2015-10-07 10:21:28 +01:00
Julien Nioche
9e93037e79
updated tld names from publicsuffix.org
2015-10-05 13:38:10 +01:00
Ken Krugler
2c687d1bba
Roll in fix for issue #87 w/RSS 1.0 site maps
2015-09-11 15:16:12 -07:00
Ken Krugler
d08f396576
Tweaked Javadoc update from Michael Roeder
2015-09-11 11:49:20 -07:00
MichaelRoeder
e8f38fda03
Added a javadoc comment to the SimpleRobotRulesParser class explaining its behavior.
2015-09-10 13:03:10 +02:00
Ken Krugler
16e13bedc4
Improve Javadoc on robot name matching
...
And add a .gitignore
2015-09-08 16:12:31 -07:00
Lewis John McGibbney
d7ed6a742c
Upgrade to Jdk1.7 - remove license header at pom.xml and improve logging implementations.
2015-09-07 15:20:00 -04:00
Lewis John McGibbney
ba5906ec40
Upgrade to JDK 1.7 compiler version and introduce Maven forbidden API's plugin
2015-09-06 13:55:26 -04:00
matt-deboer
d203f0d4ac
Reworked sitemap parser to use SAX for optional parsing of partial docs.
...
Traded Stack for LinkedList for performance improvement.
Fix to getParentElement();
Added test for case referenced by issues #79 and #75 .
2015-06-27 22:33:15 -07:00
Julien Nioche
feb40af519
Applied formatting with mvn java-formatter:format
2015-06-11 10:45:06 +01:00
Chaiavi
5cf62ab7d5
Fix for Issue 60
...
SitemapUrls can be not valid when they are referenced in a sitemap which
it's
directory is on a completely different path than the referenced
SitemapUrl.
All as indicated here:
http://www.sitemaps.org/protocol.html#location
In order to clarify the validity aspect we need to upgrade the following
1. Add a little more explanations as javadocs and as logs
2. Rename "Legal" (I think only one occurrence) to "valid" (in the
parser)
3. Add to the Sitemap class a new method to get all *valid* SitemapUrls
4. When dropping a URL due to invalidity a log should be shown, a URL
shouldn't
be dropped quietly.
2015-06-08 23:41:56 +03:00
Julien Nioche
20861baf47
Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler)
2015-05-22 11:08:21 +01:00
Julien Nioche
40731c3304
applied formatting with mvn java-formatter:format
2015-05-15 09:03:24 +01:00
Julien Nioche
dafbbdd2bf
SiteMapParser use UPPERCASE for static finals
2015-04-13 20:41:47 +01:00
Julien Nioche
152a2446c2
Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry; instanciate Tika from the start
2015-04-13 16:36:39 +01:00
kkrugler_lists@transpac.com
7f7f915b0b
Issue 68: Case-sensitive path matching
2015-03-27 22:09:32 +00:00
kkrugler_lists@transpac.com
6eb1459345
Issue 67: time in lastMod string not extracted during parse
2015-03-27 21:15:43 +00:00
kkrugler_lists@transpac.com
aeafa263e5
Make SimpleRobotRules serializable (issue #59 )
2015-01-26 13:18:54 +00:00
avraham2@gmail.com
19bc879d91
Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag
...
Fixed the NPE issue
Removed the recursive flag
Upgraded javadocs
2015-01-25 09:18:01 +00:00
digitalpebble@googlemail.com
4f8614c85e
Issue 32:[Robots] Resolve relative URL for sitemaps
2015-01-22 10:54:14 +00:00
digitalpebble@googlemail.com
8a0034c1f1
Issue 62:[Sitemaps] Add new parseSiteMap method
2015-01-21 08:59:01 +00:00
avraham2@gmail.com
34195de153
Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them
2015-01-12 10:53:20 +00:00
avraham2@gmail.com
546b9ff60e
Issue51: Upgrade httpclient to the latest version
2015-01-12 10:30:30 +00:00
avraham2@gmail.com
823ea3221c
Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily
2014-11-25 12:06:53 +00:00
avraham2@gmail.com
bcc2c7fe26
Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen
2014-11-21 12:01:55 +00:00
avraham2@gmail.com
c8261cbbc4
Issue55: fix setPriority
2014-10-26 11:48:57 +00:00
lewis.mcgibbney@gmail.com
8dda18a77c
Issue 50: Add Fetch Report to FetchedResult
2014-10-20 14:07:10 +00:00
lewis.mcgibbney@gmail.com
fa18129bcf
Issue 50: Add Fetch Report to FetchedResult
2014-10-19 18:59:15 +00:00
kkrugler_lists@transpac.com
6fe3770889
Fix for issue #53 - handle spaces in comma-separated list of agent names
2014-10-04 16:32:12 +00:00
digitalpebble@googlemail.com
fef6d41ef8
Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6
2014-09-24 13:43:09 +00:00
avraham2@gmail.com
5823288428
Removed commented out code I wrongfully put there in the past
2014-08-19 19:14:00 +00:00
avraham2@gmail.com
983cce7c07
Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases
...
new Tika().detect(URL) -- Will solve the mentioned problem.
BUT it will cause out library to fetch the sitemap twice.
A better solution should be sought.
Maybe use new Tika().detect(bytes, filename);
2014-08-19 19:08:27 +00:00
avraham2@gmail.com
19e2918aca
Change the Mime type parsing to use Tika's MediaType.
...
I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);
And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.
2014-08-06 19:06:45 +00:00
lewis.mcgibbney@gmail.com
4793307adb
Issue 39: [Sitemaps] Add the Parser a conviniece method with only a URL argument
2014-07-07 14:27:49 +00:00
lewis.mcgibbney@gmail.com
01e4feef8b
Issue 42 [Sitemaps] Add more JUnit tests
2014-07-01 05:11:27 +00:00
lewis.mcgibbney@gmail.com
94c3ed4068
Upgrade to JUnit v4 conventions in SiteMapParser
2014-05-29 21:02:16 +00:00
lewis.mcgibbney@gmail.com
cb71c5502a
Upgrade the Slf4j logging in SiteMaps
2014-05-29 20:28:37 +00:00
kkrugler_lists@transpac.com
cc67a3d2c8
Merge patch for issue #13 from Lewis
2014-03-17 00:37:26 +00:00
kkrugler_lists@transpac.com
a6ac57e354
Issue 21: allow has higher precendence than disallow, if both rules are the same length
2014-03-14 00:02:38 +00:00
kkrugler_lists@transpac.com
c1f050d33f
Add missing file from previous commit
2014-03-14 00:01:44 +00:00
kkrugler_lists@transpac.com
ea67b56e42
Add tests for wildcards (via alparslanavci), and sorting rules
2014-03-13 23:50:17 +00:00
kkrugler_lists@transpac.com
af74ccf44d
Add support for wildcards (via alparslanavci), and sorting rules
2014-03-13 23:49:49 +00:00
kkrugler_lists@transpac.com
300d6ebdb7
Roll in patch from Lewis for issue #23 ( http://code.google.com/p/crawler-commons/issues/detail?id=23 )
2014-01-24 21:16:38 +00:00
kkrugler_lists@transpac.com
dc8f241782
Fix up tests to match latest data file
2014-01-24 21:05:46 +00:00
kkrugler_lists@transpac.com
aa4d410223
Make setProcessed public, was implicitly package private
2014-01-24 20:51:33 +00:00
kkrugler_lists@transpac.com
dbae7e20df
Updated comments w/link to actual data Mozilla data file
2014-01-24 20:44:51 +00:00
kkrugler_lists@transpac.com
16e46b0d50
Added a few more suffixes
2014-01-24 20:44:31 +00:00
kkrugler_lists@transpac.com
a98bb030af
Updated to latest from http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
2014-01-24 20:44:12 +00:00
digitalpebble@googlemail.com
5f3ab105ad
SiteMap tester can take mime type as argument
2013-10-03 09:04:23 +00:00
digitalpebble@googlemail.com
4ce4b358b6
issue 29 : more robust parsing when loc element is missing
2013-10-02 13:40:50 +00:00
digitalpebble@googlemail.com
d9e3cb4cbb
Issue 25:Robots.txt parser should not lowercase sitemap URLs
2013-09-06 12:33:02 +00:00
digitalpebble@googlemail.com
15aa39d41c
Added utility class for testing sitemaps
2013-07-18 14:01:37 +00:00
digitalpebble
7596599e02
issue 26 : default priority correctly implemented in SiteMaps
2013-05-24 14:15:51 +00:00
digitalpebble
40ef1f5a10
issue 27 : [SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString()
2013-05-24 14:09:26 +00:00
lewis.mcgibbney@gmail.com
51bb23f01c
CC 12 Substantiate Javadoc
2013-01-28 02:47:01 +00:00
kkrugler_lists@transpac.com
2884151a73
Don't clear interrupted flag when we throw an
...
AbortedFetchException.
2013-01-24 00:10:55 +00:00
kkrugler_lists@transpac.com
77f2fea1e1
Support "lenient" sitemap parsing, where invalid URLs (because
...
they aren't under the path where the sitemap is located) will
not be silently rejected, but rather will be added and flagged
as invalid (in SiteMapURL)
https://code.google.com/p/crawler-commons/issues/detail?id=3
2013-01-23 23:41:43 +00:00
kkrugler_lists@transpac.com
23c2d1e7b9
Updated CHANGES.txt with recent work (post 0.1)
...
Added warning to pom.xml about not trying to use Maven to build artifacts.
Minor cleanup of comments.
2013-01-23 23:08:51 +00:00
kkrugler_lists@transpac.com
7300003e13
Add HTTP status code and reason to FetchedResult.
...
https://code.google.com/p/crawler-commons/issues/detail?id=6
2013-01-23 23:05:21 +00:00
kkrugler_lists@transpac.com
4653ced854
Minor formatting cleanup
2013-01-23 17:45:29 +00:00
kkrugler_lists@transpac.com
8cc20a48fd
Roll in NUTCH-1031 patch from Tejas, to support multiple
...
user agent names (comma-separated).
See https://issues.apache.org/jira/browse/NUTCH-1031 for details.
2013-01-23 17:44:46 +00:00
kkrugler_lists@transpac.com
0758bd56c8
Support robots.txt files that have UTF-16LE and UTF-16BE BOMs.
2012-08-17 03:04:40 +00:00
kkrugler_lists@transpac.com
78e4ae5e9e
Added test to validate proper handling of user agent crawler
...
names that consist of multiple words.
2012-08-15 14:00:24 +00:00
kkrugler_lists@transpac.com
bf3666d2b2
Fixed handling of BOM in sitemaps (from Vivek Magotra)
2012-08-14 16:22:32 +00:00
kkrugler_lists@transpac.com
1b2e98afc6
Fixed up missing test dependency on slf4j-log4j
...
Added more tests for "interesting" robots.txt files.
2012-07-04 23:56:22 +00:00
digitalpebble
55da554846
added CHANGES.txt + refactoring of SiteMap objects (thanks to Hannes Schwarz)
2011-07-25 10:23:21 +00:00
kkrugler_lists@transpac.com
5324eb415d
Added simple support for the file: protocol.
...
Cleaned up packaging.
Added "install" target.
2011-07-21 17:28:53 +00:00
digitalpebble@googlemail.com
6b977fd672
Added missing license headers
2011-06-04 09:28:57 +00:00
kkrugler_lists@transpac.com
acfaa17090
Test code for robots.txt processing code, HTTP fetcher
2011-06-03 22:14:01 +00:00
kkrugler_lists@transpac.com
1bcd2062a8
Test code for robots.txt processing code, HTTP fetcher
2011-06-03 22:12:47 +00:00
kkrugler_lists@transpac.com
cd55b6b92e
Preliminary versions of robots.txt processing code, HTTP fetcher
2011-06-03 21:30:09 +00:00
digitalpebble
ced3685969
unified logging with slf4j
2010-06-04 11:16:20 +00:00
digitalpebble
9c1bbdc544
SiteMap Parser : see discussion on [ http://groups.google.com/group/crawler-commons/browse_thread/thread/9813bba4bcde8ace/ ]
2010-04-26 10:54:16 +00:00
digitalpebble
89d9dc7b4e
improved list of compound tlds - see NUTCH-786
2010-02-09 09:36:28 +00:00
kkrugler_lists@transpac.com
bf8ba66115
Rolled in Ian's patches to pom.xml and build.xml
...
Rolled in Ian's EffectiveTldFinder code & test cases.
Fixed "dist" target for build.
2009-12-12 00:22:44 +00:00
kkrugler_lists@transpac.com
288dca1504
Initial commit of build system, plus some paid-level domain extraction code from Bixo.
2009-12-04 04:13:38 +00:00