1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-24 19:36:06 +02:00
Commit Graph

95 Commits

Author SHA1 Message Date
Lewis John McGibbney 18bbae908c Upgrade to JDK 1.8 2016-09-29 21:39:24 -07:00
Lewis John McGibbney fc3378cb95 Merge branch 'master' into ISSUE-125 2016-09-29 21:36:05 -07:00
Michael Lavelle 58608c485d Further changes to SiteMapParser and AbstractSiteMap method visibility. Previous pull request to change certain methods to protected did not cover all required methods. The changes in this commit allow for the addition of GoogleNews site maps 2016-09-29 09:07:45 +01:00
Lewis John McGibbney 8814bed160 Upgrade to JDK 1.8 2016-09-26 15:20:39 -07:00
Julien Nioche 145ff5ceaa Merge pull request #124 from echoboxapp/site-map-parser-protected-methods
Modifying parsing methods of SiteMapParser so they are protected rath…
2016-09-21 14:49:27 +01:00
Michael Lavelle b26f7fd6f9 Modifying parsing methods of SiteMapParser so they are protected rather than private 2016-09-19 10:12:51 +01:00
Julien Nioche 5f997c37e4 Faster parsing of dates. Fixes #117 2016-09-12 15:41:23 +01:00
I c24a297836 RobotRule inner class, changed from protected to public static, for easy serialization for high performance serializers like protoStuff or google protocol buffers. 2016-07-28 18:19:05 +02:00
I 6882ff4103 Cleaned unused import. 2016-07-28 18:16:41 +02:00
Julien Nioche 81aefc118e Improved sitemap tests : check that the URLs returned correspond to the input 2016-07-06 14:23:29 +01:00
Julien Nioche 96bc1bbf6b Added test class and resource for #29; See #116 2016-07-06 10:16:07 +01:00
Julien Nioche 0775bb216e Fix license headers + applied formatting. Fixes #108 2016-06-30 11:45:08 +01:00
Julien Nioche be52b770ff Rename package crawlercommons.url Fixes #107 2016-06-30 11:11:49 +01:00
Aecio Santos 22ad611aef Fixes bug introduced in pull request #98
and adds ability to configure a new timeout introduced in httpclient
4.5.1
2016-05-04 19:50:33 -04:00
Ken Krugler b5704684ff Clarify which method is preferred
Generally better to call parseSiteMap w/o passing an explicit
contentType, as web servers lie all the time - so let Tika figure it
out.
2015-12-30 22:14:21 -08:00
Ken Krugler 31a6c80ea7 Fix sitemap extraction from robots.txt 2015-12-30 22:03:49 -08:00
Julien Nioche f3f34844d4 Deprecate fetcher classes #97 2015-12-02 10:30:54 +00:00
Julien Nioche c1b3f4b086 Added URLFilter interface + BasicURLNormalizer borrowed from Nutch #106 2015-11-13 10:58:48 +00:00
Ken Krugler 940cbfd0e8 Merged with aecio 2015-10-20 07:48:51 -07:00
Aecio Santos f2bf9300e6 Upgrades httpclient to version 4.5.1 (fixes #84)
and do not ignore test failures during maven build
2015-10-09 14:08:39 -04:00
Julien Nioche b7ccc8d1f1 Fixed test for domains #103 2015-10-07 10:21:28 +01:00
Julien Nioche 9e93037e79 updated tld names from publicsuffix.org 2015-10-05 13:38:10 +01:00
Ken Krugler 2c687d1bba Roll in fix for issue #87 w/RSS 1.0 site maps 2015-09-11 15:16:12 -07:00
Ken Krugler d08f396576 Tweaked Javadoc update from Michael Roeder 2015-09-11 11:49:20 -07:00
MichaelRoeder e8f38fda03 Added a javadoc comment to the SimpleRobotRulesParser class explaining its behavior. 2015-09-10 13:03:10 +02:00
Ken Krugler 16e13bedc4 Improve Javadoc on robot name matching
And add a .gitignore
2015-09-08 16:12:31 -07:00
Lewis John McGibbney d7ed6a742c Upgrade to Jdk1.7 - remove license header at pom.xml and improve logging implementations. 2015-09-07 15:20:00 -04:00
Lewis John McGibbney ba5906ec40 Upgrade to JDK 1.7 compiler version and introduce Maven forbidden API's plugin 2015-09-06 13:55:26 -04:00
Julien Nioche feb40af519 Applied formatting with mvn java-formatter:format 2015-06-11 10:45:06 +01:00
Chaiavi 5cf62ab7d5 Fix for Issue 60
SitemapUrls can be not valid when they are referenced in a sitemap which
it's
directory is on a completely different path than the referenced
SitemapUrl.

All as indicated here:
http://www.sitemaps.org/protocol.html#location

In order to clarify the validity aspect we need to upgrade the following
1. Add a little more explanations as javadocs and as logs
2. Rename "Legal" (I think only one occurrence) to "valid" (in the
parser)
3. Add to the Sitemap class a new method to get all *valid* SitemapUrls
4. When dropping a URL due to invalidity a log should be shown, a URL
shouldn't
be dropped quietly.
2015-06-08 23:41:56 +03:00
Julien Nioche 20861baf47 Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler) 2015-05-22 11:08:21 +01:00
Julien Nioche 40731c3304 applied formatting with mvn java-formatter:format 2015-05-15 09:03:24 +01:00
Julien Nioche dafbbdd2bf SiteMapParser use UPPERCASE for static finals 2015-04-13 20:41:47 +01:00
Julien Nioche 152a2446c2 Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry; instanciate Tika from the start 2015-04-13 16:36:39 +01:00
kkrugler_lists@transpac.com 7f7f915b0b Issue 68: Case-sensitive path matching 2015-03-27 22:09:32 +00:00
kkrugler_lists@transpac.com 6eb1459345 Issue 67: time in lastMod string not extracted during parse 2015-03-27 21:15:43 +00:00
kkrugler_lists@transpac.com aeafa263e5 Make SimpleRobotRules serializable (issue #59) 2015-01-26 13:18:54 +00:00
avraham2@gmail.com 19bc879d91 Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag
Fixed the NPE issue
Removed the recursive flag
Upgraded javadocs
2015-01-25 09:18:01 +00:00
digitalpebble@googlemail.com 4f8614c85e Issue 32:[Robots] Resolve relative URL for sitemaps 2015-01-22 10:54:14 +00:00
digitalpebble@googlemail.com 8a0034c1f1 Issue 62:[Sitemaps] Add new parseSiteMap method 2015-01-21 08:59:01 +00:00
avraham2@gmail.com 34195de153 Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them 2015-01-12 10:53:20 +00:00
avraham2@gmail.com 546b9ff60e Issue51: Upgrade httpclient to the latest version 2015-01-12 10:30:30 +00:00
avraham2@gmail.com 823ea3221c Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily 2014-11-25 12:06:53 +00:00
avraham2@gmail.com bcc2c7fe26 Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen 2014-11-21 12:01:55 +00:00
avraham2@gmail.com c8261cbbc4 Issue55: fix setPriority 2014-10-26 11:48:57 +00:00
lewis.mcgibbney@gmail.com 8dda18a77c Issue 50: Add Fetch Report to FetchedResult 2014-10-20 14:07:10 +00:00
lewis.mcgibbney@gmail.com fa18129bcf Issue 50: Add Fetch Report to FetchedResult 2014-10-19 18:59:15 +00:00
kkrugler_lists@transpac.com 6fe3770889 Fix for issue #53 - handle spaces in comma-separated list of agent names 2014-10-04 16:32:12 +00:00
digitalpebble@googlemail.com fef6d41ef8 Issue 45:[Sitemaps] Upgrade code after release of Tika v1.6 2014-09-24 13:43:09 +00:00
avraham2@gmail.com 5823288428 Removed commented out code I wrongfully put there in the past 2014-08-19 19:14:00 +00:00