1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-27 12:16:04 +02:00
Commit Graph

275 Commits

Author SHA1 Message Date
Lewis John McGibbney d7ed6a742c Upgrade to Jdk1.7 - remove license header at pom.xml and improve logging implementations. 2015-09-07 15:20:00 -04:00
Lewis John McGibbney c385883ec3 Merge branch 'master' into jdk1.7 2015-09-07 14:17:29 -04:00
Avi Hayun 478a7d7240 Merge pull request #82 from crawler-commons/validSitemaps
[Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls
2015-09-07 21:11:58 +03:00
Lewis John McGibbney ba5906ec40 Upgrade to JDK 1.7 compiler version and introduce Maven forbidden API's plugin 2015-09-06 13:55:26 -04:00
Lewis John McGibbney 827b073d12 Merge branch 'master' into validSitemaps 2015-08-26 09:50:26 -07:00
Julien Nioche f155148216 Upgraded Tika 1.10 #89 2015-08-20 15:35:38 +01:00
matt-deboer d0e1f1f124 Added docker-rest script (helpful on OSX) to re-create boot2ocker vm and update env. vars in ~/.profile 2015-07-23 14:41:21 -07:00
matt-deboer d203f0d4ac Reworked sitemap parser to use SAX for optional parsing of partial docs.
Traded Stack for LinkedList for performance improvement.

Fix to getParentElement();
Added test for case referenced by issues #79 and #75.
2015-06-27 22:33:15 -07:00
Lewis John McGibbney cd06d834a6 Update README.md 2015-06-15 12:40:53 -07:00
Julien Nioche 0f24082dc0 Applied formatting with mvn java-formatter:format 2015-06-11 10:47:19 +01:00
Julien Nioche feb40af519 Applied formatting with mvn java-formatter:format 2015-06-11 10:45:06 +01:00
Julien Nioche 37c13c8465 Update README.md
added link to javadoc
2015-06-11 10:39:45 +01:00
Julien Nioche b77fa0052a Update README.md
Announcing 0.6 release
2015-06-11 10:37:52 +01:00
Chaiavi 5cf62ab7d5 Fix for Issue 60
SitemapUrls can be not valid when they are referenced in a sitemap which
it's
directory is on a completely different path than the referenced
SitemapUrl.

All as indicated here:
http://www.sitemaps.org/protocol.html#location

In order to clarify the validity aspect we need to upgrade the following
1. Add a little more explanations as javadocs and as logs
2. Rename "Legal" (I think only one occurrence) to "valid" (in the
parser)
3. Add to the Sitemap class a new method to get all *valid* SitemapUrls
4. When dropping a URL due to invalidity a log should be shown, a URL
shouldn't
be dropped quietly.
2015-06-08 23:41:56 +03:00
Julien Nioche 504c207488 Added Julien's public key to KEYS 2015-06-04 10:50:14 +01:00
Lewis John McGibbney 9d45376336 Add KEYS file to CC 2015-06-01 21:55:49 -07:00
Julien Nioche 22206f3a43 [maven-release-plugin] prepare for next development iteration 2015-05-27 16:38:05 +01:00
Julien Nioche 39d076a13b [maven-release-plugin] prepare release crawler-commons-0.6 2015-05-27 16:38:01 +01:00
Julien Nioche 2394b6713a Removed tagBase from maven-release-plugin configuration 2015-05-27 16:36:05 +01:00
Julien Nioche ee4a936066 Revert "[maven-release-plugin] prepare release crawler-commons-0.6"
This reverts commit 3b09a9ba52.
2015-05-27 16:16:54 +01:00
Julien Nioche 3b09a9ba52 [maven-release-plugin] prepare release crawler-commons-0.6 2015-05-27 16:05:02 +01:00
Julien Nioche a41ab43c41 README Fixed URL for changes file release 0.5
was pointing to the 'live' file
2015-05-27 12:18:40 +01:00
Julien Nioche e8ec75e019 Reverted failed release + changed groupId 2015-05-27 12:16:18 +01:00
Julien Nioche d115f158b2 [maven-release-plugin] prepare for next development iteration 2015-05-26 10:58:35 +01:00
Julien Nioche 8328e554d4 [maven-release-plugin] prepare release crawler-commons-0.6 2015-05-26 10:58:31 +01:00
Julien Nioche 20861baf47 Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler) 2015-05-22 11:08:21 +01:00
Julien Nioche 40731c3304 applied formatting with mvn java-formatter:format 2015-05-15 09:03:24 +01:00
Julien Nioche 8de545ccdc Merge pull request #78 from lewismc/CC-77
simplify pom file #77
2015-05-15 08:55:45 +01:00
Lewis John McGibbney e8065d5372 simplify pom file #77 2015-05-14 12:05:37 -07:00
Julien Nioche 47e30b5c22 Merge pull request #76 from crawler-commons/formatter
maven-java-formatter-plugin
2015-05-06 10:10:56 +01:00
Julien Nioche 63a837d5d7 updated CHANGES.txt for #76 2015-05-06 10:10:25 +01:00
Julien Nioche d22a9d0617 removed properties file + 1.6 compliant formatting 2015-05-06 09:33:31 +01:00
Julien Nioche 0a5d9d338a maven-java-formatter-plugin 2015-04-30 13:52:44 +01:00
Ken Krugler e42c268e03 Add news about project moving to GitHub 2015-04-22 07:59:43 -07:00
Ken Krugler ee23e1fb0d Update CHANGES.txt 2015-04-22 07:31:20 -07:00
Ken Krugler 798dc59839 Update CHANGES.txt 2015-04-22 07:30:57 -07:00
Ken Krugler 53375168ab Update CHANGES.txt 2015-04-22 07:30:25 -07:00
Julien Nioche 1647b90f7d Upgraded to Tika 1.8 fixes #71 2015-04-22 13:19:02 +01:00
Julien Nioche a28a78c942 changed groupId to crawler-commons 2015-04-22 11:02:49 +01:00
Julien Nioche 2195fb7f7e replaced references to code.google with github equivalents 2015-04-17 10:25:52 +01:00
Julien Nioche 76bc563b21 Merge pull request #72 from crawler-commons/mimetypeRegistry
Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry
2015-04-13 20:42:52 +01:00
Julien Nioche dafbbdd2bf SiteMapParser use UPPERCASE for static finals 2015-04-13 20:41:47 +01:00
Julien Nioche 152a2446c2 Get mediaTypeRegistry with MediaTypeRegistry.getDefaultRegistry; instanciate Tika from the start 2015-04-13 16:36:39 +01:00
Avi Hayun e1c7955389 Updated links to point to GitHub
I have changed several links to point to GitHub instead of GoogleCode.

For now, I am pointing all releases of CHANGES.TXT to the latest at Master, as we don't have the code of the last releases, and actually the CHANGES.TXT file does contain all changes of all releases, so it seems like a good compromise for now...
2015-04-13 12:54:06 +03:00
Avi Hayun 84cc3bf2d0 First stab at recreating front page like the original 2015-04-13 12:46:57 +03:00
Ken Krugler a7ee1bfa84 Delete README 2015-04-09 08:29:18 -07:00
kkrugler_lists@transpac.com 10c4dfbd99 Update with info about issues 67 & 68 2015-03-27 22:12:14 +00:00
kkrugler_lists@transpac.com 7f7f915b0b Issue 68: Case-sensitive path matching 2015-03-27 22:09:32 +00:00
kkrugler_lists@transpac.com 6eb1459345 Issue 67: time in lastMod string not extracted during parse 2015-03-27 21:15:43 +00:00
kkrugler_lists@transpac.com b2a92ce442 Add note about issue 59 being fixed 2015-01-26 13:21:29 +00:00