1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-26 20:06:07 +02:00
crawler-commons/README.md

69 lines
4.2 KiB
Markdown
Raw Normal View History

# Overview
crawler-commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.
# Crawler-Commons News
## 11th June 2015 - crawler-commons 0.6 is released
We are glad to announce the 0.6 release of Crawler Commons. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.6/CHANGES.txt) file included with the release for a full list of details.
We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#artifactdetails%7Ccom.github.crawler-commons%7Ccrawler-commons%7C0.6%7Cjar). Please note that the groupId has changed to com.github.crawler-commons.
## 22nd April 2015 - crawler-commons has moved
The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting.
## 15th October 2014 - crawler-commons 0.5 is released
We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to [Apache Tika 1.6](http://tika.apache.org).
See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.5/CHANGES.txt) file included with the release for a full list of details. Additionally the Java documentation can be found [here](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html).
We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
## 11th April 2014 - crawler-commons 0.4 is released
We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6\.
See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details.
We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
## 11 Oct 2013 - crawler-commons 0.3 is released
This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.
See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details.
## 24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing
Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See [Apache Nutch v1.7 Released](http://nutch.apache.org/#24th+June+2013+-+Apache+Nutch+v1.7+Released) for more details.
## 08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing
See [Apache Nutch v2.2 Released](http://nutch.apache.org/#08+June+2013+-+Apache+Nutch+v2.2+Released) for more details.
## 02 Feb 2013 - crawler-commons 0.2 is released
This release improves robots.txt and sitemap parsing support.
See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details.
# User Documentation
## Javadocs
* [0.5](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html)
* [0.4](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.4/index.html)
* [0.3](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.3/index.html)
* [0.2](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html)
# Mailing List
There is a mailing list on [Google Groups](https://groups.google.com/forum/?fromgroups#!forum/crawler-commons).
# Issue Tracking
If you find an issue, please file a report [here](https://github.com/crawler-commons/crawler-commons/issues)