[![Build Status](https://api.travis-ci.org/crawler-commons/crawler-commons.svg?branch=master)](https://travis-ci.org/crawler-commons/crawler-commons) [![license](https://img.shields.io/github/license/crawler-commons/crawler-commons.svg?maxAge=2592000?style=plastic)](http://www.apache.org/licenses/LICENSE-2.0) # Overview crawler-commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort. # User Documentation ## Javadocs * [0.7](http://crawler-commons.github.io/crawler-commons/0.7/) * [0.6](http://crawler-commons.github.io/crawler-commons/0.6/apidocs/) * [0.5](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html) # Mailing List There is a mailing list on [Google Groups](https://groups.google.com/forum/?fromgroups#!forum/crawler-commons). # Issue Tracking If you find an issue, please file a report [here](https://github.com/crawler-commons/crawler-commons/issues) # Crawler-Commons News ## 24th November 2016 - crawler-commons 0.7 released We are glad to announce the 0.7 release of Crawler-Commons. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.7/CHANGES.txt) file included with the release for a full list of details. The main changes are that Crawler-Commons now requires JAVA 8 and that the package crawlercommons.url has been replaced with crawlercommons.domains. If your project uses CC then you might want to run the following command on it ``` find . -type f -print0 | xargs -0 sed -i 's/import crawlercommons\.url\./import crawlercommons\.domains\./' ``` Please note also that this is the last release containing the HTTP fetcher support, which is deprecated and will be removed from the next version. The version 0.7 contains numerous improvements and bugfixes and all users are invited to upgrade to this version. ## 11th June 2015 - crawler-commons 0.6 is released We are glad to announce the 0.6 release of Crawler Commons. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.6/CHANGES.txt) file included with the release for a full list of details. We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#artifactdetails%7Ccom.github.crawler-commons%7Ccrawler-commons%7C0.6%7Cjar). Please note that the groupId has changed to com.github.crawler-commons. The Java documentation can be found [here](http://crawler-commons.github.io/crawler-commons/0.6/apidocs/). ## 22nd April 2015 - crawler-commons has moved The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting. ## 15th October 2014 - crawler-commons 0.5 is released We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to [Apache Tika 1.6](http://tika.apache.org). See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.5/CHANGES.txt) file included with the release for a full list of details. Additionally the Java documentation can be found [here](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html). We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22). ## 11th April 2014 - crawler-commons 0.4 is released We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6\. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details. We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22). ## 11 Oct 2013 - crawler-commons 0.3 is released This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details. ## 24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See [Apache Nutch v1.7 Released](http://nutch.apache.org/#24th+June+2013+-+Apache+Nutch+v1.7+Released) for more details. ## 08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing See [Apache Nutch v2.2 Released](http://nutch.apache.org/#08+June+2013+-+Apache+Nutch+v2.2+Released) for more details. ## 02 Feb 2013 - crawler-commons 0.2 is released This release improves robots.txt and sitemap parsing support. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details.