# Overview crawler-commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort. # Crawler-Commons News ## 22nd April 2015 - crawler-commons has moved The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting. ## 15th October 2014 - crawler-commons 0.5 is released We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to [Apache Tika 1.6](http://tika.apache.org). See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.5/CHANGES.txt) file included with the release for a full list of details. Additionally the Java documentation can be found [here](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html). We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22). ## 11th April 2014 - crawler-commons 0.4 is released We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6\. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details. We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22). ## 11 Oct 2013 - crawler-commons 0.3 is released This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details. ## 24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See [Apache Nutch v1.7 Released](http://nutch.apache.org/#24th+June+2013+-+Apache+Nutch+v1.7+Released) for more details. ## 08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing See [Apache Nutch v2.2 Released](http://nutch.apache.org/#08+June+2013+-+Apache+Nutch+v2.2+Released) for more details. ## 02 Feb 2013 - crawler-commons 0.2 is released This release improves robots.txt and sitemap parsing support. See the [CHANGES.txt](https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt) file included with the release for a full list of details. # User Documentation ## Javadocs * [0.5](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html) * [0.4](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.4/index.html) * [0.3](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.3/index.html) * [0.2](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html) # Mailing List There is a mailing list on [Google Groups](https://groups.google.com/forum/?fromgroups#!forum/crawler-commons). # Issue Tracking If you find an issue, please file a report [here](https://github.com/crawler-commons/crawler-commons/issues)