1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-06 15:06:03 +02:00

Migrating wiki contents from Google Code

This commit is contained in:
Google Code Exporter 2015-04-09 09:08:40 -04:00
commit 513f3c2783
4 changed files with 287 additions and 0 deletions

100
ApacheCon2009Meetup.md Normal file
View File

@ -0,0 +1,100 @@
# Introduction #
We had a "Web Crawler Developer" MeetUp at this year's [ApacheCon US](http://www.us.apachecon.com/c/acus2009/) in Oakland.
It wound up being an UnMeetUp (MeetDown?) on Wednesday, November 4th from 11am - 1pm.
# Details #
## Attendees ##
* Andrzej Bialecki - Apache Nutch
* Thorsten Sherler (via Skype) - Apache Droids
* Michael Stack - Formerly with Heritrix, now HBase
* Ken Krugler - Bixo
## Topics ##
### Roadmaps ###
* Nutch - become more component based.
* Droids - get more people involved.
### Sharable Components ###
* robots.txt parsing
* URL normalization
* URL filtering
* Page cleansing
* General purpose
* Specialized
* Sub-page parsing (portlets)
* AJAX-ish page interactions
* Document parsing (via Tika)
* HttpClient (configuration)
* Text similarity
* Mime/charset/language detection
### Tika ###
* Needs help to become really usable
* Would benefit from large test corpus
* Could do comparison with Nutch parser
* Needs option for direct DOM querying (screen scraping tasks)
* Handles mime & charset detection now (some issues)
* Could be extended to include language detection (wrap other impl)
### URL Normalization ###
* Includes both domain (www.x.com == x.com), path, and query portions of URL
* Often site-specific rules
* Option to derive rules using URLs to similar documents.
### AJAX-ish Page Interaction ###
* Not applicable for broad/general crawling
* Can be very important for specific web sites
* Use Selenium or headless Mozilla
### Component API Issues ###
* Want to avoid using an API that's tied too closely to any implementation.
* One option is to have simple (e.g. URL param) API that takes meta-data.
* Similar to Tika passing in of meta-data.
### Hosting Options ###
* As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
* As part of Droids - but Droids is both a framework (queue-based) and set of components.
* New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
* Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.
## Next Steps ##
* Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
* Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
* Make decision about build system (and then move on to code formatting debate :))
* I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
* Start contributing code
* Ken will put in robots.txt parser.
---
## Original Discussion Topic List ##
Below are some potential topics for discussion - feel free to add/comment.
* Potential synergies between crawler projects - e.g. sharing robots.txt processing code.
* How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite.
* Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly.
* robots.txt processing - current problems with existing implementations
* Avoiding crawler traps - link farms, honeypots, etc.
* Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
* Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?)
* Testing challenges - is it possible to unit test a crawler?
* _(ab) in one project I used a primitive proxy server that would serve a corpus downloaded in advance, and stored in a DB (well, a Nutch "segment" ;) ) together with protocol-level headers. Such corpus + proxy would mimick a small-world Web. The proxy was able to introduce also bandwidth / latency limits. I'll investigate the licensing of that code, and hopefully we could add it here._
* Fuzzy classification - mime-type, charset, language.
* The future of Nutch, Droids, Heritrix, Bixo, etc.
* Optimizing for types of crawling - intranet, focused, whole web.

View File

@ -0,0 +1,15 @@
# Introduction #
As Crawer-Commons relies upon Apache Maven for build lifecycle, it is really easy to create an Eclipse project which you can import in to Eclipse.
# Details #
Just do the following.
```
% mvn eclipse:eclipse
```
This will generate the .project file and other required files.
You can then open Eclipse and go to File > Import > Existing (Maven) Projects into Workspace > path/to/Crawler-Commons

60
ProjectHome.md Normal file
View File

@ -0,0 +1,60 @@
# Overview #
crawler-commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.
# Crawler-Commons News #
## 15th October 2014 - crawler-commons 0.5 is released ##
We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to [Apache Tika 1.6](http://tika.apache.org).
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.5/CHANGES.txt) file included with the release for a full list of details. Additionally the Java documentation can be found [here](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html).
We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
## 11th April 2014 - crawler-commons 0.4 is released ##
We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt) file included with the release for a full list of details.
We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
## 11 Oct 2013 - crawler-commons 0.3 is released ##
This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.3/CHANGES.txt) file included with the release for a full list of details.
## 24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing ##
Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See [Apache Nutch v1.7 Released](http://nutch.apache.org/#24th+June+2013+-+Apache+Nutch+v1.7+Released) for more details.
## 08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing ##
See [Apache Nutch v2.2 Released](http://nutch.apache.org/#08+June+2013+-+Apache+Nutch+v2.2+Released) for more details.
## 02 Feb 2013 - crawler-commons 0.2 is released ##
This release improves robots.txt and sitemap parsing support.
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.2/CHANGES.txt) file included with the release for a full list of details.
# User Documentation #
## Javadocs ##
* [0.5](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html)
* [0.4](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.4/index.html)
* [0.3](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.3/index.html)
* [0.2](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html)
# Mailing List #
There is a mailing list on [Google Groups](https://groups.google.com/forum/?fromgroups#!forum/crawler-commons).
# Issue Tracking #
If you find an issue, please file a report [here](http://code.google.com/p/crawler-commons/issues/list)

112
ReleaseProcedure.md Normal file
View File

@ -0,0 +1,112 @@
# Introduction #
This page describes how to do a release of crawler-commons. As we are using
pure Maven, it is drop dead easy :0). So hold on to your hats, here we go!
# Details #
First, checkout a fresh version of the **ENTIRE** Crawler-Commons SVN repository
It is pretty small (< 1/2 GB) and this saves time later on. The following command can be used:
> % `svn checkout https://crawler-commons.googlecode.com/svn/ crawler-commons --username ${user.name}`
You may also be prompted for your password here.
**Non-core Documentation is Accurate**
* `% cd trunk`
Please ensure that the details in `CHANGES.txt` are accurate and that
licensing etc. is all accurate.
**Generate Release Javadoc**
In all of the steps below, replace `X.Y` with the actual release number.
* `% gedit pom.xml`
* change project version from `X.Y-SNAPSHOT` to `X.Y`
* `% mvn javadoc:javadoc`
* You may need to create the new `javadoc/X.Y` directory
* % `cp -r target/site/apidocs/ ../wiki/javadoc/X.Y/`
* % `svn add ../wiki/javadoc/X.Y/`
* % `svn revert pom.xml`
* % `mvn clean`
**Prepare the Release Artifacts (Dry-Run)**
* % `rm -rf ~/.m2/repository/com/google/code/crawler-commons/`
* % `mvn release:clean release:prepare -DdryRun=true`
Executing the above makes life a breeze for us. It does the following:
* cleans out all previous release-related module directories
* prepares us for a release, stating that this is a dryRun of the actual release. This gives us the opportunity to review the artifacts.
At this stage is it imperative to check the resulting artifacts as we wish to iron out any discrepancies at this stage.
**Commit to SVN & Tag**
Before we commit the tag and set ourselves up for the next development drive, we want to commit the new Javadocs to Subversion.
* `% cd ../`
* You may wish to execute `svn status` to ensure that all the files you wish are being committed
* `% find wiki/javadoc -name "*.html" | xargs -I filename svn propset svn:mime-type text/html filename`
* `% find wiki/javadoc -name "*.gif" | xargs -I filename svn propset svn:mime-type image/gif filename`
* `% find wiki/javadoc -name "*.css" | xargs -I filename svn propset svn:mime-type text/css filename`
* `% svn ci -m "X.Y Release Javadoc"`
* `% cd trunk`
* `% mvn release:clean release:prepare`
N.B. If the final command fails this may be due to non-interactive mode being activated in your local SVN client.
This can easily be overcome by explicitely passing in the -Dusername=${username} -Dpassword=${password} arguments
to the command. Your username and password can be located within your GoogleCode profile.
This will create and commit the release tag and bump the development version in the pom.xml file.
**Build & Deploy to Sonatype**
This will build the pom, jar, javadoc, signatures and sources, and push them to the Sonatype staging repository
* `% mvn release:perform`
If this command fails, ensure that you have the sonatype server configuration within your ~/.m2/settings.xml as follows
```
<settings>
<servers>
...
<server>
<id>sonatype-nexus-staging</id>
<username>${nexus_username}</username>
<password>${nexus_password}</password>
</server>
...
```
**Close the Sonatype Staging Repository**
* % Browse to https://oss.sonatype.org/index.html and log in.
* % Navigate to the Staging Repositories side tab and locate the stating release.
* % Close the repository so that others can view and review it.
**Hold a Community VOTE**
* % Head over to the Crawer Commons mailing list and create a thread which details the tag and staging release.
* % Collect votes, give it time to bake.
* % If all is good head back to the staging repository and Release the artifacts.
**Update the Javadoc link on the project main page**
* Click on the Administer link
* Update the `*`User Documentation`*` section to link to the new Javadoc index.html file, e.g. `* [http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html Javadoc]`
**Publicize**
Post to the crawler-commons list, as well as Nutch, Bixo, LinkedIn, Twitter etc.
**Additional Information**
It is common for developers unfamiliar with staging, snapshot and release repositories to encounter difficulties when attempting to release artifacts. Much more information on the OSS Sonatype platform can be found [here](https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide)