mirror of
https://github.com/crawler-commons/crawler-commons
synced 2024-05-06 15:06:03 +02:00
Migrating wiki contents from Google Code
This commit is contained in:
commit
513f3c2783
|
@ -0,0 +1,100 @@
|
|||
# Introduction #
|
||||
|
||||
We had a "Web Crawler Developer" MeetUp at this year's [ApacheCon US](http://www.us.apachecon.com/c/acus2009/) in Oakland.
|
||||
|
||||
It wound up being an UnMeetUp (MeetDown?) on Wednesday, November 4th from 11am - 1pm.
|
||||
|
||||
# Details #
|
||||
|
||||
## Attendees ##
|
||||
|
||||
* Andrzej Bialecki - Apache Nutch
|
||||
* Thorsten Sherler (via Skype) - Apache Droids
|
||||
* Michael Stack - Formerly with Heritrix, now HBase
|
||||
* Ken Krugler - Bixo
|
||||
|
||||
## Topics ##
|
||||
|
||||
### Roadmaps ###
|
||||
|
||||
* Nutch - become more component based.
|
||||
* Droids - get more people involved.
|
||||
|
||||
### Sharable Components ###
|
||||
|
||||
* robots.txt parsing
|
||||
* URL normalization
|
||||
* URL filtering
|
||||
* Page cleansing
|
||||
* General purpose
|
||||
* Specialized
|
||||
* Sub-page parsing (portlets)
|
||||
* AJAX-ish page interactions
|
||||
* Document parsing (via Tika)
|
||||
* HttpClient (configuration)
|
||||
* Text similarity
|
||||
* Mime/charset/language detection
|
||||
|
||||
### Tika ###
|
||||
|
||||
* Needs help to become really usable
|
||||
* Would benefit from large test corpus
|
||||
* Could do comparison with Nutch parser
|
||||
* Needs option for direct DOM querying (screen scraping tasks)
|
||||
* Handles mime & charset detection now (some issues)
|
||||
* Could be extended to include language detection (wrap other impl)
|
||||
|
||||
### URL Normalization ###
|
||||
|
||||
* Includes both domain (www.x.com == x.com), path, and query portions of URL
|
||||
* Often site-specific rules
|
||||
* Option to derive rules using URLs to similar documents.
|
||||
|
||||
### AJAX-ish Page Interaction ###
|
||||
|
||||
* Not applicable for broad/general crawling
|
||||
* Can be very important for specific web sites
|
||||
* Use Selenium or headless Mozilla
|
||||
|
||||
### Component API Issues ###
|
||||
|
||||
* Want to avoid using an API that's tied too closely to any implementation.
|
||||
* One option is to have simple (e.g. URL param) API that takes meta-data.
|
||||
* Similar to Tika passing in of meta-data.
|
||||
|
||||
### Hosting Options ###
|
||||
|
||||
* As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
|
||||
* As part of Droids - but Droids is both a framework (queue-based) and set of components.
|
||||
* New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
|
||||
* Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.
|
||||
|
||||
## Next Steps ##
|
||||
|
||||
* Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
|
||||
* Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
|
||||
* Make decision about build system (and then move on to code formatting debate :))
|
||||
* I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
|
||||
* Start contributing code
|
||||
* Ken will put in robots.txt parser.
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
## Original Discussion Topic List ##
|
||||
|
||||
Below are some potential topics for discussion - feel free to add/comment.
|
||||
|
||||
* Potential synergies between crawler projects - e.g. sharing robots.txt processing code.
|
||||
* How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite.
|
||||
* Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly.
|
||||
* robots.txt processing - current problems with existing implementations
|
||||
* Avoiding crawler traps - link farms, honeypots, etc.
|
||||
* Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
|
||||
* Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?)
|
||||
* Testing challenges - is it possible to unit test a crawler?
|
||||
* _(ab) in one project I used a primitive proxy server that would serve a corpus downloaded in advance, and stored in a DB (well, a Nutch "segment" ;) ) together with protocol-level headers. Such corpus + proxy would mimick a small-world Web. The proxy was able to introduce also bandwidth / latency limits. I'll investigate the licensing of that code, and hopefully we could add it here._
|
||||
* Fuzzy classification - mime-type, charset, language.
|
||||
* The future of Nutch, Droids, Heritrix, Bixo, etc.
|
||||
* Optimizing for types of crawling - intranet, focused, whole web.
|
|
@ -0,0 +1,15 @@
|
|||
# Introduction #
|
||||
|
||||
As Crawer-Commons relies upon Apache Maven for build lifecycle, it is really easy to create an Eclipse project which you can import in to Eclipse.
|
||||
|
||||
# Details #
|
||||
|
||||
Just do the following.
|
||||
|
||||
```
|
||||
% mvn eclipse:eclipse
|
||||
```
|
||||
|
||||
This will generate the .project file and other required files.
|
||||
|
||||
You can then open Eclipse and go to File > Import > Existing (Maven) Projects into Workspace > path/to/Crawler-Commons
|
|
@ -0,0 +1,60 @@
|
|||
# Overview #
|
||||
|
||||
crawler-commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.
|
||||
|
||||
# Crawler-Commons News #
|
||||
|
||||
## 15th October 2014 - crawler-commons 0.5 is released ##
|
||||
|
||||
We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to [Apache Tika 1.6](http://tika.apache.org).
|
||||
|
||||
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.5/CHANGES.txt) file included with the release for a full list of details. Additionally the Java documentation can be found [here](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html).
|
||||
|
||||
We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
|
||||
|
||||
## 11th April 2014 - crawler-commons 0.4 is released ##
|
||||
|
||||
We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.
|
||||
|
||||
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt) file included with the release for a full list of details.
|
||||
|
||||
We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
|
||||
|
||||
## 11 Oct 2013 - crawler-commons 0.3 is released ##
|
||||
|
||||
This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.
|
||||
|
||||
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.3/CHANGES.txt) file included with the release for a full list of details.
|
||||
|
||||
## 24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing ##
|
||||
|
||||
Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See [Apache Nutch v1.7 Released](http://nutch.apache.org/#24th+June+2013+-+Apache+Nutch+v1.7+Released) for more details.
|
||||
|
||||
|
||||
## 08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing ##
|
||||
|
||||
See [Apache Nutch v2.2 Released](http://nutch.apache.org/#08+June+2013+-+Apache+Nutch+v2.2+Released) for more details.
|
||||
|
||||
## 02 Feb 2013 - crawler-commons 0.2 is released ##
|
||||
|
||||
This release improves robots.txt and sitemap parsing support.
|
||||
|
||||
See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.2/CHANGES.txt) file included with the release for a full list of details.
|
||||
|
||||
# User Documentation #
|
||||
|
||||
## Javadocs ##
|
||||
|
||||
* [0.5](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html)
|
||||
* [0.4](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.4/index.html)
|
||||
* [0.3](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.3/index.html)
|
||||
* [0.2](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html)
|
||||
|
||||
|
||||
# Mailing List #
|
||||
|
||||
There is a mailing list on [Google Groups](https://groups.google.com/forum/?fromgroups#!forum/crawler-commons).
|
||||
|
||||
# Issue Tracking #
|
||||
|
||||
If you find an issue, please file a report [here](http://code.google.com/p/crawler-commons/issues/list)
|
|
@ -0,0 +1,112 @@
|
|||
# Introduction #
|
||||
|
||||
This page describes how to do a release of crawler-commons. As we are using
|
||||
pure Maven, it is drop dead easy :0). So hold on to your hats, here we go!
|
||||
|
||||
# Details #
|
||||
|
||||
First, checkout a fresh version of the **ENTIRE** Crawler-Commons SVN repository
|
||||
|
||||
It is pretty small (< 1/2 GB) and this saves time later on. The following command can be used:
|
||||
|
||||
> % `svn checkout https://crawler-commons.googlecode.com/svn/ crawler-commons --username ${user.name}`
|
||||
|
||||
You may also be prompted for your password here.
|
||||
|
||||
**Non-core Documentation is Accurate**
|
||||
|
||||
* `% cd trunk`
|
||||
|
||||
Please ensure that the details in `CHANGES.txt` are accurate and that
|
||||
licensing etc. is all accurate.
|
||||
|
||||
**Generate Release Javadoc**
|
||||
|
||||
In all of the steps below, replace `X.Y` with the actual release number.
|
||||
|
||||
* `% gedit pom.xml`
|
||||
* change project version from `X.Y-SNAPSHOT` to `X.Y`
|
||||
* `% mvn javadoc:javadoc`
|
||||
* You may need to create the new `javadoc/X.Y` directory
|
||||
* % `cp -r target/site/apidocs/ ../wiki/javadoc/X.Y/`
|
||||
* % `svn add ../wiki/javadoc/X.Y/`
|
||||
* % `svn revert pom.xml`
|
||||
* % `mvn clean`
|
||||
|
||||
**Prepare the Release Artifacts (Dry-Run)**
|
||||
|
||||
* % `rm -rf ~/.m2/repository/com/google/code/crawler-commons/`
|
||||
* % `mvn release:clean release:prepare -DdryRun=true`
|
||||
|
||||
Executing the above makes life a breeze for us. It does the following:
|
||||
|
||||
* cleans out all previous release-related module directories
|
||||
* prepares us for a release, stating that this is a dryRun of the actual release. This gives us the opportunity to review the artifacts.
|
||||
|
||||
At this stage is it imperative to check the resulting artifacts as we wish to iron out any discrepancies at this stage.
|
||||
|
||||
**Commit to SVN & Tag**
|
||||
|
||||
Before we commit the tag and set ourselves up for the next development drive, we want to commit the new Javadocs to Subversion.
|
||||
|
||||
* `% cd ../`
|
||||
* You may wish to execute `svn status` to ensure that all the files you wish are being committed
|
||||
* `% find wiki/javadoc -name "*.html" | xargs -I filename svn propset svn:mime-type text/html filename`
|
||||
* `% find wiki/javadoc -name "*.gif" | xargs -I filename svn propset svn:mime-type image/gif filename`
|
||||
* `% find wiki/javadoc -name "*.css" | xargs -I filename svn propset svn:mime-type text/css filename`
|
||||
* `% svn ci -m "X.Y Release Javadoc"`
|
||||
* `% cd trunk`
|
||||
* `% mvn release:clean release:prepare`
|
||||
|
||||
N.B. If the final command fails this may be due to non-interactive mode being activated in your local SVN client.
|
||||
This can easily be overcome by explicitely passing in the -Dusername=${username} -Dpassword=${password} arguments
|
||||
to the command. Your username and password can be located within your GoogleCode profile.
|
||||
|
||||
This will create and commit the release tag and bump the development version in the pom.xml file.
|
||||
|
||||
**Build & Deploy to Sonatype**
|
||||
|
||||
This will build the pom, jar, javadoc, signatures and sources, and push them to the Sonatype staging repository
|
||||
|
||||
* `% mvn release:perform`
|
||||
|
||||
If this command fails, ensure that you have the sonatype server configuration within your ~/.m2/settings.xml as follows
|
||||
|
||||
```
|
||||
<settings>
|
||||
<servers>
|
||||
...
|
||||
<server>
|
||||
<id>sonatype-nexus-staging</id>
|
||||
<username>${nexus_username}</username>
|
||||
<password>${nexus_password}</password>
|
||||
</server>
|
||||
...
|
||||
```
|
||||
|
||||
|
||||
|
||||
**Close the Sonatype Staging Repository**
|
||||
|
||||
* % Browse to https://oss.sonatype.org/index.html and log in.
|
||||
* % Navigate to the Staging Repositories side tab and locate the stating release.
|
||||
* % Close the repository so that others can view and review it.
|
||||
|
||||
**Hold a Community VOTE**
|
||||
|
||||
* % Head over to the Crawer Commons mailing list and create a thread which details the tag and staging release.
|
||||
* % Collect votes, give it time to bake.
|
||||
* % If all is good head back to the staging repository and Release the artifacts.
|
||||
|
||||
**Update the Javadoc link on the project main page**
|
||||
|
||||
* Click on the Administer link
|
||||
* Update the `*`User Documentation`*` section to link to the new Javadoc index.html file, e.g. `* [http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html Javadoc]`
|
||||
|
||||
**Publicize**
|
||||
|
||||
Post to the crawler-commons list, as well as Nutch, Bixo, LinkedIn, Twitter etc.
|
||||
|
||||
**Additional Information**
|
||||
|
||||
It is common for developers unfamiliar with staging, snapshot and release repositories to encounter difficulties when attempting to release artifacts. Much more information on the OSS Sonatype platform can be found [here](https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide)
|
Loading…
Reference in New Issue