Migrating wiki contents from Google Code

2024-05-06 15:06:03 +02:00 · 2015-04-09 09:08:40 -04:00 · 2015-04-09 09:08:40 -04:00 · 513f3c2783
commit 513f3c2783
4 changed files with 287 additions and 0 deletions
--- a/ApacheCon2009Meetup.md
+++ b/ApacheCon2009Meetup.md
@ -0,0 +1,100 @@
+# Introduction #
+
+We had a "Web Crawler Developer" MeetUp at this year's [ApacheCon US](http://www.us.apachecon.com/c/acus2009/) in Oakland.
+
+It wound up being an UnMeetUp (MeetDown?) on Wednesday, November 4th from 11am - 1pm.
+
+# Details #
+
+## Attendees ##
+
+  * Andrzej Bialecki - Apache Nutch
+  * Thorsten Sherler (via Skype) - Apache Droids
+  * Michael Stack - Formerly with Heritrix, now HBase
+  * Ken Krugler - Bixo
+
+## Topics ##
+
+### Roadmaps ###
+
+  * Nutch - become more component based.
+  * Droids - get more people involved.
+
+### Sharable Components ###
+
+  * robots.txt parsing
+  * URL normalization
+  * URL filtering
+  * Page cleansing
+    * General purpose
+    * Specialized
+  * Sub-page parsing (portlets)
+  * AJAX-ish page interactions
+  * Document parsing (via Tika)
+  * HttpClient (configuration)
+  * Text similarity
+  * Mime/charset/language detection
+
+### Tika ###
+
+  * Needs help to become really usable
+  * Would benefit from large test corpus
+  * Could do comparison with Nutch parser
+  * Needs option for direct DOM querying (screen scraping tasks)
+  * Handles mime & charset detection now (some issues)
+  * Could be extended to include language detection (wrap other impl)
+
+### URL Normalization ###
+
+  * Includes both domain (www.x.com == x.com), path, and query portions of URL
+  * Often site-specific rules
+    * Option to derive rules using URLs to similar documents.
+
+### AJAX-ish Page Interaction ###
+
+  * Not applicable for broad/general crawling
+  * Can be very important for specific web sites
+  * Use Selenium or headless Mozilla
+
+### Component API Issues ###
+
+  * Want to avoid using an API that's tied too closely to any implementation.
+  * One option is to have simple (e.g. URL param) API that takes meta-data.
+    * Similar to Tika passing in of meta-data.
+
+### Hosting Options ###
+
+  * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
+  * As part of Droids - but Droids is both a framework (queue-based) and set of components.
+  * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
+  * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.
+
+## Next Steps ##
+
+  * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
+  * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
+  * Make decision about build system (and then move on to code formatting debate :))
+    * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
+  * Start contributing code
+    * Ken will put in robots.txt parser.
+
+
+---
+
+
+## Original Discussion Topic List ##
+
+Below are some potential topics for discussion - feel free to add/comment.
+
+  * Potential synergies between crawler projects - e.g. sharing robots.txt processing code.
+  * How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite.
+  * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly.
+  * robots.txt processing - current problems with existing implementations
+  * Avoiding crawler traps - link farms, honeypots, etc.
+  * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
+  * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?)
+  * Testing challenges - is it possible to unit test a crawler?
+    * _(ab) in one project I used a primitive proxy server that would serve a corpus downloaded in advance, and stored in a DB (well, a Nutch "segment" ;) ) together with protocol-level headers. Such corpus + proxy would mimick a small-world Web. The proxy was able to introduce also bandwidth / latency limits. I'll investigate the licensing of that code, and hopefully we could add it here._
+  * Fuzzy classification - mime-type, charset, language.
+  * The future of Nutch, Droids, Heritrix, Bixo, etc.
+  * Optimizing for types of crawling - intranet, focused, whole web.
--- a/HowToCreateEclipseProject.md
+++ b/HowToCreateEclipseProject.md
@ -0,0 +1,15 @@
+# Introduction #
+
+As Crawer-Commons relies upon Apache Maven for build lifecycle, it is really easy to create an Eclipse project which you can import in to Eclipse.
+
+# Details #
+
+Just do the following.
+
+```
+ % mvn eclipse:eclipse 
+```
+
+This will generate the .project file and other required files.
+
+You can then open Eclipse and go to File > Import > Existing (Maven) Projects into Workspace > path/to/Crawler-Commons
--- a/ProjectHome.md
+++ b/ProjectHome.md
@ -0,0 +1,60 @@
+# Overview #
+
+crawler-commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.
+
+# Crawler-Commons News #
+
+## 15th October 2014 - crawler-commons 0.5 is released ##
+
+We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to [Apache Tika 1.6](http://tika.apache.org).
+
+See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.5/CHANGES.txt) file included with the release for a full list of details. Additionally the Java documentation can be found [here](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html).
+
+We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
+
+## 11th April 2014 - crawler-commons 0.4 is released ##
+
+We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.
+
+See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt) file included with the release for a full list of details.
+
+We suggest all users to upgrade to this version. Details of how to do so can be found on [Maven Central](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.code.crawler-commons%22%20AND%20a%3A%22crawler-commons%22).
+
+## 11 Oct 2013 - crawler-commons 0.3 is released ##
+
+This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.
+
+See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.3/CHANGES.txt) file included with the release for a full list of details.
+
+## 24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing ##
+
+Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See [Apache Nutch v1.7 Released](http://nutch.apache.org/#24th+June+2013+-+Apache+Nutch+v1.7+Released) for more details.
+
+
+## 08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing ##
+
+See [Apache Nutch v2.2 Released](http://nutch.apache.org/#08+June+2013+-+Apache+Nutch+v2.2+Released) for more details.
+
+## 02 Feb 2013 - crawler-commons 0.2 is released ##
+
+This release improves robots.txt and sitemap parsing support.
+
+See the [CHANGES.txt](http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.2/CHANGES.txt) file included with the release for a full list of details.
+
+# User Documentation #
+
+## Javadocs ##
+
+  * [0.5](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.5/index.html)
+  * [0.4](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.4/index.html)
+  * [0.3](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.3/index.html)
+  * [0.2](http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html)
+
+
+# Mailing List #
+
+There is a mailing list on [Google Groups](https://groups.google.com/forum/?fromgroups#!forum/crawler-commons).
+
+# Issue Tracking #
+
+If you find an issue, please file a report [here](http://code.google.com/p/crawler-commons/issues/list)
--- a/ReleaseProcedure.md
+++ b/ReleaseProcedure.md
@ -0,0 +1,112 @@
+# Introduction #
+
+This page describes how to do a release of crawler-commons. As we are using
+pure Maven, it is drop dead easy :0). So hold on to your hats, here we go!
+
+# Details #
+
+First, checkout a fresh version of the **ENTIRE** Crawler-Commons SVN repository
+
+It is pretty small (< 1/2 GB) and this saves time later on. The following command can be used:
+
+> % `svn checkout https://crawler-commons.googlecode.com/svn/ crawler-commons --username ${user.name}`
+
+You may also be prompted for your password here.
+
+**Non-core Documentation is Accurate**
+
+  * `% cd trunk`
+
+Please ensure that the details in `CHANGES.txt` are accurate and that
+licensing etc. is all accurate.
+
+**Generate Release Javadoc**
+
+In all of the steps below, replace `X.Y` with the actual release number.
+
+  * `% gedit pom.xml`
+  * change project version from `X.Y-SNAPSHOT` to `X.Y`
+  * `% mvn javadoc:javadoc`
+  * You may need to create the new `javadoc/X.Y` directory
+  * % `cp -r target/site/apidocs/ ../wiki/javadoc/X.Y/`
+  * % `svn add ../wiki/javadoc/X.Y/`
+  * % `svn revert pom.xml`
+  * % `mvn clean`
+
+**Prepare the Release Artifacts (Dry-Run)**
+
+  * % `rm -rf ~/.m2/repository/com/google/code/crawler-commons/`
+  * % `mvn release:clean release:prepare -DdryRun=true`
+
+Executing the above makes life a breeze for us. It does the following:
+
+  * cleans out all previous release-related module directories
+  * prepares us for a release, stating that this is a dryRun of the actual release. This gives us the opportunity to review the artifacts.
+
+At this stage is it imperative to check the resulting artifacts as we wish to iron out any discrepancies at this stage.
+
+**Commit to SVN & Tag**
+
+Before we commit the tag and set ourselves up for the next development drive, we want to commit the new Javadocs to Subversion.
+
+  * `% cd ../`
+  * You may wish to execute `svn status` to ensure that all the files you wish are being committed
+  * `% find wiki/javadoc -name "*.html" | xargs -I filename svn propset svn:mime-type text/html filename`
+  * `% find wiki/javadoc -name "*.gif" | xargs -I filename svn propset svn:mime-type image/gif filename`
+  * `% find wiki/javadoc -name "*.css" | xargs -I filename svn propset svn:mime-type text/css filename`
+  * `% svn ci -m "X.Y Release Javadoc"`
+  * `% cd trunk`
+  * `% mvn release:clean release:prepare`
+
+N.B. If the final command fails this may be due to non-interactive mode being activated in your local SVN client.
+This can easily be overcome by explicitely passing in the -Dusername=${username} -Dpassword=${password} arguments
+to the command. Your username and password can be located within your GoogleCode profile.
+
+This will create and commit the release tag and bump the development version in the pom.xml file.
+
+**Build & Deploy to Sonatype**
+
+This will build the pom, jar, javadoc, signatures and sources, and push them to the Sonatype staging repository
+
+  * `% mvn release:perform`
+
+If this command fails, ensure that you have the sonatype server configuration within your ~/.m2/settings.xml as follows
+
+```
+<settings>
+  <servers>
+    ...
+    <server>
+      <id>sonatype-nexus-staging</id>
+      <username>${nexus_username}</username>
+      <password>${nexus_password}</password>
+    </server>
+    ...
+```
+
+
+
+**Close the Sonatype Staging Repository**
+
+  * % Browse to https://oss.sonatype.org/index.html and log in.
+  * % Navigate to the Staging Repositories side tab and locate the stating release.
+  * % Close the repository so that others can view and review it.
+
+**Hold a Community VOTE**
+
+  * % Head over to the Crawer Commons mailing list and create a thread which details the tag and staging release.
+  * % Collect votes, give it time to bake.
+  * % If all is good head back to the staging repository and Release the artifacts.
+
+**Update the Javadoc link on the project main page**
+
+  * Click on the Administer link
+  * Update the `*`User Documentation`*` section to link to the new Javadoc index.html file, e.g. `* [http://crawler-commons.googlecode.com/svn/wiki/javadoc/0.2/index.html Javadoc]`
+
+**Publicize**
+
+Post to the crawler-commons list, as well as Nutch, Bixo, LinkedIn, Twitter etc.
+
+**Additional Information**
+
+It is common for developers unfamiliar with staging, snapshot and release repositories to encounter difficulties when attempting to release artifacts. Much more information on the OSS Sonatype platform can be found [here](https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide)