1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-12 16:36:02 +02:00
Commit Graph

608 Commits

Author SHA1 Message Date
Sebastian Nagel ce9cf46020 Update CHANGES.txt for release of crawler-commons 1.4 2023-07-13 11:29:43 +02:00
Sebastian Nagel 2b8717d9e5 [maven-release-plugin] prepare release crawler-commons-1.4 2023-07-13 10:30:08 +02:00
Sebastian Nagel a62bd80140 Updates changelog for #376, #380, #401, #414, #425, #428, #422/#424, #114/#390/#430, #245/#360 2023-07-12 16:16:30 +02:00
Sebastian Nagel 6fb34cf856
Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (#360)
- port unit tests from https://github.com/google/robotstxt
- adapt "Google-only" unit tests dealing with overlong lines
  and none-standard user-agent names
- adapt unit tests dealing with overlong lines and percent-encoded
  URL paths were the behavior of SimpleRobotRulesParser is not
  wrong and could be even seen as an improvement compared to
  the restrictions put on API input params by the Google robots.txt parser
2023-07-12 15:28:59 +02:00
Sebastian Nagel 871e4e61d2
Merge pull request #430 from sebastian-nagel/cc-390-114-robots-closing-rule-group
[Robots.txt] Close groups of rules as defined in RFC 9309
2023-07-12 10:35:48 +02:00
Sebastian Nagel d685bafb2d
[Robots.txt] SimpleRobotRulesParser main() to follow five redirects (#428)
when fetching robots.txt over HTTP as required by RFC 9309
2023-07-11 14:49:00 +01:00
Sebastian Nagel de7221dafc
[Robots.txt] Empty disallow statement not to clear other rules, fixes #422 (#424) 2023-07-11 14:47:33 +01:00
Sebastian Nagel 7ae8617563
[Robots.txt] Add more spelling variants and typos of robots.txt directives (#425)
* [Robots.txt] Add more spelling variants and typos of robots.txt directives
- found in Google's RFC 9309 reference parser (google/robotstxt)
- and in real-world robots.txt files (Common Crawl)
- if we accept lines starting with http: as sitemap directives
  we should nowadays also accept https: as such

* Add Javadocs to some of the robots.txt extension directives
2023-07-11 14:46:07 +01:00
Sebastian Nagel e67299432c [Robots.txt] Clarify behavior when to close blocks of multiple user-agents
- must keep state whether Crawl-delay is already set for a specific agent
  as separate variable
- add unit test to ensure that no already set Crawl-delay is overridden
  by a (lower) value of another agent
2023-07-10 15:18:23 +02:00
Sebastian Nagel 17e8544980 [Robots.txt] Clarify behavior when to close blocks of multiple user-agents
- fix unit test broken by introducing compliance with RFC 9309
2023-07-10 12:59:40 +02:00
Sebastian Nagel 4524cfb5c0 [Robots.txt] Clarify behavior when to close blocks of multiple user-agents, closes #390
[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes #114
- do not close rule blocks / groups on other directives than specified
  in RFC 9309: groups are only closed on a user-agent line at least
  one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
  or set the value for a specific agent using a value defined for the
  wildcard agent
2023-07-10 12:59:40 +02:00
Sebastian Nagel d710c85871 BaseRobotRules: Document that Crawl-delay is stored in milliseconds 2023-07-10 12:59:40 +02:00
Sebastian Nagel a3900425f3 [Robots.txt] Handle robots.txt with missing sections (and implicit master rules)
- add unit test to verify solution of #114
2023-07-10 12:59:15 +02:00
Sebastian Nagel 86109c029a Updates changelog for #423/#426, #427, #429 2023-07-10 10:24:43 +02:00
Sebastian Nagel 54498a0e5a [Robots.txt] Rename default user-agent / robot name in unit tests
- replace occurrences of the user-agent name supposed to match
  the wildcard user-agent rule group by "anybot"
2023-06-16 17:34:20 +02:00
Sebastian Nagel 99289f7835 [Robots.txt] Pass empty collection of agent names to select rules for
any robot (wildcard user-agent name)
- in SimpleRobotRulesParser main()
- add unit test to verify that wildcard user-agent rules are selected
  if empty collection of agent names is passed
2023-06-16 17:19:39 +02:00
Sebastian Nagel a5bd9645fa [Robots.txt] Update Javadoc to document changes in Robots.txt classes
related to RFC 9309 compliance
- document effect of rules merging in combination with multiple agent names,
  fixes #423
- document that rules addressed to the wildcard agent are followed
  if none of the passed agent names matches - without any need to
  pass the wildcard agent name as one of the agent names
- complete documentation
- use @inheritDoc to avoid duplicated documentation
- strip doc strings where inherited automatically by @Override
  annotations
2023-06-16 17:16:23 +02:00
Sebastian Nagel 0eb2c74294 Updates changelog for #195/#408, #409/#412, #413, #416, #420 and merged dependabot pull requests 2023-06-13 14:24:13 +02:00
Sebastian Nagel 6523fd29ed [Robots.txt] Add units test based on examples in RFC 9309 2023-06-13 14:01:49 +02:00
Sebastian Nagel 7a95069f0e
Merge pull request #421 from sebastian-nagel/cc-308-url-normalizer-empty-query
[BasicNormalizer] Query parameters normalization in BasicURLNormalizer
2023-06-13 13:56:52 +02:00
dependabot[bot] 0656e9c561 Bump maven-surefire-plugin from 3.1.0 to 3.1.2
Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 3.1.0 to 3.1.2.
- [Release notes](https://github.com/apache/maven-surefire/releases)
- [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.1.0...surefire-3.1.2)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-surefire-plugin
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-06-13 10:43:59 +02:00
dependabot[bot] 6b5c0a5699 Bump commons-io from 2.12.0 to 2.13.0
Bumps commons-io from 2.12.0 to 2.13.0.

---
updated-dependencies:
- dependency-name: commons-io:commons-io
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-06-13 10:43:49 +02:00
Richard Zowalla ee620d489e
Align JDK dist with the dist we are using for actual test/build 2023-06-13 10:41:38 +02:00
Sebastian Nagel e5563c3049 [BasicNormalizer] Query parameters normalization in BasicURLNormalizer,
closes #308
- add unit test to prove that an empty query is removed
2023-06-13 09:59:07 +02:00
dependabot[bot] 9261174c6c
Bump maven-release-plugin from 3.0.0 to 3.0.1 (#415)
Bumps [maven-release-plugin](https://github.com/apache/maven-release) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/apache/maven-release/releases)
- [Commits](https://github.com/apache/maven-release/compare/maven-release-3.0.0...maven-release-3.0.1)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-release-plugin
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-09 12:14:28 +01:00
Sebastian Nagel 6c0d91e40b
[Robots.txt] Deduplicate robots rules before matching (#416)
* [Robots.txt] Deduplicate robots rules before matching
- update SimpleRobotRules documentation: add references
  to RFC 9309

* [Robots.txt] Deduplicate robots rules before matching

* SimpleRobotRules: add missing Override annotation
2023-06-09 09:10:06 +01:00
Julien Nioche bfb5b9b067
Update code_coverage.yml
fxed name of secret
2023-06-08 17:26:41 +01:00
Julien Nioche e881cdba46
Update README.md
use latest version in examples of using Maven or Gradle
2023-05-24 08:00:59 +01:00
Richard Zowalla 4663ca583b
#409 - Push Code Coverage to Coveralls (#414) 2023-05-24 07:54:57 +01:00
Sebastian Nagel 7421e5edb1
[Robots.txt] SimpleRobotRulesParser main to use the new API method (#413)
without splitting the agent name into tokens
2023-05-23 14:56:08 +01:00
Julien Nioche d1211d6057
Generate JaCoCo reports when testing (#412)
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
2023-05-23 14:55:40 +01:00
dependabot[bot] 5246b69f80 Bump commons-io from 2.11.0 to 2.12.0
Bumps commons-io from 2.11.0 to 2.12.0.

---
updated-dependencies:
- dependency-name: commons-io:commons-io
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-23 15:19:25 +02:00
dependabot[bot] e2322e2804 Bump maven-source-plugin from 3.2.1 to 3.3.0
Bumps [maven-source-plugin](https://github.com/apache/maven-source-plugin) from 3.2.1 to 3.3.0.
- [Commits](https://github.com/apache/maven-source-plugin/compare/maven-source-plugin-3.2.1...maven-source-plugin-3.3.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-source-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-23 15:18:48 +02:00
Sebastian Nagel 962787f4fd
Merge pull request #408 from sebastian-nagel/cc-195-robotstxt-url-decode
[Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
2023-05-23 15:17:43 +02:00
Sebastian Nagel 5d036a1963 [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
- fix path matching for paths containing `*` or `$`
2023-05-12 14:19:35 +02:00
Sebastian Nagel 9559134438 [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
- properly percent-encode allow/disallow paths
  and URL paths during rule matching
- decode characters where necessary
- add unit tests
2023-05-12 11:42:34 +02:00
Sebastian Nagel 8bb1694669 Updates changelog for #192, #362, #383, #389 and merged dependabot pull requests 2023-05-11 16:52:23 +02:00
dependabot[bot] 1eefc10ce1 Bump maven-surefire-plugin from 3.0.0 to 3.1.0
Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 3.0.0 to 3.1.0.
- [Release notes](https://github.com/apache/maven-surefire/releases)
- [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.0.0...surefire-3.1.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-surefire-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-11 16:20:45 +02:00
dependabot[bot] 5c317d3c23 Bump maven-gpg-plugin from 3.0.1 to 3.1.0
Bumps [maven-gpg-plugin](https://github.com/apache/maven-gpg-plugin) from 3.0.1 to 3.1.0.
- [Commits](https://github.com/apache/maven-gpg-plugin/compare/maven-gpg-plugin-3.0.1...maven-gpg-plugin-3.1.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-gpg-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-11 16:20:16 +02:00
Sebastian Nagel 79bef97d40
Merge pull request #401 from sebastian-nagel/cc-389-allow-disallow-unicode-paths
[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters
2023-05-11 16:19:23 +02:00
dependabot[bot] e691cec4cf Bump junit.version from 5.9.2 to 5.9.3
Bumps `junit.version` from 5.9.2 to 5.9.3.

Updates `junit-jupiter-engine` from 5.9.2 to 5.9.3
- [Release notes](https://github.com/junit-team/junit5/releases)
- [Commits](https://github.com/junit-team/junit5/compare/r5.9.2...r5.9.3)

Updates `junit-jupiter-params` from 5.9.2 to 5.9.3
- [Release notes](https://github.com/junit-team/junit5/releases)
- [Commits](https://github.com/junit-team/junit5/compare/r5.9.2...r5.9.3)

---
updated-dependencies:
- dependency-name: org.junit.jupiter:junit-jupiter-engine
  dependency-type: direct:development
  update-type: version-update:semver-patch
- dependency-name: org.junit.jupiter:junit-jupiter-params
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-02 08:08:32 +02:00
dependabot[bot] 764ef96ea1 Bump download-maven-plugin from 1.6.8 to 1.7.0
Bumps [download-maven-plugin](https://github.com/maven-download-plugin/maven-download-plugin) from 1.6.8 to 1.7.0.
- [Release notes](https://github.com/maven-download-plugin/maven-download-plugin/releases)
- [Commits](https://github.com/maven-download-plugin/maven-download-plugin/compare/1.6.8...1.7.0)

---
updated-dependencies:
- dependency-name: com.googlecode.maven-download-plugin:download-maven-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-02 08:08:26 +02:00
dependabot[bot] 1291f5cddb Bump forbiddenapis from 3.4 to 3.5.1
Bumps [forbiddenapis](https://github.com/policeman-tools/forbidden-apis) from 3.4 to 3.5.1.
- [Release notes](https://github.com/policeman-tools/forbidden-apis/releases)
- [Commits](https://github.com/policeman-tools/forbidden-apis/compare/3.4...3.5.1)

---
updated-dependencies:
- dependency-name: de.thetaphi:forbiddenapis
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-04-25 09:51:52 +02:00
dependabot[bot] a980ae10da Bump maven-surefire-plugin from 2.22.2 to 3.0.0
Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 2.22.2 to 3.0.0.
- [Release notes](https://github.com/apache/maven-surefire/releases)
- [Commits](https://github.com/apache/maven-surefire/compare/surefire-2.22.2...surefire-3.0.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-surefire-plugin
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-04-25 09:51:42 +02:00
Sebastian Nagel a395cfee73 Add link to RFC 9309 to Javadoc class description 2023-04-24 17:36:08 +02:00
Sebastian Nagel be2d5c24d3 Fix line wrapping in comments 2023-04-24 17:27:16 +02:00
Sebastian Nagel 2c2cb3bf7a [Robots.txt] Handle allow/disallow directives containing unescaped
Unicode characters, fixes #389
- use UTF-8 as default input encoding of robots.txt files
- add unit test
  - test matching of Unicode paths in allow/disallow directives
  - test for proper matching of ASCII paths if encoding is not
    UTF-8 (and no byte order mark present)
2023-04-24 17:27:16 +02:00
Sebastian Nagel d8a6126365
[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (#362)
* RFC compliance: matching user-agent names when selecting rule blocks
- add unit test to verify that the rule with the completely
  matched user-agent name is selected, and no partial prefix match
  is preferred (cf. also #192)

* RFC compliance: matching user-agent names when selecting rule blocks

- refactor agent name matching and move splitting robotNames string
  at comma into a separate method to be called once at the beginning
  of parsing the robots.txt file

- extend the robots parser API and add a method to pass agent names
  as a collection following the RFC 9309 with no splitting of the
  names into words/tokens.

- deprecate "old" method which splits the robot name into tokens and
  performs prefix matching

- by default user agent names are matched literally but case-insensitive
  following RFC 9309. Add method to "restore" the prefix matching:
  "setExactUserAgentMatching(false)"

- BaseRobotRulesParser: move the documented details about how
  user-agent names are matched into SimpleRobotRulesParser

- unit tests: add tests for issues described in #192, configure exact
  user-agent matching if required

* RFC compliance: matching user-agent names when selecting rule blocks
- match user-agent product token at beginning of user-agent
  line/statement followed by ignored non-token characters,
  e.g. "foo" is matched in "User-agent: foo/1.2"

* RFC compliance: matching user-agent names when selecting rule blocks
- match user-agent product tokens followed by ignored characters
  also in legacy prefix matching mode, e.g. match "butterfly" in
  "User-agent: Butterfly/1.0"
- refactor prefix matching: switch inner and outer loop, handle
  check for (common) wild-card user-agent outside of loop

* RFC compliance: matching user-agent names when selecting rule blocks
- make exact user-agent matching the default in unit tests,
  explicitly pass flag for legacy prefix user-agent matching
  in unit tests where needed
  - names not following the ua pattern in the specificiation "[a-zA-Z_-]+"
  - user-agent lines with multiple user-agent names

* RFC compliance: matching user-agent names when selecting rule blocks
- make the method to handle prefix/partial user-agent product token
  matches protected, so that it can be overridden to match non-standard
  user-agent product tokens, e.g. "Go!zilla"
2023-04-24 17:24:59 +02:00
dependabot[bot] f2982c5d11 Bump maven-deploy-plugin from 3.0.0 to 3.1.1
Bumps [maven-deploy-plugin](https://github.com/apache/maven-deploy-plugin) from 3.0.0 to 3.1.1.
- [Release notes](https://github.com/apache/maven-deploy-plugin/releases)
- [Commits](https://github.com/apache/maven-deploy-plugin/compare/maven-deploy-plugin-3.0.0...maven-deploy-plugin-3.1.1)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-deploy-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-04-20 17:07:56 +02:00
dependabot[bot] a0bf1c9167 Bump slf4j-api from 1.7.36 to 2.0.7
Bumps [slf4j-api](https://github.com/qos-ch/slf4j) from 1.7.36 to 2.0.7.
- [Release notes](https://github.com/qos-ch/slf4j/releases)
- [Commits](https://github.com/qos-ch/slf4j/commits)

---
updated-dependencies:
- dependency-name: org.slf4j:slf4j-api
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-04-20 17:07:39 +02:00