1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-04-24 08:35:04 +02:00
Commit Graph

622 Commits

Author SHA1 Message Date
dependabot[bot] 258a499330 Bump org.apache.maven.plugins:maven-surefire-plugin from 3.1.2 to 3.2.2
Bumps [org.apache.maven.plugins:maven-surefire-plugin](https://github.com/apache/maven-surefire) from 3.1.2 to 3.2.2.
- [Release notes](https://github.com/apache/maven-surefire/releases)
- [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.1.2...surefire-3.2.2)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-surefire-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-11-14 10:34:36 +01:00
Sebastian Nagel ccb218a86a Update CHANGES.txt to include recent fixes and upgrades 2023-10-29 10:49:21 +01:00
dependabot[bot] 03b5543451 Bump de.thetaphi:forbiddenapis from 3.5.1 to 3.6
Bumps [de.thetaphi:forbiddenapis](https://github.com/policeman-tools/forbidden-apis) from 3.5.1 to 3.6.
- [Commits](https://github.com/policeman-tools/forbidden-apis/compare/3.5.1...3.6)

---
updated-dependencies:
- dependency-name: de.thetaphi:forbiddenapis
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-29 10:41:46 +01:00
dependabot[bot] 54c65dae65 Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.5.0 to 3.6.0
Bumps [org.apache.maven.plugins:maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) from 3.5.0 to 3.6.0.
- [Release notes](https://github.com/apache/maven-javadoc-plugin/releases)
- [Commits](https://github.com/apache/maven-javadoc-plugin/compare/maven-javadoc-plugin-3.5.0...maven-javadoc-plugin-3.6.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-javadoc-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-29 10:41:23 +01:00
dependabot[bot] 2c64f79773 Bump commons-io:commons-io from 2.13.0 to 2.15.0
Bumps commons-io:commons-io from 2.13.0 to 2.15.0.

---
updated-dependencies:
- dependency-name: commons-io:commons-io
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-29 10:40:53 +01:00
dependabot[bot] e7421e9785 Bump org.slf4j:slf4j-api from 2.0.7 to 2.0.9
Bumps org.slf4j:slf4j-api from 2.0.7 to 2.0.9.

---
updated-dependencies:
- dependency-name: org.slf4j:slf4j-api
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-29 10:38:59 +01:00
Sebastian Nagel 54576e810d
[Sitemaps] Google Sitemap PageMap extensions, implements #388 (#442) 2023-10-28 16:09:45 +01:00
Sebastian Nagel ed1cebeff7 [Domains] Installation of a gzip-compressed public suffix list from cache breaks EffectiveTldFinder, fixes #441
- downgrade Maven download plugin (1.7.1 -> 1.6.8)
2023-10-27 07:20:58 +02:00
Ken Krugler 4192e3fab7
Fix typo in README.md 2023-08-30 12:59:02 -07:00
dependabot[bot] 4b1117943b Bump com.googlecode.maven-download-plugin:download-maven-plugin
Bumps [com.googlecode.maven-download-plugin:download-maven-plugin](https://github.com/maven-download-plugin/maven-download-plugin) from 1.7.0 to 1.7.1.
- [Release notes](https://github.com/maven-download-plugin/maven-download-plugin/releases)
- [Commits](https://github.com/maven-download-plugin/maven-download-plugin/compare/1.7.0...1.7.1)

---
updated-dependencies:
- dependency-name: com.googlecode.maven-download-plugin:download-maven-plugin
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-07-25 10:56:45 +02:00
Sebastian Nagel 69c4f606f7 Release 1.4
- fix release data in news section
- add note that user-agent product tokens must lower-case
2023-07-18 15:03:51 +02:00
Sebastian Nagel a3ff95502f Release 1.4
- update news section
- add 1.4 Javadocs to README
2023-07-18 13:44:15 +02:00
Sebastian Nagel 0e1758fcee Update CHANGES.txt for next development iteration (1.5-SNAPSHOT) 2023-07-13 11:30:09 +02:00
Sebastian Nagel 3e958801f6 [maven-release-plugin] prepare for next development iteration 2023-07-13 11:29:47 +02:00
Sebastian Nagel ce9cf46020 Update CHANGES.txt for release of crawler-commons 1.4 2023-07-13 11:29:43 +02:00
Sebastian Nagel 2b8717d9e5 [maven-release-plugin] prepare release crawler-commons-1.4 2023-07-13 10:30:08 +02:00
Sebastian Nagel a62bd80140 Updates changelog for #376, #380, #401, #414, #425, #428, #422/#424, #114/#390/#430, #245/#360 2023-07-12 16:16:30 +02:00
Sebastian Nagel 6fb34cf856
Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (#360)
- port unit tests from https://github.com/google/robotstxt
- adapt "Google-only" unit tests dealing with overlong lines
  and none-standard user-agent names
- adapt unit tests dealing with overlong lines and percent-encoded
  URL paths were the behavior of SimpleRobotRulesParser is not
  wrong and could be even seen as an improvement compared to
  the restrictions put on API input params by the Google robots.txt parser
2023-07-12 15:28:59 +02:00
Sebastian Nagel 871e4e61d2
Merge pull request #430 from sebastian-nagel/cc-390-114-robots-closing-rule-group
[Robots.txt] Close groups of rules as defined in RFC 9309
2023-07-12 10:35:48 +02:00
Sebastian Nagel d685bafb2d
[Robots.txt] SimpleRobotRulesParser main() to follow five redirects (#428)
when fetching robots.txt over HTTP as required by RFC 9309
2023-07-11 14:49:00 +01:00
Sebastian Nagel de7221dafc
[Robots.txt] Empty disallow statement not to clear other rules, fixes #422 (#424) 2023-07-11 14:47:33 +01:00
Sebastian Nagel 7ae8617563
[Robots.txt] Add more spelling variants and typos of robots.txt directives (#425)
* [Robots.txt] Add more spelling variants and typos of robots.txt directives
- found in Google's RFC 9309 reference parser (google/robotstxt)
- and in real-world robots.txt files (Common Crawl)
- if we accept lines starting with http: as sitemap directives
  we should nowadays also accept https: as such

* Add Javadocs to some of the robots.txt extension directives
2023-07-11 14:46:07 +01:00
Sebastian Nagel e67299432c [Robots.txt] Clarify behavior when to close blocks of multiple user-agents
- must keep state whether Crawl-delay is already set for a specific agent
  as separate variable
- add unit test to ensure that no already set Crawl-delay is overridden
  by a (lower) value of another agent
2023-07-10 15:18:23 +02:00
Sebastian Nagel 17e8544980 [Robots.txt] Clarify behavior when to close blocks of multiple user-agents
- fix unit test broken by introducing compliance with RFC 9309
2023-07-10 12:59:40 +02:00
Sebastian Nagel 4524cfb5c0 [Robots.txt] Clarify behavior when to close blocks of multiple user-agents, closes #390
[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes #114
- do not close rule blocks / groups on other directives than specified
  in RFC 9309: groups are only closed on a user-agent line at least
  one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
  or set the value for a specific agent using a value defined for the
  wildcard agent
2023-07-10 12:59:40 +02:00
Sebastian Nagel d710c85871 BaseRobotRules: Document that Crawl-delay is stored in milliseconds 2023-07-10 12:59:40 +02:00
Sebastian Nagel a3900425f3 [Robots.txt] Handle robots.txt with missing sections (and implicit master rules)
- add unit test to verify solution of #114
2023-07-10 12:59:15 +02:00
Sebastian Nagel 86109c029a Updates changelog for #423/#426, #427, #429 2023-07-10 10:24:43 +02:00
Sebastian Nagel 54498a0e5a [Robots.txt] Rename default user-agent / robot name in unit tests
- replace occurrences of the user-agent name supposed to match
  the wildcard user-agent rule group by "anybot"
2023-06-16 17:34:20 +02:00
Sebastian Nagel 99289f7835 [Robots.txt] Pass empty collection of agent names to select rules for
any robot (wildcard user-agent name)
- in SimpleRobotRulesParser main()
- add unit test to verify that wildcard user-agent rules are selected
  if empty collection of agent names is passed
2023-06-16 17:19:39 +02:00
Sebastian Nagel a5bd9645fa [Robots.txt] Update Javadoc to document changes in Robots.txt classes
related to RFC 9309 compliance
- document effect of rules merging in combination with multiple agent names,
  fixes #423
- document that rules addressed to the wildcard agent are followed
  if none of the passed agent names matches - without any need to
  pass the wildcard agent name as one of the agent names
- complete documentation
- use @inheritDoc to avoid duplicated documentation
- strip doc strings where inherited automatically by @Override
  annotations
2023-06-16 17:16:23 +02:00
Sebastian Nagel 0eb2c74294 Updates changelog for #195/#408, #409/#412, #413, #416, #420 and merged dependabot pull requests 2023-06-13 14:24:13 +02:00
Sebastian Nagel 6523fd29ed [Robots.txt] Add units test based on examples in RFC 9309 2023-06-13 14:01:49 +02:00
Sebastian Nagel 7a95069f0e
Merge pull request #421 from sebastian-nagel/cc-308-url-normalizer-empty-query
[BasicNormalizer] Query parameters normalization in BasicURLNormalizer
2023-06-13 13:56:52 +02:00
dependabot[bot] 0656e9c561 Bump maven-surefire-plugin from 3.1.0 to 3.1.2
Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 3.1.0 to 3.1.2.
- [Release notes](https://github.com/apache/maven-surefire/releases)
- [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.1.0...surefire-3.1.2)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-surefire-plugin
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-06-13 10:43:59 +02:00
dependabot[bot] 6b5c0a5699 Bump commons-io from 2.12.0 to 2.13.0
Bumps commons-io from 2.12.0 to 2.13.0.

---
updated-dependencies:
- dependency-name: commons-io:commons-io
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-06-13 10:43:49 +02:00
Richard Zowalla ee620d489e
Align JDK dist with the dist we are using for actual test/build 2023-06-13 10:41:38 +02:00
Sebastian Nagel e5563c3049 [BasicNormalizer] Query parameters normalization in BasicURLNormalizer,
closes #308
- add unit test to prove that an empty query is removed
2023-06-13 09:59:07 +02:00
dependabot[bot] 9261174c6c
Bump maven-release-plugin from 3.0.0 to 3.0.1 (#415)
Bumps [maven-release-plugin](https://github.com/apache/maven-release) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/apache/maven-release/releases)
- [Commits](https://github.com/apache/maven-release/compare/maven-release-3.0.0...maven-release-3.0.1)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-release-plugin
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-09 12:14:28 +01:00
Sebastian Nagel 6c0d91e40b
[Robots.txt] Deduplicate robots rules before matching (#416)
* [Robots.txt] Deduplicate robots rules before matching
- update SimpleRobotRules documentation: add references
  to RFC 9309

* [Robots.txt] Deduplicate robots rules before matching

* SimpleRobotRules: add missing Override annotation
2023-06-09 09:10:06 +01:00
Julien Nioche bfb5b9b067
Update code_coverage.yml
fxed name of secret
2023-06-08 17:26:41 +01:00
Julien Nioche e881cdba46
Update README.md
use latest version in examples of using Maven or Gradle
2023-05-24 08:00:59 +01:00
Richard Zowalla 4663ca583b
#409 - Push Code Coverage to Coveralls (#414) 2023-05-24 07:54:57 +01:00
Sebastian Nagel 7421e5edb1
[Robots.txt] SimpleRobotRulesParser main to use the new API method (#413)
without splitting the agent name into tokens
2023-05-23 14:56:08 +01:00
Julien Nioche d1211d6057
Generate JaCoCo reports when testing (#412)
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
2023-05-23 14:55:40 +01:00
dependabot[bot] 5246b69f80 Bump commons-io from 2.11.0 to 2.12.0
Bumps commons-io from 2.11.0 to 2.12.0.

---
updated-dependencies:
- dependency-name: commons-io:commons-io
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-23 15:19:25 +02:00
dependabot[bot] e2322e2804 Bump maven-source-plugin from 3.2.1 to 3.3.0
Bumps [maven-source-plugin](https://github.com/apache/maven-source-plugin) from 3.2.1 to 3.3.0.
- [Commits](https://github.com/apache/maven-source-plugin/compare/maven-source-plugin-3.2.1...maven-source-plugin-3.3.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-source-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-05-23 15:18:48 +02:00
Sebastian Nagel 962787f4fd
Merge pull request #408 from sebastian-nagel/cc-195-robotstxt-url-decode
[Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
2023-05-23 15:17:43 +02:00
Sebastian Nagel 5d036a1963 [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
- fix path matching for paths containing `*` or `$`
2023-05-12 14:19:35 +02:00
Sebastian Nagel 9559134438 [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
- properly percent-encode allow/disallow paths
  and URL paths during rule matching
- decode characters where necessary
- add unit tests
2023-05-12 11:42:34 +02:00