Sebastian Nagel
ed1cebeff7
[Domains] Installation of a gzip-compressed public suffix list from cache breaks EffectiveTldFinder, fixes #441
...
- downgrade Maven download plugin (1.7.1 -> 1.6.8)
2023-10-27 07:20:58 +02:00
Ken Krugler
4192e3fab7
Fix typo in README.md
2023-08-30 12:59:02 -07:00
dependabot[bot]
4b1117943b
Bump com.googlecode.maven-download-plugin:download-maven-plugin
...
Bumps [com.googlecode.maven-download-plugin:download-maven-plugin](https://github.com/maven-download-plugin/maven-download-plugin ) from 1.7.0 to 1.7.1.
- [Release notes](https://github.com/maven-download-plugin/maven-download-plugin/releases )
- [Commits](https://github.com/maven-download-plugin/maven-download-plugin/compare/1.7.0...1.7.1 )
---
updated-dependencies:
- dependency-name: com.googlecode.maven-download-plugin:download-maven-plugin
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-07-25 10:56:45 +02:00
Sebastian Nagel
69c4f606f7
Release 1.4
...
- fix release data in news section
- add note that user-agent product tokens must lower-case
2023-07-18 15:03:51 +02:00
Sebastian Nagel
a3ff95502f
Release 1.4
...
- update news section
- add 1.4 Javadocs to README
2023-07-18 13:44:15 +02:00
Sebastian Nagel
0e1758fcee
Update CHANGES.txt for next development iteration (1.5-SNAPSHOT)
2023-07-13 11:30:09 +02:00
Sebastian Nagel
3e958801f6
[maven-release-plugin] prepare for next development iteration
2023-07-13 11:29:47 +02:00
Sebastian Nagel
ce9cf46020
Update CHANGES.txt for release of crawler-commons 1.4
2023-07-13 11:29:43 +02:00
Sebastian Nagel
2b8717d9e5
[maven-release-plugin] prepare release crawler-commons-1.4
2023-07-13 10:30:08 +02:00
Sebastian Nagel
a62bd80140
Updates changelog for #376 , #380 , #401 , #414 , #425 , #428 , #422/#424, #114/#390/#430, #245/#360
2023-07-12 16:16:30 +02:00
Sebastian Nagel
6fb34cf856
Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests ( #360 )
...
- port unit tests from https://github.com/google/robotstxt
- adapt "Google-only" unit tests dealing with overlong lines
and none-standard user-agent names
- adapt unit tests dealing with overlong lines and percent-encoded
URL paths were the behavior of SimpleRobotRulesParser is not
wrong and could be even seen as an improvement compared to
the restrictions put on API input params by the Google robots.txt parser
2023-07-12 15:28:59 +02:00
Sebastian Nagel
871e4e61d2
Merge pull request #430 from sebastian-nagel/cc-390-114-robots-closing-rule-group
...
[Robots.txt] Close groups of rules as defined in RFC 9309
2023-07-12 10:35:48 +02:00
Sebastian Nagel
d685bafb2d
[Robots.txt] SimpleRobotRulesParser main() to follow five redirects ( #428 )
...
when fetching robots.txt over HTTP as required by RFC 9309
2023-07-11 14:49:00 +01:00
Sebastian Nagel
de7221dafc
[Robots.txt] Empty disallow statement not to clear other rules, fixes #422 ( #424 )
2023-07-11 14:47:33 +01:00
Sebastian Nagel
7ae8617563
[Robots.txt] Add more spelling variants and typos of robots.txt directives ( #425 )
...
* [Robots.txt] Add more spelling variants and typos of robots.txt directives
- found in Google's RFC 9309 reference parser (google/robotstxt)
- and in real-world robots.txt files (Common Crawl)
- if we accept lines starting with http: as sitemap directives
we should nowadays also accept https: as such
* Add Javadocs to some of the robots.txt extension directives
2023-07-11 14:46:07 +01:00
Sebastian Nagel
e67299432c
[Robots.txt] Clarify behavior when to close blocks of multiple user-agents
...
- must keep state whether Crawl-delay is already set for a specific agent
as separate variable
- add unit test to ensure that no already set Crawl-delay is overridden
by a (lower) value of another agent
2023-07-10 15:18:23 +02:00
Sebastian Nagel
17e8544980
[Robots.txt] Clarify behavior when to close blocks of multiple user-agents
...
- fix unit test broken by introducing compliance with RFC 9309
2023-07-10 12:59:40 +02:00
Sebastian Nagel
4524cfb5c0
[Robots.txt] Clarify behavior when to close blocks of multiple user-agents, closes #390
...
[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes #114
- do not close rule blocks / groups on other directives than specified
in RFC 9309: groups are only closed on a user-agent line at least
one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
or set the value for a specific agent using a value defined for the
wildcard agent
2023-07-10 12:59:40 +02:00
Sebastian Nagel
d710c85871
BaseRobotRules: Document that Crawl-delay is stored in milliseconds
2023-07-10 12:59:40 +02:00
Sebastian Nagel
a3900425f3
[Robots.txt] Handle robots.txt with missing sections (and implicit master rules)
...
- add unit test to verify solution of #114
2023-07-10 12:59:15 +02:00
Sebastian Nagel
86109c029a
Updates changelog for #423/#426, #427 , #429
2023-07-10 10:24:43 +02:00
Sebastian Nagel
54498a0e5a
[Robots.txt] Rename default user-agent / robot name in unit tests
...
- replace occurrences of the user-agent name supposed to match
the wildcard user-agent rule group by "anybot"
2023-06-16 17:34:20 +02:00
Sebastian Nagel
99289f7835
[Robots.txt] Pass empty collection of agent names to select rules for
...
any robot (wildcard user-agent name)
- in SimpleRobotRulesParser main()
- add unit test to verify that wildcard user-agent rules are selected
if empty collection of agent names is passed
2023-06-16 17:19:39 +02:00
Sebastian Nagel
a5bd9645fa
[Robots.txt] Update Javadoc to document changes in Robots.txt classes
...
related to RFC 9309 compliance
- document effect of rules merging in combination with multiple agent names,
fixes #423
- document that rules addressed to the wildcard agent are followed
if none of the passed agent names matches - without any need to
pass the wildcard agent name as one of the agent names
- complete documentation
- use @inheritDoc to avoid duplicated documentation
- strip doc strings where inherited automatically by @Override
annotations
2023-06-16 17:16:23 +02:00
Sebastian Nagel
0eb2c74294
Updates changelog for #195/#408, #409/#412, #413 , #416 , #420 and merged dependabot pull requests
2023-06-13 14:24:13 +02:00
Sebastian Nagel
6523fd29ed
[Robots.txt] Add units test based on examples in RFC 9309
2023-06-13 14:01:49 +02:00
Sebastian Nagel
7a95069f0e
Merge pull request #421 from sebastian-nagel/cc-308-url-normalizer-empty-query
...
[BasicNormalizer] Query parameters normalization in BasicURLNormalizer
2023-06-13 13:56:52 +02:00
dependabot[bot]
0656e9c561
Bump maven-surefire-plugin from 3.1.0 to 3.1.2
...
Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire ) from 3.1.0 to 3.1.2.
- [Release notes](https://github.com/apache/maven-surefire/releases )
- [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.1.0...surefire-3.1.2 )
---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-surefire-plugin
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-06-13 10:43:59 +02:00
dependabot[bot]
6b5c0a5699
Bump commons-io from 2.12.0 to 2.13.0
...
Bumps commons-io from 2.12.0 to 2.13.0.
---
updated-dependencies:
- dependency-name: commons-io:commons-io
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-06-13 10:43:49 +02:00
Richard Zowalla
ee620d489e
Align JDK dist with the dist we are using for actual test/build
2023-06-13 10:41:38 +02:00
Sebastian Nagel
e5563c3049
[BasicNormalizer] Query parameters normalization in BasicURLNormalizer,
...
closes #308
- add unit test to prove that an empty query is removed
2023-06-13 09:59:07 +02:00
dependabot[bot]
9261174c6c
Bump maven-release-plugin from 3.0.0 to 3.0.1 ( #415 )
...
Bumps [maven-release-plugin](https://github.com/apache/maven-release ) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/apache/maven-release/releases )
- [Commits](https://github.com/apache/maven-release/compare/maven-release-3.0.0...maven-release-3.0.1 )
---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-release-plugin
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-09 12:14:28 +01:00
Sebastian Nagel
6c0d91e40b
[Robots.txt] Deduplicate robots rules before matching ( #416 )
...
* [Robots.txt] Deduplicate robots rules before matching
- update SimpleRobotRules documentation: add references
to RFC 9309
* [Robots.txt] Deduplicate robots rules before matching
* SimpleRobotRules: add missing Override annotation
2023-06-09 09:10:06 +01:00
Julien Nioche
bfb5b9b067
Update code_coverage.yml
...
fxed name of secret
2023-06-08 17:26:41 +01:00
Julien Nioche
e881cdba46
Update README.md
...
use latest version in examples of using Maven or Gradle
2023-05-24 08:00:59 +01:00
Richard Zowalla
4663ca583b
#409 - Push Code Coverage to Coveralls ( #414 )
2023-05-24 07:54:57 +01:00
Sebastian Nagel
7421e5edb1
[Robots.txt] SimpleRobotRulesParser main to use the new API method ( #413 )
...
without splitting the agent name into tokens
2023-05-23 14:56:08 +01:00
Julien Nioche
d1211d6057
Generate JaCoCo reports when testing ( #412 )
...
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
2023-05-23 14:55:40 +01:00
dependabot[bot]
5246b69f80
Bump commons-io from 2.11.0 to 2.12.0
...
Bumps commons-io from 2.11.0 to 2.12.0.
---
updated-dependencies:
- dependency-name: commons-io:commons-io
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-05-23 15:19:25 +02:00
dependabot[bot]
e2322e2804
Bump maven-source-plugin from 3.2.1 to 3.3.0
...
Bumps [maven-source-plugin](https://github.com/apache/maven-source-plugin ) from 3.2.1 to 3.3.0.
- [Commits](https://github.com/apache/maven-source-plugin/compare/maven-source-plugin-3.2.1...maven-source-plugin-3.3.0 )
---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-source-plugin
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-05-23 15:18:48 +02:00
Sebastian Nagel
962787f4fd
Merge pull request #408 from sebastian-nagel/cc-195-robotstxt-url-decode
...
[Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
2023-05-23 15:17:43 +02:00
Sebastian Nagel
5d036a1963
[Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
...
- fix path matching for paths containing `*` or `$`
2023-05-12 14:19:35 +02:00
Sebastian Nagel
9559134438
[Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
...
- properly percent-encode allow/disallow paths
and URL paths during rule matching
- decode characters where necessary
- add unit tests
2023-05-12 11:42:34 +02:00
Sebastian Nagel
8bb1694669
Updates changelog for #192 , #362 , #383 , #389 and merged dependabot pull requests
2023-05-11 16:52:23 +02:00
dependabot[bot]
1eefc10ce1
Bump maven-surefire-plugin from 3.0.0 to 3.1.0
...
Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire ) from 3.0.0 to 3.1.0.
- [Release notes](https://github.com/apache/maven-surefire/releases )
- [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.0.0...surefire-3.1.0 )
---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-surefire-plugin
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-05-11 16:20:45 +02:00
dependabot[bot]
5c317d3c23
Bump maven-gpg-plugin from 3.0.1 to 3.1.0
...
Bumps [maven-gpg-plugin](https://github.com/apache/maven-gpg-plugin ) from 3.0.1 to 3.1.0.
- [Commits](https://github.com/apache/maven-gpg-plugin/compare/maven-gpg-plugin-3.0.1...maven-gpg-plugin-3.1.0 )
---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-gpg-plugin
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-05-11 16:20:16 +02:00
Sebastian Nagel
79bef97d40
Merge pull request #401 from sebastian-nagel/cc-389-allow-disallow-unicode-paths
...
[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters
2023-05-11 16:19:23 +02:00
dependabot[bot]
e691cec4cf
Bump junit.version from 5.9.2 to 5.9.3
...
Bumps `junit.version` from 5.9.2 to 5.9.3.
Updates `junit-jupiter-engine` from 5.9.2 to 5.9.3
- [Release notes](https://github.com/junit-team/junit5/releases )
- [Commits](https://github.com/junit-team/junit5/compare/r5.9.2...r5.9.3 )
Updates `junit-jupiter-params` from 5.9.2 to 5.9.3
- [Release notes](https://github.com/junit-team/junit5/releases )
- [Commits](https://github.com/junit-team/junit5/compare/r5.9.2...r5.9.3 )
---
updated-dependencies:
- dependency-name: org.junit.jupiter:junit-jupiter-engine
dependency-type: direct:development
update-type: version-update:semver-patch
- dependency-name: org.junit.jupiter:junit-jupiter-params
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-05-02 08:08:32 +02:00
dependabot[bot]
764ef96ea1
Bump download-maven-plugin from 1.6.8 to 1.7.0
...
Bumps [download-maven-plugin](https://github.com/maven-download-plugin/maven-download-plugin ) from 1.6.8 to 1.7.0.
- [Release notes](https://github.com/maven-download-plugin/maven-download-plugin/releases )
- [Commits](https://github.com/maven-download-plugin/maven-download-plugin/compare/1.6.8...1.7.0 )
---
updated-dependencies:
- dependency-name: com.googlecode.maven-download-plugin:download-maven-plugin
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-05-02 08:08:26 +02:00
dependabot[bot]
1291f5cddb
Bump forbiddenapis from 3.4 to 3.5.1
...
Bumps [forbiddenapis](https://github.com/policeman-tools/forbidden-apis ) from 3.4 to 3.5.1.
- [Release notes](https://github.com/policeman-tools/forbidden-apis/releases )
- [Commits](https://github.com/policeman-tools/forbidden-apis/compare/3.4...3.5.1 )
---
updated-dependencies:
- dependency-name: de.thetaphi:forbiddenapis
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-04-25 09:51:52 +02:00