crawler-commons

mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-12 16:36:02 +02:00

Author	SHA1	Message	Date
Sebastian Nagel	ce9cf46020	Update CHANGES.txt for release of crawler-commons 1.4	2023-07-13 11:29:43 +02:00
Sebastian Nagel	2b8717d9e5	[maven-release-plugin] prepare release crawler-commons-1.4	2023-07-13 10:30:08 +02:00
Sebastian Nagel	a62bd80140	Updates changelog for #376 , #380 , #401 , #414 , #425 , #428 , #422/#424, #114/#390/#430, #245/#360	2023-07-12 16:16:30 +02:00
Sebastian Nagel	6fb34cf856	Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (#360 ) - port unit tests from https://github.com/google/robotstxt - adapt "Google-only" unit tests dealing with overlong lines and none-standard user-agent names - adapt unit tests dealing with overlong lines and percent-encoded URL paths were the behavior of SimpleRobotRulesParser is not wrong and could be even seen as an improvement compared to the restrictions put on API input params by the Google robots.txt parser	2023-07-12 15:28:59 +02:00
Sebastian Nagel	871e4e61d2	Merge pull request #430 from sebastian-nagel/cc-390-114-robots-closing-rule-group [Robots.txt] Close groups of rules as defined in RFC 9309	2023-07-12 10:35:48 +02:00
Sebastian Nagel	d685bafb2d	[Robots.txt] SimpleRobotRulesParser main() to follow five redirects (#428 ) when fetching robots.txt over HTTP as required by RFC 9309	2023-07-11 14:49:00 +01:00
Sebastian Nagel	de7221dafc	[Robots.txt] Empty disallow statement not to clear other rules, fixes #422 (#424 )	2023-07-11 14:47:33 +01:00
Sebastian Nagel	7ae8617563	[Robots.txt] Add more spelling variants and typos of robots.txt directives (#425 ) * [Robots.txt] Add more spelling variants and typos of robots.txt directives - found in Google's RFC 9309 reference parser (google/robotstxt) - and in real-world robots.txt files (Common Crawl) - if we accept lines starting with http: as sitemap directives we should nowadays also accept https: as such * Add Javadocs to some of the robots.txt extension directives	2023-07-11 14:46:07 +01:00
Sebastian Nagel	e67299432c	[Robots.txt] Clarify behavior when to close blocks of multiple user-agents - must keep state whether Crawl-delay is already set for a specific agent as separate variable - add unit test to ensure that no already set Crawl-delay is overridden by a (lower) value of another agent	2023-07-10 15:18:23 +02:00
Sebastian Nagel	17e8544980	[Robots.txt] Clarify behavior when to close blocks of multiple user-agents - fix unit test broken by introducing compliance with RFC 9309	2023-07-10 12:59:40 +02:00
Sebastian Nagel	4524cfb5c0	[Robots.txt] Clarify behavior when to close blocks of multiple user-agents, closes #390 [Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes #114 - do not close rule blocks / groups on other directives than specified in RFC 9309: groups are only closed on a user-agent line at least one allow/disallow line was read before - set Crawl-delay independently from grouping, but never override or set the value for a specific agent using a value defined for the wildcard agent	2023-07-10 12:59:40 +02:00
Sebastian Nagel	d710c85871	BaseRobotRules: Document that Crawl-delay is stored in milliseconds	2023-07-10 12:59:40 +02:00
Sebastian Nagel	a3900425f3	[Robots.txt] Handle robots.txt with missing sections (and implicit master rules) - add unit test to verify solution of #114	2023-07-10 12:59:15 +02:00
Sebastian Nagel	86109c029a	Updates changelog for #423/#426, #427 , #429	2023-07-10 10:24:43 +02:00
Sebastian Nagel	54498a0e5a	[Robots.txt] Rename default user-agent / robot name in unit tests - replace occurrences of the user-agent name supposed to match the wildcard user-agent rule group by "anybot"	2023-06-16 17:34:20 +02:00
Sebastian Nagel	99289f7835	[Robots.txt] Pass empty collection of agent names to select rules for any robot (wildcard user-agent name) - in SimpleRobotRulesParser main() - add unit test to verify that wildcard user-agent rules are selected if empty collection of agent names is passed	2023-06-16 17:19:39 +02:00
Sebastian Nagel	a5bd9645fa	[Robots.txt] Update Javadoc to document changes in Robots.txt classes related to RFC 9309 compliance - document effect of rules merging in combination with multiple agent names, fixes #423 - document that rules addressed to the wildcard agent are followed if none of the passed agent names matches - without any need to pass the wildcard agent name as one of the agent names - complete documentation - use @inheritDoc to avoid duplicated documentation - strip doc strings where inherited automatically by @Override annotations	2023-06-16 17:16:23 +02:00
Sebastian Nagel	0eb2c74294	Updates changelog for #195/#408, #409/#412, #413 , #416 , #420 and merged dependabot pull requests	2023-06-13 14:24:13 +02:00
Sebastian Nagel	6523fd29ed	[Robots.txt] Add units test based on examples in RFC 9309	2023-06-13 14:01:49 +02:00
Sebastian Nagel	7a95069f0e	Merge pull request #421 from sebastian-nagel/cc-308-url-normalizer-empty-query [BasicNormalizer] Query parameters normalization in BasicURLNormalizer	2023-06-13 13:56:52 +02:00
dependabot[bot]	0656e9c561	Bump maven-surefire-plugin from 3.1.0 to 3.1.2 Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 3.1.0 to 3.1.2. - [Release notes](https://github.com/apache/maven-surefire/releases) - [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.1.0...surefire-3.1.2) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-surefire-plugin dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2023-06-13 10:43:59 +02:00
dependabot[bot]	6b5c0a5699	Bump commons-io from 2.12.0 to 2.13.0 Bumps commons-io from 2.12.0 to 2.13.0. --- updated-dependencies: - dependency-name: commons-io:commons-io dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-06-13 10:43:49 +02:00
Richard Zowalla	ee620d489e	Align JDK dist with the dist we are using for actual test/build	2023-06-13 10:41:38 +02:00
Sebastian Nagel	e5563c3049	[BasicNormalizer] Query parameters normalization in BasicURLNormalizer, closes #308 - add unit test to prove that an empty query is removed	2023-06-13 09:59:07 +02:00
dependabot[bot]	9261174c6c	Bump maven-release-plugin from 3.0.0 to 3.0.1 (#415 ) Bumps [maven-release-plugin](https://github.com/apache/maven-release) from 3.0.0 to 3.0.1. - [Release notes](https://github.com/apache/maven-release/releases) - [Commits](https://github.com/apache/maven-release/compare/maven-release-3.0.0...maven-release-3.0.1) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-release-plugin dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-06-09 12:14:28 +01:00
Sebastian Nagel	6c0d91e40b	[Robots.txt] Deduplicate robots rules before matching (#416 ) * [Robots.txt] Deduplicate robots rules before matching - update SimpleRobotRules documentation: add references to RFC 9309 * [Robots.txt] Deduplicate robots rules before matching * SimpleRobotRules: add missing Override annotation	2023-06-09 09:10:06 +01:00
Julien Nioche	bfb5b9b067	Update code_coverage.yml fxed name of secret	2023-06-08 17:26:41 +01:00
Julien Nioche	e881cdba46	Update README.md use latest version in examples of using Maven or Gradle	2023-05-24 08:00:59 +01:00
Richard Zowalla	4663ca583b	#409 - Push Code Coverage to Coveralls (#414 )	2023-05-24 07:54:57 +01:00
Sebastian Nagel	7421e5edb1	[Robots.txt] SimpleRobotRulesParser main to use the new API method (#413 ) without splitting the agent name into tokens	2023-05-23 14:56:08 +01:00
Julien Nioche	d1211d6057	Generate JaCoCo reports when testing (#412 ) Signed-off-by: Julien Nioche <julien@digitalpebble.com>	2023-05-23 14:55:40 +01:00
dependabot[bot]	5246b69f80	Bump commons-io from 2.11.0 to 2.12.0 Bumps commons-io from 2.11.0 to 2.12.0. --- updated-dependencies: - dependency-name: commons-io:commons-io dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-05-23 15:19:25 +02:00
dependabot[bot]	e2322e2804	Bump maven-source-plugin from 3.2.1 to 3.3.0 Bumps [maven-source-plugin](https://github.com/apache/maven-source-plugin) from 3.2.1 to 3.3.0. - [Commits](https://github.com/apache/maven-source-plugin/compare/maven-source-plugin-3.2.1...maven-source-plugin-3.3.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-source-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-05-23 15:18:48 +02:00
Sebastian Nagel	962787f4fd	Merge pull request #408 from sebastian-nagel/cc-195-robotstxt-url-decode [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters	2023-05-23 15:17:43 +02:00
Sebastian Nagel	5d036a1963	[Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters - fix path matching for paths containing `*` or `$`	2023-05-12 14:19:35 +02:00
Sebastian Nagel	9559134438	[Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters - properly percent-encode allow/disallow paths and URL paths during rule matching - decode characters where necessary - add unit tests	2023-05-12 11:42:34 +02:00
Sebastian Nagel	8bb1694669	Updates changelog for #192 , #362 , #383 , #389 and merged dependabot pull requests	2023-05-11 16:52:23 +02:00
dependabot[bot]	1eefc10ce1	Bump maven-surefire-plugin from 3.0.0 to 3.1.0 Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 3.0.0 to 3.1.0. - [Release notes](https://github.com/apache/maven-surefire/releases) - [Commits](https://github.com/apache/maven-surefire/compare/surefire-3.0.0...surefire-3.1.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-surefire-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-05-11 16:20:45 +02:00
dependabot[bot]	5c317d3c23	Bump maven-gpg-plugin from 3.0.1 to 3.1.0 Bumps [maven-gpg-plugin](https://github.com/apache/maven-gpg-plugin) from 3.0.1 to 3.1.0. - [Commits](https://github.com/apache/maven-gpg-plugin/compare/maven-gpg-plugin-3.0.1...maven-gpg-plugin-3.1.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-gpg-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-05-11 16:20:16 +02:00
Sebastian Nagel	79bef97d40	Merge pull request #401 from sebastian-nagel/cc-389-allow-disallow-unicode-paths [Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters	2023-05-11 16:19:23 +02:00
dependabot[bot]	e691cec4cf	Bump junit.version from 5.9.2 to 5.9.3 Bumps `junit.version` from 5.9.2 to 5.9.3. Updates `junit-jupiter-engine` from 5.9.2 to 5.9.3 - [Release notes](https://github.com/junit-team/junit5/releases) - [Commits](https://github.com/junit-team/junit5/compare/r5.9.2...r5.9.3) Updates `junit-jupiter-params` from 5.9.2 to 5.9.3 - [Release notes](https://github.com/junit-team/junit5/releases) - [Commits](https://github.com/junit-team/junit5/compare/r5.9.2...r5.9.3) --- updated-dependencies: - dependency-name: org.junit.jupiter:junit-jupiter-engine dependency-type: direct:development update-type: version-update:semver-patch - dependency-name: org.junit.jupiter:junit-jupiter-params dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2023-05-02 08:08:32 +02:00
dependabot[bot]	764ef96ea1	Bump download-maven-plugin from 1.6.8 to 1.7.0 Bumps [download-maven-plugin](https://github.com/maven-download-plugin/maven-download-plugin) from 1.6.8 to 1.7.0. - [Release notes](https://github.com/maven-download-plugin/maven-download-plugin/releases) - [Commits](https://github.com/maven-download-plugin/maven-download-plugin/compare/1.6.8...1.7.0) --- updated-dependencies: - dependency-name: com.googlecode.maven-download-plugin:download-maven-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-05-02 08:08:26 +02:00
dependabot[bot]	1291f5cddb	Bump forbiddenapis from 3.4 to 3.5.1 Bumps [forbiddenapis](https://github.com/policeman-tools/forbidden-apis) from 3.4 to 3.5.1. - [Release notes](https://github.com/policeman-tools/forbidden-apis/releases) - [Commits](https://github.com/policeman-tools/forbidden-apis/compare/3.4...3.5.1) --- updated-dependencies: - dependency-name: de.thetaphi:forbiddenapis dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-04-25 09:51:52 +02:00
dependabot[bot]	a980ae10da	Bump maven-surefire-plugin from 2.22.2 to 3.0.0 Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 2.22.2 to 3.0.0. - [Release notes](https://github.com/apache/maven-surefire/releases) - [Commits](https://github.com/apache/maven-surefire/compare/surefire-2.22.2...surefire-3.0.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-surefire-plugin dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2023-04-25 09:51:42 +02:00
Sebastian Nagel	a395cfee73	Add link to RFC 9309 to Javadoc class description	2023-04-24 17:36:08 +02:00
Sebastian Nagel	be2d5c24d3	Fix line wrapping in comments	2023-04-24 17:27:16 +02:00
Sebastian Nagel	2c2cb3bf7a	[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters, fixes #389 - use UTF-8 as default input encoding of robots.txt files - add unit test - test matching of Unicode paths in allow/disallow directives - test for proper matching of ASCII paths if encoding is not UTF-8 (and no byte order mark present)	2023-04-24 17:27:16 +02:00
Sebastian Nagel	d8a6126365	[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (#362 ) * RFC compliance: matching user-agent names when selecting rule blocks - add unit test to verify that the rule with the completely matched user-agent name is selected, and no partial prefix match is preferred (cf. also #192) * RFC compliance: matching user-agent names when selecting rule blocks - refactor agent name matching and move splitting robotNames string at comma into a separate method to be called once at the beginning of parsing the robots.txt file - extend the robots parser API and add a method to pass agent names as a collection following the RFC 9309 with no splitting of the names into words/tokens. - deprecate "old" method which splits the robot name into tokens and performs prefix matching - by default user agent names are matched literally but case-insensitive following RFC 9309. Add method to "restore" the prefix matching: "setExactUserAgentMatching(false)" - BaseRobotRulesParser: move the documented details about how user-agent names are matched into SimpleRobotRulesParser - unit tests: add tests for issues described in #192, configure exact user-agent matching if required * RFC compliance: matching user-agent names when selecting rule blocks - match user-agent product token at beginning of user-agent line/statement followed by ignored non-token characters, e.g. "foo" is matched in "User-agent: foo/1.2" * RFC compliance: matching user-agent names when selecting rule blocks - match user-agent product tokens followed by ignored characters also in legacy prefix matching mode, e.g. match "butterfly" in "User-agent: Butterfly/1.0" - refactor prefix matching: switch inner and outer loop, handle check for (common) wild-card user-agent outside of loop * RFC compliance: matching user-agent names when selecting rule blocks - make exact user-agent matching the default in unit tests, explicitly pass flag for legacy prefix user-agent matching in unit tests where needed - names not following the ua pattern in the specificiation "[a-zA-Z_-]+" - user-agent lines with multiple user-agent names * RFC compliance: matching user-agent names when selecting rule blocks - make the method to handle prefix/partial user-agent product token matches protected, so that it can be overridden to match non-standard user-agent product tokens, e.g. "Go!zilla"	2023-04-24 17:24:59 +02:00
dependabot[bot]	f2982c5d11	Bump maven-deploy-plugin from 3.0.0 to 3.1.1 Bumps [maven-deploy-plugin](https://github.com/apache/maven-deploy-plugin) from 3.0.0 to 3.1.1. - [Release notes](https://github.com/apache/maven-deploy-plugin/releases) - [Commits](https://github.com/apache/maven-deploy-plugin/compare/maven-deploy-plugin-3.0.0...maven-deploy-plugin-3.1.1) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-deploy-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-04-20 17:07:56 +02:00
dependabot[bot]	a0bf1c9167	Bump slf4j-api from 1.7.36 to 2.0.7 Bumps [slf4j-api](https://github.com/qos-ch/slf4j) from 1.7.36 to 2.0.7. - [Release notes](https://github.com/qos-ch/slf4j/releases) - [Commits](https://github.com/qos-ch/slf4j/commits) --- updated-dependencies: - dependency-name: org.slf4j:slf4j-api dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2023-04-20 17:07:39 +02:00

1 2 3 4 5 ...

608 Commits All Branches Search

608 Commits

All Branches