1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-26 20:06:07 +02:00
Commit Graph

251 Commits

Author SHA1 Message Date
Sebastian Nagel 6fb34cf856
Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (#360)
- port unit tests from https://github.com/google/robotstxt
- adapt "Google-only" unit tests dealing with overlong lines
  and none-standard user-agent names
- adapt unit tests dealing with overlong lines and percent-encoded
  URL paths were the behavior of SimpleRobotRulesParser is not
  wrong and could be even seen as an improvement compared to
  the restrictions put on API input params by the Google robots.txt parser
2023-07-12 15:28:59 +02:00
Sebastian Nagel 871e4e61d2
Merge pull request #430 from sebastian-nagel/cc-390-114-robots-closing-rule-group
[Robots.txt] Close groups of rules as defined in RFC 9309
2023-07-12 10:35:48 +02:00
Sebastian Nagel d685bafb2d
[Robots.txt] SimpleRobotRulesParser main() to follow five redirects (#428)
when fetching robots.txt over HTTP as required by RFC 9309
2023-07-11 14:49:00 +01:00
Sebastian Nagel de7221dafc
[Robots.txt] Empty disallow statement not to clear other rules, fixes #422 (#424) 2023-07-11 14:47:33 +01:00
Sebastian Nagel 7ae8617563
[Robots.txt] Add more spelling variants and typos of robots.txt directives (#425)
* [Robots.txt] Add more spelling variants and typos of robots.txt directives
- found in Google's RFC 9309 reference parser (google/robotstxt)
- and in real-world robots.txt files (Common Crawl)
- if we accept lines starting with http: as sitemap directives
  we should nowadays also accept https: as such

* Add Javadocs to some of the robots.txt extension directives
2023-07-11 14:46:07 +01:00
Sebastian Nagel e67299432c [Robots.txt] Clarify behavior when to close blocks of multiple user-agents
- must keep state whether Crawl-delay is already set for a specific agent
  as separate variable
- add unit test to ensure that no already set Crawl-delay is overridden
  by a (lower) value of another agent
2023-07-10 15:18:23 +02:00
Sebastian Nagel 17e8544980 [Robots.txt] Clarify behavior when to close blocks of multiple user-agents
- fix unit test broken by introducing compliance with RFC 9309
2023-07-10 12:59:40 +02:00
Sebastian Nagel 4524cfb5c0 [Robots.txt] Clarify behavior when to close blocks of multiple user-agents, closes #390
[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes #114
- do not close rule blocks / groups on other directives than specified
  in RFC 9309: groups are only closed on a user-agent line at least
  one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
  or set the value for a specific agent using a value defined for the
  wildcard agent
2023-07-10 12:59:40 +02:00
Sebastian Nagel d710c85871 BaseRobotRules: Document that Crawl-delay is stored in milliseconds 2023-07-10 12:59:40 +02:00
Sebastian Nagel a3900425f3 [Robots.txt] Handle robots.txt with missing sections (and implicit master rules)
- add unit test to verify solution of #114
2023-07-10 12:59:15 +02:00
Sebastian Nagel 54498a0e5a [Robots.txt] Rename default user-agent / robot name in unit tests
- replace occurrences of the user-agent name supposed to match
  the wildcard user-agent rule group by "anybot"
2023-06-16 17:34:20 +02:00
Sebastian Nagel 99289f7835 [Robots.txt] Pass empty collection of agent names to select rules for
any robot (wildcard user-agent name)
- in SimpleRobotRulesParser main()
- add unit test to verify that wildcard user-agent rules are selected
  if empty collection of agent names is passed
2023-06-16 17:19:39 +02:00
Sebastian Nagel a5bd9645fa [Robots.txt] Update Javadoc to document changes in Robots.txt classes
related to RFC 9309 compliance
- document effect of rules merging in combination with multiple agent names,
  fixes #423
- document that rules addressed to the wildcard agent are followed
  if none of the passed agent names matches - without any need to
  pass the wildcard agent name as one of the agent names
- complete documentation
- use @inheritDoc to avoid duplicated documentation
- strip doc strings where inherited automatically by @Override
  annotations
2023-06-16 17:16:23 +02:00
Sebastian Nagel 6523fd29ed [Robots.txt] Add units test based on examples in RFC 9309 2023-06-13 14:01:49 +02:00
Sebastian Nagel e5563c3049 [BasicNormalizer] Query parameters normalization in BasicURLNormalizer,
closes #308
- add unit test to prove that an empty query is removed
2023-06-13 09:59:07 +02:00
Sebastian Nagel 6c0d91e40b
[Robots.txt] Deduplicate robots rules before matching (#416)
* [Robots.txt] Deduplicate robots rules before matching
- update SimpleRobotRules documentation: add references
  to RFC 9309

* [Robots.txt] Deduplicate robots rules before matching

* SimpleRobotRules: add missing Override annotation
2023-06-09 09:10:06 +01:00
Sebastian Nagel 7421e5edb1
[Robots.txt] SimpleRobotRulesParser main to use the new API method (#413)
without splitting the agent name into tokens
2023-05-23 14:56:08 +01:00
Sebastian Nagel 5d036a1963 [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
- fix path matching for paths containing `*` or `$`
2023-05-12 14:19:35 +02:00
Sebastian Nagel 9559134438 [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters
- properly percent-encode allow/disallow paths
  and URL paths during rule matching
- decode characters where necessary
- add unit tests
2023-05-12 11:42:34 +02:00
Sebastian Nagel a395cfee73 Add link to RFC 9309 to Javadoc class description 2023-04-24 17:36:08 +02:00
Sebastian Nagel be2d5c24d3 Fix line wrapping in comments 2023-04-24 17:27:16 +02:00
Sebastian Nagel 2c2cb3bf7a [Robots.txt] Handle allow/disallow directives containing unescaped
Unicode characters, fixes #389
- use UTF-8 as default input encoding of robots.txt files
- add unit test
  - test matching of Unicode paths in allow/disallow directives
  - test for proper matching of ASCII paths if encoding is not
    UTF-8 (and no byte order mark present)
2023-04-24 17:27:16 +02:00
Sebastian Nagel d8a6126365
[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (#362)
* RFC compliance: matching user-agent names when selecting rule blocks
- add unit test to verify that the rule with the completely
  matched user-agent name is selected, and no partial prefix match
  is preferred (cf. also #192)

* RFC compliance: matching user-agent names when selecting rule blocks

- refactor agent name matching and move splitting robotNames string
  at comma into a separate method to be called once at the beginning
  of parsing the robots.txt file

- extend the robots parser API and add a method to pass agent names
  as a collection following the RFC 9309 with no splitting of the
  names into words/tokens.

- deprecate "old" method which splits the robot name into tokens and
  performs prefix matching

- by default user agent names are matched literally but case-insensitive
  following RFC 9309. Add method to "restore" the prefix matching:
  "setExactUserAgentMatching(false)"

- BaseRobotRulesParser: move the documented details about how
  user-agent names are matched into SimpleRobotRulesParser

- unit tests: add tests for issues described in #192, configure exact
  user-agent matching if required

* RFC compliance: matching user-agent names when selecting rule blocks
- match user-agent product token at beginning of user-agent
  line/statement followed by ignored non-token characters,
  e.g. "foo" is matched in "User-agent: foo/1.2"

* RFC compliance: matching user-agent names when selecting rule blocks
- match user-agent product tokens followed by ignored characters
  also in legacy prefix matching mode, e.g. match "butterfly" in
  "User-agent: Butterfly/1.0"
- refactor prefix matching: switch inner and outer loop, handle
  check for (common) wild-card user-agent outside of loop

* RFC compliance: matching user-agent names when selecting rule blocks
- make exact user-agent matching the default in unit tests,
  explicitly pass flag for legacy prefix user-agent matching
  in unit tests where needed
  - names not following the ua pattern in the specificiation "[a-zA-Z_-]+"
  - user-agent lines with multiple user-agent names

* RFC compliance: matching user-agent names when selecting rule blocks
- make the method to handle prefix/partial user-agent product token
  matches protected, so that it can be overridden to match non-standard
  user-agent product tokens, e.g. "Go!zilla"
2023-04-24 17:24:59 +02:00
Sebastian Nagel 3368cb53ef
Improve readability of robots.txt unit tests (#383)
- put lines of embedded robots.txt test files
  into separate code lines (except for empty lines)
- apply code formatting template
2022-10-06 13:26:13 +02:00
Eduardo Jimenez 4ad101cf0d Ran java formatter 2022-08-11 14:08:36 +02:00
Eduardo Jimenez 1f0e79b72a Improve robots check draft rfc compliance 2022-08-11 14:08:36 +02:00
Sebastian Nagel 23ee0634dc [Sitemaps] Disable support for DTDs in sitemaps by default
- update change log
- apply code formatting
- add support for parsing sitemaps with DTD in SiteMapTester
2022-03-02 16:03:13 +01:00
kennethwong-hc 273ac6ac7e Allow set option for allow DTD, instead of system setting 2022-03-02 13:15:13 +00:00
kennethwong-hc a2981962fc Add system property condition to allow DOCTYPE 2022-02-25 17:33:04 +00:00
kennethwong-hc c2a1bf3ee3 TEC-16207 Not allow doctype 2022-02-25 11:42:52 +00:00
Richard Zowalla 73552f0433 Removes provided dependency towards Servlet API (#368) 2022-02-22 19:27:20 +01:00
Richard Zowalla 5e922e4d9d Fixes two JavaDoc warnings 2021-10-19 14:09:58 +02:00
Sebastian Nagel 0493878f80
Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions (#326)
* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
(fixes #322)
- compare URL strings to avoid that java.net.URL::equals triggers unwanted and potentially slow
  DNS lookups to resolve the host part. Replace:
  - Objects::equals in equals methods of sitemap extensions
  - URL::equals and URL::hashCode in SiteMapIndex and SiteMapURL
- enable check for URL::equals and URL::hashCode in Forbidden API Checker

* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
- avoid NPEs in equals and hashCode methods

* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
- avoid NPE, return null as before if null is passed to SitemapIndex::getSitemap
2021-10-06 12:07:02 +03:00
Sebastian Nagel 10d3021055 Add a builder API for configuring the BasicURLNormalizer
- allow to normalize host names to Unicode
2021-10-04 17:24:26 +02:00
Aécio Santos 12e2c389b2
Add a builder API for configuring the BasicURLNormalizer
Usage example:
```
normalizer = BasicURLNormalizer.newBuilder()
  .idnNormalization(IdnNormalization.PUNYCODE)
  .queryParamsToRemove(
    asList("sid", "phpsessid", "sessionid", "jsessionid")
  )
  .build();
```

Closes #321.
2021-10-04 10:15:09 -04:00
Sebastian Nagel 4841242390 Fix XXE vulnerability in Sitemap parser
- add unit test to verify that the parser is not vulnerable
  to XInclude attacks
- apply code formatter
- add changelog entry
2021-10-01 10:07:14 +02:00
Oleksiy Kovyrin 2b66ad2060 Do not use a temporary file 2021-09-30 17:38:35 -04:00
Oleksiy Kovyrin 7555bcbbbe Disable entity resolution features in Java SAX XML parser to avoid XXE vulnerabilities while parsing Sitemaps 2021-09-29 12:56:17 -04:00
Sebastian Nagel a10cf2540a Merge branch 'aecio:aecio/query-params-normalization', fixes #246, closes #309
- rebase to master and squash commits
- fix failing sitemaps unit tests with URL filtering using BasicURLNormalizer
  (sort query params in test sitemap)
- CHANGES.txt: updated to follow style, added missing entry for preceding commit
2021-09-21 12:34:39 +02:00
Aécio Santos 94bac65639 Query parameters normalization
- Sort query parameters (fix #246)
- Allows to (optionally) remove common irrelevant query parameters
2021-09-21 12:02:00 +02:00
Sebastian Nagel 386608f7e8 Allow to normalize URLs in sitemaps, resolves #305
- extend SiteMapParser by methods to register a URLFilter (function)
  used to normalize or filter (if null is returned) URLs found in
  sitemaps
- implement URL filtering in sitemap parsers / XML handlers
- add unit tests to verify URL filtering for text and XML sitemaps
2020-12-08 15:28:58 +01:00
Sebastian Nagel 295a2966fe BasicURLNormalizer: Normalize URL without a scheme, resolves #271
- prefix URL without a scheme with `http://`
- decrease log level for malformed URL (info -> debug)
2020-11-12 16:35:12 +01:00
Sebastian Nagel 643a17d533 [BasicNormalizer] decode percent-encoded host names, fixes #303 (#304)
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
  percent-encoded sequences are properly caught
2020-09-17 15:30:47 +02:00
Richard Zowalla 35f0638069 introduces download-maven-plugin to download the 'effective_tld_names.dat' during the maven build
ensures no cached files are used for the release profile
2020-08-06 14:47:58 +02:00
Sebastian Nagel e47e1536b0
[sitemaps] Replace priority "NaN" by default value, fixes #296 (#297)
Co-authored-by: Julien Nioche <julien@digitalpebble.com>
2020-07-24 15:49:11 +01:00
Evan Halley 9affd1d992
Adding duration to the map generated by VideoAttributes.asMap (#301) 2020-07-24 15:46:59 +01:00
Sebastian Nagel e16d022dea [Robots] Maximum values (crawl-delay and warnings): document and make visible,
fixes #276
- change visibility (private -> public) of DEFAULT_MAX_CRAWL_DELAY and DEFAULT_MAX_WARNINGS
- document methods to set/get max crawl delay and max warnings
2020-07-15 17:47:32 +02:00
Sebastian Nagel 7f9fd49ba3 [SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
fixes #267
2020-07-15 15:49:43 +02:00
Julien Nioche 1447aa4799 JAVADOC fix in NewsAttributes 2020-06-25 09:49:08 +01:00
Sebastian Nagel 774c5c8092
Improvements to BasicURLNormalizer (#292)
- better percent-encoding of URL paths and queries, fixes #263
- hostnames:
  * convert IDNs from Unicode to Punycode, fixes #248
  * remove trailing dot
- normalize path `/..` to `/`
- also normalize path of file:/ URLs
2020-06-22 13:51:39 +01:00