- port unit tests from https://github.com/google/robotstxt
- adapt "Google-only" unit tests dealing with overlong lines
and none-standard user-agent names
- adapt unit tests dealing with overlong lines and percent-encoded
URL paths were the behavior of SimpleRobotRulesParser is not
wrong and could be even seen as an improvement compared to
the restrictions put on API input params by the Google robots.txt parser
* [Robots.txt] Add more spelling variants and typos of robots.txt directives
- found in Google's RFC 9309 reference parser (google/robotstxt)
- and in real-world robots.txt files (Common Crawl)
- if we accept lines starting with http: as sitemap directives
we should nowadays also accept https: as such
* Add Javadocs to some of the robots.txt extension directives
- must keep state whether Crawl-delay is already set for a specific agent
as separate variable
- add unit test to ensure that no already set Crawl-delay is overridden
by a (lower) value of another agent
[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes #114
- do not close rule blocks / groups on other directives than specified
in RFC 9309: groups are only closed on a user-agent line at least
one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
or set the value for a specific agent using a value defined for the
wildcard agent
any robot (wildcard user-agent name)
- in SimpleRobotRulesParser main()
- add unit test to verify that wildcard user-agent rules are selected
if empty collection of agent names is passed
related to RFC 9309 compliance
- document effect of rules merging in combination with multiple agent names,
fixes #423
- document that rules addressed to the wildcard agent are followed
if none of the passed agent names matches - without any need to
pass the wildcard agent name as one of the agent names
- complete documentation
- use @inheritDoc to avoid duplicated documentation
- strip doc strings where inherited automatically by @Override
annotations
Unicode characters, fixes #389
- use UTF-8 as default input encoding of robots.txt files
- add unit test
- test matching of Unicode paths in allow/disallow directives
- test for proper matching of ASCII paths if encoding is not
UTF-8 (and no byte order mark present)
* RFC compliance: matching user-agent names when selecting rule blocks
- add unit test to verify that the rule with the completely
matched user-agent name is selected, and no partial prefix match
is preferred (cf. also #192)
* RFC compliance: matching user-agent names when selecting rule blocks
- refactor agent name matching and move splitting robotNames string
at comma into a separate method to be called once at the beginning
of parsing the robots.txt file
- extend the robots parser API and add a method to pass agent names
as a collection following the RFC 9309 with no splitting of the
names into words/tokens.
- deprecate "old" method which splits the robot name into tokens and
performs prefix matching
- by default user agent names are matched literally but case-insensitive
following RFC 9309. Add method to "restore" the prefix matching:
"setExactUserAgentMatching(false)"
- BaseRobotRulesParser: move the documented details about how
user-agent names are matched into SimpleRobotRulesParser
- unit tests: add tests for issues described in #192, configure exact
user-agent matching if required
* RFC compliance: matching user-agent names when selecting rule blocks
- match user-agent product token at beginning of user-agent
line/statement followed by ignored non-token characters,
e.g. "foo" is matched in "User-agent: foo/1.2"
* RFC compliance: matching user-agent names when selecting rule blocks
- match user-agent product tokens followed by ignored characters
also in legacy prefix matching mode, e.g. match "butterfly" in
"User-agent: Butterfly/1.0"
- refactor prefix matching: switch inner and outer loop, handle
check for (common) wild-card user-agent outside of loop
* RFC compliance: matching user-agent names when selecting rule blocks
- make exact user-agent matching the default in unit tests,
explicitly pass flag for legacy prefix user-agent matching
in unit tests where needed
- names not following the ua pattern in the specificiation "[a-zA-Z_-]+"
- user-agent lines with multiple user-agent names
* RFC compliance: matching user-agent names when selecting rule blocks
- make the method to handle prefix/partial user-agent product token
matches protected, so that it can be overridden to match non-standard
user-agent product tokens, e.g. "Go!zilla"