1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-25 11:46:03 +02:00
Commit Graph

157 Commits

Author SHA1 Message Date
Julien Nioche 0da1b8b8b5 Minor changes + applied formatting pre 0.10 release 2018-06-05 11:33:27 +01:00
Sebastian Nagel d98a3f14cf Allow for legacy URIs when checking sitemap namespaces (#211)
* Allow for legacy URIs when checking sitemap namespaces
- e.g., allow legacy namespace URI but ignore URLs
  from image and video sitemap extensions
- resolve relative namespace URIs
- add namespace URIs of sitemap extensions (news, images, videos)

* Address kkrugler's review comments:
- document addition of sitemap namespace required by sitemap
  protocol specification when calling setStrictNamespace(true)
- remove early return on <rss> root element
2018-06-05 11:20:26 +01:00
Sebastian Nagel 0c75e75496 Add main to SimpleRobotRulesParser for testing (#193)
* Add main to SimpleRobotRulesParser for testing
- implement toString() for robot rules
- fix line breaks in comments

* Do not detect MIME type as Tika dependency has been removed
2018-06-04 21:26:45 +01:00
Ken Krugler d99c034dd0
Merge branch 'master' into issue-134 2018-05-14 11:20:17 -07:00
Aecio Santos 47c2cad8b8 Add getters/setters and update CHANGES.txt 2018-05-14 12:00:02 -04:00
Aecio Santos fd1e7fcffe SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos 7bef14d386 Make RobotRules accessible #134
- Makes SimpleRobotRulesParser._rules property protected
  and adds getters for SimpleRobotRulesParser._rules and
  RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
  to SimpleRobotRules to allow access to concrete class without
  nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Sebastian Nagel 7d3eccfa63 Add changelog entry and fix unit test 2018-04-25 14:06:33 +02:00
Luc Boruta 355684ff20 Add methods to parse and iterate sitemap
SiteMapParser#walkSiteMap(URL,Consumer)
2018-04-25 14:04:58 +02:00
Sebastian Nagel 0ef7cf87fa Improve sitemap parsing
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel a6b3178fc7 Simplify MIME detection:
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel 6714ea57d6 Fix error message format string 2018-04-18 14:53:21 +02:00
Sebastian Nagel 72aa177738 RDF-based RSS feeds: map MIME type, detect from content 2018-04-18 14:44:10 +02:00
Sebastian Nagel 4780678af0 Improve logging of content type detection for gzip-compressed sitemaps 2018-04-18 14:43:52 +02:00
Sebastian Nagel 907be2343f Format fix: add braces, complete CHANGES.txt 2018-04-16 13:36:06 +02:00
Sebastian Nagel af084f7ee4 Sitemaps limit on "bad url" log messages, fixes #145
- degrade log level to debug for lines which are not valid
- only log first 1024 characters of line
2018-04-13 11:16:47 +02:00
Sebastian Nagel 34c19d811a Fix MIME detection for sitemaps:
- avoid NPE if no MIME type has been detected
- allow optional leading white space before MIME patterns
  (after optional BOM)
2018-04-10 19:17:16 +02:00
Ken Krugler 165888ba7e Use locale with String.format() 2018-03-28 14:40:55 -07:00
Ken Krugler 2cff74427d Remove Tika dependency 2018-03-27 15:40:50 -07:00
Sebastian Nagel 6de3f26da8 - add missing license header
- add debug log messages
- add unit test for invalid hostname (www..example..com)
2017-12-07 21:23:26 +01:00
Sebastian Nagel 95c6e3886a EffectiveTldFinder to parse Internationalized Domain Names (IDN),
fixes #179
- allow eTLDs to appear as IDN variants
  * one or more dot-separated parts as IDN,
  * optionally others as punycoded IDNA
- speed-up lookup of public suffixes / eTLDs using a suffix trie
- update public suffix list to recent version
- add unit tests to test for wildcard eTLDs
2017-12-06 21:47:01 +01:00
Sebastian Nagel 5c1ad13cea Add main() to EffectiveTldFinder 2017-12-01 10:54:03 +01:00
Sebastian Nagel f598af8d28 - address TODO of #184: do not use of private suffixes
for PaidLevelDomain detection
- adapt unit tests: 'de.com' is a private suffix,
  consequently 'xxx.de.com' isn't a PaidLevelDomain
- rebase on master, apply code format
2017-11-06 14:03:35 +01:00
Sebastian Nagel cda1bdee82 Allow to include/exclude private domains / effective TLDs without re-initialization, fixes #185
- add optional flag excludePrivate to getAssignedDomain(...) / getEffectiveTLD(...)
- remove methods which exclude private eTLD during initialization
2017-11-06 11:48:30 +01:00
Ken Krugler 03163d0b3f Add pending test for private domains 2017-11-05 14:51:14 -08:00
Ken Krugler cb701a9da9 Use slf4j formatting support
And avoid forbidden api check for calling String.format without a
locale.
2017-10-28 09:59:28 -07:00
Ken Krugler c0a158d086 Fix minor typo 2017-10-27 13:53:28 -07:00
Ken Krugler 569ec9b8cd Use TLD for PLD code 2017-10-27 13:51:29 -07:00
Julien Nioche ee69049db0 Sitemap: removed DOM parser, fixes #177 (#181) 2017-10-27 07:48:22 +01:00
Sebastian Nagel d2de87cf92 Normalize RSS pubDate, fixes #166 for SiteMapParserSAX (#180) 2017-10-18 14:07:05 +01:00
Sebastian Nagel 0b70328671 EffectiveTLDFinder improvements, fixes #172
- update public suffix list to recent version of
      https://publicsuffix.org/list/public_suffix_list.dat
- add method flag to force a check whether the domain has a valid
      effective TLD listed in the public suffix list
- fix mixed case hostnames (wWW.eXample.com)
- improve parsing of public suffix list
2017-10-17 17:15:16 +02:00
Sebastian Nagel 8dea17e003 Add tests from publicsuffixlist.org 2017-10-17 13:29:14 +02:00
Sebastian Nagel 5e9d70fec6 EffectiveTldFinder: optionally exclude private domains (opposed to ICANN domains) 2017-10-17 13:29:14 +02:00
Sebastian Nagel bba44925a5 Update public suffix list to recent version of
https://publicsuffix.org/list/public_suffix_list.dat
2017-10-17 13:23:15 +02:00
Sebastian Nagel 01d675fc37 Incorrect domains returned by EffectiveTldFinder (#172)
- correctly recognize exceptions to wildcard rules as domains
- do not disallow TLDs with last element not being a TLD (e.g., .ac.za)
- partially fix IDNs: punycoded IDNs are now recognized
- add unit test for uppercase / mixed case host names
2017-10-17 13:23:15 +02:00
Julien Nioche 6adb771b72 Add namespace aware DOM/SAX parsing for XML Sitemaps (#176)
* Add namespace aware DOM/SAX parsing for XML Sitemaps.  RSS and Atom parsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".

* Lenient namespacing in non strict mode + applied formatting

* Introduced separate field strictNamespace to sitemapparsers + added test to saxparser

* Fixes Javadoc

* Fixes the fix for the Javadoc

* Allow to set strictNamespace in SiteMapTester

- Fix strict namespace handling in SitemapParserSAX:
- pass strictNamespace from DelegatorHandler to delegates
- ignore text if inside an element of invalid namespace
- use SAX parser in unit test
- set exception and pass it to calling DelegatorHandler if namespace
  does not match
2017-10-17 10:47:17 +01:00
Sebastian Nagel 2afdf5b04d Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169
- completely add sitemap URLs from sitemap index if URL contains
  XML entities or CDATA
2017-08-12 17:28:08 +02:00
Ilya Kondratiev a49fba71b7 [#166] Fix.
* Create single TreadLocal static filed for all RSS date-time formats.
* Add messages to assertations.
2017-06-20 23:06:11 +03:00
Ilya Kondratiev b8aaffdf31 [ISSUE-166] RSS parser ignores item's pubDate
* Converts RSS timestamp format to w3c format
* Use pubDate of item in new SiteMapURL creation
2017-06-17 00:03:55 +03:00
Sebastian Nagel 4dd61ded84 SAX parser to stop URL at closing </loc> fixes #153
- on opening <loc> and <url> assume forgotten closing </url> tag
  only if there is more than white space in the buffer
2017-06-06 21:04:28 +02:00
Julien Nioche 7bf499a98a SAX parser to stop URL at closing </loc> fixes #153 2017-06-05 16:58:39 +01:00
Sebastian Nagel ce42b00150 Merge pull request #152 from sebastian-nagel/sitemaps-disable-xml-resolvers
Disable validation of XML sitemaps and install noop entity resolver, fixes #151
2017-05-04 22:33:35 +02:00
Sebastian Nagel 4f1534a6f4 SAX parser does not extract URLs from Atom feeds, fixes #159
- call delegated method startElement(...) from DelegatorHandler
- add heuristics to select link URL
- add Atom feed unit test
- add static method to encapsulate reading of test resource files
2017-03-28 11:54:04 +02:00
Sebastian Nagel 772f02fcb0 Fix parsing of gzipped text sitemaps, fixes #143
- detect gzip embedded media type to decide
  whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Sebastian Nagel 6759ee73e9 SAX parser to fail on valid but unknown XML formats, fixes #154 2017-03-08 14:23:55 +01:00
Sebastian Nagel 9b936ecca4 Disable validation of XML sitemaps and install noop entity resolver, fixes #151 2017-03-03 00:01:12 +01:00
Sebastian Nagel f7c7cab7a8 Merge branch 'matt-deboer-master'
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Sebastian Nagel 960b7a8fed remove unused import 2017-02-02 15:18:05 +01:00
Sebastian Nagel 61a500ad21 Use constants from StandardCharsets where applicable, fixes #141 2017-02-02 14:59:14 +01:00