1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-22 19:06:03 +02:00
Commit Graph

108 Commits

Author SHA1 Message Date
Julien Nioche 0dc699f3e9 Updated CHANGES pre-1.0 release 2019-03-19 22:04:31 +00:00
Sebastian Nagel e8b598b2e8 [Sitemaps] Unit tests depend on system timezone, fixes #238 (#239)
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
  add note that UTC is assumed if no time zone is contained in
  date string
2019-03-19 15:00:04 +00:00
Sebastian Nagel 40531efe25 EffectiveTldFinder: upgrade public suffix list, implements #219 (#235)
- upgrade the public suffix list to the latest version retrieved right now from
     https://publicsuffix.org/list/public_suffix_list.dat
  resp.
     802c469416/public_suffix_list.dat
2019-03-14 11:22:50 +00:00
Sebastian Nagel 0349fbe1f0 Update changelog for #144/#234 2019-03-14 10:29:17 +01:00
Sebastian Nagel eb74336bd3 Update changelog for #225 and #226 2019-02-21 23:01:32 +01:00
Sebastian Nagel 916415d262
Merge branch 'master' into cc-231-etld-invalid-idns 2019-02-21 22:16:48 +01:00
Sebastian Nagel 40b1c44d68 Update changelog for #231 2019-02-21 22:15:40 +01:00
Sebastian Nagel 67db8bf1be [Sitemaps] Trim Unicode whitespace around URLs, fixes #224 2019-02-20 16:27:16 +01:00
Sebastian Nagel 78e935f83b Update changelog for #213 2019-02-03 13:49:11 +01:00
Sebastian Nagel ab9e33a5f9 Update changelog for #220 and #221 2019-01-18 17:35:33 +01:00
Sebastian Nagel 862af9416f Sitemap extension support
- add extension support to SiteMapTester
- list extension attributes in SiteMapURL.toString()
- update change log
2018-09-28 12:14:02 +02:00
Sebastian Nagel 9318de951f Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (#217)
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
  but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests

* Fix Javadoc error and warnings, update change log

* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
  by updating to Java 8 date and time API
2018-09-24 10:09:58 +01:00
Ken Krugler a5c5091d64 Update CHANGES.txt 2018-07-31 17:14:27 -07:00
Julien Nioche 6213784e8b Updated README for 0.10 release 2018-06-07 09:20:43 +01:00
Julien Nioche 0da1b8b8b5 Minor changes + applied formatting pre 0.10 release 2018-06-05 11:33:27 +01:00
Julien Nioche 8195140e21
Update CHANGES.txt
added #211
2018-06-05 11:23:00 +01:00
Julien Nioche a8b474551a
Update CHANGES.txt
Add main to SimpleRobotRulesParser for testing (#193)
2018-06-04 21:28:09 +01:00
Ken Krugler d99c034dd0
Merge branch 'master' into issue-134 2018-05-14 11:20:17 -07:00
Aecio Santos 47c2cad8b8 Add getters/setters and update CHANGES.txt 2018-05-14 12:00:02 -04:00
Aecio Santos aaa3113e55 Update CHANGES.txt 2018-05-14 11:51:43 -04:00
Aecio Santos fd1e7fcffe SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos 7bef14d386 Make RobotRules accessible #134
- Makes SimpleRobotRulesParser._rules property protected
  and adds getters for SimpleRobotRulesParser._rules and
  RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
  to SimpleRobotRules to allow access to concrete class without
  nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Julien Nioche e25309d26c
Add JAX-B dependencies to POM (#207)
* Add JAX-B dependencies to POM, fixes #196

* mentioned in CHANGES.txt
2018-05-03 11:04:03 +01:00
Sebastian Nagel 7d3eccfa63 Add changelog entry and fix unit test 2018-04-25 14:06:33 +02:00
Sebastian Nagel 0ef7cf87fa Improve sitemap parsing
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel a6b3178fc7 Simplify MIME detection:
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel 907be2343f Format fix: add braces, complete CHANGES.txt 2018-04-16 13:36:06 +02:00
Ken Krugler 12155888bc
Add reference to issue #199 2018-04-02 12:59:17 -07:00
Sebastian Nagel 49bf37c6d9 Update CHANGES.txt 2017-12-08 09:42:52 +01:00
Ken Krugler 2b58c5050c merge with master 2017-11-05 14:55:30 -08:00
Ken Krugler aeb0cb91a2 Update CHANGES.txt 2017-11-05 14:53:58 -08:00
Julien Nioche af0a013776 Released 0.9 2017-10-31 09:42:57 +00:00
Julien Nioche f3e37f37da Updated change log prior to 0.9 release 2017-10-27 11:08:47 +01:00
Sebastian Nagel 2afdf5b04d Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169
- completely add sitemap URLs from sitemap index if URL contains
  XML entities or CDATA
2017-08-12 17:28:08 +02:00
Ken Krugler 7e08c1da49 Update CHANGES.txt 2017-06-20 15:18:47 -07:00
Julien Nioche 694e74207b release notes for 0.8 2017-06-09 10:15:44 +01:00
Julien Nioche 2c72ba8708 Update CHANGES.txt
added tika 1.15 to changes
2017-06-02 15:02:20 +01:00
Sebastian Nagel 02e62c12cb Disable XML resolvers: update changelog 2017-05-04 22:36:40 +02:00
Julien Nioche 4ba1295c17 Update forbiddenapis to v2.0. Fixes #99 2017-03-20 15:58:55 +00:00
Sebastian Nagel 772f02fcb0 Fix parsing of gzipped text sitemaps, fixes #143
- detect gzip embedded media type to decide
  whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Sebastian Nagel f7c7cab7a8 Merge branch 'matt-deboer-master'
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Sebastian Nagel 61a500ad21 Use constants from StandardCharsets where applicable, fixes #141 2017-02-02 14:59:14 +01:00
Sebastian Nagel 49b3097083 Increase size limit of sitemaps (10MB -> 50MB), fixes #132 2017-02-02 12:00:47 +01:00
Julien Nioche eefeda558c Update CHANGES.txt
#137
2017-01-13 17:31:10 +00:00
Sebastian Nagel cb38a5fc8f BasicURLNormalizer: NPE for URLs without authority
- check whether URL.getAuthority() returns null
- recompose URLs without authority with empty authority/host
2017-01-11 17:05:53 +01:00
Sebastian Nagel e39aa60373 BasicURLNormalizer to remove empty port 2016-12-09 14:54:00 +01:00
Ken Krugler 5783046f4e Issue #96
Remove fetcher support
2016-11-27 09:18:21 -08:00
Julien Nioche 1a6c1b0dce Released 0.7 2016-11-24 10:07:27 +00:00
Julien Nioche 9ea4f1b514 added ref to #126
and changed presentation of issue number
2016-09-30 12:33:06 +01:00