1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-28 20:36:05 +02:00
Commit Graph

128 Commits

Author SHA1 Message Date
Sebastian Nagel 78d7e7e85f
Sitemaps to implement Serializable, fixes #244 (#294)
- make all sitemap classes including extensions to implement the
  Serializable interface
- extend sitemap parser unit tests to check object serialization
  on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
  etc.)
2020-06-22 12:51:40 +01:00
Sebastian Nagel e9772be011
[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 (#293)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262

* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
2020-06-19 13:15:48 +01:00
Julien Nioche d4d8d6531c
removed duplicate entry 2020-06-16 14:41:37 +01:00
Evan Halley ed0933f2b3
fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt (#291) 2020-06-16 14:39:05 +01:00
Julien Nioche 5d30196ffb Added #288 to CHANGES 2020-06-15 16:02:12 +01:00
Evan Halley c04e3f17e7
Adding asMap to ExtensionMetadata Interface (#288)
* added abstract method to extension metadata

* implemented asmap in image/link/mobile/news attributes

* implemented asmap in videoattributes

* adding video attributes unit test

* added news attributes unit tests

* unit test for link attributes

* unit tests for image and mobile attributes

* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString

* using constants instead of strings in more attributes

* cleaned up the imports

* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage

* added related issue to the changelog

* reverting change to NewsAttributes.equal, that causes a unit test failure
2020-06-15 15:55:20 +01:00
Avi Hayun cb21c29a57 Upgrade the toString() method of the Base/simple RobotRules #264 2020-04-29 20:00:43 +03:00
Avi Hayun 89ccfc308b
Merge branch 'master' into master 2020-04-29 08:54:15 +03:00
Sebastian Nagel ac3fd3a464 Robots parser: deduplicate sitemap links, fixes #261
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel c070453c5e EffectiveTldFinder to log loading of public suffix list, fixes #284
- log the location of the public suffix list (effective_tld_names.dat)
  during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel 8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 (#286)
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 (#282)
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel 6c271f5fab SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
- move the counter _numWarnings into the ParseState object local to
  parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
  of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
  log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel f7cc8b694a Make mvn run parameterized tests, fixes #273
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel 7419e96d74 [BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
fixes #247
2019-11-12 13:32:00 +01:00
Sebastian Nagel 2979aaf4aa [BasicNormalizer] Trailing question mark in url query test should be added,
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel e38bb95b4c Update CHANGES.txt to reflect latest code contributions 2019-10-15 16:34:00 +02:00
Avi Hayun f571f4e8c7 Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:38:57 +03:00
Sebastian Nagel b449fdf024 Robots parser to always handle absolute sitemap URL even without valid base URL,
fixes #240
2019-03-26 15:05:12 +01:00
Julien Nioche 651ce08afe CHANGES ready for next cycle of improvements 2019-03-21 21:02:10 +00:00
Julien Nioche 0dc699f3e9 Updated CHANGES pre-1.0 release 2019-03-19 22:04:31 +00:00
Sebastian Nagel e8b598b2e8 [Sitemaps] Unit tests depend on system timezone, fixes #238 (#239)
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
  add note that UTC is assumed if no time zone is contained in
  date string
2019-03-19 15:00:04 +00:00
Sebastian Nagel 40531efe25 EffectiveTldFinder: upgrade public suffix list, implements #219 (#235)
- upgrade the public suffix list to the latest version retrieved right now from
     https://publicsuffix.org/list/public_suffix_list.dat
  resp.
     802c469416/public_suffix_list.dat
2019-03-14 11:22:50 +00:00
Sebastian Nagel 0349fbe1f0 Update changelog for #144/#234 2019-03-14 10:29:17 +01:00
Sebastian Nagel eb74336bd3 Update changelog for #225 and #226 2019-02-21 23:01:32 +01:00
Sebastian Nagel 916415d262
Merge branch 'master' into cc-231-etld-invalid-idns 2019-02-21 22:16:48 +01:00
Sebastian Nagel 40b1c44d68 Update changelog for #231 2019-02-21 22:15:40 +01:00
Sebastian Nagel 67db8bf1be [Sitemaps] Trim Unicode whitespace around URLs, fixes #224 2019-02-20 16:27:16 +01:00
Sebastian Nagel 78e935f83b Update changelog for #213 2019-02-03 13:49:11 +01:00
Sebastian Nagel ab9e33a5f9 Update changelog for #220 and #221 2019-01-18 17:35:33 +01:00
Sebastian Nagel 862af9416f Sitemap extension support
- add extension support to SiteMapTester
- list extension attributes in SiteMapURL.toString()
- update change log
2018-09-28 12:14:02 +02:00
Sebastian Nagel 9318de951f Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (#217)
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
  but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests

* Fix Javadoc error and warnings, update change log

* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
  by updating to Java 8 date and time API
2018-09-24 10:09:58 +01:00
Ken Krugler a5c5091d64 Update CHANGES.txt 2018-07-31 17:14:27 -07:00
Julien Nioche 6213784e8b Updated README for 0.10 release 2018-06-07 09:20:43 +01:00
Julien Nioche 0da1b8b8b5 Minor changes + applied formatting pre 0.10 release 2018-06-05 11:33:27 +01:00
Julien Nioche 8195140e21
Update CHANGES.txt
added #211
2018-06-05 11:23:00 +01:00
Julien Nioche a8b474551a
Update CHANGES.txt
Add main to SimpleRobotRulesParser for testing (#193)
2018-06-04 21:28:09 +01:00
Ken Krugler d99c034dd0
Merge branch 'master' into issue-134 2018-05-14 11:20:17 -07:00
Aecio Santos 47c2cad8b8 Add getters/setters and update CHANGES.txt 2018-05-14 12:00:02 -04:00
Aecio Santos aaa3113e55 Update CHANGES.txt 2018-05-14 11:51:43 -04:00
Aecio Santos fd1e7fcffe SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos 7bef14d386 Make RobotRules accessible #134
- Makes SimpleRobotRulesParser._rules property protected
  and adds getters for SimpleRobotRulesParser._rules and
  RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
  to SimpleRobotRules to allow access to concrete class without
  nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Julien Nioche e25309d26c
Add JAX-B dependencies to POM (#207)
* Add JAX-B dependencies to POM, fixes #196

* mentioned in CHANGES.txt
2018-05-03 11:04:03 +01:00
Sebastian Nagel 7d3eccfa63 Add changelog entry and fix unit test 2018-04-25 14:06:33 +02:00
Sebastian Nagel 0ef7cf87fa Improve sitemap parsing
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel a6b3178fc7 Simplify MIME detection:
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel 907be2343f Format fix: add braces, complete CHANGES.txt 2018-04-16 13:36:06 +02:00
Ken Krugler 12155888bc
Add reference to issue #199 2018-04-02 12:59:17 -07:00
Sebastian Nagel 49bf37c6d9 Update CHANGES.txt 2017-12-08 09:42:52 +01:00