1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-10 16:06:04 +02:00
Commit Graph

438 Commits

Author SHA1 Message Date
Julien Nioche 24e9e87db4 [maven-release-plugin] prepare release crawler-commons-1.1 2020-06-25 09:52:34 +01:00
Julien Nioche 1447aa4799 JAVADOC fix in NewsAttributes 2020-06-25 09:49:08 +01:00
Sebastian Nagel 774c5c8092
Improvements to BasicURLNormalizer (#292)
- better percent-encoding of URL paths and queries, fixes #263
- hostnames:
  * convert IDNs from Unicode to Punycode, fixes #248
  * remove trailing dot
- normalize path `/..` to `/`
- also normalize path of file:/ URLs
2020-06-22 13:51:39 +01:00
Sebastian Nagel 78d7e7e85f
Sitemaps to implement Serializable, fixes #244 (#294)
- make all sitemap classes including extensions to implement the
  Serializable interface
- extend sitemap parser unit tests to check object serialization
  on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
  etc.)
2020-06-22 12:51:40 +01:00
Sebastian Nagel e9772be011
[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 (#293)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262

* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
2020-06-19 13:15:48 +01:00
Julien Nioche d4d8d6531c
removed duplicate entry 2020-06-16 14:41:37 +01:00
Evan Halley ed0933f2b3
fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt (#291) 2020-06-16 14:39:05 +01:00
Sebastian Nagel 48aecd2c68
Fix build failure (forbidden apis) (#290)
- make String::format not depend on system locale
2020-06-16 09:59:04 +01:00
Julien Nioche 5d30196ffb Added #288 to CHANGES 2020-06-15 16:02:12 +01:00
Evan Halley c04e3f17e7
Adding asMap to ExtensionMetadata Interface (#288)
* added abstract method to extension metadata

* implemented asmap in image/link/mobile/news attributes

* implemented asmap in videoattributes

* adding video attributes unit test

* added news attributes unit tests

* unit test for link attributes

* unit tests for image and mobile attributes

* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString

* using constants instead of strings in more attributes

* cleaned up the imports

* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage

* added related issue to the changelog

* reverting change to NewsAttributes.equal, that causes a unit test failure
2020-06-15 15:55:20 +01:00
Ken Krugler 2d727b2cfa
Merge pull request #268 from Chaiavi/BetterRobotsToString
Better robots to string
2020-04-30 07:02:16 -07:00
Avi Hayun cb21c29a57 Upgrade the toString() method of the Base/simple RobotRules #264 2020-04-29 20:00:43 +03:00
Avi Hayun 95207d0928
Merge pull request #4 from Chaiavi/master
Updating my fork
2020-04-29 09:08:43 +03:00
Avi Hayun 0265b98254
Merge branch 'BetterRobotsToString' into master 2020-04-29 08:59:49 +03:00
Avi Hayun 1e4c673057
Merge pull request #2 from crawler-commons/master
Updates till 2020-04-29
2020-04-29 08:54:48 +03:00
Avi Hayun 89ccfc308b
Merge branch 'master' into master 2020-04-29 08:54:15 +03:00
Sebastian Nagel 636a286ddb
Merge pull request #280 from sebastian-nagel/cc-261-robots-parser-dedup-sitemaps
Robots parser: deduplicate sitemap links, fixes #261
2020-03-24 10:34:17 +01:00
Sebastian Nagel ac3fd3a464 Robots parser: deduplicate sitemap links, fixes #261
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel 80fb6e3b68
Merge pull request #285 from sebastian-nagel/cc-284-log-loading-public-suffix-list
EffectiveTldFinder to log loading of public suffix list #284
2020-03-24 10:23:08 +01:00
Sebastian Nagel c070453c5e EffectiveTldFinder to log loading of public suffix list, fixes #284
- log the location of the public suffix list (effective_tld_names.dat)
  during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel 8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 (#286)
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 (#282)
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel df3d3a446a
Merge pull request #279 from sebastian-nagel/cc-278-simple-robots-parser-num-warnings-thread-safe
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
2020-01-08 14:08:08 +01:00
Sebastian Nagel 6c271f5fab SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
- move the counter _numWarnings into the ParseState object local to
  parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
  of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
  log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel c9c0ac6eda
Merge pull request #274 from sebastian-nagel/cc-273-maven-exec-parameterized-tests
Make mvn run parameterized tests, fixes #273
2019-11-12 13:55:17 +01:00
Sebastian Nagel f7cc8b694a Make mvn run parameterized tests, fixes #273
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel 8509bdad01
Merge pull request #275 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
[BasicNormalizer] Empty path before query to be normalized to `/`
2019-11-12 13:34:18 +01:00
Sebastian Nagel 7419e96d74 [BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
fixes #247
2019-11-12 13:32:00 +01:00
Avi Hayun 0d9287a3f4
Merge pull request #272 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
[BasicNormalizer] Trailing question mark in url query test should be added
2019-10-17 13:39:21 +03:00
Sebastian Nagel 2979aaf4aa [BasicNormalizer] Trailing question mark in url query test should be added,
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel e38bb95b4c Update CHANGES.txt to reflect latest code contributions 2019-10-15 16:34:00 +02:00
Sebastian Nagel 7e5347eca6
Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check
EffectiveTldFinder to validate returned domain names for length restrictions
2019-10-15 16:24:47 +02:00
Avi Hayun f2b005ef1d Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:48:53 +03:00
Avi Hayun f571f4e8c7 Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:38:57 +03:00
Avi Hayun bb7523d978 Updated the toString method to include a list of sitemaps if there are less than 10 sitemaps in the robots file 2019-09-11 18:22:41 +03:00
Avi Hayun e0d071eacb
Merge pull request #1 from crawler-commons/master
Syncing back from master
2019-09-08 13:56:20 +03:00
Avi Hayun e55bf1f697 Updated the gitignore file to include all files which should be ignor… (#265)
* Updated the gitignore file to include all files which should be ignored due to the following frameworks: Java, Eclipse, Intellij Idea, Maven.

File generated initially using: https://www.gitignore.io/
This submit attempts to solve: #260
2019-09-05 08:34:23 +01:00
Julien Nioche 559d45cbba
use open jdk in travis 2019-09-04 09:29:58 +01:00
Avi Hayun 1c74fc6ca0 The robots.txt test now supports the junit5 parameterizedTest, gaining us the ability to have each robot rule be tested separately, so if one rule fails, the others can still show success in their own unit tests. (#256)
IMHO it also makes the tests more readable but that is in the eyes of the beholder

This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
2019-07-23 13:13:45 +01:00
Avi Hayun 2cfddca060
Merge pull request #254 from Chaiavi/NormalizerUsingParameterizedTest
Updated the tests to use parameterizedTest, separating each test
2019-07-22 13:59:07 +03:00
Avi Hayun 3597dee5a7 Updated the tests to use parameterizedTest, thus making the test file more readable and each normalized pair is now a test in itself, thus making it much easier to identify exactly failure points 2019-07-16 21:17:25 +03:00
Avi Hayun 5bda3636d2
Upgraded to Junit v5.5 (#250)
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly

* Removed unneeded before and after

* This is a technical debt

Fixing a styling issue I caused about 4 years ago

Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82

* Fixed according to @sebastian-nagel code review
2019-07-15 21:29:03 +03:00
Sebastian Nagel fba716c765 EffectiveTldFinder to validate returned domain names for length restrictions,
fixes #251
2019-07-15 12:39:04 +02:00
Avi Hayun 2aa107e441
fixed a typo 2019-07-11 19:34:24 +03:00
Sebastian Nagel 3355b3ca1d
Merge pull request #241 from sebastian-nagel/cc-240-robotstxt-sitemap-url
Robots parser to always handle absolute sitemap URL even without valid base URL
2019-04-04 17:26:12 +02:00
Sebastian Nagel b449fdf024 Robots parser to always handle absolute sitemap URL even without valid base URL,
fixes #240
2019-03-26 15:05:12 +01:00
Sebastian Nagel ec2ad7c790 Updated README for 1.0 release 2019-03-25 10:30:48 +01:00
Julien Nioche 651ce08afe CHANGES ready for next cycle of improvements 2019-03-21 21:02:10 +00:00
Julien Nioche 7c6a3f2ccd [maven-release-plugin] prepare for next development iteration 2019-03-19 22:07:14 +00:00
Julien Nioche 5504239c6d [maven-release-plugin] prepare release crawler-commons-1.0 2019-03-19 22:06:56 +00:00