1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-04 14:36:04 +02:00
Commit Graph

622 Commits

Author SHA1 Message Date
Sebastian Nagel 636a286ddb
Merge pull request #280 from sebastian-nagel/cc-261-robots-parser-dedup-sitemaps
Robots parser: deduplicate sitemap links, fixes #261
2020-03-24 10:34:17 +01:00
Sebastian Nagel ac3fd3a464 Robots parser: deduplicate sitemap links, fixes #261
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel 80fb6e3b68
Merge pull request #285 from sebastian-nagel/cc-284-log-loading-public-suffix-list
EffectiveTldFinder to log loading of public suffix list #284
2020-03-24 10:23:08 +01:00
Sebastian Nagel c070453c5e EffectiveTldFinder to log loading of public suffix list, fixes #284
- log the location of the public suffix list (effective_tld_names.dat)
  during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel 8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 (#286)
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 (#282)
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel df3d3a446a
Merge pull request #279 from sebastian-nagel/cc-278-simple-robots-parser-num-warnings-thread-safe
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
2020-01-08 14:08:08 +01:00
Sebastian Nagel 6c271f5fab SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
- move the counter _numWarnings into the ParseState object local to
  parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
  of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
  log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel c9c0ac6eda
Merge pull request #274 from sebastian-nagel/cc-273-maven-exec-parameterized-tests
Make mvn run parameterized tests, fixes #273
2019-11-12 13:55:17 +01:00
Sebastian Nagel f7cc8b694a Make mvn run parameterized tests, fixes #273
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel 8509bdad01
Merge pull request #275 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
[BasicNormalizer] Empty path before query to be normalized to `/`
2019-11-12 13:34:18 +01:00
Sebastian Nagel 7419e96d74 [BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
fixes #247
2019-11-12 13:32:00 +01:00
Avi Hayun 0d9287a3f4
Merge pull request #272 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
[BasicNormalizer] Trailing question mark in url query test should be added
2019-10-17 13:39:21 +03:00
Sebastian Nagel 2979aaf4aa [BasicNormalizer] Trailing question mark in url query test should be added,
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel e38bb95b4c Update CHANGES.txt to reflect latest code contributions 2019-10-15 16:34:00 +02:00
Sebastian Nagel 7e5347eca6
Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check
EffectiveTldFinder to validate returned domain names for length restrictions
2019-10-15 16:24:47 +02:00
Avi Hayun f2b005ef1d Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:48:53 +03:00
Avi Hayun f571f4e8c7 Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:38:57 +03:00
Avi Hayun bb7523d978 Updated the toString method to include a list of sitemaps if there are less than 10 sitemaps in the robots file 2019-09-11 18:22:41 +03:00
Avi Hayun e0d071eacb
Merge pull request #1 from crawler-commons/master
Syncing back from master
2019-09-08 13:56:20 +03:00
Avi Hayun e55bf1f697 Updated the gitignore file to include all files which should be ignor… (#265)
* Updated the gitignore file to include all files which should be ignored due to the following frameworks: Java, Eclipse, Intellij Idea, Maven.

File generated initially using: https://www.gitignore.io/
This submit attempts to solve: #260
2019-09-05 08:34:23 +01:00
Julien Nioche 559d45cbba
use open jdk in travis 2019-09-04 09:29:58 +01:00
Avi Hayun 1c74fc6ca0 The robots.txt test now supports the junit5 parameterizedTest, gaining us the ability to have each robot rule be tested separately, so if one rule fails, the others can still show success in their own unit tests. (#256)
IMHO it also makes the tests more readable but that is in the eyes of the beholder

This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
2019-07-23 13:13:45 +01:00
Avi Hayun 2cfddca060
Merge pull request #254 from Chaiavi/NormalizerUsingParameterizedTest
Updated the tests to use parameterizedTest, separating each test
2019-07-22 13:59:07 +03:00
Avi Hayun 3597dee5a7 Updated the tests to use parameterizedTest, thus making the test file more readable and each normalized pair is now a test in itself, thus making it much easier to identify exactly failure points 2019-07-16 21:17:25 +03:00
Avi Hayun 5bda3636d2
Upgraded to Junit v5.5 (#250)
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly

* Removed unneeded before and after

* This is a technical debt

Fixing a styling issue I caused about 4 years ago

Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82

* Fixed according to @sebastian-nagel code review
2019-07-15 21:29:03 +03:00
Sebastian Nagel fba716c765 EffectiveTldFinder to validate returned domain names for length restrictions,
fixes #251
2019-07-15 12:39:04 +02:00
Avi Hayun 2aa107e441
fixed a typo 2019-07-11 19:34:24 +03:00
Sebastian Nagel 3355b3ca1d
Merge pull request #241 from sebastian-nagel/cc-240-robotstxt-sitemap-url
Robots parser to always handle absolute sitemap URL even without valid base URL
2019-04-04 17:26:12 +02:00
Sebastian Nagel b449fdf024 Robots parser to always handle absolute sitemap URL even without valid base URL,
fixes #240
2019-03-26 15:05:12 +01:00
Sebastian Nagel ec2ad7c790 Updated README for 1.0 release 2019-03-25 10:30:48 +01:00
Julien Nioche 651ce08afe CHANGES ready for next cycle of improvements 2019-03-21 21:02:10 +00:00
Julien Nioche 7c6a3f2ccd [maven-release-plugin] prepare for next development iteration 2019-03-19 22:07:14 +00:00
Julien Nioche 5504239c6d [maven-release-plugin] prepare release crawler-commons-1.0 2019-03-19 22:06:56 +00:00
Julien Nioche 0dc699f3e9 Updated CHANGES pre-1.0 release 2019-03-19 22:04:31 +00:00
Julien Nioche 6d7f3a3fbf Revert "[maven-release-plugin] prepare release crawler-commons-1.0"
This reverts commit 2be967020e.
2019-03-19 22:02:19 +00:00
Sebastian Nagel e8b598b2e8 [Sitemaps] Unit tests depend on system timezone, fixes #238 (#239)
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
  add note that UTC is assumed if no time zone is contained in
  date string
2019-03-19 15:00:04 +00:00
Julien Nioche 9a5236c484 [maven-release-plugin] prepare for next development iteration 2019-03-16 21:31:53 +00:00
Julien Nioche 2be967020e [maven-release-plugin] prepare release crawler-commons-1.0 2019-03-16 21:31:34 +00:00
Sebastian Nagel 40531efe25 EffectiveTldFinder: upgrade public suffix list, implements #219 (#235)
- upgrade the public suffix list to the latest version retrieved right now from
     https://publicsuffix.org/list/public_suffix_list.dat
  resp.
     802c469416/public_suffix_list.dat
2019-03-14 11:22:50 +00:00
Sebastian Nagel cd57a5b86a
Merge pull request #234 from sebastian-nagel/cc-144-xml-skip-leading-whitespace
Detection and parsing of XML sitemaps fails with whitespace before XML declaration #144
2019-03-14 10:34:06 +01:00
Sebastian Nagel 0349fbe1f0 Update changelog for #144/#234 2019-03-14 10:29:17 +01:00
Sebastian Nagel a69bf09cd4 Detection and parsing of XML sitemaps fails with whitespace before XML declaration,
fixes #144
- implement InputStream skipping over white space at beginning of file
- use for XML sitemaps in combination with BOMInputStream,
  so that white space or empty lines before <?xml ...> do not
  cause the parser to fail
2019-03-07 16:17:09 +01:00
Sebastian Nagel 3f3121568f
Merge pull request #233 from sebastian-nagel/cc-225-226-xml-handlers-buffer-characters
Buffer characters when characters(...) is called, fixes #225 and #226
2019-02-28 16:40:00 +01:00
Sebastian Nagel 2539b07948 Buffer characters when characters(...) is called (instead of immediately processing
the current character chunk)
- use `localName` instead of calling `currentElement()` where applicable
- remove unnecessary null checks of character buffer
2019-02-25 14:32:26 +01:00
Sebastian Nagel eb74336bd3 Update changelog for #225 and #226 2019-02-21 23:01:32 +01:00
Sebastian Nagel 82859c2b1a Buffer characters when characters(...) is called (instead of immediately processing
the current character chunk)
- fix errors when character chunks are interrupted by CDATA sections or character entities
  - fixes #225 XMLIndexHandler needs to accumulate the lastmod date string before parsing
  - fixes #226 XMLHandler needs to append text in characters() vs. immediately processing
  - provide character buffer in DelegatorHandler, so that derived classes
    can append characters to it and finally get the buffered content
- code cleanup in all handler classes:
  - add @Override annotations
  - remove stubb method implementations
2019-02-21 23:01:32 +01:00
Sebastian Nagel b5a9517bc4
Merge pull request #232 from sebastian-nagel/cc-231-etld-invalid-idns
EffectiveTldFinder throws IllegalArgumentException on IDN domain names containing prohibited characters, fixes #231
2019-02-21 22:27:59 +01:00
Sebastian Nagel 916415d262
Merge branch 'master' into cc-231-etld-invalid-idns 2019-02-21 22:16:48 +01:00
Sebastian Nagel 40b1c44d68 Update changelog for #231 2019-02-21 22:15:40 +01:00