Sebastian Nagel
636a286ddb
Merge pull request #280 from sebastian-nagel/cc-261-robots-parser-dedup-sitemaps
...
Robots parser: deduplicate sitemap links, fixes #261
2020-03-24 10:34:17 +01:00
Sebastian Nagel
ac3fd3a464
Robots parser: deduplicate sitemap links, fixes #261
...
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel
80fb6e3b68
Merge pull request #285 from sebastian-nagel/cc-284-log-loading-public-suffix-list
...
EffectiveTldFinder to log loading of public suffix list #284
2020-03-24 10:23:08 +01:00
Sebastian Nagel
c070453c5e
EffectiveTldFinder to log loading of public suffix list, fixes #284
...
- log the location of the public suffix list (effective_tld_names.dat)
during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel
8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 ( #286 )
...
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel
f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 ( #282 )
...
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel
df3d3a446a
Merge pull request #279 from sebastian-nagel/cc-278-simple-robots-parser-num-warnings-thread-safe
...
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
2020-01-08 14:08:08 +01:00
Sebastian Nagel
6c271f5fab
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
...
- move the counter _numWarnings into the ParseState object local to
parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel
c9c0ac6eda
Merge pull request #274 from sebastian-nagel/cc-273-maven-exec-parameterized-tests
...
Make mvn run parameterized tests, fixes #273
2019-11-12 13:55:17 +01:00
Sebastian Nagel
f7cc8b694a
Make mvn run parameterized tests, fixes #273
...
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel
8509bdad01
Merge pull request #275 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
...
[BasicNormalizer] Empty path before query to be normalized to `/`
2019-11-12 13:34:18 +01:00
Sebastian Nagel
7419e96d74
[BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
...
fixes #247
2019-11-12 13:32:00 +01:00
Avi Hayun
0d9287a3f4
Merge pull request #272 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
...
[BasicNormalizer] Trailing question mark in url query test should be added
2019-10-17 13:39:21 +03:00
Sebastian Nagel
2979aaf4aa
[BasicNormalizer] Trailing question mark in url query test should be added,
...
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel
e38bb95b4c
Update CHANGES.txt to reflect latest code contributions
2019-10-15 16:34:00 +02:00
Sebastian Nagel
7e5347eca6
Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check
...
EffectiveTldFinder to validate returned domain names for length restrictions
2019-10-15 16:24:47 +02:00
Avi Hayun
f2b005ef1d
Updated the CHANGES.txt with the latest issues I fixed
2019-09-11 22:48:53 +03:00
Avi Hayun
f571f4e8c7
Updated the CHANGES.txt with the latest issues I fixed
2019-09-11 22:38:57 +03:00
Avi Hayun
bb7523d978
Updated the toString method to include a list of sitemaps if there are less than 10 sitemaps in the robots file
2019-09-11 18:22:41 +03:00
Avi Hayun
e0d071eacb
Merge pull request #1 from crawler-commons/master
...
Syncing back from master
2019-09-08 13:56:20 +03:00
Avi Hayun
e55bf1f697
Updated the gitignore file to include all files which should be ignor… ( #265 )
...
* Updated the gitignore file to include all files which should be ignored due to the following frameworks: Java, Eclipse, Intellij Idea, Maven.
File generated initially using: https://www.gitignore.io/
This submit attempts to solve: #260
2019-09-05 08:34:23 +01:00
Julien Nioche
559d45cbba
use open jdk in travis
2019-09-04 09:29:58 +01:00
Avi Hayun
1c74fc6ca0
The robots.txt test now supports the junit5 parameterizedTest, gaining us the ability to have each robot rule be tested separately, so if one rule fails, the others can still show success in their own unit tests. ( #256 )
...
IMHO it also makes the tests more readable but that is in the eyes of the beholder
This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
2019-07-23 13:13:45 +01:00
Avi Hayun
2cfddca060
Merge pull request #254 from Chaiavi/NormalizerUsingParameterizedTest
...
Updated the tests to use parameterizedTest, separating each test
2019-07-22 13:59:07 +03:00
Avi Hayun
3597dee5a7
Updated the tests to use parameterizedTest, thus making the test file more readable and each normalized pair is now a test in itself, thus making it much easier to identify exactly failure points
2019-07-16 21:17:25 +03:00
Avi Hayun
5bda3636d2
Upgraded to Junit v5.5 ( #250 )
...
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly
* Removed unneeded before and after
* This is a technical debt
Fixing a styling issue I caused about 4 years ago
Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82
* Fixed according to @sebastian-nagel code review
2019-07-15 21:29:03 +03:00
Sebastian Nagel
fba716c765
EffectiveTldFinder to validate returned domain names for length restrictions,
...
fixes #251
2019-07-15 12:39:04 +02:00
Avi Hayun
2aa107e441
fixed a typo
2019-07-11 19:34:24 +03:00
Sebastian Nagel
3355b3ca1d
Merge pull request #241 from sebastian-nagel/cc-240-robotstxt-sitemap-url
...
Robots parser to always handle absolute sitemap URL even without valid base URL
2019-04-04 17:26:12 +02:00
Sebastian Nagel
b449fdf024
Robots parser to always handle absolute sitemap URL even without valid base URL,
...
fixes #240
2019-03-26 15:05:12 +01:00
Sebastian Nagel
ec2ad7c790
Updated README for 1.0 release
2019-03-25 10:30:48 +01:00
Julien Nioche
651ce08afe
CHANGES ready for next cycle of improvements
2019-03-21 21:02:10 +00:00
Julien Nioche
7c6a3f2ccd
[maven-release-plugin] prepare for next development iteration
2019-03-19 22:07:14 +00:00
Julien Nioche
5504239c6d
[maven-release-plugin] prepare release crawler-commons-1.0
2019-03-19 22:06:56 +00:00
Julien Nioche
0dc699f3e9
Updated CHANGES pre-1.0 release
2019-03-19 22:04:31 +00:00
Julien Nioche
6d7f3a3fbf
Revert "[maven-release-plugin] prepare release crawler-commons-1.0"
...
This reverts commit 2be967020e
.
2019-03-19 22:02:19 +00:00
Sebastian Nagel
e8b598b2e8
[Sitemaps] Unit tests depend on system timezone, fixes #238 ( #239 )
...
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
add note that UTC is assumed if no time zone is contained in
date string
2019-03-19 15:00:04 +00:00
Julien Nioche
9a5236c484
[maven-release-plugin] prepare for next development iteration
2019-03-16 21:31:53 +00:00
Julien Nioche
2be967020e
[maven-release-plugin] prepare release crawler-commons-1.0
2019-03-16 21:31:34 +00:00
Sebastian Nagel
40531efe25
EffectiveTldFinder: upgrade public suffix list, implements #219 ( #235 )
...
- upgrade the public suffix list to the latest version retrieved right now from
https://publicsuffix.org/list/public_suffix_list.dat
resp.
802c469416/public_suffix_list.dat
2019-03-14 11:22:50 +00:00
Sebastian Nagel
cd57a5b86a
Merge pull request #234 from sebastian-nagel/cc-144-xml-skip-leading-whitespace
...
Detection and parsing of XML sitemaps fails with whitespace before XML declaration #144
2019-03-14 10:34:06 +01:00
Sebastian Nagel
0349fbe1f0
Update changelog for #144/#234
2019-03-14 10:29:17 +01:00
Sebastian Nagel
a69bf09cd4
Detection and parsing of XML sitemaps fails with whitespace before XML declaration,
...
fixes #144
- implement InputStream skipping over white space at beginning of file
- use for XML sitemaps in combination with BOMInputStream,
so that white space or empty lines before <?xml ...> do not
cause the parser to fail
2019-03-07 16:17:09 +01:00
Sebastian Nagel
3f3121568f
Merge pull request #233 from sebastian-nagel/cc-225-226-xml-handlers-buffer-characters
...
Buffer characters when characters(...) is called, fixes #225 and #226
2019-02-28 16:40:00 +01:00
Sebastian Nagel
2539b07948
Buffer characters when characters(...) is called (instead of immediately processing
...
the current character chunk)
- use `localName` instead of calling `currentElement()` where applicable
- remove unnecessary null checks of character buffer
2019-02-25 14:32:26 +01:00
Sebastian Nagel
eb74336bd3
Update changelog for #225 and #226
2019-02-21 23:01:32 +01:00
Sebastian Nagel
82859c2b1a
Buffer characters when characters(...) is called (instead of immediately processing
...
the current character chunk)
- fix errors when character chunks are interrupted by CDATA sections or character entities
- fixes #225 XMLIndexHandler needs to accumulate the lastmod date string before parsing
- fixes #226 XMLHandler needs to append text in characters() vs. immediately processing
- provide character buffer in DelegatorHandler, so that derived classes
can append characters to it and finally get the buffered content
- code cleanup in all handler classes:
- add @Override annotations
- remove stubb method implementations
2019-02-21 23:01:32 +01:00
Sebastian Nagel
b5a9517bc4
Merge pull request #232 from sebastian-nagel/cc-231-etld-invalid-idns
...
EffectiveTldFinder throws IllegalArgumentException on IDN domain names containing prohibited characters, fixes #231
2019-02-21 22:27:59 +01:00
Sebastian Nagel
916415d262
Merge branch 'master' into cc-231-etld-invalid-idns
2019-02-21 22:16:48 +01:00
Sebastian Nagel
40b1c44d68
Update changelog for #231
2019-02-21 22:15:40 +01:00