Julien Nioche
24e9e87db4
[maven-release-plugin] prepare release crawler-commons-1.1
2020-06-25 09:52:34 +01:00
Julien Nioche
1447aa4799
JAVADOC fix in NewsAttributes
2020-06-25 09:49:08 +01:00
Sebastian Nagel
774c5c8092
Improvements to BasicURLNormalizer ( #292 )
...
- better percent-encoding of URL paths and queries, fixes #263
- hostnames:
* convert IDNs from Unicode to Punycode, fixes #248
* remove trailing dot
- normalize path `/..` to `/`
- also normalize path of file:/ URLs
2020-06-22 13:51:39 +01:00
Sebastian Nagel
78d7e7e85f
Sitemaps to implement Serializable, fixes #244 ( #294 )
...
- make all sitemap classes including extensions to implement the
Serializable interface
- extend sitemap parser unit tests to check object serialization
on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
etc.)
2020-06-22 12:51:40 +01:00
Sebastian Nagel
e9772be011
[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 ( #293 )
...
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
2020-06-19 13:15:48 +01:00
Julien Nioche
d4d8d6531c
removed duplicate entry
2020-06-16 14:41:37 +01:00
Evan Halley
ed0933f2b3
fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt ( #291 )
2020-06-16 14:39:05 +01:00
Sebastian Nagel
48aecd2c68
Fix build failure (forbidden apis) ( #290 )
...
- make String::format not depend on system locale
2020-06-16 09:59:04 +01:00
Julien Nioche
5d30196ffb
Added #288 to CHANGES
2020-06-15 16:02:12 +01:00
Evan Halley
c04e3f17e7
Adding asMap to ExtensionMetadata Interface ( #288 )
...
* added abstract method to extension metadata
* implemented asmap in image/link/mobile/news attributes
* implemented asmap in videoattributes
* adding video attributes unit test
* added news attributes unit tests
* unit test for link attributes
* unit tests for image and mobile attributes
* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString
* using constants instead of strings in more attributes
* cleaned up the imports
* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage
* added related issue to the changelog
* reverting change to NewsAttributes.equal, that causes a unit test failure
2020-06-15 15:55:20 +01:00
Ken Krugler
2d727b2cfa
Merge pull request #268 from Chaiavi/BetterRobotsToString
...
Better robots to string
2020-04-30 07:02:16 -07:00
Avi Hayun
cb21c29a57
Upgrade the toString() method of the Base/simple RobotRules #264
2020-04-29 20:00:43 +03:00
Avi Hayun
95207d0928
Merge pull request #4 from Chaiavi/master
...
Updating my fork
2020-04-29 09:08:43 +03:00
Avi Hayun
0265b98254
Merge branch 'BetterRobotsToString' into master
2020-04-29 08:59:49 +03:00
Avi Hayun
1e4c673057
Merge pull request #2 from crawler-commons/master
...
Updates till 2020-04-29
2020-04-29 08:54:48 +03:00
Avi Hayun
89ccfc308b
Merge branch 'master' into master
2020-04-29 08:54:15 +03:00
Sebastian Nagel
636a286ddb
Merge pull request #280 from sebastian-nagel/cc-261-robots-parser-dedup-sitemaps
...
Robots parser: deduplicate sitemap links, fixes #261
2020-03-24 10:34:17 +01:00
Sebastian Nagel
ac3fd3a464
Robots parser: deduplicate sitemap links, fixes #261
...
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel
80fb6e3b68
Merge pull request #285 from sebastian-nagel/cc-284-log-loading-public-suffix-list
...
EffectiveTldFinder to log loading of public suffix list #284
2020-03-24 10:23:08 +01:00
Sebastian Nagel
c070453c5e
EffectiveTldFinder to log loading of public suffix list, fixes #284
...
- log the location of the public suffix list (effective_tld_names.dat)
during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel
8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 ( #286 )
...
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel
f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 ( #282 )
...
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel
df3d3a446a
Merge pull request #279 from sebastian-nagel/cc-278-simple-robots-parser-num-warnings-thread-safe
...
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
2020-01-08 14:08:08 +01:00
Sebastian Nagel
6c271f5fab
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
...
- move the counter _numWarnings into the ParseState object local to
parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel
c9c0ac6eda
Merge pull request #274 from sebastian-nagel/cc-273-maven-exec-parameterized-tests
...
Make mvn run parameterized tests, fixes #273
2019-11-12 13:55:17 +01:00
Sebastian Nagel
f7cc8b694a
Make mvn run parameterized tests, fixes #273
...
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel
8509bdad01
Merge pull request #275 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
...
[BasicNormalizer] Empty path before query to be normalized to `/`
2019-11-12 13:34:18 +01:00
Sebastian Nagel
7419e96d74
[BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
...
fixes #247
2019-11-12 13:32:00 +01:00
Avi Hayun
0d9287a3f4
Merge pull request #272 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
...
[BasicNormalizer] Trailing question mark in url query test should be added
2019-10-17 13:39:21 +03:00
Sebastian Nagel
2979aaf4aa
[BasicNormalizer] Trailing question mark in url query test should be added,
...
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel
e38bb95b4c
Update CHANGES.txt to reflect latest code contributions
2019-10-15 16:34:00 +02:00
Sebastian Nagel
7e5347eca6
Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check
...
EffectiveTldFinder to validate returned domain names for length restrictions
2019-10-15 16:24:47 +02:00
Avi Hayun
f2b005ef1d
Updated the CHANGES.txt with the latest issues I fixed
2019-09-11 22:48:53 +03:00
Avi Hayun
f571f4e8c7
Updated the CHANGES.txt with the latest issues I fixed
2019-09-11 22:38:57 +03:00
Avi Hayun
bb7523d978
Updated the toString method to include a list of sitemaps if there are less than 10 sitemaps in the robots file
2019-09-11 18:22:41 +03:00
Avi Hayun
e0d071eacb
Merge pull request #1 from crawler-commons/master
...
Syncing back from master
2019-09-08 13:56:20 +03:00
Avi Hayun
e55bf1f697
Updated the gitignore file to include all files which should be ignor… ( #265 )
...
* Updated the gitignore file to include all files which should be ignored due to the following frameworks: Java, Eclipse, Intellij Idea, Maven.
File generated initially using: https://www.gitignore.io/
This submit attempts to solve: #260
2019-09-05 08:34:23 +01:00
Julien Nioche
559d45cbba
use open jdk in travis
2019-09-04 09:29:58 +01:00
Avi Hayun
1c74fc6ca0
The robots.txt test now supports the junit5 parameterizedTest, gaining us the ability to have each robot rule be tested separately, so if one rule fails, the others can still show success in their own unit tests. ( #256 )
...
IMHO it also makes the tests more readable but that is in the eyes of the beholder
This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
2019-07-23 13:13:45 +01:00
Avi Hayun
2cfddca060
Merge pull request #254 from Chaiavi/NormalizerUsingParameterizedTest
...
Updated the tests to use parameterizedTest, separating each test
2019-07-22 13:59:07 +03:00
Avi Hayun
3597dee5a7
Updated the tests to use parameterizedTest, thus making the test file more readable and each normalized pair is now a test in itself, thus making it much easier to identify exactly failure points
2019-07-16 21:17:25 +03:00
Avi Hayun
5bda3636d2
Upgraded to Junit v5.5 ( #250 )
...
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly
* Removed unneeded before and after
* This is a technical debt
Fixing a styling issue I caused about 4 years ago
Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82
* Fixed according to @sebastian-nagel code review
2019-07-15 21:29:03 +03:00
Sebastian Nagel
fba716c765
EffectiveTldFinder to validate returned domain names for length restrictions,
...
fixes #251
2019-07-15 12:39:04 +02:00
Avi Hayun
2aa107e441
fixed a typo
2019-07-11 19:34:24 +03:00
Sebastian Nagel
3355b3ca1d
Merge pull request #241 from sebastian-nagel/cc-240-robotstxt-sitemap-url
...
Robots parser to always handle absolute sitemap URL even without valid base URL
2019-04-04 17:26:12 +02:00
Sebastian Nagel
b449fdf024
Robots parser to always handle absolute sitemap URL even without valid base URL,
...
fixes #240
2019-03-26 15:05:12 +01:00
Sebastian Nagel
ec2ad7c790
Updated README for 1.0 release
2019-03-25 10:30:48 +01:00
Julien Nioche
651ce08afe
CHANGES ready for next cycle of improvements
2019-03-21 21:02:10 +00:00
Julien Nioche
7c6a3f2ccd
[maven-release-plugin] prepare for next development iteration
2019-03-19 22:07:14 +00:00
Julien Nioche
5504239c6d
[maven-release-plugin] prepare release crawler-commons-1.0
2019-03-19 22:06:56 +00:00