Sebastian Nagel
295a2966fe
BasicURLNormalizer: Normalize URL without a scheme, resolves #271
...
- prefix URL without a scheme with `http://`
- decrease log level for malformed URL (info -> debug)
2020-11-12 16:35:12 +01:00
Sebastian Nagel
d63317533b
Merge PR #302 (branch rzo1:295-public-suffix-list-download-during-build), resolved #295 , closes #302
2020-09-17 15:49:32 +02:00
Sebastian Nagel
8c3691195e
[EffectiveTldFinder] upgrade public suffix list / download latest effective_tld_names.dat during Maven build
...
- update changelog
2020-09-17 15:48:02 +02:00
Sebastian Nagel
643a17d533
[BasicNormalizer] decode percent-encoded host names, fixes #303 ( #304 )
...
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
percent-encoded sequences are properly caught
2020-09-17 15:30:47 +02:00
Sebastian Nagel
cf1ec67999
[BasicNormalizer] decode percent-encoded host names, fixes #303 ( #304 )
...
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
percent-encoded sequences are properly caught
2020-09-15 09:14:09 +01:00
Richard Zowalla
35f0638069
introduces download-maven-plugin to download the 'effective_tld_names.dat' during the maven build
...
ensures no cached files are used for the release profile
2020-08-06 14:47:58 +02:00
Sebastian Nagel
e47e1536b0
[sitemaps] Replace priority "NaN" by default value, fixes #296 ( #297 )
...
Co-authored-by: Julien Nioche <julien@digitalpebble.com>
2020-07-24 15:49:11 +01:00
Evan Halley
9affd1d992
Adding duration to the map generated by VideoAttributes.asMap ( #301 )
2020-07-24 15:46:59 +01:00
Ken Krugler
ce9579409e
Merge pull request #299 from sebastian-nagel/cc-276-robots-document-max-crawl-delay
...
[Robots] Maximum values (crawl-delay and warnings): document and make visible
2020-07-15 09:31:12 -07:00
Sebastian Nagel
e16d022dea
[Robots] Maximum values (crawl-delay and warnings): document and make visible,
...
fixes #276
- change visibility (private -> public) of DEFAULT_MAX_CRAWL_DELAY and DEFAULT_MAX_WARNINGS
- document methods to set/get max crawl delay and max warnings
2020-07-15 17:47:32 +02:00
Ken Krugler
29b9d39f48
Merge pull request #298 from sebastian-nagel/cc-267-sitemap-document-strict-allowpartial
...
[SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
2020-07-15 07:25:00 -07:00
Sebastian Nagel
7f9fd49ba3
[SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
...
fixes #267
2020-07-15 15:49:43 +02:00
Julien Nioche
2de8b23635
Update README.md
...
fixed version for javadoc
2020-06-29 18:13:25 +01:00
Julien Nioche
f59d32be94
Release 1.1
2020-06-29 18:08:17 +01:00
Julien Nioche
d2c162f1cc
[maven-release-plugin] prepare for next development iteration
2020-06-25 09:53:20 +01:00
Julien Nioche
24e9e87db4
[maven-release-plugin] prepare release crawler-commons-1.1
2020-06-25 09:52:34 +01:00
Julien Nioche
1447aa4799
JAVADOC fix in NewsAttributes
2020-06-25 09:49:08 +01:00
Sebastian Nagel
774c5c8092
Improvements to BasicURLNormalizer ( #292 )
...
- better percent-encoding of URL paths and queries, fixes #263
- hostnames:
* convert IDNs from Unicode to Punycode, fixes #248
* remove trailing dot
- normalize path `/..` to `/`
- also normalize path of file:/ URLs
2020-06-22 13:51:39 +01:00
Sebastian Nagel
78d7e7e85f
Sitemaps to implement Serializable, fixes #244 ( #294 )
...
- make all sitemap classes including extensions to implement the
Serializable interface
- extend sitemap parser unit tests to check object serialization
on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
etc.)
2020-06-22 12:51:40 +01:00
Sebastian Nagel
e9772be011
[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 ( #293 )
...
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
2020-06-19 13:15:48 +01:00
Julien Nioche
d4d8d6531c
removed duplicate entry
2020-06-16 14:41:37 +01:00
Evan Halley
ed0933f2b3
fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt ( #291 )
2020-06-16 14:39:05 +01:00
Sebastian Nagel
48aecd2c68
Fix build failure (forbidden apis) ( #290 )
...
- make String::format not depend on system locale
2020-06-16 09:59:04 +01:00
Julien Nioche
5d30196ffb
Added #288 to CHANGES
2020-06-15 16:02:12 +01:00
Evan Halley
c04e3f17e7
Adding asMap to ExtensionMetadata Interface ( #288 )
...
* added abstract method to extension metadata
* implemented asmap in image/link/mobile/news attributes
* implemented asmap in videoattributes
* adding video attributes unit test
* added news attributes unit tests
* unit test for link attributes
* unit tests for image and mobile attributes
* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString
* using constants instead of strings in more attributes
* cleaned up the imports
* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage
* added related issue to the changelog
* reverting change to NewsAttributes.equal, that causes a unit test failure
2020-06-15 15:55:20 +01:00
Ken Krugler
2d727b2cfa
Merge pull request #268 from Chaiavi/BetterRobotsToString
...
Better robots to string
2020-04-30 07:02:16 -07:00
Avi Hayun
cb21c29a57
Upgrade the toString() method of the Base/simple RobotRules #264
2020-04-29 20:00:43 +03:00
Avi Hayun
95207d0928
Merge pull request #4 from Chaiavi/master
...
Updating my fork
2020-04-29 09:08:43 +03:00
Avi Hayun
0265b98254
Merge branch 'BetterRobotsToString' into master
2020-04-29 08:59:49 +03:00
Avi Hayun
1e4c673057
Merge pull request #2 from crawler-commons/master
...
Updates till 2020-04-29
2020-04-29 08:54:48 +03:00
Avi Hayun
89ccfc308b
Merge branch 'master' into master
2020-04-29 08:54:15 +03:00
Sebastian Nagel
636a286ddb
Merge pull request #280 from sebastian-nagel/cc-261-robots-parser-dedup-sitemaps
...
Robots parser: deduplicate sitemap links, fixes #261
2020-03-24 10:34:17 +01:00
Sebastian Nagel
ac3fd3a464
Robots parser: deduplicate sitemap links, fixes #261
...
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel
80fb6e3b68
Merge pull request #285 from sebastian-nagel/cc-284-log-loading-public-suffix-list
...
EffectiveTldFinder to log loading of public suffix list #284
2020-03-24 10:23:08 +01:00
Sebastian Nagel
c070453c5e
EffectiveTldFinder to log loading of public suffix list, fixes #284
...
- log the location of the public suffix list (effective_tld_names.dat)
during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel
8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 ( #286 )
...
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel
f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 ( #282 )
...
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel
df3d3a446a
Merge pull request #279 from sebastian-nagel/cc-278-simple-robots-parser-num-warnings-thread-safe
...
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
2020-01-08 14:08:08 +01:00
Sebastian Nagel
6c271f5fab
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
...
- move the counter _numWarnings into the ParseState object local to
parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel
c9c0ac6eda
Merge pull request #274 from sebastian-nagel/cc-273-maven-exec-parameterized-tests
...
Make mvn run parameterized tests, fixes #273
2019-11-12 13:55:17 +01:00
Sebastian Nagel
f7cc8b694a
Make mvn run parameterized tests, fixes #273
...
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel
8509bdad01
Merge pull request #275 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
...
[BasicNormalizer] Empty path before query to be normalized to `/`
2019-11-12 13:34:18 +01:00
Sebastian Nagel
7419e96d74
[BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
...
fixes #247
2019-11-12 13:32:00 +01:00
Avi Hayun
0d9287a3f4
Merge pull request #272 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
...
[BasicNormalizer] Trailing question mark in url query test should be added
2019-10-17 13:39:21 +03:00
Sebastian Nagel
2979aaf4aa
[BasicNormalizer] Trailing question mark in url query test should be added,
...
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel
e38bb95b4c
Update CHANGES.txt to reflect latest code contributions
2019-10-15 16:34:00 +02:00
Sebastian Nagel
7e5347eca6
Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check
...
EffectiveTldFinder to validate returned domain names for length restrictions
2019-10-15 16:24:47 +02:00
Avi Hayun
f2b005ef1d
Updated the CHANGES.txt with the latest issues I fixed
2019-09-11 22:48:53 +03:00
Avi Hayun
f571f4e8c7
Updated the CHANGES.txt with the latest issues I fixed
2019-09-11 22:38:57 +03:00
Avi Hayun
bb7523d978
Updated the toString method to include a list of sitemaps if there are less than 10 sitemaps in the robots file
2019-09-11 18:22:41 +03:00