1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-24 19:36:06 +02:00
Commit Graph

503 Commits

Author SHA1 Message Date
Sebastian Nagel 295a2966fe BasicURLNormalizer: Normalize URL without a scheme, resolves #271
- prefix URL without a scheme with `http://`
- decrease log level for malformed URL (info -> debug)
2020-11-12 16:35:12 +01:00
Sebastian Nagel d63317533b Merge PR #302 (branch rzo1:295-public-suffix-list-download-during-build), resolved #295, closes #302 2020-09-17 15:49:32 +02:00
Sebastian Nagel 8c3691195e [EffectiveTldFinder] upgrade public suffix list / download latest effective_tld_names.dat during Maven build
- update changelog
2020-09-17 15:48:02 +02:00
Sebastian Nagel 643a17d533 [BasicNormalizer] decode percent-encoded host names, fixes #303 (#304)
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
  percent-encoded sequences are properly caught
2020-09-17 15:30:47 +02:00
Sebastian Nagel cf1ec67999
[BasicNormalizer] decode percent-encoded host names, fixes #303 (#304)
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
  percent-encoded sequences are properly caught
2020-09-15 09:14:09 +01:00
Richard Zowalla 35f0638069 introduces download-maven-plugin to download the 'effective_tld_names.dat' during the maven build
ensures no cached files are used for the release profile
2020-08-06 14:47:58 +02:00
Sebastian Nagel e47e1536b0
[sitemaps] Replace priority "NaN" by default value, fixes #296 (#297)
Co-authored-by: Julien Nioche <julien@digitalpebble.com>
2020-07-24 15:49:11 +01:00
Evan Halley 9affd1d992
Adding duration to the map generated by VideoAttributes.asMap (#301) 2020-07-24 15:46:59 +01:00
Ken Krugler ce9579409e
Merge pull request #299 from sebastian-nagel/cc-276-robots-document-max-crawl-delay
[Robots] Maximum values (crawl-delay and warnings): document and make visible
2020-07-15 09:31:12 -07:00
Sebastian Nagel e16d022dea [Robots] Maximum values (crawl-delay and warnings): document and make visible,
fixes #276
- change visibility (private -> public) of DEFAULT_MAX_CRAWL_DELAY and DEFAULT_MAX_WARNINGS
- document methods to set/get max crawl delay and max warnings
2020-07-15 17:47:32 +02:00
Ken Krugler 29b9d39f48
Merge pull request #298 from sebastian-nagel/cc-267-sitemap-document-strict-allowpartial
[SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
2020-07-15 07:25:00 -07:00
Sebastian Nagel 7f9fd49ba3 [SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
fixes #267
2020-07-15 15:49:43 +02:00
Julien Nioche 2de8b23635
Update README.md
fixed version for javadoc
2020-06-29 18:13:25 +01:00
Julien Nioche f59d32be94 Release 1.1 2020-06-29 18:08:17 +01:00
Julien Nioche d2c162f1cc [maven-release-plugin] prepare for next development iteration 2020-06-25 09:53:20 +01:00
Julien Nioche 24e9e87db4 [maven-release-plugin] prepare release crawler-commons-1.1 2020-06-25 09:52:34 +01:00
Julien Nioche 1447aa4799 JAVADOC fix in NewsAttributes 2020-06-25 09:49:08 +01:00
Sebastian Nagel 774c5c8092
Improvements to BasicURLNormalizer (#292)
- better percent-encoding of URL paths and queries, fixes #263
- hostnames:
  * convert IDNs from Unicode to Punycode, fixes #248
  * remove trailing dot
- normalize path `/..` to `/`
- also normalize path of file:/ URLs
2020-06-22 13:51:39 +01:00
Sebastian Nagel 78d7e7e85f
Sitemaps to implement Serializable, fixes #244 (#294)
- make all sitemap classes including extensions to implement the
  Serializable interface
- extend sitemap parser unit tests to check object serialization
  on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
  etc.)
2020-06-22 12:51:40 +01:00
Sebastian Nagel e9772be011
[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 (#293)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262

* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
2020-06-19 13:15:48 +01:00
Julien Nioche d4d8d6531c
removed duplicate entry 2020-06-16 14:41:37 +01:00
Evan Halley ed0933f2b3
fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt (#291) 2020-06-16 14:39:05 +01:00
Sebastian Nagel 48aecd2c68
Fix build failure (forbidden apis) (#290)
- make String::format not depend on system locale
2020-06-16 09:59:04 +01:00
Julien Nioche 5d30196ffb Added #288 to CHANGES 2020-06-15 16:02:12 +01:00
Evan Halley c04e3f17e7
Adding asMap to ExtensionMetadata Interface (#288)
* added abstract method to extension metadata

* implemented asmap in image/link/mobile/news attributes

* implemented asmap in videoattributes

* adding video attributes unit test

* added news attributes unit tests

* unit test for link attributes

* unit tests for image and mobile attributes

* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString

* using constants instead of strings in more attributes

* cleaned up the imports

* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage

* added related issue to the changelog

* reverting change to NewsAttributes.equal, that causes a unit test failure
2020-06-15 15:55:20 +01:00
Ken Krugler 2d727b2cfa
Merge pull request #268 from Chaiavi/BetterRobotsToString
Better robots to string
2020-04-30 07:02:16 -07:00
Avi Hayun cb21c29a57 Upgrade the toString() method of the Base/simple RobotRules #264 2020-04-29 20:00:43 +03:00
Avi Hayun 95207d0928
Merge pull request #4 from Chaiavi/master
Updating my fork
2020-04-29 09:08:43 +03:00
Avi Hayun 0265b98254
Merge branch 'BetterRobotsToString' into master 2020-04-29 08:59:49 +03:00
Avi Hayun 1e4c673057
Merge pull request #2 from crawler-commons/master
Updates till 2020-04-29
2020-04-29 08:54:48 +03:00
Avi Hayun 89ccfc308b
Merge branch 'master' into master 2020-04-29 08:54:15 +03:00
Sebastian Nagel 636a286ddb
Merge pull request #280 from sebastian-nagel/cc-261-robots-parser-dedup-sitemaps
Robots parser: deduplicate sitemap links, fixes #261
2020-03-24 10:34:17 +01:00
Sebastian Nagel ac3fd3a464 Robots parser: deduplicate sitemap links, fixes #261
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel 80fb6e3b68
Merge pull request #285 from sebastian-nagel/cc-284-log-loading-public-suffix-list
EffectiveTldFinder to log loading of public suffix list #284
2020-03-24 10:23:08 +01:00
Sebastian Nagel c070453c5e EffectiveTldFinder to log loading of public suffix list, fixes #284
- log the location of the public suffix list (effective_tld_names.dat)
  during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel 8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 (#286)
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 (#282)
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel df3d3a446a
Merge pull request #279 from sebastian-nagel/cc-278-simple-robots-parser-num-warnings-thread-safe
SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
2020-01-08 14:08:08 +01:00
Sebastian Nagel 6c271f5fab SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
- move the counter _numWarnings into the ParseState object local to
  parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
  of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
  log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel c9c0ac6eda
Merge pull request #274 from sebastian-nagel/cc-273-maven-exec-parameterized-tests
Make mvn run parameterized tests, fixes #273
2019-11-12 13:55:17 +01:00
Sebastian Nagel f7cc8b694a Make mvn run parameterized tests, fixes #273
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel 8509bdad01
Merge pull request #275 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
[BasicNormalizer] Empty path before query to be normalized to `/`
2019-11-12 13:34:18 +01:00
Sebastian Nagel 7419e96d74 [BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
fixes #247
2019-11-12 13:32:00 +01:00
Avi Hayun 0d9287a3f4
Merge pull request #272 from sebastian-nagel/cc-247-urlnormalizer-trailing-question-mark
[BasicNormalizer] Trailing question mark in url query test should be added
2019-10-17 13:39:21 +03:00
Sebastian Nagel 2979aaf4aa [BasicNormalizer] Trailing question mark in url query test should be added,
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel e38bb95b4c Update CHANGES.txt to reflect latest code contributions 2019-10-15 16:34:00 +02:00
Sebastian Nagel 7e5347eca6
Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check
EffectiveTldFinder to validate returned domain names for length restrictions
2019-10-15 16:24:47 +02:00
Avi Hayun f2b005ef1d Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:48:53 +03:00
Avi Hayun f571f4e8c7 Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:38:57 +03:00
Avi Hayun bb7523d978 Updated the toString method to include a list of sitemaps if there are less than 10 sitemaps in the robots file 2019-09-11 18:22:41 +03:00