1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-24 19:36:06 +02:00
Commit Graph

145 Commits

Author SHA1 Message Date
Sebastian Nagel 1f9e238db4 Prepare release of crawler-commons-1.1
- update CHANGES.txt
- complete KEYS
2021-10-06 21:41:41 +02:00
Sebastian Nagel 0493878f80
Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions (#326)
* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
(fixes #322)
- compare URL strings to avoid that java.net.URL::equals triggers unwanted and potentially slow
  DNS lookups to resolve the host part. Replace:
  - Objects::equals in equals methods of sitemap extensions
  - URL::equals and URL::hashCode in SiteMapIndex and SiteMapURL
- enable check for URL::equals and URL::hashCode in Forbidden API Checker

* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
- avoid NPEs in equals and hashCode methods

* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
- avoid NPE, return null as before if null is passed to SitemapIndex::getSitemap
2021-10-06 12:07:02 +03:00
Sebastian Nagel 4b45097441 Add a builder API for configuring the BasicURLNormalizer
- allow to normalize host names to Unicode (add to changelog)
2021-10-05 10:21:34 +02:00
Aécio Santos 12e2c389b2
Add a builder API for configuring the BasicURLNormalizer
Usage example:
```
normalizer = BasicURLNormalizer.newBuilder()
  .idnNormalization(IdnNormalization.PUNYCODE)
  .queryParamsToRemove(
    asList("sid", "phpsessid", "sessionid", "jsessionid")
  )
  .build();
```

Closes #321.
2021-10-04 10:15:09 -04:00
Sebastian Nagel 4841242390 Fix XXE vulnerability in Sitemap parser
- add unit test to verify that the parser is not vulnerable
  to XInclude attacks
- apply code formatter
- add changelog entry
2021-10-01 10:07:14 +02:00
Sebastian Nagel a10cf2540a Merge branch 'aecio:aecio/query-params-normalization', fixes #246, closes #309
- rebase to master and squash commits
- fix failing sitemaps unit tests with URL filtering using BasicURLNormalizer
  (sort query params in test sitemap)
- CHANGES.txt: updated to follow style, added missing entry for preceding commit
2021-09-21 12:34:39 +02:00
Aécio Santos 94bac65639 Query parameters normalization
- Sort query parameters (fix #246)
- Allows to (optionally) remove common irrelevant query parameters
2021-09-21 12:02:00 +02:00
Avi Hayun 0ea45f4c5c
Normalizing CHANGES.txt (#313)
* This normalization basically adds the [Unit_Name] in front of the issue when it is obvious and when it is missing
Added the [Domains] unit name (as in the java package name)
Didn't touch the issues changelog prior to v0.7

This resolves #270

* Updated according to Sebastian's code review
2021-08-11 17:16:22 +03:00
Avi Hayun 44304581bc
Readme.md Overhaul (#312)
Added Table-of-Contents
Removed issue tracking section
Added Maven installation
Added License
2021-08-09 09:00:06 +03:00
Sebastian Nagel 295a2966fe BasicURLNormalizer: Normalize URL without a scheme, resolves #271
- prefix URL without a scheme with `http://`
- decrease log level for malformed URL (info -> debug)
2020-11-12 16:35:12 +01:00
Sebastian Nagel 8c3691195e [EffectiveTldFinder] upgrade public suffix list / download latest effective_tld_names.dat during Maven build
- update changelog
2020-09-17 15:48:02 +02:00
Sebastian Nagel 643a17d533 [BasicNormalizer] decode percent-encoded host names, fixes #303 (#304)
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
  percent-encoded sequences are properly caught
2020-09-17 15:30:47 +02:00
Sebastian Nagel e47e1536b0
[sitemaps] Replace priority "NaN" by default value, fixes #296 (#297)
Co-authored-by: Julien Nioche <julien@digitalpebble.com>
2020-07-24 15:49:11 +01:00
Evan Halley 9affd1d992
Adding duration to the map generated by VideoAttributes.asMap (#301) 2020-07-24 15:46:59 +01:00
Sebastian Nagel e16d022dea [Robots] Maximum values (crawl-delay and warnings): document and make visible,
fixes #276
- change visibility (private -> public) of DEFAULT_MAX_CRAWL_DELAY and DEFAULT_MAX_WARNINGS
- document methods to set/get max crawl delay and max warnings
2020-07-15 17:47:32 +02:00
Sebastian Nagel 7f9fd49ba3 [SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
fixes #267
2020-07-15 15:49:43 +02:00
Julien Nioche f59d32be94 Release 1.1 2020-06-29 18:08:17 +01:00
Sebastian Nagel 78d7e7e85f
Sitemaps to implement Serializable, fixes #244 (#294)
- make all sitemap classes including extensions to implement the
  Serializable interface
- extend sitemap parser unit tests to check object serialization
  on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
  etc.)
2020-06-22 12:51:40 +01:00
Sebastian Nagel e9772be011
[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 (#293)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262

* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
2020-06-19 13:15:48 +01:00
Julien Nioche d4d8d6531c
removed duplicate entry 2020-06-16 14:41:37 +01:00
Evan Halley ed0933f2b3
fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt (#291) 2020-06-16 14:39:05 +01:00
Julien Nioche 5d30196ffb Added #288 to CHANGES 2020-06-15 16:02:12 +01:00
Evan Halley c04e3f17e7
Adding asMap to ExtensionMetadata Interface (#288)
* added abstract method to extension metadata

* implemented asmap in image/link/mobile/news attributes

* implemented asmap in videoattributes

* adding video attributes unit test

* added news attributes unit tests

* unit test for link attributes

* unit tests for image and mobile attributes

* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString

* using constants instead of strings in more attributes

* cleaned up the imports

* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage

* added related issue to the changelog

* reverting change to NewsAttributes.equal, that causes a unit test failure
2020-06-15 15:55:20 +01:00
Avi Hayun cb21c29a57 Upgrade the toString() method of the Base/simple RobotRules #264 2020-04-29 20:00:43 +03:00
Avi Hayun 89ccfc308b
Merge branch 'master' into master 2020-04-29 08:54:15 +03:00
Sebastian Nagel ac3fd3a464 Robots parser: deduplicate sitemap links, fixes #261
- use a LinkedHashSet to store and unify sitemap URLs
2020-03-24 10:31:38 +01:00
Sebastian Nagel c070453c5e EffectiveTldFinder to log loading of public suffix list, fixes #284
- log the location of the public suffix list (effective_tld_names.dat)
  during initialization
- log errors while loading as errors
- improve documentation about public suffix list and class initialization
2020-02-17 16:41:25 +01:00
Sebastian Nagel 8522cfdd34
[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 (#286)
- check for null values before converting ZonedDateTime to Date
2020-02-17 15:33:38 +00:00
Sebastian Nagel f83c43bdd4
SimpleRobotRulesParser: Trim log messages, fixes #281 (#282)
- clip strings shown in log message to 1024 chars
2020-02-03 11:25:59 +00:00
Sebastian Nagel 6c271f5fab SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278
- move the counter _numWarnings into the ParseState object local to
  parseContent(...) which makes counting and suppressing warnings thread-safe
- after parsing the warnings count is stored in a ThreadLocal member variable
  of SimpleRobotRulesParser, the value is returned by getNumWarnings()
- use parameterized method calls for logging, avoid concatenating
  log messages beforehand
2020-01-08 13:55:46 +01:00
Sebastian Nagel f7cc8b694a Make mvn run parameterized tests, fixes #273
- make test classes public
- upgrade Maven surefire plugin to recent version
- depend on junit-jupiter-engine
2019-11-12 13:49:40 +01:00
Sebastian Nagel 7419e96d74 [BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query,
fixes #247
2019-11-12 13:32:00 +01:00
Sebastian Nagel 2979aaf4aa [BasicNormalizer] Trailing question mark in url query test should be added,
fixes #247
2019-10-17 11:58:36 +02:00
Sebastian Nagel e38bb95b4c Update CHANGES.txt to reflect latest code contributions 2019-10-15 16:34:00 +02:00
Avi Hayun f571f4e8c7 Updated the CHANGES.txt with the latest issues I fixed 2019-09-11 22:38:57 +03:00
Sebastian Nagel b449fdf024 Robots parser to always handle absolute sitemap URL even without valid base URL,
fixes #240
2019-03-26 15:05:12 +01:00
Julien Nioche 651ce08afe CHANGES ready for next cycle of improvements 2019-03-21 21:02:10 +00:00
Julien Nioche 0dc699f3e9 Updated CHANGES pre-1.0 release 2019-03-19 22:04:31 +00:00
Sebastian Nagel e8b598b2e8 [Sitemaps] Unit tests depend on system timezone, fixes #238 (#239)
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
  add note that UTC is assumed if no time zone is contained in
  date string
2019-03-19 15:00:04 +00:00
Sebastian Nagel 40531efe25 EffectiveTldFinder: upgrade public suffix list, implements #219 (#235)
- upgrade the public suffix list to the latest version retrieved right now from
     https://publicsuffix.org/list/public_suffix_list.dat
  resp.
     802c469416/public_suffix_list.dat
2019-03-14 11:22:50 +00:00
Sebastian Nagel 0349fbe1f0 Update changelog for #144/#234 2019-03-14 10:29:17 +01:00
Sebastian Nagel eb74336bd3 Update changelog for #225 and #226 2019-02-21 23:01:32 +01:00
Sebastian Nagel 916415d262
Merge branch 'master' into cc-231-etld-invalid-idns 2019-02-21 22:16:48 +01:00
Sebastian Nagel 40b1c44d68 Update changelog for #231 2019-02-21 22:15:40 +01:00
Sebastian Nagel 67db8bf1be [Sitemaps] Trim Unicode whitespace around URLs, fixes #224 2019-02-20 16:27:16 +01:00
Sebastian Nagel 78e935f83b Update changelog for #213 2019-02-03 13:49:11 +01:00
Sebastian Nagel ab9e33a5f9 Update changelog for #220 and #221 2019-01-18 17:35:33 +01:00
Sebastian Nagel 862af9416f Sitemap extension support
- add extension support to SiteMapTester
- list extension attributes in SiteMapURL.toString()
- update change log
2018-09-28 12:14:02 +02:00
Sebastian Nagel 9318de951f Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (#217)
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
  but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests

* Fix Javadoc error and warnings, update change log

* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
  by updating to Java 8 date and time API
2018-09-24 10:09:58 +01:00
Ken Krugler a5c5091d64 Update CHANGES.txt 2018-07-31 17:14:27 -07:00