1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-03 22:26:15 +02:00
Commit Graph

622 Commits

Author SHA1 Message Date
Sebastian Nagel 24da43e4c2 [maven-release-plugin] prepare for next development iteration 2021-10-06 22:24:07 +02:00
Sebastian Nagel b5b500f58b [maven-release-plugin] prepare release crawler-commons-1.2 2021-10-06 22:24:00 +02:00
Sebastian Nagel 1f9e238db4 Prepare release of crawler-commons-1.1
- update CHANGES.txt
- complete KEYS
2021-10-06 21:41:41 +02:00
Sebastian Nagel 0493878f80
Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions (#326)
* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
(fixes #322)
- compare URL strings to avoid that java.net.URL::equals triggers unwanted and potentially slow
  DNS lookups to resolve the host part. Replace:
  - Objects::equals in equals methods of sitemap extensions
  - URL::equals and URL::hashCode in SiteMapIndex and SiteMapURL
- enable check for URL::equals and URL::hashCode in Forbidden API Checker

* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
- avoid NPEs in equals and hashCode methods

* Sitemaps: avoid calling java.net.URL::equals in equals method of sitemaps and sitemap extensions
- avoid NPE, return null as before if null is passed to SitemapIndex::getSitemap
2021-10-06 12:07:02 +03:00
Sebastian Nagel ec1f2e54ec
Merge pull request #324 from aecio/issue-321-builder
Add a builder API for configuring the BasicURLNormalizer
2021-10-05 10:22:14 +02:00
Sebastian Nagel 4b45097441 Add a builder API for configuring the BasicURLNormalizer
- allow to normalize host names to Unicode (add to changelog)
2021-10-05 10:21:34 +02:00
Sebastian Nagel 10d3021055 Add a builder API for configuring the BasicURLNormalizer
- allow to normalize host names to Unicode
2021-10-04 17:24:26 +02:00
Aécio Santos 12e2c389b2
Add a builder API for configuring the BasicURLNormalizer
Usage example:
```
normalizer = BasicURLNormalizer.newBuilder()
  .idnNormalization(IdnNormalization.PUNYCODE)
  .queryParamsToRemove(
    asList("sid", "phpsessid", "sessionid", "jsessionid")
  )
  .build();
```

Closes #321.
2021-10-04 10:15:09 -04:00
Sebastian Nagel 47ee966024 Merge branch 'kovyrin/sitemap-xxe'
Fix XXE vulnerability in Sitemap parser #323
2021-10-01 10:10:54 +02:00
Sebastian Nagel 4841242390 Fix XXE vulnerability in Sitemap parser
- add unit test to verify that the parser is not vulnerable
  to XInclude attacks
- apply code formatter
- add changelog entry
2021-10-01 10:07:14 +02:00
Oleksiy Kovyrin 2b66ad2060 Do not use a temporary file 2021-09-30 17:38:35 -04:00
Oleksiy Kovyrin 7555bcbbbe Disable entity resolution features in Java SAX XML parser to avoid XXE vulnerabilities while parsing Sitemaps 2021-09-29 12:56:17 -04:00
Sebastian Nagel a10cf2540a Merge branch 'aecio:aecio/query-params-normalization', fixes #246, closes #309
- rebase to master and squash commits
- fix failing sitemaps unit tests with URL filtering using BasicURLNormalizer
  (sort query params in test sitemap)
- CHANGES.txt: updated to follow style, added missing entry for preceding commit
2021-09-21 12:34:39 +02:00
Aécio Santos 94bac65639 Query parameters normalization
- Sort query parameters (fix #246)
- Allows to (optionally) remove common irrelevant query parameters
2021-09-21 12:02:00 +02:00
Sebastian Nagel 7a8bbb6ba3
Merge pull request #307 from sebastian-nagel/cc-305-sitemaps-normalize-urls
Allow to normalize URLs in sitemaps, resolves #305
2021-08-14 13:45:21 +02:00
Avi Hayun 0ea45f4c5c
Normalizing CHANGES.txt (#313)
* This normalization basically adds the [Unit_Name] in front of the issue when it is obvious and when it is missing
Added the [Domains] unit name (as in the java package name)
Didn't touch the issues changelog prior to v0.7

This resolves #270

* Updated according to Sebastian's code review
2021-08-11 17:16:22 +03:00
Avi Hayun 44304581bc
Readme.md Overhaul (#312)
Added Table-of-Contents
Removed issue tracking section
Added Maven installation
Added License
2021-08-09 09:00:06 +03:00
Sebastian Nagel 386608f7e8 Allow to normalize URLs in sitemaps, resolves #305
- extend SiteMapParser by methods to register a URLFilter (function)
  used to normalize or filter (if null is returned) URLs found in
  sitemaps
- implement URL filtering in sitemap parsers / XML handlers
- add unit tests to verify URL filtering for text and XML sitemaps
2020-12-08 15:28:58 +01:00
Sebastian Nagel 9630f4c09c
Merge pull request #306 from sebastian-nagel/cc-271-urlnormalizer-basic-url-without-scheme
Normalize URL without a scheme, resolves #271
2020-11-13 12:15:04 +01:00
Sebastian Nagel 295a2966fe BasicURLNormalizer: Normalize URL without a scheme, resolves #271
- prefix URL without a scheme with `http://`
- decrease log level for malformed URL (info -> debug)
2020-11-12 16:35:12 +01:00
Sebastian Nagel d63317533b Merge PR #302 (branch rzo1:295-public-suffix-list-download-during-build), resolved #295, closes #302 2020-09-17 15:49:32 +02:00
Sebastian Nagel 8c3691195e [EffectiveTldFinder] upgrade public suffix list / download latest effective_tld_names.dat during Maven build
- update changelog
2020-09-17 15:48:02 +02:00
Sebastian Nagel 643a17d533 [BasicNormalizer] decode percent-encoded host names, fixes #303 (#304)
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
  percent-encoded sequences are properly caught
2020-09-17 15:30:47 +02:00
Sebastian Nagel cf1ec67999
[BasicNormalizer] decode percent-encoded host names, fixes #303 (#304)
- extend unit tests to host names including IDNs
- add unit test to verify that runtime exception caused by invalid
  percent-encoded sequences are properly caught
2020-09-15 09:14:09 +01:00
Richard Zowalla 35f0638069 introduces download-maven-plugin to download the 'effective_tld_names.dat' during the maven build
ensures no cached files are used for the release profile
2020-08-06 14:47:58 +02:00
Sebastian Nagel e47e1536b0
[sitemaps] Replace priority "NaN" by default value, fixes #296 (#297)
Co-authored-by: Julien Nioche <julien@digitalpebble.com>
2020-07-24 15:49:11 +01:00
Evan Halley 9affd1d992
Adding duration to the map generated by VideoAttributes.asMap (#301) 2020-07-24 15:46:59 +01:00
Ken Krugler ce9579409e
Merge pull request #299 from sebastian-nagel/cc-276-robots-document-max-crawl-delay
[Robots] Maximum values (crawl-delay and warnings): document and make visible
2020-07-15 09:31:12 -07:00
Sebastian Nagel e16d022dea [Robots] Maximum values (crawl-delay and warnings): document and make visible,
fixes #276
- change visibility (private -> public) of DEFAULT_MAX_CRAWL_DELAY and DEFAULT_MAX_WARNINGS
- document methods to set/get max crawl delay and max warnings
2020-07-15 17:47:32 +02:00
Ken Krugler 29b9d39f48
Merge pull request #298 from sebastian-nagel/cc-267-sitemap-document-strict-allowpartial
[SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
2020-07-15 07:25:00 -07:00
Sebastian Nagel 7f9fd49ba3 [SiteMapParser] Document options *strict* and *allowPartial* in SiteMapParser constructors,
fixes #267
2020-07-15 15:49:43 +02:00
Julien Nioche 2de8b23635
Update README.md
fixed version for javadoc
2020-06-29 18:13:25 +01:00
Julien Nioche f59d32be94 Release 1.1 2020-06-29 18:08:17 +01:00
Julien Nioche d2c162f1cc [maven-release-plugin] prepare for next development iteration 2020-06-25 09:53:20 +01:00
Julien Nioche 24e9e87db4 [maven-release-plugin] prepare release crawler-commons-1.1 2020-06-25 09:52:34 +01:00
Julien Nioche 1447aa4799 JAVADOC fix in NewsAttributes 2020-06-25 09:49:08 +01:00
Sebastian Nagel 774c5c8092
Improvements to BasicURLNormalizer (#292)
- better percent-encoding of URL paths and queries, fixes #263
- hostnames:
  * convert IDNs from Unicode to Punycode, fixes #248
  * remove trailing dot
- normalize path `/..` to `/`
- also normalize path of file:/ URLs
2020-06-22 13:51:39 +01:00
Sebastian Nagel 78d7e7e85f
Sitemaps to implement Serializable, fixes #244 (#294)
- make all sitemap classes including extensions to implement the
  Serializable interface
- extend sitemap parser unit tests to check object serialization
  on various types of sitemaps (index, Atom/RSS feeds, video sitemaps,
  etc.)
2020-06-22 12:51:40 +01:00
Sebastian Nagel e9772be011
[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 (#293)
* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262

* [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262
- add unit test
2020-06-19 13:15:48 +01:00
Julien Nioche d4d8d6531c
removed duplicate entry 2020-06-16 14:41:37 +01:00
Evan Halley ed0933f2b3
fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt (#291) 2020-06-16 14:39:05 +01:00
Sebastian Nagel 48aecd2c68
Fix build failure (forbidden apis) (#290)
- make String::format not depend on system locale
2020-06-16 09:59:04 +01:00
Julien Nioche 5d30196ffb Added #288 to CHANGES 2020-06-15 16:02:12 +01:00
Evan Halley c04e3f17e7
Adding asMap to ExtensionMetadata Interface (#288)
* added abstract method to extension metadata

* implemented asmap in image/link/mobile/news attributes

* implemented asmap in videoattributes

* adding video attributes unit test

* added news attributes unit tests

* unit test for link attributes

* unit tests for image and mobile attributes

* added constants to news and link attributes
fixing a small issue in NewsAttributes.toString

* using constants instead of strings in more attributes

* cleaned up the imports

* decreasing the visibility of LinkAttributes.PARAMS_PREFIX
adding a comment explaining it's usage

* added related issue to the changelog

* reverting change to NewsAttributes.equal, that causes a unit test failure
2020-06-15 15:55:20 +01:00
Ken Krugler 2d727b2cfa
Merge pull request #268 from Chaiavi/BetterRobotsToString
Better robots to string
2020-04-30 07:02:16 -07:00
Avi Hayun cb21c29a57 Upgrade the toString() method of the Base/simple RobotRules #264 2020-04-29 20:00:43 +03:00
Avi Hayun 95207d0928
Merge pull request #4 from Chaiavi/master
Updating my fork
2020-04-29 09:08:43 +03:00
Avi Hayun 0265b98254
Merge branch 'BetterRobotsToString' into master 2020-04-29 08:59:49 +03:00
Avi Hayun 1e4c673057
Merge pull request #2 from crawler-commons/master
Updates till 2020-04-29
2020-04-29 08:54:48 +03:00
Avi Hayun 89ccfc308b
Merge branch 'master' into master 2020-04-29 08:54:15 +03:00