2011-07-25 12:23:21 +02:00
Crawler-Commons Change Log
2023-07-13 11:25:00 +02:00
Current Development 1.5-SNAPSHOT (yyyy-mm-dd)
2023-10-29 10:49:21 +01:00
- [Sitemaps] Google Sitemap PageMap extensions (josepowera, sebastian-nagel, Richard Zowalla, jnioche) #388, #442
- [Domains] Installation of a gzip-compressed public suffix list from Maven cache breaks EffectiveTldFinder to address (sebastian-nagel, Richard Zowalla) #441, #443
- Upgrade dependencies (dependabot) #437, #444
- Upgrade Maven plugins (dependabot) #434, #438, #439
2023-07-13 11:25:00 +02:00
2023-07-13 11:28:48 +02:00
Release 1.4 (2023-07-13)
2023-07-12 16:16:30 +02:00
- [Robots.txt] Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (sebastian-nagel, Richard Zowalla) #245, #360
- [Robots.txt] Close groups of rules as defined in RFC 9309 (kkrugler, garyillyes, jnioche, sebastian-nagel) #114, #390, #430
- [Robots.txt] Empty disallow statement not to clear other rules (sebastian-nagel, jnioche) #422, #424
- [Robots.txt] SimpleRobotRulesParser main() to follow five redirects (sebastian-nagel, jnioche) #428
- [Robots.txt] Add more spelling variants and typos of robots.txt directives (sebastian-nagel, jnioche) #425
2023-07-10 10:23:20 +02:00
- [Robots.txt] Document effect of rules merging in combination with multiple agent names (sebastian-nagel, Richard Zowalla) #423, #426
- [Robots.txt] Pass empty collection of agent names to select rules for any robot (wildcard user-agent name) (sebastian-nagel, Richard Zowalla) #427
- [Robots.txt] Rename default user-agent / robot name in unit tests (sebastian-nagel, Richard Zowalla) #429
2023-06-13 14:24:13 +02:00
- [Robots.txt] Add units test based on examples in RFC 9309 (sebastian-nagel, Richard Zowalla) #420
- [BasicNormalizer] Query parameters normalization in BasicURLNormalizer (aecio, sebastian-nagel, Richard Zowalla) #308, #421
- [Robots.txt] Deduplicate robots rules before matching (sebastian-nagel, jnioche) #416
- [Robots.txt] SimpleRobotRulesParser main to use the new API method (sebastian-nagel, jnioche) #413
- Generate JaCoCo reports when testing (jnioche) #409, #412
2023-07-12 16:16:30 +02:00
- Push Code Coverage to Coveralls (Richard Zowalla, jnioche) #414
2023-06-13 14:24:13 +02:00
- [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters (tkalistratov, sebastian-nagel, Richard Zowalla) #195, #408
2023-07-12 16:16:30 +02:00
- [Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters (sebastian-nagel, Richard Zowalla, aecio) #389, #401
- [Robots.txt] Improve readability of robots.txt unit tests (sebastian-nagel, Richard Zowalla) #383
- Upgrade project to use Java 11 (Avi Hayun, Richard Zowalla, aecio, sebastian-nagel) #320, #376
2023-05-11 16:52:23 +02:00
- [Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (sebastian-nagel, Richard Zowalla) #362
- [Robots.txt] Matching user-agent names does not conform to robots.txt RFC (YossiTamari, sebastian-nagel) #192
2022-08-11 14:12:20 +02:00
- [Robots.txt] Improve robots check draft rfc compliance (Eduardo Jimenez) #351
2023-06-13 14:24:13 +02:00
- Upgrade dependencies (dependabot) #379, #384, #394, #399, #404, #419
- Upgrade Maven plugins (dependabot) #377, #381, #386, #396, #397, #398, #400, #402, #403, #405, #406, #407, #415, #418
2023-07-12 16:16:30 +02:00
- Javadoc: ensure Javascript search is working (sebastian-nagel, Richard Zowalla, aecio) #378, #380
2022-07-28 12:12:38 +02:00
Release 1.3 (2022-07-19)
2022-03-02 15:47:19 +01:00
- [Sitemaps] Disable support for DTDs in XML sitemaps and feeds by default (Kenneth Wong) #371
2021-10-19 16:06:40 +02:00
- Migrate Continuous Integration from Travis to GitHub Actions (Valery Yatsynovich) #333
2022-07-15 08:59:13 +02:00
- Upgrade dependencies (dependabot, Richard Zowalla) #334, #339, #345, #346, #347, #350, #354, #361, #369
- Upgrade Maven plugins (dependabot, Richard Zowalla, sebastian-nagel) #328, #329, #330, #331, #335, #336, #337, #338, #340, #341, #343, #356, #363. #364, #366, #373, #374
2022-07-19 09:19:56 +02:00
- Update pom.xml to address Maven warnings and deprecations (sebastian-nagel, Richard Zowalla, Avi Hayun) #342
2021-10-19 16:06:40 +02:00
- Enable Dependabot (Valery Yatsynovich) #327
2022-02-21 15:06:23 +01:00
- Removes test dependency towards mockito-core (Richard Zowalla) #367
2022-02-22 19:27:20 +01:00
- Drops provided dependency towards servlet-api (Richard Zowalla) #368
2021-10-06 19:24:45 +02:00
Release 1.2 (2021-10-06)
2021-10-06 11:07:02 +02:00
- [Sitemaps] Avoid calling java.net.URL::equals in equals method of sitemaps and extensions (sebastian-nagel) #322
2021-10-05 10:21:34 +02:00
- [URLs] Provide a builder class to configure the URL normalizer (aecio) #321, #324
- [URLs] Make normalization of IDNs configurable (to ASCII or Unicode) via builder (aecio, sebastian-nagel) #324
2021-10-01 09:42:47 +02:00
- [Sitemaps] Fix XXE vulnerability in Sitemap parser (kovyrin) #323
2021-09-21 12:02:32 +02:00
- [URLs] Sorting the Query Parameters (aecio) #246, #309
- [URLs] Allows to (optionally) remove common irrelevant query parameters (aecio) #309
- [Sitemaps] Allow to normalize URLs in sitemaps (murderinc, sebastian-nagel) #305
2021-08-11 16:16:22 +02:00
- Normalize CHANGES.txt (Avi Hayun) #270
2021-08-09 08:00:06 +02:00
- Readme.MD Overhaul of TOC, Installation, License (Avi Hayun) #311
2021-08-11 16:16:22 +02:00
- [URLs] Normalize URL without a scheme (Avi Hayun, sebastian-nagel) #271
- [Domains] EffectiveTldFinder: upgrade public suffix list / Download latest effective_tld_names.dat during Maven build (Richard Zowalla) #295, #302
- [URLs] decode percent-encoded host names (sebastian-nagel) #303
- [Sitemaps] Document options *strict* and *allowPartial* in SiteMapParser constructors (sebastian-nagel) #267
- [Robots.txt] Maximum values (crawl-delay and warnings): document and make visible (sebastian-nagel, Avi Hayun) #276
- [Sitemaps] Replace priority "NaN" by default value (sebastian-nagel) #296
2021-10-06 19:24:45 +02:00
- [Sitemaps] Adding duration to the map generated by VideoAttributes.asMap (evanhalley) #300
2020-06-29 19:08:17 +02:00
Release 1.1 (2020-06-29)
2021-08-11 16:16:22 +02:00
- [Sitemaps] Sitemaps to implement Serializable (cdalexndr, sebastian-nagel) #244
- [Sitemaps] Allow to deduplicate sitemap links in sitemap indexes (sebastian-nagel) #262
- [Robots.txt] Upgrade the toString() method of the Base/Simple RobotRules (Avi Hayun) #264
2020-04-29 19:00:43 +02:00
- Upgrade GitIgnore (Avi Hayun) #260
2021-08-11 16:16:22 +02:00
- [Robots.txt] Deduplicate sitemap links (sebastian-nagel) #261
- [Domains] EffectiveTldFinder to log loading of public suffix list (sebastian-nagel) #284
- [Sitemaps] SiteMapParser getPublicationDate in VideoAttributes may throw NPE (panthony, sebastian-nagel) #283
- [Robots.txt] SimpleRobotRulesParser: Trim log messages (jnioche, sebastian-nagel) #281
- [Robots.txt] SimpleRobotRulesParser: counter _numWarnings not thread-safe (sebastian-nagel, kkrugler) #278
2019-11-09 12:54:49 +01:00
- ParameterizedTest not executed by mvn builds (sebastian-nagel) #273
2021-08-11 16:16:22 +02:00
- [URLs] Empty path before query to be normalized to `/` (Avi Hayun, sebastian-nagel) #247
- [Domains] EffectiveTldFinder to validate returned domain names for length restrictions (sebastian-nagel, Avi Hayun) #251
2020-04-29 19:00:43 +02:00
- Upgrade unit tests to use JUnit v5.x and parameterized tests (Avi Hayun) #249, #253, #255
2021-08-11 16:16:22 +02:00
- [Robots.txt] Robots parser to always handle absolute sitemap URL even without valid base URL (pr3mar, kkrugler, sebastian-nagel) #240
2020-06-16 15:41:37 +02:00
- [Sitemaps] Adding asMap to ExtensionMetadata Interface (evanhalley) #288
- [Sitemaps] NewsAttribute.equals() compares the instance variable PublicationDate with itself (evanhalley) #289
2019-03-21 22:02:10 +01:00
2019-03-19 23:04:31 +01:00
Release 1.0 (2019-03-19)
2019-03-19 16:00:04 +01:00
- [Sitemaps] Unit tests depend on system timezone (kkrugler, sebastian-nagel) #238
2021-08-11 16:16:22 +02:00
- [Domains] EffectiveTldFinder: upgrade public suffix list (sebastian-nagel) #219
2019-03-14 10:29:17 +01:00
- [Sitemaps] Detection and parsing of XML sitemaps fails with whitespace before XML declaration (sebastian-nagel, jnioche) #144
2019-02-21 22:38:48 +01:00
- [Sitemaps] XMLHandler needs to append text in characters() vs. immediately processing (kkrugler, sebastian-nagel) #226
- [Sitemaps] XMLIndexHandler needs to accumulate the lastmod date string before parsing (kkrugler, sebastian-nagel) #225
2021-08-11 16:16:22 +02:00
- [Domains] EffectiveTldFinder throws IllegalArgumentException on IDN domain names containing prohibited characters (sebastian-nagel) #231
2018-12-13 16:30:23 +01:00
- [Sitemaps] Trim Unicode whitespace around URLs (sebastian-nagel, kkrugler) #224
2019-02-03 13:49:11 +01:00
- [Sitemaps] Sitemap index: stop URL at closing </loc> (sebastian-nagel, kkrugler) #213
2019-01-18 17:35:33 +01:00
- [Sitemaps] Allow empty price in video sitemaps (sebastian-nagel) #221
- [Sitemaps] In case of the use of a different locale, price tag can be formatted with ',' instead of '.' leading to a NPE (goldenlink) #220
2018-09-28 12:14:02 +02:00
- [Sitemaps] Add support for sitemap extensions (tuxnco, sebastian-nagel) #35, #36, #149, #162
2018-09-24 11:09:58 +02:00
- [Sitemaps] Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (sebastian-nagel) #217
2021-08-11 16:16:22 +02:00
- [Robots.txt] Fix for handling URLs with query parameters but no path (kkrugler) #215
2018-06-05 12:33:27 +02:00
Release 0.10 (2018-06-05)
2018-05-03 12:04:03 +02:00
- Add JAX-B dependencies to POM (jnioche) #207
2018-06-05 12:33:27 +02:00
- [Sitemaps] Add method to parse and iterate sitemap SiteMapParser#walkSiteMap(URL,Consumer) (Luc Boruta) #190
- [Sitemaps] Sitemap file location to ignore query part of URL (sebastian-nagel) #202
2021-08-11 16:16:22 +02:00
- [Sitemaps] Link extraction from RSS feeds fails on XML entities (sebastian-nagel) #204
- [Sitemaps] Resolve relative links in RSS feeds (sebastian-nagel) #203
- [Sitemaps] Extract links from <guid> elements (sebastian-nagel) #201
2018-04-16 13:36:06 +02:00
- [Sitemaps] Limit on "bad url" log messages (sebastian-nagel) #145
2021-08-11 16:16:22 +02:00
- [Domains] EffectiveTldFinder to parse Internationalized Domain Names (sebastian-nagel) #179
- [Domains] Add main() to EffectiveTldFinder (sebastian-nagel) #187
- [Domains] Handle new suffixes in PaidLevelDomain (kkrugler) #183
2018-04-02 21:59:17 +02:00
- Remove Tika dependency (kkrugler) #199
2021-08-11 16:16:22 +02:00
- [Sitemaps] Improve MIME detection for sitemaps (sebastian-nagel) #200
- [Robots.txt] Make RobotRules accessible (aecio via kkrugler) #134
- [Robots.txt] SimpleRobotRulesParser: Expose MAX_WARNINGS and MAX_CRAWL_DELAY (aecio via kkrugler) #194
- [Robots.txt] Added main to SimpleRobotRulesParser for testing (sebastian-nagel) #193
- [Sitemaps] Allow for legacy URIs when checking sitemap namespaces (sebastian-nagel) #211
2017-11-05 23:53:58 +01:00
2017-10-27 12:08:15 +02:00
Release 0.9 (2017-10-27)
- [Sitemaps] Removed DOM-based sitemap parser (jnioche) #177
2021-08-11 16:16:22 +02:00
- [Domains] Incorrect domains returned by EffectiveTldFinder (sebastian-nagel) #172
2017-10-27 12:08:15 +02:00
- [Sitemaps] Add namespace aware DOM/SAX parsing for XML Sitemaps (Marko Milicevic, jnioche, sebastian-nagel) #176
- Upgraded Tika 1.16 (jnioche) #175
2017-07-17 15:02:35 +02:00
- [Sitemaps] Sitemap SAX parsing mangles target URLs (jnioche, sebastian-nagel) #169
2017-06-21 00:18:47 +02:00
- [Sitemaps] RSS parser ignores pubDate of link (MichealKum via kkrugler) #166
2017-06-09 11:15:44 +02:00
Release 0.8 (2017-06-09)
2017-06-02 16:02:20 +02:00
- Upgraded Tika 1.15 (jnioche) #163
2017-05-04 22:36:40 +02:00
- [Sitemaps] Disable XML resolvers (sebastian-nagel) #151
2017-03-20 16:58:55 +01:00
- Update forbiddenapis to v2.3 (jnioche) #99
2017-03-18 09:35:19 +01:00
- [Sitemaps] gzipped text files fail to parse (sebastian-nagel) #143
2017-02-03 16:45:02 +01:00
- [Sitemaps] Optionally use SAX parser (matt-deboer, jnioche, sebastian-nagel) #116
- [Sitemaps] Properly log XML parsing errors (sebastian-nagel) #146
2017-02-02 14:56:29 +01:00
- Use StandardCharsets where applicable (sebastian-nagel) #141
2021-08-11 16:16:22 +02:00
- [Sitemaps] Increase sitemap size limit to 50MB (Avi Hayun) #132
2017-01-13 18:31:10 +01:00
- Remove dependencies to system-specific locale (sebastian-nagel) #137
2021-08-11 16:16:22 +02:00
- [URLs] BasicURLNormalizer: NPE for URLs without authority (sebastian-nagel) #136
- [URLs] BasicURLNormalizer to strip empty port (sebastian-nagel) #133
2016-11-27 18:18:21 +01:00
- Remove deprecated HTTP fetcher (kkrugler) #96
2016-11-24 11:07:27 +01:00
Release 0.7 (2016-11-24)
2016-09-30 13:33:06 +02:00
- Upgrade to JDK 1.8 (lewismc) #126
- [Sitemaps] SitemapParser methods now protected (michaellavelle) #124
- [Sitemaps] Faster parsing of dates (jnioche) #117
- Upgraded Tika 1.13 (jnioche) #113
- Fix license headers (jnioche) #108
- Rename package crawlercommons.url (jnioche) #107
2021-08-11 16:16:22 +02:00
- [Sitemaps] Sitemap url is not extracted if user agent matches earlier in file (srwilson, kkrugler) #112
2016-09-30 13:33:06 +02:00
- Deprecate HTTP fetcher support (kkrugler) #92
2021-08-11 16:16:22 +02:00
- [URLs] Added URLFilter interface + BasicURLNormalizer (jnioche) #106
- [Domains] Updated tld names from publicsuffix.org (jnioche) #100
2016-09-30 13:33:06 +02:00
- Upgraded http-client to version 4.5.1 (aecio via kkrugler) #84
- Upgraded Tika 1.10 (jnioche) #89
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #82
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #60
- Simplify pom file (jnioche, lewismc) #77
- Upgrade javac.src.version and javac.target.version to 1.7 or 1.8 (lewismc) #93
- [Sitemaps] Not able to detect RSS feeds (yogendrasoni via kkrugler) #87
2021-08-11 16:16:22 +02:00
- [Robots.txt] Added javadoc comments to the SimpleRobotRulesParser class (MichaelRoeder, kkrugler) #95
2015-09-15 07:40:04 +02:00
2015-09-15 16:57:13 +02:00
Release 0.6 (2015-05-27)
2015-05-22 12:08:21 +02:00
- Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler)
2015-05-06 11:10:25 +02:00
- Issue 76: maven-java-formatter-plugin (jnioche)
2015-04-22 16:30:25 +02:00
- Issue 73: Switch groupID in pom from com.google.code.crawler-commons to crawler-commons (jnioche)
- Issue 71: Upgrade to Tika 1.8 (jnioche)
2021-08-11 16:16:22 +02:00
- Issue 68: [Robots.txt] Path matching should be case-sensitive (kkrugler)
2015-03-27 23:12:14 +01:00
- Issue 67: [Sitemaps] Parsing of lastMod date should use time portion (kkrugler)
2021-08-11 16:16:22 +02:00
- Issue 59: [Robots.txt] Let SimpleRobotRules and its members implements the Serializable interface (kkrugler)
2015-04-22 16:30:57 +02:00
- Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag (Avi Hayun)
2015-01-22 15:43:34 +01:00
- Issue 64: Upgraded to Tika 1.7 (jnioche)
2021-08-11 16:16:22 +02:00
- Issue 32: [Robots.txt] Resolve relative URL for sitemaps (jnioche)
2015-01-21 09:59:01 +01:00
- Issue 62: [Sitemaps] Add new parseSiteMap method (jnioche)
2015-04-22 16:30:57 +02:00
- Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them (Avi Hayun)
- Issue 51: Upgrade httpclient to the latest version (Avi Hayun)
- Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily (Avi Hayun)
2014-11-21 13:01:55 +01:00
- Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen (Avi Hayun)
2014-10-19 20:59:15 +02:00
- Issue 50: Add Fetch Report to FetchedResult (lewismc, avraham2)
2014-11-21 13:01:55 +01:00
- Issue 55: [Sitemaps] SitemapUrl "setPriority(String str)" should check for proper value (Avi Hayun)
2014-10-19 20:59:15 +02:00
2015-09-15 16:57:13 +02:00
Release 0.5 (2014-10-15)
2015-04-22 16:31:20 +02:00
- Issue 53: Spaces in a comma separated list of names in a User-agent: line cause rules to be applicable to all agents (kkrugler)
2014-10-10 06:15:23 +02:00
- Issue 45: [Sitemaps] Upgrade code after release of Tika v1.6 (Avi Hayun)
- Issue 48: Upgraded to Tika 1.6 (jnioche)
- Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases (Avi Hayun)
- Issue 40: [Sitemaps] Add Tika MediaType Support (Avi Hayun)
2014-08-06 21:06:45 +02:00
- Issue 39: [Sitemaps] Add the Parser a convenience method with only a URL argument (Avi Hayun via lewismc)
2014-07-01 07:11:27 +02:00
- Issue 42: [Sitemaps] Add more JUnit tests (Avi Hayun via lewismc)
2014-08-06 21:06:45 +02:00
- Issue 37: Upgrade the Slf4j logging Library to v1.7.7 (Avi Hayun via kkrugler)
- Issue 41: Upgrade to JUnit v4 conventions in SiteMapParser (Avi Hayun via lewismc)
- Issue 34: Upgrade the Slf4j logging in SiteMaps (Avi Hayun via lewismc)
2014-06-24 04:49:01 +02:00
2015-09-15 16:57:13 +02:00
Release 0.4 (2014-04-11)
2014-03-19 20:15:05 +01:00
- Issue 13: Fix deprecation in Crawler Commons Code (lewismc via kkrugler)
- Issue 8 : Upgrade of httpclient to v4.2.6 (Fuad Efendi, lewismc via kkrugler)
- Issue 18: Support matching against query parameters in robots.txt rules (alparslanavci, kkrugler)
2014-03-16 22:53:28 +01:00
- Issue 21: Follow Google example of giving Allow directives higher match weight than Disallow directives (y.vladimirov, via kkrugler)
- Issue 22: Use longest-match-wins approach to matching URLs in robots.txt (kkrugler)
- Issue 17: Support Googlebot-compatible regular expressions in URL specifications (alparslanavci. kkrugler)
- Issue 31: Missing top level domains (jnioche, kkrugler)
- Issue 23: Trivial improvements to UserAgent (lewismc)
- Issue 30: SitemapIndex should allow to skip sitemaps (Sebastian Nagel, kkrugler)
2013-10-21 17:31:14 +02:00
- cleanup of ANT build remnants [lib and lib-ext] (jnioche)
2015-09-15 16:57:13 +02:00
Release 0.3 (2013-10-11)
2013-10-11 12:40:00 +02:00
- Upgraded to Tika 1.4 (jnioche)
2013-07-18 16:01:37 +02:00
- [SiteMap] added utility class for testing sitemaps (jnioche)
2013-07-01 21:18:25 +02:00
- Issue 16: remove ant scripts and configuration (lewismc)
2013-05-24 16:09:26 +02:00
- Issue 27: [SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString() (jnioche)
2013-05-24 16:15:51 +02:00
- Issue 26: [SiteMap] Set correct default priority for URL in a sitemap file (jnioche)
2021-08-11 16:16:22 +02:00
- Issue 25: [Robots.txt] Robots parser should not lowercase sitemap URLs (jnioche)
2013-10-02 15:40:50 +02:00
- Issue 29: [SiteMap] try urls when <loc> element is missing (jnioche)
2013-05-24 16:09:26 +02:00
2015-09-15 16:57:13 +02:00
Release 0.2 (2013-02-02)
2013-01-30 05:12:34 +01:00
- Move to pure Maven for CC build lifecycle (lewismc)
- Move Javadoc out of core code (lewismc)
2013-01-28 03:45:41 +01:00
- Substantiate Javadoc (lewismc)
2013-01-28 03:43:34 +01:00
- Review default.properties (lewismc)
2013-01-24 00:08:51 +01:00
- add HTTP status code & reason to FetchedResult (Fuad Efendi via kkrugler)
- support for multiple user agent names (Tejas Patil via kkrugler)
- added javadoc generation, publish in /doc/javadoc (kkrugler)
- switch to using eclipse-formatter.properties (kkrugler)
- support robots.txt files that have UTF-16LE and UTF-16BE BOMs (kkrugler)
- support for user agent names that contain spaces (kkrugler)
- fixed handling of BOM in sitemaps (Vivek Magotra via kkrugler)
2011-07-25 12:23:21 +02:00
- refactoring of SiteMap objects (Hannes Schwarz via jnioche)
- added simple support for the file: protocol (kkrugler)
- cleaned up packaging and added "install" target (kkrugler)
Release 0.1
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
2022-07-28 12:12:38 +02:00
- a simple HttpFetcher