1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-22 19:06:03 +02:00
Commit Graph

471 Commits

Author SHA1 Message Date
Sebastian Nagel 67db8bf1be [Sitemaps] Trim Unicode whitespace around URLs, fixes #224 2019-02-20 16:27:16 +01:00
Sebastian Nagel 680df29f0f EffectiveTldFinder throws IllegalArgumentException on IDN domain names
containing prohibited charactersi, fixes #231
- catch IllegalArgumentException when converting IDNs to ASCII
- validate domain names returned by getAssignedDomain(...)
- add unit tests, complete Javadoc
2019-02-20 14:59:58 +01:00
Sebastian Nagel 6d3bbd2512
Merge pull request #227 from sebastian-nagel/cc-213-auto-close-sitemap-in-sitemapindex
Sitemap index: stop URL at closing </loc>, fixes #213
2019-02-04 22:24:18 +01:00
Sebastian Nagel 78e935f83b Update changelog for #213 2019-02-03 13:49:11 +01:00
Sebastian Nagel 2bdc1923b1 Merge branch 'master' into cc-213-auto-close-sitemap-in-sitemapindex 2019-02-03 13:47:31 +01:00
Sebastian Nagel 5706ee1e97 Sitemap index: stop URL at closing </loc>
- bug fix: <sitemap> element is closed, not <url>
- add unit test to cover resetting of attributes
2019-02-03 13:44:14 +01:00
Sebastian Nagel ab9e33a5f9 Update changelog for #220 and #221 2019-01-18 17:35:33 +01:00
Sebastian Nagel 102189f4c0
Merge pull request #223 from sebastian-nagel/cc-video-sitemap-extension-220-221
Allow empty price in video sitemaps, fixes #221, closes #220
2019-01-18 17:31:41 +01:00
Sebastian Nagel e0ef1ee29d Remove final from method parameters 2019-01-18 17:27:23 +01:00
Sebastian Nagel 13fab2e033 Sitemap index: stop URL at closing </loc>, fixes #213
- at start of a <loc> element auto-close any unclosed
  <sitemap> element and add the sitemap if there is a
  valid URL from the previous <loc> element
2018-12-13 15:01:23 +01:00
Sebastian Nagel 439c350ab3 Allow empty price in video sitemaps, fixes #221 2018-12-07 16:44:47 +01:00
Sebastian Nagel f361597447 Inline unit test for locale-specific number formats in video:price tags (#220)
- simply unit tests to contain only tested or required video attributes
- fix order of attributes/elements to follow XSD schema spec
2018-12-07 16:44:16 +01:00
Sebastian Nagel 4de43a7534 Merge branch 'goldenlink-master' to fix #220 2018-12-07 16:43:17 +01:00
Aurelien Violette 3c12c715c8 Fix object comparison on testing. 2018-12-03 12:55:30 +01:00
Aurelien Violette e802cfa8eb In case of the use of a different locale, price tag can be formatted with ',' instead of '.' leading to a NPE.
NPE is generated because parseFloat returns a Float object that can be set null in case of NumberFormatException, but the VideoPrice accepts only float.
To bypass this issue and avoid reccuring errors, I've moved the VideoPrice price field to a Float object instead accepting null in case of.

It is far from ideal, and parseFloat would enjoy being able to parse different locale formatting. Anyway, in a first quick fix, this allows the rest of the file to be parsed,
whereas the previous error had all the file to fail while parsing.
2018-12-03 12:44:15 +01:00
Sebastian Nagel 0519d14023
Merge pull request #218 from sebastian-nagel/support-sitemap-extensions
Support sitemap extensions
2018-10-12 21:53:35 +02:00
Sebastian Nagel 862af9416f Sitemap extension support
- add extension support to SiteMapTester
- list extension attributes in SiteMapURL.toString()
- update change log
2018-09-28 12:14:02 +02:00
Sebastian Nagel 66745b29a1 Add support for the Mobile sitemap extension 2018-09-28 12:05:07 +02:00
Sebastian Nagel b924bd0828 Sitemap extension support
- optionally parse elements in the namespace of sitemap extensions:
  - Google video sitemaps (resolves #35)
  - Google image sitemaps (resolves #36)
  - Google news sitemaps
  - alternate links in sitemaps (resolves #149)
- the code is taken from Tanguy Moal's (@tuxnco) PR #162
  with the following modifications:
  - port from DOM to SAX parser
  - keep specific extensions separate from the "core" sitemap classes
2018-09-28 12:04:39 +02:00
Sebastian Nagel 9318de951f Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (#217)
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
  but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests

* Fix Javadoc error and warnings, update change log

* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
  by updating to Java 8 date and time API
2018-09-24 10:09:58 +01:00
Ken Krugler bef1b8437e
Merge pull request #216 from crawler-commons/kkrugler_215_slash-disallow-all
Fix for URL with only a query string (no path)
2018-08-01 10:17:19 -07:00
Ken Krugler 54d90deaa4 Fix up formatting 2018-08-01 10:15:42 -07:00
Ken Krugler a5c5091d64 Update CHANGES.txt 2018-07-31 17:14:27 -07:00
Ken Krugler d5f614b67a Fix for URL with only a query string (no path) 2018-07-31 17:10:12 -07:00
Julien Nioche 6213784e8b Updated README for 0.10 release 2018-06-07 09:20:43 +01:00
Julien Nioche 1c23d5966e [maven-release-plugin] prepare for next development iteration 2018-06-05 11:36:19 +01:00
Julien Nioche 939fbf439e [maven-release-plugin] prepare release crawler-commons-0.10 2018-06-05 11:36:12 +01:00
Julien Nioche 0da1b8b8b5 Minor changes + applied formatting pre 0.10 release 2018-06-05 11:33:27 +01:00
Julien Nioche 8195140e21
Update CHANGES.txt
added #211
2018-06-05 11:23:00 +01:00
Sebastian Nagel d98a3f14cf Allow for legacy URIs when checking sitemap namespaces (#211)
* Allow for legacy URIs when checking sitemap namespaces
- e.g., allow legacy namespace URI but ignore URLs
  from image and video sitemap extensions
- resolve relative namespace URIs
- add namespace URIs of sitemap extensions (news, images, videos)

* Address kkrugler's review comments:
- document addition of sitemap namespace required by sitemap
  protocol specification when calling setStrictNamespace(true)
- remove early return on <rss> root element
2018-06-05 11:20:26 +01:00
Julien Nioche a8b474551a
Update CHANGES.txt
Add main to SimpleRobotRulesParser for testing (#193)
2018-06-04 21:28:09 +01:00
Sebastian Nagel 0c75e75496 Add main to SimpleRobotRulesParser for testing (#193)
* Add main to SimpleRobotRulesParser for testing
- implement toString() for robot rules
- fix line breaks in comments

* Do not detect MIME type as Tika dependency has been removed
2018-06-04 21:26:45 +01:00
Ken Krugler 993c4fdb9e
Merge pull request #208 from aecio/issue-134
Make RobotRules accessible #134
2018-05-14 11:21:42 -07:00
Ken Krugler d99c034dd0
Merge branch 'master' into issue-134 2018-05-14 11:20:17 -07:00
Ken Krugler 67adc27a8c
Merge pull request #209 from aecio/issue-194
SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
2018-05-14 11:16:17 -07:00
Aecio Santos 47c2cad8b8 Add getters/setters and update CHANGES.txt 2018-05-14 12:00:02 -04:00
Aecio Santos aaa3113e55 Update CHANGES.txt 2018-05-14 11:51:43 -04:00
Aecio Santos fd1e7fcffe SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos 7bef14d386 Make RobotRules accessible #134
- Makes SimpleRobotRulesParser._rules property protected
  and adds getters for SimpleRobotRulesParser._rules and
  RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
  to SimpleRobotRules to allow access to concrete class without
  nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Julien Nioche e25309d26c
Add JAX-B dependencies to POM (#207)
* Add JAX-B dependencies to POM, fixes #196

* mentioned in CHANGES.txt
2018-05-03 11:04:03 +01:00
Sebastian Nagel 8ca46ffe2c Merge pull request #190 from thunken:features/for-each
Add SiteMapParser#walkSiteMap(URL,Consumer)
2018-04-25 14:08:27 +02:00
Sebastian Nagel 7d3eccfa63 Add changelog entry and fix unit test 2018-04-25 14:06:33 +02:00
Luc Boruta 355684ff20 Add methods to parse and iterate sitemap
SiteMapParser#walkSiteMap(URL,Consumer)
2018-04-25 14:04:58 +02:00
Sebastian Nagel fa76a594b6
Merge pull request #205 from sebastian-nagel/cc-201-202-203-204
Improve sitemap parsing
2018-04-25 09:51:49 +02:00
Sebastian Nagel 0ef7cf87fa Improve sitemap parsing
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel a6b3178fc7 Simplify MIME detection:
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel 6714ea57d6 Fix error message format string 2018-04-18 14:53:21 +02:00
Sebastian Nagel 72aa177738 RDF-based RSS feeds: map MIME type, detect from content 2018-04-18 14:44:10 +02:00
Sebastian Nagel 4780678af0 Improve logging of content type detection for gzip-compressed sitemaps 2018-04-18 14:43:52 +02:00