1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-12 00:26:03 +02:00
Commit Graph

401 Commits

Author SHA1 Message Date
Avi Hayun e55bf1f697 Updated the gitignore file to include all files which should be ignor… (#265)
* Updated the gitignore file to include all files which should be ignored due to the following frameworks: Java, Eclipse, Intellij Idea, Maven.

File generated initially using: https://www.gitignore.io/
This submit attempts to solve: #260
2019-09-05 08:34:23 +01:00
Julien Nioche 559d45cbba
use open jdk in travis 2019-09-04 09:29:58 +01:00
Avi Hayun 1c74fc6ca0 The robots.txt test now supports the junit5 parameterizedTest, gaining us the ability to have each robot rule be tested separately, so if one rule fails, the others can still show success in their own unit tests. (#256)
IMHO it also makes the tests more readable but that is in the eyes of the beholder

This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests
2019-07-23 13:13:45 +01:00
Avi Hayun 2cfddca060
Merge pull request #254 from Chaiavi/NormalizerUsingParameterizedTest
Updated the tests to use parameterizedTest, separating each test
2019-07-22 13:59:07 +03:00
Avi Hayun 3597dee5a7 Updated the tests to use parameterizedTest, thus making the test file more readable and each normalized pair is now a test in itself, thus making it much easier to identify exactly failure points 2019-07-16 21:17:25 +03:00
Avi Hayun 5bda3636d2
Upgraded to Junit v5.5 (#250)
* Upgraded to Junit v5.5
Updated the annotations and assertions accordingly

* Removed unneeded before and after

* This is a technical debt

Fixing a styling issue I caused about 4 years ago

Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82

* Fixed according to @sebastian-nagel code review
2019-07-15 21:29:03 +03:00
Avi Hayun 2aa107e441
fixed a typo 2019-07-11 19:34:24 +03:00
Sebastian Nagel 3355b3ca1d
Merge pull request #241 from sebastian-nagel/cc-240-robotstxt-sitemap-url
Robots parser to always handle absolute sitemap URL even without valid base URL
2019-04-04 17:26:12 +02:00
Sebastian Nagel b449fdf024 Robots parser to always handle absolute sitemap URL even without valid base URL,
fixes #240
2019-03-26 15:05:12 +01:00
Sebastian Nagel ec2ad7c790 Updated README for 1.0 release 2019-03-25 10:30:48 +01:00
Julien Nioche 651ce08afe CHANGES ready for next cycle of improvements 2019-03-21 21:02:10 +00:00
Julien Nioche 7c6a3f2ccd [maven-release-plugin] prepare for next development iteration 2019-03-19 22:07:14 +00:00
Julien Nioche 5504239c6d [maven-release-plugin] prepare release crawler-commons-1.0 2019-03-19 22:06:56 +00:00
Julien Nioche 0dc699f3e9 Updated CHANGES pre-1.0 release 2019-03-19 22:04:31 +00:00
Julien Nioche 6d7f3a3fbf Revert "[maven-release-plugin] prepare release crawler-commons-1.0"
This reverts commit 2be967020e.
2019-03-19 22:02:19 +00:00
Sebastian Nagel e8b598b2e8 [Sitemaps] Unit tests depend on system timezone, fixes #238 (#239)
- fix unit test to format data in time zone UTC
- improve documentation of `convertToZonedDateTime`:
  add note that UTC is assumed if no time zone is contained in
  date string
2019-03-19 15:00:04 +00:00
Julien Nioche 9a5236c484 [maven-release-plugin] prepare for next development iteration 2019-03-16 21:31:53 +00:00
Julien Nioche 2be967020e [maven-release-plugin] prepare release crawler-commons-1.0 2019-03-16 21:31:34 +00:00
Sebastian Nagel 40531efe25 EffectiveTldFinder: upgrade public suffix list, implements #219 (#235)
- upgrade the public suffix list to the latest version retrieved right now from
     https://publicsuffix.org/list/public_suffix_list.dat
  resp.
     802c469416/public_suffix_list.dat
2019-03-14 11:22:50 +00:00
Sebastian Nagel cd57a5b86a
Merge pull request #234 from sebastian-nagel/cc-144-xml-skip-leading-whitespace
Detection and parsing of XML sitemaps fails with whitespace before XML declaration #144
2019-03-14 10:34:06 +01:00
Sebastian Nagel 0349fbe1f0 Update changelog for #144/#234 2019-03-14 10:29:17 +01:00
Sebastian Nagel a69bf09cd4 Detection and parsing of XML sitemaps fails with whitespace before XML declaration,
fixes #144
- implement InputStream skipping over white space at beginning of file
- use for XML sitemaps in combination with BOMInputStream,
  so that white space or empty lines before <?xml ...> do not
  cause the parser to fail
2019-03-07 16:17:09 +01:00
Sebastian Nagel 3f3121568f
Merge pull request #233 from sebastian-nagel/cc-225-226-xml-handlers-buffer-characters
Buffer characters when characters(...) is called, fixes #225 and #226
2019-02-28 16:40:00 +01:00
Sebastian Nagel 2539b07948 Buffer characters when characters(...) is called (instead of immediately processing
the current character chunk)
- use `localName` instead of calling `currentElement()` where applicable
- remove unnecessary null checks of character buffer
2019-02-25 14:32:26 +01:00
Sebastian Nagel eb74336bd3 Update changelog for #225 and #226 2019-02-21 23:01:32 +01:00
Sebastian Nagel 82859c2b1a Buffer characters when characters(...) is called (instead of immediately processing
the current character chunk)
- fix errors when character chunks are interrupted by CDATA sections or character entities
  - fixes #225 XMLIndexHandler needs to accumulate the lastmod date string before parsing
  - fixes #226 XMLHandler needs to append text in characters() vs. immediately processing
  - provide character buffer in DelegatorHandler, so that derived classes
    can append characters to it and finally get the buffered content
- code cleanup in all handler classes:
  - add @Override annotations
  - remove stubb method implementations
2019-02-21 23:01:32 +01:00
Sebastian Nagel b5a9517bc4
Merge pull request #232 from sebastian-nagel/cc-231-etld-invalid-idns
EffectiveTldFinder throws IllegalArgumentException on IDN domain names containing prohibited characters, fixes #231
2019-02-21 22:27:59 +01:00
Sebastian Nagel 916415d262
Merge branch 'master' into cc-231-etld-invalid-idns 2019-02-21 22:16:48 +01:00
Sebastian Nagel 40b1c44d68 Update changelog for #231 2019-02-21 22:15:40 +01:00
Sebastian Nagel 4d6b27c36e
Merge pull request #228 from sebastian-nagel/cc-224-sitemaps-trim-unicode-whitespace
[Sitemaps] Trim Unicode whitespace around URLs, fixes #224
2019-02-20 22:24:11 +01:00
Sebastian Nagel 67db8bf1be [Sitemaps] Trim Unicode whitespace around URLs, fixes #224 2019-02-20 16:27:16 +01:00
Sebastian Nagel 680df29f0f EffectiveTldFinder throws IllegalArgumentException on IDN domain names
containing prohibited charactersi, fixes #231
- catch IllegalArgumentException when converting IDNs to ASCII
- validate domain names returned by getAssignedDomain(...)
- add unit tests, complete Javadoc
2019-02-20 14:59:58 +01:00
Sebastian Nagel 6d3bbd2512
Merge pull request #227 from sebastian-nagel/cc-213-auto-close-sitemap-in-sitemapindex
Sitemap index: stop URL at closing </loc>, fixes #213
2019-02-04 22:24:18 +01:00
Sebastian Nagel 78e935f83b Update changelog for #213 2019-02-03 13:49:11 +01:00
Sebastian Nagel 2bdc1923b1 Merge branch 'master' into cc-213-auto-close-sitemap-in-sitemapindex 2019-02-03 13:47:31 +01:00
Sebastian Nagel 5706ee1e97 Sitemap index: stop URL at closing </loc>
- bug fix: <sitemap> element is closed, not <url>
- add unit test to cover resetting of attributes
2019-02-03 13:44:14 +01:00
Sebastian Nagel ab9e33a5f9 Update changelog for #220 and #221 2019-01-18 17:35:33 +01:00
Sebastian Nagel 102189f4c0
Merge pull request #223 from sebastian-nagel/cc-video-sitemap-extension-220-221
Allow empty price in video sitemaps, fixes #221, closes #220
2019-01-18 17:31:41 +01:00
Sebastian Nagel e0ef1ee29d Remove final from method parameters 2019-01-18 17:27:23 +01:00
Sebastian Nagel 13fab2e033 Sitemap index: stop URL at closing </loc>, fixes #213
- at start of a <loc> element auto-close any unclosed
  <sitemap> element and add the sitemap if there is a
  valid URL from the previous <loc> element
2018-12-13 15:01:23 +01:00
Sebastian Nagel 439c350ab3 Allow empty price in video sitemaps, fixes #221 2018-12-07 16:44:47 +01:00
Sebastian Nagel f361597447 Inline unit test for locale-specific number formats in video:price tags (#220)
- simply unit tests to contain only tested or required video attributes
- fix order of attributes/elements to follow XSD schema spec
2018-12-07 16:44:16 +01:00
Sebastian Nagel 4de43a7534 Merge branch 'goldenlink-master' to fix #220 2018-12-07 16:43:17 +01:00
Aurelien Violette 3c12c715c8 Fix object comparison on testing. 2018-12-03 12:55:30 +01:00
Aurelien Violette e802cfa8eb In case of the use of a different locale, price tag can be formatted with ',' instead of '.' leading to a NPE.
NPE is generated because parseFloat returns a Float object that can be set null in case of NumberFormatException, but the VideoPrice accepts only float.
To bypass this issue and avoid reccuring errors, I've moved the VideoPrice price field to a Float object instead accepting null in case of.

It is far from ideal, and parseFloat would enjoy being able to parse different locale formatting. Anyway, in a first quick fix, this allows the rest of the file to be parsed,
whereas the previous error had all the file to fail while parsing.
2018-12-03 12:44:15 +01:00
Sebastian Nagel 0519d14023
Merge pull request #218 from sebastian-nagel/support-sitemap-extensions
Support sitemap extensions
2018-10-12 21:53:35 +02:00
Sebastian Nagel 862af9416f Sitemap extension support
- add extension support to SiteMapTester
- list extension attributes in SiteMapURL.toString()
- update change log
2018-09-28 12:14:02 +02:00
Sebastian Nagel 66745b29a1 Add support for the Mobile sitemap extension 2018-09-28 12:05:07 +02:00
Sebastian Nagel b924bd0828 Sitemap extension support
- optionally parse elements in the namespace of sitemap extensions:
  - Google video sitemaps (resolves #35)
  - Google image sitemaps (resolves #36)
  - Google news sitemaps
  - alternate links in sitemaps (resolves #149)
- the code is taken from Tanguy Moal's (@tuxnco) PR #162
  with the following modifications:
  - port from DOM to SAX parser
  - keep specific extensions separate from the "core" sitemap classes
2018-09-28 12:04:39 +02:00
Sebastian Nagel 9318de951f Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (#217)
* Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps
- use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat>
- simplify parsing of RSS publication dates
- remove obsolete regex pattern to catch dates with time zone
  but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME)
- extend unit tests

* Fix Javadoc error and warnings, update change log

* Remove obsolete dependency to jaxb-api
- import of javax.xml.bind.DatatypeConverter has been removed
  by updating to Java 8 date and time API
2018-09-24 10:09:58 +01:00