1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-09 23:56:04 +02:00
Commit Graph

294 Commits

Author SHA1 Message Date
Julien Nioche 9a54f53c14 [maven-release-plugin] prepare release crawler-commons-0.9 2017-10-27 11:13:07 +01:00
Julien Nioche f3e37f37da Updated change log prior to 0.9 release 2017-10-27 11:08:47 +01:00
Julien Nioche ee69049db0 Sitemap: removed DOM parser, fixes #177 (#181) 2017-10-27 07:48:22 +01:00
Sebastian Nagel d2de87cf92 Normalize RSS pubDate, fixes #166 for SiteMapParserSAX (#180) 2017-10-18 14:07:05 +01:00
Sebastian Nagel b34e951a90 Merge pull request #178 from sebastian-nagel/cc-172-effective-tld-finder
Improve EffectiveTldFinder, fixes #172
2017-10-17 22:12:29 +02:00
Sebastian Nagel 0b70328671 EffectiveTLDFinder improvements, fixes #172
- update public suffix list to recent version of
      https://publicsuffix.org/list/public_suffix_list.dat
- add method flag to force a check whether the domain has a valid
      effective TLD listed in the public suffix list
- fix mixed case hostnames (wWW.eXample.com)
- improve parsing of public suffix list
2017-10-17 17:15:16 +02:00
Sebastian Nagel 8dea17e003 Add tests from publicsuffixlist.org 2017-10-17 13:29:14 +02:00
Sebastian Nagel 5e9d70fec6 EffectiveTldFinder: optionally exclude private domains (opposed to ICANN domains) 2017-10-17 13:29:14 +02:00
Sebastian Nagel bba44925a5 Update public suffix list to recent version of
https://publicsuffix.org/list/public_suffix_list.dat
2017-10-17 13:23:15 +02:00
Sebastian Nagel 01d675fc37 Incorrect domains returned by EffectiveTldFinder (#172)
- correctly recognize exceptions to wildcard rules as domains
- do not disallow TLDs with last element not being a TLD (e.g., .ac.za)
- partially fix IDNs: punycoded IDNs are now recognized
- add unit test for uppercase / mixed case host names
2017-10-17 13:23:15 +02:00
Julien Nioche 6adb771b72 Add namespace aware DOM/SAX parsing for XML Sitemaps (#176)
* Add namespace aware DOM/SAX parsing for XML Sitemaps.  RSS and Atom parsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".

* Lenient namespacing in non strict mode + applied formatting

* Introduced separate field strictNamespace to sitemapparsers + added test to saxparser

* Fixes Javadoc

* Fixes the fix for the Javadoc

* Allow to set strictNamespace in SiteMapTester

- Fix strict namespace handling in SitemapParserSAX:
- pass strictNamespace from DelegatorHandler to delegates
- ignore text if inside an element of invalid namespace
- use SAX parser in unit test
- set exception and pass it to calling DelegatorHandler if namespace
  does not match
2017-10-17 10:47:17 +01:00
Julien Nioche 5e60792a0b Upgraded Tika 1.16, fixes #175 2017-09-08 14:34:17 +01:00
Sebastian Nagel 2afdf5b04d Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169
- completely add sitemap URLs from sitemap index if URL contains
  XML entities or CDATA
2017-08-12 17:28:08 +02:00
Ken Krugler 7e08c1da49 Update CHANGES.txt 2017-06-20 15:18:47 -07:00
Ken Krugler 98e75e419c Merge pull request #167 from MichealKum/sitemap_166_rss_ignores_item_pubDate
ISSUE #166 RSS parser ignores item's pubDate
2017-06-20 15:15:34 -07:00
Ilya Kondratiev a49fba71b7 [#166] Fix.
* Create single TreadLocal static filed for all RSS date-time formats.
* Add messages to assertations.
2017-06-20 23:06:11 +03:00
Ilya Kondratiev b8aaffdf31 [ISSUE-166] RSS parser ignores item's pubDate
* Converts RSS timestamp format to w3c format
* Use pubDate of item in new SiteMapURL creation
2017-06-17 00:03:55 +03:00
Julien Nioche 694e74207b release notes for 0.8 2017-06-09 10:15:44 +01:00
Julien Nioche 7ea41105d7 [maven-release-plugin] prepare for next development iteration 2017-06-07 16:09:32 +01:00
Julien Nioche 58f8e22dc2 [maven-release-plugin] prepare release crawler-commons-0.8 2017-06-07 16:09:25 +01:00
Julien Nioche 79032dbaf5 Merge pull request #165 from crawler-commons/153
SAX parser to stop URL at closing </loc> fixes #153
2017-06-06 22:02:46 +01:00
Sebastian Nagel 4dd61ded84 SAX parser to stop URL at closing </loc> fixes #153
- on opening <loc> and <url> assume forgotten closing </url> tag
  only if there is more than white space in the buffer
2017-06-06 21:04:28 +02:00
Ken Krugler 90c2800869 Merge pull request #164 from crawler-commons/153
SAX parser to stop URL at closing </loc> fixes #153
2017-06-05 12:35:22 -07:00
Julien Nioche 7bf499a98a SAX parser to stop URL at closing </loc> fixes #153 2017-06-05 16:58:39 +01:00
Julien Nioche 2c72ba8708 Update CHANGES.txt
added tika 1.15 to changes
2017-06-02 15:02:20 +01:00
Julien Nioche e2f38b3903 Merge pull request #163 from crawler-commons/tika1.15
Upgraded Tika 1.15
2017-06-02 14:59:05 +01:00
Julien Nioche f8c20c8e4a Upgraded Tika 1.15 2017-06-02 12:22:44 +01:00
Sebastian Nagel 02e62c12cb Disable XML resolvers: update changelog 2017-05-04 22:36:40 +02:00
Sebastian Nagel ce42b00150 Merge pull request #152 from sebastian-nagel/sitemaps-disable-xml-resolvers
Disable validation of XML sitemaps and install noop entity resolver, fixes #151
2017-05-04 22:33:35 +02:00
Julien Nioche 462a7f40b7 Merge pull request #160 from sebastian-nagel/cc-159-sax-atom-feed
SAX parser does not extract URLs from Atom feeds, fixes #159
2017-03-28 13:29:06 +01:00
Sebastian Nagel 4f1534a6f4 SAX parser does not extract URLs from Atom feeds, fixes #159
- call delegated method startElement(...) from DelegatorHandler
- add heuristics to select link URL
- add Atom feed unit test
- add static method to encapsulate reading of test resource files
2017-03-28 11:54:04 +02:00
Julien Nioche 4ba1295c17 Update forbiddenapis to v2.0. Fixes #99 2017-03-20 15:58:55 +00:00
Julien Nioche 9d67dabb1f Merge pull request #157 from sebastian-nagel/cc-143-sitemap-text-gzipped
Fix parsing of gzipped text sitemaps, fixes #143
2017-03-20 15:48:57 +00:00
Sebastian Nagel 772f02fcb0 Fix parsing of gzipped text sitemaps, fixes #143
- detect gzip embedded media type to decide
  whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Julien Nioche 266ac09fff Merge pull request #155 from sebastian-nagel/cc-154-sax-parser-fail-on-unknown-xml
SAX parser to fail on valid but unknown XML formats, fixes #154
2017-03-08 13:58:46 +00:00
Sebastian Nagel 6759ee73e9 SAX parser to fail on valid but unknown XML formats, fixes #154 2017-03-08 14:23:55 +01:00
Sebastian Nagel 9b936ecca4 Disable validation of XML sitemaps and install noop entity resolver, fixes #151 2017-03-03 00:01:12 +01:00
Julien Nioche cb8e760102 Merge pull request #150 from sebastian-nagel/cc-116-sitemap-sax-parser
Optional SAX Parser for Sitemaps
2017-02-28 09:20:09 +00:00
Sebastian Nagel cc09c889fe Merge branch 'master' into cc-116-sitemap-sax-parser 2017-02-27 23:12:55 +01:00
Sebastian Nagel f7c7cab7a8 Merge branch 'matt-deboer-master'
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Ken Krugler d5d4e1b751 Merge pull request #147 from sebastian-nagel/cc-146-log-xml-errors
Looks good, thanks @sebastian-nagel
2017-02-07 13:39:54 -08:00
Sebastian Nagel be982c1a1c SiteMapParser: install error handlers in DOM builder to log XML errors, fixes #146 2017-02-07 22:06:03 +01:00
Ken Krugler 2d751a04b8 Merge pull request #142 from sebastian-nagel/cc-141-use-standardcharsets
Use constants from StandardCharsets where applicable, fixes #141.

Thanks @sebastian-nagel!
2017-02-02 09:44:00 -08:00
Sebastian Nagel 960b7a8fed remove unused import 2017-02-02 15:18:05 +01:00
Sebastian Nagel 61a500ad21 Use constants from StandardCharsets where applicable, fixes #141 2017-02-02 14:59:14 +01:00
Ken Krugler 375aae3d99 Merge pull request #140 from sebastian-nagel/cc-132-sitemap-limit-50mb
Increase size limit of sitemaps (10MB -> 50MB), fixes #132
2017-02-02 03:57:45 -08:00
Sebastian Nagel 49b3097083 Increase size limit of sitemaps (10MB -> 50MB), fixes #132 2017-02-02 12:00:47 +01:00
Sebastian Nagel 9a49088ec8 Add new committer 2017-01-16 15:47:44 +01:00
Julien Nioche eefeda558c Update CHANGES.txt
#137
2017-01-13 17:31:10 +00:00
Julien Nioche 040c02dad8 Merge pull request #138 from sebastian-nagel/137-default-locale
Remove dependencies to system-specific locale, fixes #137
2017-01-13 17:29:16 +00:00