Julien Nioche
9a54f53c14
[maven-release-plugin] prepare release crawler-commons-0.9
2017-10-27 11:13:07 +01:00
Julien Nioche
f3e37f37da
Updated change log prior to 0.9 release
2017-10-27 11:08:47 +01:00
Julien Nioche
ee69049db0
Sitemap: removed DOM parser, fixes #177 ( #181 )
2017-10-27 07:48:22 +01:00
Sebastian Nagel
d2de87cf92
Normalize RSS pubDate, fixes #166 for SiteMapParserSAX ( #180 )
2017-10-18 14:07:05 +01:00
Sebastian Nagel
b34e951a90
Merge pull request #178 from sebastian-nagel/cc-172-effective-tld-finder
...
Improve EffectiveTldFinder, fixes #172
2017-10-17 22:12:29 +02:00
Sebastian Nagel
0b70328671
EffectiveTLDFinder improvements, fixes #172
...
- update public suffix list to recent version of
https://publicsuffix.org/list/public_suffix_list.dat
- add method flag to force a check whether the domain has a valid
effective TLD listed in the public suffix list
- fix mixed case hostnames (wWW.eXample.com)
- improve parsing of public suffix list
2017-10-17 17:15:16 +02:00
Sebastian Nagel
8dea17e003
Add tests from publicsuffixlist.org
2017-10-17 13:29:14 +02:00
Sebastian Nagel
5e9d70fec6
EffectiveTldFinder: optionally exclude private domains (opposed to ICANN domains)
2017-10-17 13:29:14 +02:00
Sebastian Nagel
bba44925a5
Update public suffix list to recent version of
...
https://publicsuffix.org/list/public_suffix_list.dat
2017-10-17 13:23:15 +02:00
Sebastian Nagel
01d675fc37
Incorrect domains returned by EffectiveTldFinder ( #172 )
...
- correctly recognize exceptions to wildcard rules as domains
- do not disallow TLDs with last element not being a TLD (e.g., .ac.za)
- partially fix IDNs: punycoded IDNs are now recognized
- add unit test for uppercase / mixed case host names
2017-10-17 13:23:15 +02:00
Julien Nioche
6adb771b72
Add namespace aware DOM/SAX parsing for XML Sitemaps ( #176 )
...
* Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom parsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".
* Lenient namespacing in non strict mode + applied formatting
* Introduced separate field strictNamespace to sitemapparsers + added test to saxparser
* Fixes Javadoc
* Fixes the fix for the Javadoc
* Allow to set strictNamespace in SiteMapTester
- Fix strict namespace handling in SitemapParserSAX:
- pass strictNamespace from DelegatorHandler to delegates
- ignore text if inside an element of invalid namespace
- use SAX parser in unit test
- set exception and pass it to calling DelegatorHandler if namespace
does not match
2017-10-17 10:47:17 +01:00
Julien Nioche
5e60792a0b
Upgraded Tika 1.16, fixes #175
2017-09-08 14:34:17 +01:00
Sebastian Nagel
2afdf5b04d
Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169
...
- completely add sitemap URLs from sitemap index if URL contains
XML entities or CDATA
2017-08-12 17:28:08 +02:00
Ken Krugler
7e08c1da49
Update CHANGES.txt
2017-06-20 15:18:47 -07:00
Ken Krugler
98e75e419c
Merge pull request #167 from MichealKum/sitemap_166_rss_ignores_item_pubDate
...
ISSUE #166 RSS parser ignores item's pubDate
2017-06-20 15:15:34 -07:00
Ilya Kondratiev
a49fba71b7
[ #166 ] Fix.
...
* Create single TreadLocal static filed for all RSS date-time formats.
* Add messages to assertations.
2017-06-20 23:06:11 +03:00
Ilya Kondratiev
b8aaffdf31
[ISSUE-166] RSS parser ignores item's pubDate
...
* Converts RSS timestamp format to w3c format
* Use pubDate of item in new SiteMapURL creation
2017-06-17 00:03:55 +03:00
Julien Nioche
694e74207b
release notes for 0.8
2017-06-09 10:15:44 +01:00
Julien Nioche
7ea41105d7
[maven-release-plugin] prepare for next development iteration
2017-06-07 16:09:32 +01:00
Julien Nioche
58f8e22dc2
[maven-release-plugin] prepare release crawler-commons-0.8
2017-06-07 16:09:25 +01:00
Julien Nioche
79032dbaf5
Merge pull request #165 from crawler-commons/153
...
SAX parser to stop URL at closing </loc> fixes #153
2017-06-06 22:02:46 +01:00
Sebastian Nagel
4dd61ded84
SAX parser to stop URL at closing </loc> fixes #153
...
- on opening <loc> and <url> assume forgotten closing </url> tag
only if there is more than white space in the buffer
2017-06-06 21:04:28 +02:00
Ken Krugler
90c2800869
Merge pull request #164 from crawler-commons/153
...
SAX parser to stop URL at closing </loc> fixes #153
2017-06-05 12:35:22 -07:00
Julien Nioche
7bf499a98a
SAX parser to stop URL at closing </loc> fixes #153
2017-06-05 16:58:39 +01:00
Julien Nioche
2c72ba8708
Update CHANGES.txt
...
added tika 1.15 to changes
2017-06-02 15:02:20 +01:00
Julien Nioche
e2f38b3903
Merge pull request #163 from crawler-commons/tika1.15
...
Upgraded Tika 1.15
2017-06-02 14:59:05 +01:00
Julien Nioche
f8c20c8e4a
Upgraded Tika 1.15
2017-06-02 12:22:44 +01:00
Sebastian Nagel
02e62c12cb
Disable XML resolvers: update changelog
2017-05-04 22:36:40 +02:00
Sebastian Nagel
ce42b00150
Merge pull request #152 from sebastian-nagel/sitemaps-disable-xml-resolvers
...
Disable validation of XML sitemaps and install noop entity resolver, fixes #151
2017-05-04 22:33:35 +02:00
Julien Nioche
462a7f40b7
Merge pull request #160 from sebastian-nagel/cc-159-sax-atom-feed
...
SAX parser does not extract URLs from Atom feeds, fixes #159
2017-03-28 13:29:06 +01:00
Sebastian Nagel
4f1534a6f4
SAX parser does not extract URLs from Atom feeds, fixes #159
...
- call delegated method startElement(...) from DelegatorHandler
- add heuristics to select link URL
- add Atom feed unit test
- add static method to encapsulate reading of test resource files
2017-03-28 11:54:04 +02:00
Julien Nioche
4ba1295c17
Update forbiddenapis to v2.0. Fixes #99
2017-03-20 15:58:55 +00:00
Julien Nioche
9d67dabb1f
Merge pull request #157 from sebastian-nagel/cc-143-sitemap-text-gzipped
...
Fix parsing of gzipped text sitemaps, fixes #143
2017-03-20 15:48:57 +00:00
Sebastian Nagel
772f02fcb0
Fix parsing of gzipped text sitemaps, fixes #143
...
- detect gzip embedded media type to decide
whether to parse as text or XML
2017-03-20 16:24:46 +01:00
Julien Nioche
266ac09fff
Merge pull request #155 from sebastian-nagel/cc-154-sax-parser-fail-on-unknown-xml
...
SAX parser to fail on valid but unknown XML formats, fixes #154
2017-03-08 13:58:46 +00:00
Sebastian Nagel
6759ee73e9
SAX parser to fail on valid but unknown XML formats, fixes #154
2017-03-08 14:23:55 +01:00
Sebastian Nagel
9b936ecca4
Disable validation of XML sitemaps and install noop entity resolver, fixes #151
2017-03-03 00:01:12 +01:00
Julien Nioche
cb8e760102
Merge pull request #150 from sebastian-nagel/cc-116-sitemap-sax-parser
...
Optional SAX Parser for Sitemaps
2017-02-28 09:20:09 +00:00
Sebastian Nagel
cc09c889fe
Merge branch 'master' into cc-116-sitemap-sax-parser
2017-02-27 23:12:55 +01:00
Sebastian Nagel
f7c7cab7a8
Merge branch 'matt-deboer-master'
...
- provide SAX parser optionally to DOM-based parser
- SiteMapTester: trigger usage of SAX parser by property sitemap.useSax
2017-02-27 23:09:45 +01:00
Ken Krugler
d5d4e1b751
Merge pull request #147 from sebastian-nagel/cc-146-log-xml-errors
...
Looks good, thanks @sebastian-nagel
2017-02-07 13:39:54 -08:00
Sebastian Nagel
be982c1a1c
SiteMapParser: install error handlers in DOM builder to log XML errors, fixes #146
2017-02-07 22:06:03 +01:00
Ken Krugler
2d751a04b8
Merge pull request #142 from sebastian-nagel/cc-141-use-standardcharsets
...
Use constants from StandardCharsets where applicable, fixes #141 .
Thanks @sebastian-nagel!
2017-02-02 09:44:00 -08:00
Sebastian Nagel
960b7a8fed
remove unused import
2017-02-02 15:18:05 +01:00
Sebastian Nagel
61a500ad21
Use constants from StandardCharsets where applicable, fixes #141
2017-02-02 14:59:14 +01:00
Ken Krugler
375aae3d99
Merge pull request #140 from sebastian-nagel/cc-132-sitemap-limit-50mb
...
Increase size limit of sitemaps (10MB -> 50MB), fixes #132
2017-02-02 03:57:45 -08:00
Sebastian Nagel
49b3097083
Increase size limit of sitemaps (10MB -> 50MB), fixes #132
2017-02-02 12:00:47 +01:00
Sebastian Nagel
9a49088ec8
Add new committer
2017-01-16 15:47:44 +01:00
Julien Nioche
eefeda558c
Update CHANGES.txt
...
#137
2017-01-13 17:31:10 +00:00
Julien Nioche
040c02dad8
Merge pull request #138 from sebastian-nagel/137-default-locale
...
Remove dependencies to system-specific locale, fixes #137
2017-01-13 17:29:16 +00:00