Sebastian Nagel
8a34e25c41
Merge pull request #206 from sebastian-nagel/cc-145-bad-url-warnings
...
Sitemaps limit on "bad url" log messages, fixes #145
2018-04-16 13:37:39 +02:00
Sebastian Nagel
907be2343f
Format fix: add braces, complete CHANGES.txt
2018-04-16 13:36:06 +02:00
Sebastian Nagel
af084f7ee4
Sitemaps limit on "bad url" log messages, fixes #145
...
- degrade log level to debug for lines which are not valid
- only log first 1024 characters of line
2018-04-13 11:16:47 +02:00
Sebastian Nagel
34c19d811a
Fix MIME detection for sitemaps:
...
- avoid NPE if no MIME type has been detected
- allow optional leading white space before MIME patterns
(after optional BOM)
2018-04-10 19:17:16 +02:00
Ken Krugler
12155888bc
Add reference to issue #199
2018-04-02 12:59:17 -07:00
Ken Krugler
14153c4eae
Merge pull request #198 from crawler-commons/kkrugler_no-tika
...
Remove Tika dependency
2018-04-02 12:55:40 -07:00
Ken Krugler
165888ba7e
Use locale with String.format()
2018-03-28 14:40:55 -07:00
Ken Krugler
2cff74427d
Remove Tika dependency
2018-03-27 15:40:50 -07:00
Julien Nioche
5132651a6e
Upgraded Tika to 1.17, fixes #189
2018-03-20 17:20:07 +00:00
Ken Krugler
4b13456bc9
Merge pull request #188 from sebastian-nagel/cc-179-effective-tldfinder-parse-idns
...
EffectiveTldFinder to parse Internationalized Domain Names (IDN)
2017-12-08 09:16:37 -08:00
Sebastian Nagel
49bf37c6d9
Update CHANGES.txt
2017-12-08 09:42:52 +01:00
Sebastian Nagel
6de3f26da8
- add missing license header
...
- add debug log messages
- add unit test for invalid hostname (www..example..com)
2017-12-07 21:23:26 +01:00
Sebastian Nagel
95c6e3886a
EffectiveTldFinder to parse Internationalized Domain Names (IDN),
...
fixes #179
- allow eTLDs to appear as IDN variants
* one or more dot-separated parts as IDN,
* optionally others as punycoded IDNA
- speed-up lookup of public suffixes / eTLDs using a suffix trie
- update public suffix list to recent version
- add unit tests to test for wildcard eTLDs
2017-12-06 21:47:01 +01:00
Sebastian Nagel
b63b9b62ce
Merge pull request #187 from sebastian-nagel/effective-tldfinder-main
...
Add main() to EffectiveTldFinder
2017-12-06 16:52:41 +01:00
Sebastian Nagel
5c1ad13cea
Add main() to EffectiveTldFinder
2017-12-01 10:54:03 +01:00
Sebastian Nagel
bde59ff33e
Merge pull request #186 from sebastian-nagel/cc-185-effective-tld-finder-private-domains
...
Allow to include/exclude private domains / effective TLDs without re-initialization, fixes #185
2017-12-01 10:22:56 +01:00
Sebastian Nagel
f598af8d28
- address TODO of #184 : do not use of private suffixes
...
for PaidLevelDomain detection
- adapt unit tests: 'de.com' is a private suffix,
consequently 'xxx.de.com' isn't a PaidLevelDomain
- rebase on master, apply code format
2017-11-06 14:03:35 +01:00
Sebastian Nagel
cda1bdee82
Allow to include/exclude private domains / effective TLDs without re-initialization, fixes #185
...
- add optional flag excludePrivate to getAssignedDomain(...) / getEffectiveTLD(...)
- remove methods which exclude private eTLD during initialization
2017-11-06 11:48:30 +01:00
Ken Krugler
ac78f25356
Merge pull request #184 from crawler-commons/183_paidleveldomain-with-tld
...
Use TLD for PLD code
2017-11-05 15:04:01 -08:00
Ken Krugler
2b58c5050c
merge with master
2017-11-05 14:55:30 -08:00
Ken Krugler
aeb0cb91a2
Update CHANGES.txt
2017-11-05 14:53:58 -08:00
Ken Krugler
03163d0b3f
Add pending test for private domains
2017-11-05 14:51:14 -08:00
Julien Nioche
af0a013776
Released 0.9
2017-10-31 09:42:57 +00:00
Ken Krugler
cb701a9da9
Use slf4j formatting support
...
And avoid forbidden api check for calling String.format without a
locale.
2017-10-28 09:59:28 -07:00
Ken Krugler
c0a158d086
Fix minor typo
2017-10-27 13:53:28 -07:00
Ken Krugler
569ec9b8cd
Use TLD for PLD code
2017-10-27 13:51:29 -07:00
Julien Nioche
535f96236c
[maven-release-plugin] prepare for next development iteration
2017-10-27 11:13:15 +01:00
Julien Nioche
9a54f53c14
[maven-release-plugin] prepare release crawler-commons-0.9
2017-10-27 11:13:07 +01:00
Julien Nioche
f3e37f37da
Updated change log prior to 0.9 release
2017-10-27 11:08:47 +01:00
Julien Nioche
ee69049db0
Sitemap: removed DOM parser, fixes #177 ( #181 )
2017-10-27 07:48:22 +01:00
Sebastian Nagel
d2de87cf92
Normalize RSS pubDate, fixes #166 for SiteMapParserSAX ( #180 )
2017-10-18 14:07:05 +01:00
Sebastian Nagel
b34e951a90
Merge pull request #178 from sebastian-nagel/cc-172-effective-tld-finder
...
Improve EffectiveTldFinder, fixes #172
2017-10-17 22:12:29 +02:00
Sebastian Nagel
0b70328671
EffectiveTLDFinder improvements, fixes #172
...
- update public suffix list to recent version of
https://publicsuffix.org/list/public_suffix_list.dat
- add method flag to force a check whether the domain has a valid
effective TLD listed in the public suffix list
- fix mixed case hostnames (wWW.eXample.com)
- improve parsing of public suffix list
2017-10-17 17:15:16 +02:00
Sebastian Nagel
8dea17e003
Add tests from publicsuffixlist.org
2017-10-17 13:29:14 +02:00
Sebastian Nagel
5e9d70fec6
EffectiveTldFinder: optionally exclude private domains (opposed to ICANN domains)
2017-10-17 13:29:14 +02:00
Sebastian Nagel
bba44925a5
Update public suffix list to recent version of
...
https://publicsuffix.org/list/public_suffix_list.dat
2017-10-17 13:23:15 +02:00
Sebastian Nagel
01d675fc37
Incorrect domains returned by EffectiveTldFinder ( #172 )
...
- correctly recognize exceptions to wildcard rules as domains
- do not disallow TLDs with last element not being a TLD (e.g., .ac.za)
- partially fix IDNs: punycoded IDNs are now recognized
- add unit test for uppercase / mixed case host names
2017-10-17 13:23:15 +02:00
Julien Nioche
6adb771b72
Add namespace aware DOM/SAX parsing for XML Sitemaps ( #176 )
...
* Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom parsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".
* Lenient namespacing in non strict mode + applied formatting
* Introduced separate field strictNamespace to sitemapparsers + added test to saxparser
* Fixes Javadoc
* Fixes the fix for the Javadoc
* Allow to set strictNamespace in SiteMapTester
- Fix strict namespace handling in SitemapParserSAX:
- pass strictNamespace from DelegatorHandler to delegates
- ignore text if inside an element of invalid namespace
- use SAX parser in unit test
- set exception and pass it to calling DelegatorHandler if namespace
does not match
2017-10-17 10:47:17 +01:00
Julien Nioche
5e60792a0b
Upgraded Tika 1.16, fixes #175
2017-09-08 14:34:17 +01:00
Sebastian Nagel
2afdf5b04d
Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169
...
- completely add sitemap URLs from sitemap index if URL contains
XML entities or CDATA
2017-08-12 17:28:08 +02:00
Ken Krugler
7e08c1da49
Update CHANGES.txt
2017-06-20 15:18:47 -07:00
Ken Krugler
98e75e419c
Merge pull request #167 from MichealKum/sitemap_166_rss_ignores_item_pubDate
...
ISSUE #166 RSS parser ignores item's pubDate
2017-06-20 15:15:34 -07:00
Ilya Kondratiev
a49fba71b7
[ #166 ] Fix.
...
* Create single TreadLocal static filed for all RSS date-time formats.
* Add messages to assertations.
2017-06-20 23:06:11 +03:00
Ilya Kondratiev
b8aaffdf31
[ISSUE-166] RSS parser ignores item's pubDate
...
* Converts RSS timestamp format to w3c format
* Use pubDate of item in new SiteMapURL creation
2017-06-17 00:03:55 +03:00
Julien Nioche
694e74207b
release notes for 0.8
2017-06-09 10:15:44 +01:00
Julien Nioche
7ea41105d7
[maven-release-plugin] prepare for next development iteration
2017-06-07 16:09:32 +01:00
Julien Nioche
58f8e22dc2
[maven-release-plugin] prepare release crawler-commons-0.8
2017-06-07 16:09:25 +01:00
Julien Nioche
79032dbaf5
Merge pull request #165 from crawler-commons/153
...
SAX parser to stop URL at closing </loc> fixes #153
2017-06-06 22:02:46 +01:00
Sebastian Nagel
4dd61ded84
SAX parser to stop URL at closing </loc> fixes #153
...
- on opening <loc> and <url> assume forgotten closing </url> tag
only if there is more than white space in the buffer
2017-06-06 21:04:28 +02:00
Ken Krugler
90c2800869
Merge pull request #164 from crawler-commons/153
...
SAX parser to stop URL at closing </loc> fixes #153
2017-06-05 12:35:22 -07:00