Julien Nioche
939fbf439e
[maven-release-plugin] prepare release crawler-commons-0.10
2018-06-05 11:36:12 +01:00
Julien Nioche
0da1b8b8b5
Minor changes + applied formatting pre 0.10 release
2018-06-05 11:33:27 +01:00
Julien Nioche
8195140e21
Update CHANGES.txt
...
added #211
2018-06-05 11:23:00 +01:00
Sebastian Nagel
d98a3f14cf
Allow for legacy URIs when checking sitemap namespaces ( #211 )
...
* Allow for legacy URIs when checking sitemap namespaces
- e.g., allow legacy namespace URI but ignore URLs
from image and video sitemap extensions
- resolve relative namespace URIs
- add namespace URIs of sitemap extensions (news, images, videos)
* Address kkrugler's review comments:
- document addition of sitemap namespace required by sitemap
protocol specification when calling setStrictNamespace(true)
- remove early return on <rss> root element
2018-06-05 11:20:26 +01:00
Julien Nioche
a8b474551a
Update CHANGES.txt
...
Add main to SimpleRobotRulesParser for testing (#193 )
2018-06-04 21:28:09 +01:00
Sebastian Nagel
0c75e75496
Add main to SimpleRobotRulesParser for testing ( #193 )
...
* Add main to SimpleRobotRulesParser for testing
- implement toString() for robot rules
- fix line breaks in comments
* Do not detect MIME type as Tika dependency has been removed
2018-06-04 21:26:45 +01:00
Ken Krugler
993c4fdb9e
Merge pull request #208 from aecio/issue-134
...
Make RobotRules accessible #134
2018-05-14 11:21:42 -07:00
Ken Krugler
d99c034dd0
Merge branch 'master' into issue-134
2018-05-14 11:20:17 -07:00
Ken Krugler
67adc27a8c
Merge pull request #209 from aecio/issue-194
...
SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
2018-05-14 11:16:17 -07:00
Aecio Santos
47c2cad8b8
Add getters/setters and update CHANGES.txt
2018-05-14 12:00:02 -04:00
Aecio Santos
aaa3113e55
Update CHANGES.txt
2018-05-14 11:51:43 -04:00
Aecio Santos
fd1e7fcffe
SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194
...
- Makes MAX_CRAWL_DELAY configurable through class constructor
2018-05-13 20:10:55 -04:00
Aecio Santos
7bef14d386
Make RobotRules accessible #134
...
- Makes SimpleRobotRulesParser._rules property protected
and adds getters for SimpleRobotRulesParser._rules and
RobotRules's properties
- Changes SimpleRobotRulesParser return type from BaseRobotRules
to SimpleRobotRules to allow access to concrete class without
nasty type casts while still obeying super class contract
2018-05-13 20:07:54 -04:00
Julien Nioche
e25309d26c
Add JAX-B dependencies to POM ( #207 )
...
* Add JAX-B dependencies to POM, fixes #196
* mentioned in CHANGES.txt
2018-05-03 11:04:03 +01:00
Sebastian Nagel
8ca46ffe2c
Merge pull request #190 from thunken:features/for-each
...
Add SiteMapParser#walkSiteMap(URL,Consumer)
2018-04-25 14:08:27 +02:00
Sebastian Nagel
7d3eccfa63
Add changelog entry and fix unit test
2018-04-25 14:06:33 +02:00
Luc Boruta
355684ff20
Add methods to parse and iterate sitemap
...
SiteMapParser#walkSiteMap(URL,Consumer)
2018-04-25 14:04:58 +02:00
Sebastian Nagel
fa76a594b6
Merge pull request #205 from sebastian-nagel/cc-201-202-203-204
...
Improve sitemap parsing
2018-04-25 09:51:49 +02:00
Sebastian Nagel
0ef7cf87fa
Improve sitemap parsing
...
- ignore query part of URL to determine sitemap location prefix
for URL validation, fixes #202
- resolve relative links in RSS feeds, fixes #203
- allow non-continuous content (containing XML entities or CDATA)
when parsing links in RSS feeds, fixes #204
- extract links from <guid> elements in RSS feeds, fixes #201
2018-04-25 09:36:27 +02:00
Sebastian Nagel
a9277acde2
Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions
...
Improve MIME detection for sitemaps
2018-04-25 09:19:27 +02:00
Sebastian Nagel
a6b3178fc7
Simplify MIME detection:
...
- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
2018-04-24 14:32:28 +02:00
Sebastian Nagel
6714ea57d6
Fix error message format string
2018-04-18 14:53:21 +02:00
Sebastian Nagel
72aa177738
RDF-based RSS feeds: map MIME type, detect from content
2018-04-18 14:44:10 +02:00
Sebastian Nagel
4780678af0
Improve logging of content type detection for gzip-compressed sitemaps
2018-04-18 14:43:52 +02:00
Sebastian Nagel
8a34e25c41
Merge pull request #206 from sebastian-nagel/cc-145-bad-url-warnings
...
Sitemaps limit on "bad url" log messages, fixes #145
2018-04-16 13:37:39 +02:00
Sebastian Nagel
907be2343f
Format fix: add braces, complete CHANGES.txt
2018-04-16 13:36:06 +02:00
Sebastian Nagel
af084f7ee4
Sitemaps limit on "bad url" log messages, fixes #145
...
- degrade log level to debug for lines which are not valid
- only log first 1024 characters of line
2018-04-13 11:16:47 +02:00
Sebastian Nagel
34c19d811a
Fix MIME detection for sitemaps:
...
- avoid NPE if no MIME type has been detected
- allow optional leading white space before MIME patterns
(after optional BOM)
2018-04-10 19:17:16 +02:00
Ken Krugler
12155888bc
Add reference to issue #199
2018-04-02 12:59:17 -07:00
Ken Krugler
14153c4eae
Merge pull request #198 from crawler-commons/kkrugler_no-tika
...
Remove Tika dependency
2018-04-02 12:55:40 -07:00
Ken Krugler
165888ba7e
Use locale with String.format()
2018-03-28 14:40:55 -07:00
Ken Krugler
2cff74427d
Remove Tika dependency
2018-03-27 15:40:50 -07:00
Julien Nioche
5132651a6e
Upgraded Tika to 1.17, fixes #189
2018-03-20 17:20:07 +00:00
Ken Krugler
4b13456bc9
Merge pull request #188 from sebastian-nagel/cc-179-effective-tldfinder-parse-idns
...
EffectiveTldFinder to parse Internationalized Domain Names (IDN)
2017-12-08 09:16:37 -08:00
Sebastian Nagel
49bf37c6d9
Update CHANGES.txt
2017-12-08 09:42:52 +01:00
Sebastian Nagel
6de3f26da8
- add missing license header
...
- add debug log messages
- add unit test for invalid hostname (www..example..com)
2017-12-07 21:23:26 +01:00
Sebastian Nagel
95c6e3886a
EffectiveTldFinder to parse Internationalized Domain Names (IDN),
...
fixes #179
- allow eTLDs to appear as IDN variants
* one or more dot-separated parts as IDN,
* optionally others as punycoded IDNA
- speed-up lookup of public suffixes / eTLDs using a suffix trie
- update public suffix list to recent version
- add unit tests to test for wildcard eTLDs
2017-12-06 21:47:01 +01:00
Sebastian Nagel
b63b9b62ce
Merge pull request #187 from sebastian-nagel/effective-tldfinder-main
...
Add main() to EffectiveTldFinder
2017-12-06 16:52:41 +01:00
Sebastian Nagel
5c1ad13cea
Add main() to EffectiveTldFinder
2017-12-01 10:54:03 +01:00
Sebastian Nagel
bde59ff33e
Merge pull request #186 from sebastian-nagel/cc-185-effective-tld-finder-private-domains
...
Allow to include/exclude private domains / effective TLDs without re-initialization, fixes #185
2017-12-01 10:22:56 +01:00
Sebastian Nagel
f598af8d28
- address TODO of #184 : do not use of private suffixes
...
for PaidLevelDomain detection
- adapt unit tests: 'de.com' is a private suffix,
consequently 'xxx.de.com' isn't a PaidLevelDomain
- rebase on master, apply code format
2017-11-06 14:03:35 +01:00
Sebastian Nagel
cda1bdee82
Allow to include/exclude private domains / effective TLDs without re-initialization, fixes #185
...
- add optional flag excludePrivate to getAssignedDomain(...) / getEffectiveTLD(...)
- remove methods which exclude private eTLD during initialization
2017-11-06 11:48:30 +01:00
Ken Krugler
ac78f25356
Merge pull request #184 from crawler-commons/183_paidleveldomain-with-tld
...
Use TLD for PLD code
2017-11-05 15:04:01 -08:00
Ken Krugler
2b58c5050c
merge with master
2017-11-05 14:55:30 -08:00
Ken Krugler
aeb0cb91a2
Update CHANGES.txt
2017-11-05 14:53:58 -08:00
Ken Krugler
03163d0b3f
Add pending test for private domains
2017-11-05 14:51:14 -08:00
Julien Nioche
af0a013776
Released 0.9
2017-10-31 09:42:57 +00:00
Ken Krugler
cb701a9da9
Use slf4j formatting support
...
And avoid forbidden api check for calling String.format without a
locale.
2017-10-28 09:59:28 -07:00
Ken Krugler
c0a158d086
Fix minor typo
2017-10-27 13:53:28 -07:00
Ken Krugler
569ec9b8cd
Use TLD for PLD code
2017-10-27 13:51:29 -07:00