crawler-commons

mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-25 11:46:03 +02:00

Author	SHA1	Message	Date
Julien Nioche	0da1b8b8b5	Minor changes + applied formatting pre 0.10 release	2018-06-05 11:33:27 +01:00
Sebastian Nagel	d98a3f14cf	Allow for legacy URIs when checking sitemap namespaces (#211 ) * Allow for legacy URIs when checking sitemap namespaces - e.g., allow legacy namespace URI but ignore URLs from image and video sitemap extensions - resolve relative namespace URIs - add namespace URIs of sitemap extensions (news, images, videos) * Address kkrugler's review comments: - document addition of sitemap namespace required by sitemap protocol specification when calling setStrictNamespace(true) - remove early return on <rss> root element	2018-06-05 11:20:26 +01:00
Sebastian Nagel	0c75e75496	Add main to SimpleRobotRulesParser for testing (#193 ) * Add main to SimpleRobotRulesParser for testing - implement toString() for robot rules - fix line breaks in comments * Do not detect MIME type as Tika dependency has been removed	2018-06-04 21:26:45 +01:00
Ken Krugler	d99c034dd0	Merge branch 'master' into issue-134	2018-05-14 11:20:17 -07:00
Aecio Santos	47c2cad8b8	Add getters/setters and update CHANGES.txt	2018-05-14 12:00:02 -04:00
Aecio Santos	fd1e7fcffe	SimpleRobotRulesParser: Expose MAX_CRAWL_DELAY #194 - Makes MAX_CRAWL_DELAY configurable through class constructor	2018-05-13 20:10:55 -04:00
Aecio Santos	7bef14d386	Make RobotRules accessible #134 - Makes SimpleRobotRulesParser._rules property protected and adds getters for SimpleRobotRulesParser._rules and RobotRules's properties - Changes SimpleRobotRulesParser return type from BaseRobotRules to SimpleRobotRules to allow access to concrete class without nasty type casts while still obeying super class contract	2018-05-13 20:07:54 -04:00
Sebastian Nagel	7d3eccfa63	Add changelog entry and fix unit test	2018-04-25 14:06:33 +02:00
Luc Boruta	355684ff20	Add methods to parse and iterate sitemap SiteMapParser#walkSiteMap(URL,Consumer)	2018-04-25 14:04:58 +02:00
Sebastian Nagel	0ef7cf87fa	Improve sitemap parsing - ignore query part of URL to determine sitemap location prefix for URL validation, fixes #202 - resolve relative links in RSS feeds, fixes #203 - allow non-continuous content (containing XML entities or CDATA) when parsing links in RSS feeds, fixes #204 - extract links from <guid> elements in RSS feeds, fixes #201	2018-04-25 09:36:27 +02:00
Sebastian Nagel	a9277acde2	Merge pull request #200 from sebastian-nagel/cc-198-fix-regressions Improve MIME detection for sitemaps	2018-04-25 09:19:27 +02:00
Sebastian Nagel	a6b3178fc7	Simplify MIME detection: - handle BOM and leading white space together - remove parameter to detect patterns at a specific offset	2018-04-24 14:32:28 +02:00
Sebastian Nagel	6714ea57d6	Fix error message format string	2018-04-18 14:53:21 +02:00
Sebastian Nagel	72aa177738	RDF-based RSS feeds: map MIME type, detect from content	2018-04-18 14:44:10 +02:00
Sebastian Nagel	4780678af0	Improve logging of content type detection for gzip-compressed sitemaps	2018-04-18 14:43:52 +02:00
Sebastian Nagel	907be2343f	Format fix: add braces, complete CHANGES.txt	2018-04-16 13:36:06 +02:00
Sebastian Nagel	af084f7ee4	Sitemaps limit on "bad url" log messages, fixes #145 - degrade log level to debug for lines which are not valid - only log first 1024 characters of line	2018-04-13 11:16:47 +02:00
Sebastian Nagel	34c19d811a	Fix MIME detection for sitemaps: - avoid NPE if no MIME type has been detected - allow optional leading white space before MIME patterns (after optional BOM)	2018-04-10 19:17:16 +02:00
Ken Krugler	165888ba7e	Use locale with String.format()	2018-03-28 14:40:55 -07:00
Ken Krugler	2cff74427d	Remove Tika dependency	2018-03-27 15:40:50 -07:00
Sebastian Nagel	6de3f26da8	- add missing license header - add debug log messages - add unit test for invalid hostname (www..example..com)	2017-12-07 21:23:26 +01:00
Sebastian Nagel	95c6e3886a	EffectiveTldFinder to parse Internationalized Domain Names (IDN), fixes #179 - allow eTLDs to appear as IDN variants * one or more dot-separated parts as IDN, * optionally others as punycoded IDNA - speed-up lookup of public suffixes / eTLDs using a suffix trie - update public suffix list to recent version - add unit tests to test for wildcard eTLDs	2017-12-06 21:47:01 +01:00
Sebastian Nagel	5c1ad13cea	Add main() to EffectiveTldFinder	2017-12-01 10:54:03 +01:00
Sebastian Nagel	f598af8d28	- address TODO of #184 : do not use of private suffixes for PaidLevelDomain detection - adapt unit tests: 'de.com' is a private suffix, consequently 'xxx.de.com' isn't a PaidLevelDomain - rebase on master, apply code format	2017-11-06 14:03:35 +01:00
Sebastian Nagel	cda1bdee82	Allow to include/exclude private domains / effective TLDs without re-initialization, fixes #185 - add optional flag excludePrivate to getAssignedDomain(...) / getEffectiveTLD(...) - remove methods which exclude private eTLD during initialization	2017-11-06 11:48:30 +01:00
Ken Krugler	03163d0b3f	Add pending test for private domains	2017-11-05 14:51:14 -08:00
Ken Krugler	cb701a9da9	Use slf4j formatting support And avoid forbidden api check for calling String.format without a locale.	2017-10-28 09:59:28 -07:00
Ken Krugler	c0a158d086	Fix minor typo	2017-10-27 13:53:28 -07:00
Ken Krugler	569ec9b8cd	Use TLD for PLD code	2017-10-27 13:51:29 -07:00
Julien Nioche	ee69049db0	Sitemap: removed DOM parser, fixes #177 (#181 )	2017-10-27 07:48:22 +01:00
Sebastian Nagel	d2de87cf92	Normalize RSS pubDate, fixes #166 for SiteMapParserSAX (#180 )	2017-10-18 14:07:05 +01:00
Sebastian Nagel	0b70328671	EffectiveTLDFinder improvements, fixes #172 - update public suffix list to recent version of https://publicsuffix.org/list/public_suffix_list.dat - add method flag to force a check whether the domain has a valid effective TLD listed in the public suffix list - fix mixed case hostnames (wWW.eXample.com) - improve parsing of public suffix list	2017-10-17 17:15:16 +02:00
Sebastian Nagel	8dea17e003	Add tests from publicsuffixlist.org	2017-10-17 13:29:14 +02:00
Sebastian Nagel	5e9d70fec6	EffectiveTldFinder: optionally exclude private domains (opposed to ICANN domains)	2017-10-17 13:29:14 +02:00
Sebastian Nagel	bba44925a5	Update public suffix list to recent version of https://publicsuffix.org/list/public_suffix_list.dat	2017-10-17 13:23:15 +02:00
Sebastian Nagel	01d675fc37	Incorrect domains returned by EffectiveTldFinder (#172 ) - correctly recognize exceptions to wildcard rules as domains - do not disallow TLDs with last element not being a TLD (e.g., .ac.za) - partially fix IDNs: punycoded IDNs are now recognized - add unit test for uppercase / mixed case host names	2017-10-17 13:23:15 +02:00
Julien Nioche	6adb771b72	Add namespace aware DOM/SAX parsing for XML Sitemaps (#176 ) * Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom parsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName". * Lenient namespacing in non strict mode + applied formatting * Introduced separate field strictNamespace to sitemapparsers + added test to saxparser * Fixes Javadoc * Fixes the fix for the Javadoc * Allow to set strictNamespace in SiteMapTester - Fix strict namespace handling in SitemapParserSAX: - pass strictNamespace from DelegatorHandler to delegates - ignore text if inside an element of invalid namespace - use SAX parser in unit test - set exception and pass it to calling DelegatorHandler if namespace does not match	2017-10-17 10:47:17 +01:00
Sebastian Nagel	2afdf5b04d	Sitemap SAX parser mangles sitemap URLs in sitemap index, fixes #169 - completely add sitemap URLs from sitemap index if URL contains XML entities or CDATA	2017-08-12 17:28:08 +02:00
Ilya Kondratiev	a49fba71b7	[#166 ] Fix. * Create single TreadLocal static filed for all RSS date-time formats. * Add messages to assertations.	2017-06-20 23:06:11 +03:00
Ilya Kondratiev	b8aaffdf31	[ISSUE-166] RSS parser ignores item's pubDate * Converts RSS timestamp format to w3c format * Use pubDate of item in new SiteMapURL creation	2017-06-17 00:03:55 +03:00
Sebastian Nagel	4dd61ded84	SAX parser to stop URL at closing </loc> fixes #153 - on opening <loc> and <url> assume forgotten closing </url> tag only if there is more than white space in the buffer	2017-06-06 21:04:28 +02:00
Julien Nioche	7bf499a98a	SAX parser to stop URL at closing </loc> fixes #153	2017-06-05 16:58:39 +01:00
Sebastian Nagel	ce42b00150	Merge pull request #152 from sebastian-nagel/sitemaps-disable-xml-resolvers Disable validation of XML sitemaps and install noop entity resolver, fixes #151	2017-05-04 22:33:35 +02:00
Sebastian Nagel	4f1534a6f4	SAX parser does not extract URLs from Atom feeds, fixes #159 - call delegated method startElement(...) from DelegatorHandler - add heuristics to select link URL - add Atom feed unit test - add static method to encapsulate reading of test resource files	2017-03-28 11:54:04 +02:00
Sebastian Nagel	772f02fcb0	Fix parsing of gzipped text sitemaps, fixes #143 - detect gzip embedded media type to decide whether to parse as text or XML	2017-03-20 16:24:46 +01:00
Sebastian Nagel	6759ee73e9	SAX parser to fail on valid but unknown XML formats, fixes #154	2017-03-08 14:23:55 +01:00
Sebastian Nagel	9b936ecca4	Disable validation of XML sitemaps and install noop entity resolver, fixes #151	2017-03-03 00:01:12 +01:00
Sebastian Nagel	f7c7cab7a8	Merge branch 'matt-deboer-master' - provide SAX parser optionally to DOM-based parser - SiteMapTester: trigger usage of SAX parser by property sitemap.useSax	2017-02-27 23:09:45 +01:00
Sebastian Nagel	960b7a8fed	remove unused import	2017-02-02 15:18:05 +01:00
Sebastian Nagel	61a500ad21	Use constants from StandardCharsets where applicable, fixes #141	2017-02-02 14:59:14 +01:00

1 2 3 4

157 Commits