crawler-commons

mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-24 03:26:04 +02:00

Author	SHA1	Message	Date
Julien Nioche	1447aa4799	JAVADOC fix in NewsAttributes	2020-06-25 09:49:08 +01:00
Sebastian Nagel	774c5c8092	Improvements to BasicURLNormalizer (#292 ) - better percent-encoding of URL paths and queries, fixes #263 - hostnames: * convert IDNs from Unicode to Punycode, fixes #248 * remove trailing dot - normalize path `/..` to `/` - also normalize path of file:/ URLs	2020-06-22 13:51:39 +01:00
Sebastian Nagel	78d7e7e85f	Sitemaps to implement Serializable, fixes #244 (#294 ) - make all sitemap classes including extensions to implement the Serializable interface - extend sitemap parser unit tests to check object serialization on various types of sitemaps (index, Atom/RSS feeds, video sitemaps, etc.)	2020-06-22 12:51:40 +01:00
Sebastian Nagel	e9772be011	[sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 (#293 ) * [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 * [sitemaps] Allow to deduplicate sitemap links in sitemap indexes, fixes #262 - add unit test	2020-06-19 13:15:48 +01:00
Evan Halley	ed0933f2b3	fixing NewsAttribute.equals(), comparing against that.publicationDate, updating the unit teset, added change to Changes.txt (#291 )	2020-06-16 14:39:05 +01:00
Sebastian Nagel	48aecd2c68	Fix build failure (forbidden apis) (#290 ) - make String::format not depend on system locale	2020-06-16 09:59:04 +01:00
Evan Halley	c04e3f17e7	Adding asMap to ExtensionMetadata Interface (#288 ) * added abstract method to extension metadata * implemented asmap in image/link/mobile/news attributes * implemented asmap in videoattributes * adding video attributes unit test * added news attributes unit tests * unit test for link attributes * unit tests for image and mobile attributes * added constants to news and link attributes fixing a small issue in NewsAttributes.toString * using constants instead of strings in more attributes * cleaned up the imports * decreasing the visibility of LinkAttributes.PARAMS_PREFIX adding a comment explaining it's usage * added related issue to the changelog * reverting change to NewsAttributes.equal, that causes a unit test failure	2020-06-15 15:55:20 +01:00
Avi Hayun	cb21c29a57	Upgrade the toString() method of the Base/simple RobotRules #264	2020-04-29 20:00:43 +03:00
Avi Hayun	0265b98254	Merge branch 'BetterRobotsToString' into master	2020-04-29 08:59:49 +03:00
Sebastian Nagel	ac3fd3a464	Robots parser: deduplicate sitemap links, fixes #261 - use a LinkedHashSet to store and unify sitemap URLs	2020-03-24 10:31:38 +01:00
Sebastian Nagel	c070453c5e	EffectiveTldFinder to log loading of public suffix list, fixes #284 - log the location of the public suffix list (effective_tld_names.dat) during initialization - log errors while loading as errors - improve documentation about public suffix list and class initialization	2020-02-17 16:41:25 +01:00
Sebastian Nagel	8522cfdd34	[SiteMapParser] getPublicationDate in VideoAttributes may throw NPE, fixes #283 (#286 ) - check for null values before converting ZonedDateTime to Date	2020-02-17 15:33:38 +00:00
Sebastian Nagel	f83c43bdd4	SimpleRobotRulesParser: Trim log messages, fixes #281 (#282 ) - clip strings shown in log message to 1024 chars	2020-02-03 11:25:59 +00:00
Sebastian Nagel	6c271f5fab	SimpleRobotRulesParser: counter _numWarnings not thread-safe, fixes #278 - move the counter _numWarnings into the ParseState object local to parseContent(...) which makes counting and suppressing warnings thread-safe - after parsing the warnings count is stored in a ThreadLocal member variable of SimpleRobotRulesParser, the value is returned by getNumWarnings() - use parameterized method calls for logging, avoid concatenating log messages beforehand	2020-01-08 13:55:46 +01:00
Sebastian Nagel	f7cc8b694a	Make mvn run parameterized tests, fixes #273 - make test classes public - upgrade Maven surefire plugin to recent version - depend on junit-jupiter-engine	2019-11-12 13:49:40 +01:00
Sebastian Nagel	7419e96d74	[BasicNormalizer] Empty path to be normalized to / even if there is an (empty) query, fixes #247	2019-11-12 13:32:00 +01:00
Sebastian Nagel	2979aaf4aa	[BasicNormalizer] Trailing question mark in url query test should be added, fixes #247	2019-10-17 11:58:36 +02:00
Sebastian Nagel	7e5347eca6	Merge pull request #252 from sebastian-nagel/cc-251-domain-max-length-check EffectiveTldFinder to validate returned domain names for length restrictions	2019-10-15 16:24:47 +02:00
Avi Hayun	bb7523d978	Updated the toString method to include a list of sitemaps if there are less than 10 sitemaps in the robots file	2019-09-11 18:22:41 +03:00
Avi Hayun	1c74fc6ca0	The robots.txt test now supports the junit5 parameterizedTest, gaining us the ability to have each robot rule be tested separately, so if one rule fails, the others can still show success in their own unit tests. (#256 ) IMHO it also makes the tests more readable but that is in the eyes of the beholder This files is smaller by 40 lines, and has 172 unit tests instead of 46 tests	2019-07-23 13:13:45 +01:00
Avi Hayun	3597dee5a7	Updated the tests to use parameterizedTest, thus making the test file more readable and each normalized pair is now a test in itself, thus making it much easier to identify exactly failure points	2019-07-16 21:17:25 +03:00
Avi Hayun	5bda3636d2	Upgraded to Junit v5.5 (#250 ) * Upgraded to Junit v5.5 Updated the annotations and assertions accordingly * Removed unneeded before and after * This is a technical debt Fixing a styling issue I caused about 4 years ago Details can be found here: https://github.com/crawler-commons/crawler-commons/pull/82 * Fixed according to @sebastian-nagel code review	2019-07-15 21:29:03 +03:00
Sebastian Nagel	fba716c765	EffectiveTldFinder to validate returned domain names for length restrictions, fixes #251	2019-07-15 12:39:04 +02:00
Sebastian Nagel	b449fdf024	Robots parser to always handle absolute sitemap URL even without valid base URL, fixes #240	2019-03-26 15:05:12 +01:00
Sebastian Nagel	e8b598b2e8	[Sitemaps] Unit tests depend on system timezone, fixes #238 (#239 ) - fix unit test to format data in time zone UTC - improve documentation of `convertToZonedDateTime`: add note that UTC is assumed if no time zone is contained in date string	2019-03-19 15:00:04 +00:00
Sebastian Nagel	40531efe25	EffectiveTldFinder: upgrade public suffix list, implements #219 (#235 ) - upgrade the public suffix list to the latest version retrieved right now from https://publicsuffix.org/list/public_suffix_list.dat resp. `802c469416/public_suffix_list.dat`	2019-03-14 11:22:50 +00:00
Sebastian Nagel	a69bf09cd4	Detection and parsing of XML sitemaps fails with whitespace before XML declaration, fixes #144 - implement InputStream skipping over white space at beginning of file - use for XML sitemaps in combination with BOMInputStream, so that white space or empty lines before <?xml ...> do not cause the parser to fail	2019-03-07 16:17:09 +01:00
Sebastian Nagel	2539b07948	Buffer characters when characters(...) is called (instead of immediately processing the current character chunk) - use `localName` instead of calling `currentElement()` where applicable - remove unnecessary null checks of character buffer	2019-02-25 14:32:26 +01:00
Sebastian Nagel	82859c2b1a	Buffer characters when characters(...) is called (instead of immediately processing the current character chunk) - fix errors when character chunks are interrupted by CDATA sections or character entities - fixes #225 XMLIndexHandler needs to accumulate the lastmod date string before parsing - fixes #226 XMLHandler needs to append text in characters() vs. immediately processing - provide character buffer in DelegatorHandler, so that derived classes can append characters to it and finally get the buffered content - code cleanup in all handler classes: - add @Override annotations - remove stubb method implementations	2019-02-21 23:01:32 +01:00
Sebastian Nagel	916415d262	Merge branch 'master' into cc-231-etld-invalid-idns	2019-02-21 22:16:48 +01:00
Sebastian Nagel	67db8bf1be	[Sitemaps] Trim Unicode whitespace around URLs, fixes #224	2019-02-20 16:27:16 +01:00
Sebastian Nagel	680df29f0f	EffectiveTldFinder throws IllegalArgumentException on IDN domain names containing prohibited charactersi, fixes #231 - catch IllegalArgumentException when converting IDNs to ASCII - validate domain names returned by getAssignedDomain(...) - add unit tests, complete Javadoc	2019-02-20 14:59:58 +01:00
Sebastian Nagel	2bdc1923b1	Merge branch 'master' into cc-213-auto-close-sitemap-in-sitemapindex	2019-02-03 13:47:31 +01:00
Sebastian Nagel	5706ee1e97	Sitemap index: stop URL at closing </loc> - bug fix: <sitemap> element is closed, not <url> - add unit test to cover resetting of attributes	2019-02-03 13:44:14 +01:00
Sebastian Nagel	e0ef1ee29d	Remove final from method parameters	2019-01-18 17:27:23 +01:00
Sebastian Nagel	13fab2e033	Sitemap index: stop URL at closing </loc>, fixes #213 - at start of a <loc> element auto-close any unclosed <sitemap> element and add the sitemap if there is a valid URL from the previous <loc> element	2018-12-13 15:01:23 +01:00
Sebastian Nagel	439c350ab3	Allow empty price in video sitemaps, fixes #221	2018-12-07 16:44:47 +01:00
Sebastian Nagel	f361597447	Inline unit test for locale-specific number formats in video:price tags (#220 ) - simply unit tests to contain only tested or required video attributes - fix order of attributes/elements to follow XSD schema spec	2018-12-07 16:44:16 +01:00
Aurelien Violette	3c12c715c8	Fix object comparison on testing.	2018-12-03 12:55:30 +01:00
Aurelien Violette	e802cfa8eb	In case of the use of a different locale, price tag can be formatted with ',' instead of '.' leading to a NPE. NPE is generated because parseFloat returns a Float object that can be set null in case of NumberFormatException, but the VideoPrice accepts only float. To bypass this issue and avoid reccuring errors, I've moved the VideoPrice price field to a Float object instead accepting null in case of. It is far from ideal, and parseFloat would enjoy being able to parse different locale formatting. Anyway, in a first quick fix, this allows the rest of the file to be parsed, whereas the previous error had all the file to fail while parsing.	2018-12-03 12:44:15 +01:00
Sebastian Nagel	862af9416f	Sitemap extension support - add extension support to SiteMapTester - list extension attributes in SiteMapURL.toString() - update change log	2018-09-28 12:14:02 +02:00
Sebastian Nagel	66745b29a1	Add support for the Mobile sitemap extension	2018-09-28 12:05:07 +02:00
Sebastian Nagel	b924bd0828	Sitemap extension support - optionally parse elements in the namespace of sitemap extensions: - Google video sitemaps (resolves #35) - Google image sitemaps (resolves #36) - Google news sitemaps - alternate links in sitemaps (resolves #149) - the code is taken from Tanguy Moal's (@tuxnco) PR #162 with the following modifications: - port from DOM to SAX parser - keep specific extensions separate from the "core" sitemap classes	2018-09-28 12:04:39 +02:00
Sebastian Nagel	9318de951f	Use the Java 8 date and time API (java.time.) to parse dates in sitemaps (#217 ) Use the Java 8 date and time API (java.time.) to parse dates in sitemaps - use thread-safe DateTimeFormatter instead of ThreadLocal<DateFormat> - simplify parsing of RSS publication dates - remove obsolete regex pattern to catch dates with time zone but without seconds (covered by DateTimeFormatter.ISO_OFFSET_DATE_TIME) - extend unit tests Fix Javadoc error and warnings, update change log * Remove obsolete dependency to jaxb-api - import of javax.xml.bind.DatatypeConverter has been removed by updating to Java 8 date and time API	2018-09-24 10:09:58 +01:00
Ken Krugler	54d90deaa4	Fix up formatting	2018-08-01 10:15:42 -07:00
Ken Krugler	d5f614b67a	Fix for URL with only a query string (no path)	2018-07-31 17:10:12 -07:00
Julien Nioche	0da1b8b8b5	Minor changes + applied formatting pre 0.10 release	2018-06-05 11:33:27 +01:00
Sebastian Nagel	d98a3f14cf	Allow for legacy URIs when checking sitemap namespaces (#211 ) * Allow for legacy URIs when checking sitemap namespaces - e.g., allow legacy namespace URI but ignore URLs from image and video sitemap extensions - resolve relative namespace URIs - add namespace URIs of sitemap extensions (news, images, videos) * Address kkrugler's review comments: - document addition of sitemap namespace required by sitemap protocol specification when calling setStrictNamespace(true) - remove early return on <rss> root element	2018-06-05 11:20:26 +01:00
Sebastian Nagel	0c75e75496	Add main to SimpleRobotRulesParser for testing (#193 ) * Add main to SimpleRobotRulesParser for testing - implement toString() for robot rules - fix line breaks in comments * Do not detect MIME type as Tika dependency has been removed	2018-06-04 21:26:45 +01:00
Ken Krugler	d99c034dd0	Merge branch 'master' into issue-134	2018-05-14 11:20:17 -07:00

1 2 3 4 5

203 Commits