1
0
Fork 0
mirror of https://github.com/crawler-commons/crawler-commons synced 2024-05-18 01:56:06 +02:00
crawler-commons/src/test/resources/normalizer/weirdToNormalizedUrls.csv
Sebastian Nagel e5563c3049 [BasicNormalizer] Query parameters normalization in BasicURLNormalizer,
closes #308
- add unit test to prove that an empty query is removed
2023-06-13 09:59:07 +02:00

9.2 KiB
Raw Blame History

1# Weird URLNormalized URL
2http://foo.com/%66oo.htmlhttp://foo.com/foo.html
3http://foo.com/%66oo.htm%6chttp://foo.com/foo.html
4http://foo.com/%66oo.ht%6dlhttp://foo.com/foo.html
5http://foo.com/%66oo.ht%6d%6chttp://foo.com/foo.html
6http://foo.com/%66oo.htm%C0http://foo.com/foo.htm%C0
7http://foo.com/%66oo.htm%1Ahttp://foo.com/foo.htm%1A
8http://foo.com/%66oo.htm%c0http://foo.com/foo.htm%C0
9https://www.example.com/search/%2a/https://www.example.com/search/%2A/
10https://www.example.com/topic/9%2f11/https://www.example.com/topic/9%2F11/
11http://foo.com/you%20too.htmlhttp://foo.com/you%20too.html
12http://foo.com/you too.htmlhttp://foo.com/you%20too.html
13http://foo.com/file.html%23czhttp://foo.com/file.html%23cz
14http://foo.com/fast/dir%2fczhttp://foo.com/fast/dir%2Fcz
15http://foo.com/!http://foo.com/%1A!
16http://foo.com/!http://foo.com/%01!
17http://mydomain.com/en Español.aspxhttp://mydomain.com/en%20Espa%C3%B1ol.aspx
18http://x.com/s?m=10&q=a%26bhttp://x.com/s?m=10&q=a%26b
19http://x.com/show?http%3A%2F%2Fx.com%2Fbhttp://x.com/show?http%3A%2F%2Fx.com%2Fb
20http://google.com/search?q=c%2B%2Bhttp://google.com/search?q=c%2B%2B
21http://x.com/s?q=a+bhttp://x.com/s?q=a+b
22http://bücher.de/http://xn--bcher-kva.de/
23http://êxample.comhttp://xn--xample-hva.com/
24https://нэб.рф/https://xn--90ax2c.xn--p1ai/
25https://www.0251-sachverst%c3%a4ndiger.de/https://www.xn--0251-sachverstndiger-ozb.de/
26http://x.com/./a/../%66.htmlhttp://x.com/f.html
27http://x.com/?x[y]=1http://x.com/?x%5By%5D=1
28http://x.com/foo€http://x.com/foo%C2%80
29http://x.com/foo%c2%80http://x.com/foo%C2%80
30http://foo.com/ http://foo.com/
31http://foo.com/http://foo.com/
32http://Foo.Com/index.htmlhttp://foo.com/index.html
33http://Foo.Com/index.htmlhttp://foo.com/index.html
34https://example%2Ecom/https://example.com/
35http://foo.com:80/index.htmlhttp://foo.com/index.html
36http://foo.com:81/http://foo.com:81/
37http://example.com:/http://example.com/
38https://example.com:/foobar.htmlhttps://example.com/foobar.html
39http://foo.comhttp://foo.com/
40http://foo.com/foo.html#refhttp://foo.com/foo.html
41http://foo.com/%66oo.htmlhttp://foo.com/foo.html
42http://foo.com/aa/./foo.htmlhttp://foo.com/aa/foo.html
43http://foo.com/aa/../http://foo.com/
44http://foo.com/aa/bb/../http://foo.com/aa/
45http://foo.com/aa/..http://foo.com/
46http://foo.com/aa/bb/cc/../../foo.htmlhttp://foo.com/aa/foo.html
47http://foo.com/aa/bb/../cc/dd/../ee/foo.htmlhttp://foo.com/aa/cc/ee/foo.html
48http://foo.com/../foo.htmlhttp://foo.com/foo.html
49http://foo.com/../../foo.htmlhttp://foo.com/foo.html
50http://foo.com/../aa/../foo.htmlhttp://foo.com/foo.html
51http://foo.com/aa/../../foo.htmlhttp://foo.com/foo.html
52http://foo.com/aa/../bb/../foo.html/../../http://foo.com/
53http://foo.com/../aa/foo.htmlhttp://foo.com/aa/foo.html
54http://foo.com/../aa/../foo.htmlhttp://foo.com/foo.html
55http://foo.com/a..a/foo.htmlhttp://foo.com/a..a/foo.html
56http://foo.com/a..a/../foo.htmlhttp://foo.com/foo.html
57http://foo.com/foo.foo/../foo.htmlhttp://foo.com/foo.html
58http://foo.com//aa/bb/foo.htmlhttp://foo.com/aa/bb/foo.html
59http://foo.com/aa//bb/foo.htmlhttp://foo.com/aa/bb/foo.html
60http://foo.com/aa/bb//foo.htmlhttp://foo.com/aa/bb/foo.html
61http://foo.com//aa//bb//foo.htmlhttp://foo.com/aa/bb/foo.html
62http://foo.com////aa////bb//foo.htmlhttp://foo.com/aa/bb/foo.html
63http://foo.com////aa////bb////foo.htmlhttp://foo.com/aa/bb/foo.html
64http://foo.com/aa?referer=http://bar.comhttp://foo.com/aa?referer=http://bar.com
65http://foo.com/..http://foo.com/
66file:///foo/bar.txtfile:///foo/bar.txt
67ftp:/ftp:/
68http:http:/
69http:////http:/
70http:///////http:/
71http://example.com?http://example.com/
72http://example.com?a=1http://example.com/?a=1
73http://example.com/?http://example.com/
74https://www.last.fm/music/Prefuse+73/_/90%+of+My+Mind+Is+With+Youhttps://www.last.fm/music/Prefuse+73/_/90%25+of+My+Mind+Is+With+You
75http://foo.com/{{stuff}} http://foo.com/%7B%7Bstuff%7D%7D
76http://www.example.com/a/c/../b/search?q=foobar"http://www.example.com/a/b/search?q=foobar%22
77http://www.example.com/a/c/../b/search?q=foobar%http://www.example.com/a/b/search?q=foobar%25
78http://www.example.com/a/c/../b/search?q=foobar<http://www.example.com/a/b/search?q=foobar%3C
79http://www.example.com/a/c/../b/search?q=foobar>http://www.example.com/a/b/search?q=foobar%3E
80http://www.example.com/a/c/../b/search?q=foobar^http://www.example.com/a/b/search?q=foobar%5E
81http://www.example.com/a/c/../b/search?q=foobar`http://www.example.com/a/b/search?q=foobar%60
82http://www.example.com/a/c/../b/search?q=foobar|http://www.example.com/a/b/search?q=foobar%7C
83http://www.example.com/p%zz%77%vhttp://www.example.com/p%25zzw%25v
84http://www.example.com/search?q=foobar%http://www.example.com/search?q=foobar%25
85http://www.example.com/search?q=foobar%2http://www.example.com/search?q=foobar%252
86http://www.example.com/search?q=foobar%25http://www.example.com/search?q=foobar%25
87http://www.example.com/search?q=foobar%252http://www.example.com/search?q=foobar%252
88HTTP://foo.com/http://foo.com/
89# no protocol/schemesee #271
90foo.com/index.htmlhttp://foo.com/index.html
91ftp://foo.com/index.htmlftp://foo.com/index.html
92file:/path/index.htmlfile:/path/index.html
93https://www.example.org./https://www.example.org/
94file:/var/www/html/////./bar/index.htmlfile:/var/www/html/bar/index.html
95file:/var/www/html/foo/../bar/index.htmlfile:/var/www/html/bar/index.html
96http://example.com/?b=1&a=1http://example.com/?a=1&b=1
97http://foo.com/foo.html?b=1&a=1http://foo.com/foo.html?a=1&b=1
98http://foo.com/index?a=1&b=2http://foo.com/index?a=1&b=2
99http://foo.com/index?b=2&a=1http://foo.com/index?a=1&b=2
100http://foo.com/index?b=2&a=1#chttp://foo.com/index?a=1&b=2
101https://foo.com/search?q=tl;drhttps://foo.com/search?q=tl;dr
102http://foo.com/index?a=1&bhttp://foo.com/index?a=1&b
103http://foo.com/index?a=1&b=http://foo.com/index?a=1&b
104http://foo.com/index?a=1&b#chttp://foo.com/index?a=1&b
105http://foo.com/index?b&a=1http://foo.com/index?a=1&b
106http://foo.com/index?b=&a=1http://foo.com/index?a=1&b
107http://foo.com/index?b=1&a=1&http://foo.com/index?a=1&b=1
108http://foo.com/index?&b=1&a=1http://foo.com/index?a=1&b=1
109http://foo.com/index?&=1&a=1http://foo.com/index?a=1
110http://foo.com/index?=1&b=1&a=1http://foo.com/index?a=1&b=1
111http://example.com/?http://example.com/
112https://foo.com/?one/valid_query/without_%2F_paramshttps://foo.com/?one/valid_query/without_%2F_params
113http://foo.com/asdf/page.php?article%2F1234http://foo.com/asdf/page.php?article%2F1234
114https://www.example.com/path/file-with-a-*.htmlhttps://www.example.com/path/file-with-a-*.html
115https://www.example.com/path/foo-$https://www.example.com/path/foo-$