Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some common characters should not be treaded as ambiguous Unicode characters #20999

Closed
wxiaoguang opened this issue Aug 30, 2022 · 8 comments · Fixed by #22016
Closed

Some common characters should not be treaded as ambiguous Unicode characters #20999

wxiaoguang opened this issue Aug 30, 2022 · 8 comments · Fixed by #22016
Labels
topic/ui Change the appearance of the Gitea UI type/bug
Milestone

Comments

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Aug 30, 2022

Gitea 1.18-dev

https://try.gitea.io/wxiaoguang/test/src/branch/master/test-chars.md

Some common characters should not be treaded as ambiguous Unicode characters.

Many CJK punctuations are quite common in daily usage, they should not be marked as ambiguous character.

Otherwise the misleading warning appears on every page which contains CJK texts.

image

@lunny lunny added the topic/ui Change the appearance of the Gitea UI label Aug 30, 2022
@wxiaoguang wxiaoguang added this to the 1.18.0 milestone Oct 7, 2022
@lunny lunny modified the milestones: 1.18.0, 1.19.0 Oct 26, 2022
@lunny
Copy link
Member

lunny commented Nov 27, 2022

At least these Chinese characters are normal and should not be warning.

@zeripath
Copy link
Contributor

zeripath commented Nov 28, 2022

If your locale is zh-CN these will not be shown as ambiguous. Further if you look carefully, it's only the when the characters (e.g. (FULL-WIDTH COMMA)) are next to English words/latin text that they're shown as ambiguous. That's because the ambiguous detection is looking locally at the specific words and not the whole sentence - it has to do this in order to prevent misbehaviour elsewhere.

@zeripath
Copy link
Contributor

I think what we should do is only show ambiguous warnings on Source Code and not on rendered files - as per VSCode.

@zeripath
Copy link
Contributor

At least these Chinese characters are normal and should not be warning.

Only if you're using them in Chinese text - they're very definitely ambiguous otherwise. The difficulty is always what is "Chinese" or "English" text. The algorithm - which is the same as that used in vscode - does this on a word by word basis. One could try to come up with some heuristic to determine whether a paragraph as a whole is some language or other - but that's almost certainly still an open problem in NLP.

The best solution I suspect is just drop the warning on Rendered pages, default to unescaped and allow people to escape if they are interested.

@wxiaoguang
Copy link
Contributor Author

If your locale is zh-CN these will not be shown as ambiguous.

It still shows.

https://gitea.com/xorm/xorm/src/branch/master/README_CN.md?lang=zh-CN

image

@silverwind
Copy link
Member

silverwind commented Dec 1, 2022

Event gitea's own README is marked as containing "ambiguous Unicode characters" because of the pronouncation characters. I think it's silly that we even mark these and I think the set of matched characters should be reduced to the absolute minimum that may be actually malicious, e.g. non-standard whitespace, text reversal and such.

image

@zeripath
Copy link
Contributor

zeripath commented Dec 3, 2022

I think the only answer is to not render the warning on rendered pages.

This is a simple 3 line diff:

diff --git a/templates/repo/view_file.tmpl b/templates/repo/view_file.tmpl
index 0fe0a1319..9d82cc018 100644
--- a/templates/repo/view_file.tmpl
+++ b/templates/repo/view_file.tmpl
@@ -58,7 +58,9 @@
 		</div>
 	</h4>
 	<div class="ui attached table unstackable segment">
-		{{template "repo/unicode_escape_prompt" dict "EscapeStatus" .EscapeStatus "root" $}}
+		{{if not (or .IsMarkup .IsRenderedHTML)}}
+			{{template "repo/unicode_escape_prompt" dict "EscapeStatus" .EscapeStatus "root" $}}
+		{{end}}
 		<div class="file-view{{if .IsMarkup}} markup {{.MarkupType}}{{else if .IsRenderedHTML}} plain-text{{else if .IsTextSource}} code-view{{end}}">
 			{{if .IsMarkup}}
 				{{if .FileContent}}{{.FileContent | Safe}}{{end}}

@zeripath
Copy link
Contributor

zeripath commented Dec 3, 2022

If your locale is zh-CN these will not be shown as ambiguous.

It still shows.

OK well that's a bug. Looking at ambiguous_gen the locale exceptions there are looking for zh-hant/zh-hans so we'll need to add a mapping from zh-CN to zh-hans

zeripath added a commit to zeripath/gitea that referenced this issue Dec 3, 2022
The real sensitivity of ambiguous characters is in source code - therefore warning about
them in rendered pages causes too many warnings. Therefore simply remove the warning
on rendered pages.

The escape button will remain available and it is present on the view source page.

Fix go-gitea#20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit to zeripath/gitea that referenced this issue Dec 3, 2022
Although there are per-locale fallbacks for ambiguity the locale names for Chinese do not
quite match our locales. This PR simply maps zh-CN on to zh-hans and other zh variants on to
zh-hant.

Ref go-gitea#20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
lunny pushed a commit that referenced this issue Dec 3, 2022
The real sensitivity of ambiguous characters is in source code -
therefore warning about them in rendered pages causes too many warnings.
Therefore simply remove the warning on rendered pages.

The escape button will remain available and it is present on the view
source page.

Fix #20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit to zeripath/gitea that referenced this issue Dec 3, 2022
…2016)

Backport go-gitea#22016

The real sensitivity of ambiguous characters is in source code -
therefore warning about them in rendered pages causes too many warnings.
Therefore simply remove the warning on rendered pages.

The escape button will remain available and it is present on the view
source page.

Fix go-gitea#20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit that referenced this issue Dec 4, 2022
…22018)

Backport #22016

The real sensitivity of ambiguous characters is in source code -
therefore warning about them in rendered pages causes too many warnings.
Therefore simply remove the warning on rendered pages.

The escape button will remain available and it is present on the view
source page.

Fix #20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit that referenced this issue Dec 4, 2022
…se (#22019)

Although there are per-locale fallbacks for ambiguity the locale names
for Chinese do not quite match our locales. This PR simply maps zh-CN on
to zh-hans and other zh variants on to zh-hant.

Ref #20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit to zeripath/gitea that referenced this issue Dec 4, 2022
…se (go-gitea#22019)

Backport go-gitea#22019

Although there are per-locale fallbacks for ambiguity the locale names
for Chinese do not quite match our locales. This PR simply maps zh-CN on
to zh-hans and other zh variants on to zh-hant.

Ref go-gitea#20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
lunny pushed a commit that referenced this issue Dec 5, 2022
…se (#22019) (#22030)

Backport #22019

Although there are per-locale fallbacks for ambiguity the locale names
for Chinese do not quite match our locales. This PR simply maps zh-CN on
to zh-hans and other zh variants on to zh-hant.

Ref #20999

Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: Lauris BH <lauris@nix.lv>
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
topic/ui Change the appearance of the Gitea UI type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants