Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong display of cyrillic symbols in UTF-8 file #19743

Closed
sIspravnikov opened this issue May 18, 2022 · 10 comments · Fixed by #19773
Closed

Wrong display of cyrillic symbols in UTF-8 file #19743

sIspravnikov opened this issue May 18, 2022 · 10 comments · Fixed by #19773
Labels

Comments

@sIspravnikov
Copy link

sIspravnikov commented May 18, 2022

Description

I have a file in UTF-8 encoding with cyrillic comment.
When i open this file in gitea web view, display of cyrillic symbols seems broken.
But in gitea file editor, diff page and others, this symbols displays correct.

Gitea Version

1.16.7

Can you reproduce the bug on the Gitea demo site?

Yes

Log Gist

No response

Screenshots

View file:

1
Edit file:

2
Diff changes in file:

3

reproduced it also on demo site
https://try.gitea.io/sIspravnikov/test/src/branch/main/build.gradle.kts

Git Version

2.30.3

Operating System

ubuntu 20.04

How are you running Gitea?

official docker-container

Database

PostgreSQL

@silverwind
Copy link
Member

File displays fine in raw view for me, so this is definitely a bug. Also it triggers the "hidden unicode" incorrectly. Maybe something for @zeripath to check out.

@ashimokawa
Copy link
Contributor

This seems to be a bug of the code formatter.

Can reproduce on codeberg.org also:

https://codeberg.org/test/test/src/commit/d94df248a7937afc16bb6d3c98cf17a8b7862ff0/build.gradle.kts

but when I delete everything except the function with the cyrillic characters they suddenly look correct.

https://codeberg.org/test/test/src/branch/main/build.gradle.kts

@sIspravnikov
Copy link
Author

In issue #14434 was an idea, about chardet buffer size 1024 bytes, so i made a branch on demosite, where i placed cyrillic comment in beginning of the file:
https://try.gitea.io/sIspravnikov/test/commit/0682cf1ba2862f1a39e734ecb6fa6a7bcdaa57fb
and view page now display everything correct:
https://try.gitea.io/sIspravnikov/test/src/branch/test/build.gradle.kts
And @ashimokawa have done simillar, when everything, except the function with the cyrillic characters, was deleted, cyrillic symbols now catched in first 1024 bytes buffer.

So, i think, @lunny idea about chardet looks correct.

@supermangithu

This comment was marked as off-topic.

@wxiaoguang

This comment was marked as off-topic.

@zeripath
Copy link
Contributor

Hi I'm just looking at this.

This looks like a double encoding utf8 problem.

The problem is not the escapecontrolreader - there are test cases to ensure that it's doing what it should do on utf8 code.

The problem will be earlier that that.

@zeripath
Copy link
Contributor

The issue is that the file is being detected as ISO-8859-1.

In fact if you check the debug logs you will see:

2022/05/21 09:36:27 ...s/charset/charset.go:190:DetectEncoding() [D] [6288a48b] Detected encoding: ISO-8859-1

@zeripath
Copy link
Contributor

I bet the reason why the detection is failing is that the 2048th byte is within a utf8 character - and... therefore a slight change in the file would cause the correct rendering.

Is this is a somewhat carefully calculated failing example? That kind of information would have been helpful information to provide - because it would have helped us to immediately understand where the problem was.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented May 21, 2022

This example is the 2048 problem.

func renderFile(ctx *context.Context, entry *git.TreeEntry, treeLink, rawLink string) {
...
	buf := make([]byte, 1024)
	n, _ := util.ReadAtMost(dataRc, buf)
	buf = buf[:n]
...
		rd := charset.ToUTF8WithFallbackReader(io.MultiReader(bytes.NewReader(buf), dataRc))

The ToUTF8WithFallbackReader only reads first 2048 to detect, which may break UTF-8 runes.

image

Then there is a weighted algorithm to decide (guess) which encoding should be used.

Usually there should be no problem. But, with this sample:

image

The top confidence is not UTF-8 here.

The confidence of the result seems related to some statistics-based models (just a guess, haven't look into it), so a strings.Repeat("a" ...) won't trigger the bug, the repeated "a"'s top confidence result is still UTF-8.

@wxiaoguang
Copy link
Contributor

Here is a designed test case how to trigger the bug:

func TestDetectEncodingEx(t *testing.T) {
	for testLen := 0; testLen < 2048; testLen++ {
		pattern := "    test { () }\n"
		input := ""
		for len(input) < testLen {
			input += pattern
		}
		input = input[:testLen]
		input += "// Выключаем"
		rd := ToUTF8WithFallbackReader(bytes.NewReader([]byte(input)))
		r, _ := io.ReadAll(rd)
		assert.EqualValuesf(t, input, string(r), "testing string len=%d", testLen)
	}
}

It will always fail:

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -127,2 +127,2 @@
        	            	     test { () }
        	            	-    te// Выключаем
        	            	+    te// Выключаем
        	Test:       	TestDetectEncodingEx
        	Messages:   	testing string len=2038

zeripath added a commit to zeripath/gitea that referenced this issue May 21, 2022
…esenting utf-8

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit that referenced this issue May 21, 2022
…esenting utf-8 (#19773)

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix #19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit to zeripath/gitea that referenced this issue May 21, 2022
…esenting utf-8 (go-gitea#19773)

Backport go-gitea#19773

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
lunny pushed a commit that referenced this issue May 21, 2022
…esenting utf-8 (#19773) (#19774)

Backport #19773

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix #19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
AbdulrhmnGhanem pushed a commit to kitspace/gitea that referenced this issue Aug 24, 2022
…esenting utf-8 (go-gitea#19773)

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants