Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect truncated utf-8 characters at the end of content as still representing utf-8 #19773

Merged
merged 6 commits into from May 21, 2022

Conversation

zeripath
Copy link
Contributor

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix #19743

Signed-off-by: Andrew Thornton art27@cantab.net

…esenting utf-8

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
@zeripath
Copy link
Contributor Author

zeripath commented May 21, 2022

Strangely I'm having some difficulty creating a test that replicates this issue from within the charset module.

I'm not certain as to what's going on that means that I can't replicate this.


I've been able to add a testcase.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label May 21, 2022
@wxiaoguang
Copy link
Contributor

See my comment in #19743, there is a test case.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels May 21, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 21, 2022

Codecov Report

❗ No coverage uploaded for pull request base (main@876cad0). Click here to learn what that means.
The diff coverage is 85.00%.

❗ Current head 72a0a45 differs from pull request most recent head 928f95d. Consider uploading reports for the commit 928f95d to get more accurate results

@@           Coverage Diff           @@
##             main   #19773   +/-   ##
=======================================
  Coverage        ?   47.29%           
=======================================
  Files           ?      957           
  Lines           ?   133317           
  Branches        ?        0           
=======================================
  Hits            ?    63058           
  Misses          ?    62599           
  Partials        ?     7660           
Impacted Files Coverage Δ
modules/charset/charset.go 71.73% <85.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 876cad0...928f95d. Read the comment docs.

@lunny
Copy link
Member

lunny commented May 21, 2022

#19743

It's better to include this test case.

@zeripath
Copy link
Contributor Author

#19743

It's better to include this test case.

I've already included a specific test case but I can add that if you really want it.

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels May 21, 2022
@zeripath zeripath merged commit bc4764f into go-gitea:main May 21, 2022
@zeripath zeripath deleted the fix-19743-improve-encoding-detection branch May 21, 2022 13:06
zeripath added a commit to zeripath/gitea that referenced this pull request May 21, 2022
…esenting utf-8 (go-gitea#19773)

Backport go-gitea#19773

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
lunny pushed a commit that referenced this pull request May 21, 2022
…esenting utf-8 (#19773) (#19774)

Backport #19773

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix #19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
zjjhot added a commit to zjjhot/gitea that referenced this pull request May 21, 2022
* giteaofficial/main:
  Prevent NPE when cache service is disabled (go-gitea#19703)
  Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773)
  Add silentcodeg to MAINTAINERS (go-gitea#19771)
  Allows repo search to match against "owner/repo" pattern strings (go-gitea#19754)
  Update JS dependencies (go-gitea#19767)
  Nuke the incorrect permission report on /api/v1/notifications (go-gitea#19761)
zeripath added a commit to zeripath/gitea that referenced this pull request Jun 20, 2022
## [1.16.9](https://github.com/go-gitea/gitea/releases/tag/1.16.9) - 2022-06-20

* BUGFIXES
  * Fix permission check for delete tag (go-gitea#19985) (go-gitea#20001)
  * Only log non ErrNotExist errors in git.GetNote  (go-gitea#19884) (go-gitea#19905)
  *  Use exact search instead of fuzzy search for branch filter dropdown (go-gitea#19885) (go-gitea#19893)
  * Set Setpgid on child git processes (go-gitea#19865) (go-gitea#19881)
  * Import git from alpine 3.16 repository as 2.30.4 is needed for `safe.directory = '*'` to work but alpine 3.13 has 2.30.3 (go-gitea#19876)
  * Ensure responses are context.ResponseWriters (go-gitea#19843) (go-gitea#19859)
  * Fix count bug (go-gitea#19850)
  * Fix raw endpoint PDF file headers (go-gitea#19825) (go-gitea#19826)
  * Make WIP prefixes case insensitive, e.g. allow `Draft` as a WIP prefix (go-gitea#19780) (go-gitea#19811)
  * Fix NotificationUnreadCount (go-gitea#19802)
  * Prevent NPE when cache service is disabled (go-gitea#19703) (go-gitea#19783)
  * Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773) (go-gitea#19774)
  * Fix doctor pq: syntax error at or near "." quote user table name (go-gitea#19765) (go-gitea#19770)
  * Fix bug (go-gitea#19757)

Signed-off-by: Andrew Thornton <art27@cantab.net>
@zeripath zeripath mentioned this pull request Jun 20, 2022
AbdulrhmnGhanem pushed a commit to kitspace/gitea that referenced this pull request Aug 24, 2022
…esenting utf-8 (go-gitea#19773)

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <art27@cantab.net>
@wxiaoguang wxiaoguang mentioned this pull request Sep 24, 2022
1 task
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong display of cyrillic symbols in UTF-8 file
5 participants