Detect truncated utf-8 characters at the end of content as still representing utf-8 #19773

zeripath · 2022-05-21T11:14:39Z

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix #19743

Signed-off-by: Andrew Thornton art27@cantab.net

…esenting utf-8 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

zeripath · 2022-05-21T11:15:52Z

Strangely I'm having some difficulty creating a test that replicates this issue from within the charset module.

I'm not certain as to what's going on that means that I can't replicate this.

I've been able to add a testcase.

wxiaoguang · 2022-05-21T11:19:57Z

See my comment in #19743, there is a test case.

Signed-off-by: Andrew Thornton <art27@cantab.net>

codecov-commenter · 2022-05-21T11:45:46Z

Codecov Report

❗ No coverage uploaded for pull request base (main@876cad0). Click here to learn what that means.
The diff coverage is 85.00%.

❗ Current head 72a0a45 differs from pull request most recent head 928f95d. Consider uploading reports for the commit 928f95d to get more accurate results

@@           Coverage Diff           @@
##             main   #19773   +/-   ##
=======================================
  Coverage        ?   47.29%           
=======================================
  Files           ?      957           
  Lines           ?   133317           
  Branches        ?        0           
=======================================
  Hits            ?    63058           
  Misses          ?    62599           
  Partials        ?     7660

Impacted Files	Coverage Δ
modules/charset/charset.go	`71.73% <85.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 876cad0...928f95d. Read the comment docs.

lunny · 2022-05-21T12:09:11Z

#19743

It's better to include this test case.

zeripath · 2022-05-21T12:19:38Z

#19743

It's better to include this test case.

I've already included a specific test case but I can add that if you really want it.

Signed-off-by: Andrew Thornton <art27@cantab.net>

…ection' into fix-19743-improve-encoding-detection

…esenting utf-8 (go-gitea#19773) Backport go-gitea#19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

…esenting utf-8 (#19773) (#19774) Backport #19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix #19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

* giteaofficial/main: Prevent NPE when cache service is disabled (go-gitea#19703) Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773) Add silentcodeg to MAINTAINERS (go-gitea#19771) Allows repo search to match against "owner/repo" pattern strings (go-gitea#19754) Update JS dependencies (go-gitea#19767) Nuke the incorrect permission report on /api/v1/notifications (go-gitea#19761)

## [1.16.9](https://github.com/go-gitea/gitea/releases/tag/1.16.9) - 2022-06-20 * BUGFIXES * Fix permission check for delete tag (go-gitea#19985) (go-gitea#20001) * Only log non ErrNotExist errors in git.GetNote (go-gitea#19884) (go-gitea#19905) * Use exact search instead of fuzzy search for branch filter dropdown (go-gitea#19885) (go-gitea#19893) * Set Setpgid on child git processes (go-gitea#19865) (go-gitea#19881) * Import git from alpine 3.16 repository as 2.30.4 is needed for `safe.directory = '*'` to work but alpine 3.13 has 2.30.3 (go-gitea#19876) * Ensure responses are context.ResponseWriters (go-gitea#19843) (go-gitea#19859) * Fix count bug (go-gitea#19850) * Fix raw endpoint PDF file headers (go-gitea#19825) (go-gitea#19826) * Make WIP prefixes case insensitive, e.g. allow `Draft` as a WIP prefix (go-gitea#19780) (go-gitea#19811) * Fix NotificationUnreadCount (go-gitea#19802) * Prevent NPE when cache service is disabled (go-gitea#19703) (go-gitea#19783) * Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773) (go-gitea#19774) * Fix doctor pq: syntax error at or near "." quote user table name (go-gitea#19765) (go-gitea#19770) * Fix bug (go-gitea#19757) Signed-off-by: Andrew Thornton <art27@cantab.net>

…esenting utf-8 (go-gitea#19773) Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

zeripath added type/bug backport/v1.16 labels May 21, 2022

zeripath added this to the 1.17.0 milestone May 21, 2022

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label May 21, 2022

wxiaoguang approved these changes May 21, 2022

View reviewed changes

GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels May 21, 2022

zeripath added 2 commits May 21, 2022 12:43

Add testcase

72a0a45

Signed-off-by: Andrew Thornton <art27@cantab.net>

Merge branch 'main' into fix-19743-improve-encoding-detection

928f95d

zeripath added 3 commits May 21, 2022 13:20

as per lunny

ec56d54

Signed-off-by: Andrew Thornton <art27@cantab.net>

Merge branch 'main' into fix-19743-improve-encoding-detection

aaaedea

Merge remote-tracking branch 'zeripath/fix-19743-improve-encoding-det…

1169025

…ection' into fix-19743-improve-encoding-detection

lunny approved these changes May 21, 2022

View reviewed changes

GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels May 21, 2022

zeripath merged commit bc4764f into go-gitea:main May 21, 2022

zeripath deleted the fix-19743-improve-encoding-detection branch May 21, 2022 13:06

zeripath mentioned this pull request May 21, 2022

Detect truncated utf-8 characters at the end of content as still representing utf-8 (#19773) #19774

Merged

zeripath mentioned this pull request Jun 20, 2022

Changelog for 1.16.9 #20059

Merged

wxiaoguang mentioned this pull request Sep 24, 2022

file encoding detection bug #14434

Closed

1 task

go-gitea locked and limited conversation to collaborators May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect truncated utf-8 characters at the end of content as still representing utf-8 #19773

Detect truncated utf-8 characters at the end of content as still representing utf-8 #19773

zeripath commented May 21, 2022

zeripath commented May 21, 2022 •

edited

wxiaoguang commented May 21, 2022

codecov-commenter commented May 21, 2022 •

edited

lunny commented May 21, 2022

zeripath commented May 21, 2022

Detect truncated utf-8 characters at the end of content as still representing utf-8 #19773

Detect truncated utf-8 characters at the end of content as still representing utf-8 #19773

Conversation

zeripath commented May 21, 2022

zeripath commented May 21, 2022 • edited

wxiaoguang commented May 21, 2022

codecov-commenter commented May 21, 2022 • edited

Codecov Report

lunny commented May 21, 2022

zeripath commented May 21, 2022

zeripath commented May 21, 2022 •

edited

codecov-commenter commented May 21, 2022 •

edited