1
0
Fork 0
mirror of https://github.com/git/git.git synced 2024-05-05 03:36:31 +02:00

Merge branch 'ls/checkout-encoding'

The new "checkout-encoding" attribute can ask Git to convert the
contents to the specified encoding when checking out to the working
tree (and the other way around when checking in).

* ls/checkout-encoding:
  convert: add round trip check based on 'core.checkRoundtripEncoding'
  convert: add tracing for 'working-tree-encoding' attribute
  convert: check for detectable errors in UTF encodings
  convert: add 'working-tree-encoding' attribute
  utf8: add function to detect a missing UTF-16/32 BOM
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: teach same_encoding() alternative UTF encoding names
  strbuf: add a case insensitive starts_with()
  strbuf: add xstrdup_toupper()
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
This commit is contained in:
Junio C Hamano 2018-05-08 15:59:22 +09:00
commit 1ac0ce4d32
13 changed files with 737 additions and 5 deletions

View File

@ -530,6 +530,12 @@ core.autocrlf::
This variable can be set to 'input',
in which case no output conversion is performed.
core.checkRoundtripEncoding::
A comma and/or whitespace separated list of encodings that Git
performs UTF-8 round trip checks on if they are used in an
`working-tree-encoding` attribute (see linkgit:gitattributes[5]).
The default value is `SHIFT-JIS`.
core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and

View File

@ -279,6 +279,94 @@ few exceptions. Even though...
catch potential problems early, safety triggers.
`working-tree-encoding`
^^^^^^^^^^^^^^^^^^^^^^^
Git recognizes files encoded in ASCII or one of its supersets (e.g.
UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
encodings (e.g. UTF-16) are interpreted as binary and consequently
built-in Git text processing tools (e.g. 'git diff') as well as most Git
web front ends do not visualize the contents of these files by default.
In these cases you can tell Git the encoding of a file in the working
directory with the `working-tree-encoding` attribute. If a file with this
attribute is added to Git, then Git reencodes the content from the
specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
content in its internal data structure (called "the index"). On checkout
the content is reencoded back to the specified encoding.
Please note that using the `working-tree-encoding` attribute may have a
number of pitfalls:
- Alternative Git implementations (e.g. JGit or libgit2) and older Git
versions (as of March 2018) do not support the `working-tree-encoding`
attribute. If you decide to use the `working-tree-encoding` attribute
in your repository, then it is strongly recommended to ensure that all
clients working with the repository support it.
For example, Microsoft Visual Studio resources files (`*.rc`) or
PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
stored as UTF-8 internally. A client without `working-tree-encoding`
support will checkout `foo.ps1` as UTF-8 encoded file. This will
typically cause trouble for the users of this file.
If a Git client, that does not support the `working-tree-encoding`
attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
stored "as-is" internally (in this example probably as UTF-16).
A client with `working-tree-encoding` support will interpret the
internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
That operation will fail and cause an error.
- Reencoding content to non-UTF encodings can cause errors as the
conversion might not be UTF-8 round trip safe. If you suspect your
encoding to not be round trip safe, then add it to
`core.checkRoundtripEncoding` to make Git check the round trip
encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
set) is known to have round trip issues with UTF-8 and is checked by
default.
- Reencoding content requires resources that might slow down certain
Git operations (e.g 'git checkout' or 'git add').
Use the `working-tree-encoding` attribute only if you cannot store a file
in UTF-8 encoding and if you want Git to be able to process the content
as text.
As an example, use the following attributes if your '*.ps1' files are
UTF-16 encoded with byte order mark (BOM) and you want Git to perform
automatic line ending conversion based on your platform.
------------------------
*.ps1 text working-tree-encoding=UTF-16
------------------------
Use the following attributes if your '*.ps1' files are UTF-16 little
endian encoded without BOM and you want Git to use Windows line endings
in the working directory. Please note, it is highly recommended to
explicitly define the line endings with `eol` if the `working-tree-encoding`
attribute is used to avoid ambiguity.
------------------------
*.ps1 text working-tree-encoding=UTF-16LE eol=CRLF
------------------------
You can get a list of all available encodings on your platform with the
following command:
------------------------
iconv --list
------------------------
If you do not know the encoding of a file, then you can use the `file`
command to guess the encoding:
------------------------
file foo.ps1
------------------------
`ident`
^^^^^^^

View File

@ -1239,6 +1239,11 @@ static int git_default_core_config(const char *var, const char *value)
return 0;
}
if (!strcmp(var, "core.checkroundtripencoding")) {
check_roundtrip_encoding = xstrdup(value);
return 0;
}
if (!strcmp(var, "core.notesref")) {
notes_ref_name = xstrdup(value);
return 0;

276
convert.c
View File

@ -7,6 +7,7 @@
#include "sigchain.h"
#include "pkt-line.h"
#include "sub-process.h"
#include "utf8.h"
/*
* convert.c - convert a file when checking it out and checking it in.
@ -265,6 +266,241 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
}
static int validate_encoding(const char *path, const char *enc,
const char *data, size_t len, int die_on_error)
{
/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
if (istarts_with(enc, "UTF")) {
/*
* Check for detectable errors in UTF encodings
*/
if (has_prohibited_utf_bom(enc, data, len)) {
const char *error_msg = _(
"BOM is prohibited in '%s' if encoded as %s");
/*
* This advice is shown for UTF-??BE and UTF-??LE encodings.
* We cut off the last two characters of the encoding name
* to generate the encoding name suitable for BOMs.
*/
const char *advise_msg = _(
"The file '%s' contains a byte order "
"mark (BOM). Please use UTF-%s as "
"working-tree-encoding.");
const char *stripped = NULL;
char *upper = xstrdup_toupper(enc);
upper[strlen(upper)-2] = '\0';
if (!skip_prefix(upper, "UTF-", &stripped))
skip_prefix(stripped, "UTF", &stripped);
advise(advise_msg, path, stripped);
free(upper);
if (die_on_error)
die(error_msg, path, enc);
else {
return error(error_msg, path, enc);
}
} else if (is_missing_required_utf_bom(enc, data, len)) {
const char *error_msg = _(
"BOM is required in '%s' if encoded as %s");
const char *advise_msg = _(
"The file '%s' is missing a byte order "
"mark (BOM). Please use UTF-%sBE or UTF-%sLE "
"(depending on the byte order) as "
"working-tree-encoding.");
const char *stripped = NULL;
char *upper = xstrdup_toupper(enc);
if (!skip_prefix(upper, "UTF-", &stripped))
skip_prefix(stripped, "UTF", &stripped);
advise(advise_msg, path, stripped, stripped);
free(upper);
if (die_on_error)
die(error_msg, path, enc);
else {
return error(error_msg, path, enc);
}
}
}
return 0;
}
static void trace_encoding(const char *context, const char *path,
const char *encoding, const char *buf, size_t len)
{
static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
struct strbuf trace = STRBUF_INIT;
int i;
strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
for (i = 0; i < len && buf; ++i) {
strbuf_addf(
&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
i,
(unsigned char) buf[i],
(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
((i+1) % 8 && (i+1) < len ? ' ' : '\n')
);
}
strbuf_addchars(&trace, '\n', 1);
trace_strbuf(&coe, &trace);
strbuf_release(&trace);
}
static int check_roundtrip(const char *enc_name)
{
/*
* check_roundtrip_encoding contains a string of comma and/or
* space separated encodings (eg. "UTF-16, ASCII, CP1125").
* Search for the given encoding in that string.
*/
const char *found = strcasestr(check_roundtrip_encoding, enc_name);
const char *next;
int len;
if (!found)
return 0;
next = found + strlen(enc_name);
len = strlen(check_roundtrip_encoding);
return (found && (
/*
* check that the found encoding is at the
* beginning of check_roundtrip_encoding or
* that it is prefixed with a space or comma
*/
found == check_roundtrip_encoding || (
(isspace(found[-1]) || found[-1] == ',')
)
) && (
/*
* check that the found encoding is at the
* end of check_roundtrip_encoding or
* that it is suffixed with a space or comma
*/
next == check_roundtrip_encoding + len || (
next < check_roundtrip_encoding + len &&
(isspace(next[0]) || next[0] == ',')
)
));
}
static const char *default_encoding = "UTF-8";
static int encode_to_git(const char *path, const char *src, size_t src_len,
struct strbuf *buf, const char *enc, int conv_flags)
{
char *dst;
int dst_len;
int die_on_error = conv_flags & CONV_WRITE_OBJECT;
/*
* No encoding is specified or there is nothing to encode.
* Tell the caller that the content was not modified.
*/
if (!enc || (src && !src_len))
return 0;
/*
* Looks like we got called from "would_convert_to_git()".
* This means Git wants to know if it would encode (= modify!)
* the content. Let's answer with "yes", since an encoding was
* specified.
*/
if (!buf && !src)
return 1;
if (validate_encoding(path, enc, src, src_len, die_on_error))
return 0;
trace_encoding("source", path, enc, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc,
&dst_len);
if (!dst) {
/*
* We could add the blob "as-is" to Git. However, on checkout
* we would try to reencode to the original encoding. This
* would fail and we would leave the user with a messed-up
* working tree. Let's try to avoid this by screaming loud.
*/
const char* msg = _("failed to encode '%s' from %s to %s");
if (die_on_error)
die(msg, path, enc, default_encoding);
else {
error(msg, path, enc, default_encoding);
return 0;
}
}
trace_encoding("destination", path, default_encoding, dst, dst_len);
/*
* UTF supports lossless conversion round tripping [1] and conversions
* between UTF and other encodings are mostly round trip safe as
* Unicode aims to be a superset of all other character encodings.
* However, certain encodings (e.g. SHIFT-JIS) are known to have round
* trip issues [2]. Check the round trip conversion for all encodings
* listed in core.checkRoundtripEncoding.
*
* The round trip check is only performed if content is written to Git.
* This ensures that no information is lost during conversion to/from
* the internal UTF-8 representation.
*
* Please note, the code below is not tested because I was not able to
* generate a faulty round trip without an iconv error. Iconv errors
* are already caught above.
*
* [1] http://unicode.org/faq/utf_bom.html#gen2
* [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
*/
if (die_on_error && check_roundtrip(enc)) {
char *re_src;
int re_src_len;
re_src = reencode_string_len(dst, dst_len,
enc, default_encoding,
&re_src_len);
trace_printf("Checking roundtrip encoding for %s...\n", enc);
trace_encoding("reencoded source", path, enc,
re_src, re_src_len);
if (!re_src || src_len != re_src_len ||
memcmp(src, re_src, src_len)) {
const char* msg = _("encoding '%s' from %s to %s and "
"back is not the same");
die(msg, path, enc, default_encoding);
}
free(re_src);
}
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
}
static int encode_to_worktree(const char *path, const char *src, size_t src_len,
struct strbuf *buf, const char *enc)
{
char *dst;
int dst_len;
/*
* No encoding is specified or there is nothing to encode.
* Tell the caller that the content was not modified.
*/
if (!enc || (src && !src_len))
return 0;
dst = reencode_string_len(src, src_len, enc, default_encoding,
&dst_len);
if (!dst) {
error("failed to encode '%s' from %s to %s",
path, default_encoding, enc);
return 0;
}
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
}
static int crlf_to_git(const struct index_state *istate,
const char *path, const char *src, size_t len,
struct strbuf *buf,
@ -978,6 +1214,24 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
return 1;
}
static const char *git_path_check_encoding(struct attr_check_item *check)
{
const char *value = check->value;
if (ATTR_UNSET(value) || !strlen(value))
return NULL;
if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
die(_("true/false are no valid working-tree-encodings"));
}
/* Don't encode to the default encoding */
if (same_encoding(value, default_encoding))
return NULL;
return value;
}
static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
{
const char *value = check->value;
@ -1033,6 +1287,7 @@ struct conv_attrs {
enum crlf_action attr_action; /* What attr says */
enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
int ident;
const char *working_tree_encoding; /* Supported encoding or default encoding if NULL */
};
static void convert_attrs(struct conv_attrs *ca, const char *path)
@ -1041,7 +1296,8 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
if (!check) {
check = attr_check_initl("crlf", "ident", "filter",
"eol", "text", NULL);
"eol", "text", "working-tree-encoding",
NULL);
user_convert_tail = &user_convert;
git_config(read_convert_config, NULL);
}
@ -1064,6 +1320,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
else if (eol_attr == EOL_CRLF)
ca->crlf_action = CRLF_TEXT_CRLF;
}
ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
} else {
ca->drv = NULL;
ca->crlf_action = CRLF_UNDEFINED;
@ -1144,6 +1401,13 @@ int convert_to_git(const struct index_state *istate,
src = dst->buf;
len = dst->len;
}
ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
if (ret && dst) {
src = dst->buf;
len = dst->len;
}
if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
if (ret && dst) {
@ -1167,6 +1431,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
die("%s: clean filter '%s' failed", path, ca.drv->name);
encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
}
@ -1198,6 +1463,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
}
}
ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
if (ret) {
src = dst->buf;
len = dst->len;
}
ret_filter = apply_filter(
path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
if (!ret_filter && ca.drv && ca.drv->required)
@ -1664,6 +1935,9 @@ struct stream_filter *get_stream_filter(const char *path, const struct object_id
if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
return NULL;
if (ca.working_tree_encoding)
return NULL;
if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
return NULL;

View File

@ -12,6 +12,7 @@ struct index_state;
#define CONV_EOL_RNDTRP_WARN (1<<1) /* Warn if CRLF to LF to CRLF is different */
#define CONV_EOL_RENORMALIZE (1<<2) /* Convert CRLF to LF */
#define CONV_EOL_KEEP_CRLF (1<<3) /* Keep CRLF line endings as is */
#define CONV_WRITE_OBJECT (1<<4) /* Content is written to the index */
extern int global_conv_flags_eol;
@ -55,6 +56,7 @@ struct delayed_checkout {
};
extern enum eol core_eol;
extern char *check_roundtrip_encoding;
extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
const char *path);
extern const char *get_wt_convert_stats_ascii(const char *path);

View File

@ -55,6 +55,7 @@ int check_replace_refs = 1; /* NEEDSWORK: rename to read_replace_refs */
char *git_replace_ref_base;
enum eol core_eol = EOL_UNSET;
int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
char *check_roundtrip_encoding = "SHIFT-JIS";
unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
enum rebase_setup_type autorebase = AUTOREBASE_NEVER;

View File

@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, va_list params);
extern void set_die_is_recursing_routine(int (*routine)(void));
extern int starts_with(const char *str, const char *prefix);
extern int istarts_with(const char *str, const char *prefix);
/*
* If the string "str" begins with the string found in "prefix", return 1.

View File

@ -142,7 +142,7 @@ static int get_conv_flags(unsigned flags)
if (flags & HASH_RENORMALIZE)
return CONV_EOL_RENORMALIZE;
else if (flags & HASH_WRITE_OBJECT)
return global_conv_flags_eol;
return global_conv_flags_eol | CONV_WRITE_OBJECT;
else
return 0;
}

View File

@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
return 0;
}
int istarts_with(const char *str, const char *prefix)
{
for (; ; str++, prefix++)
if (!*prefix)
return 1;
else if (tolower(*str) != tolower(*prefix))
return 0;
}
int skip_to_optional_arg_default(const char *str, const char *prefix,
const char **arg, const char *def)
{
@ -793,7 +802,18 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
result[i] = '\0';
return result;
}
char *xstrdup_toupper(const char *string)
{
char *result;
size_t len, i;
len = strlen(string);
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = toupper(string[i]);
return result;
}

View File

@ -616,6 +616,7 @@ __attribute__((format (printf,2,3)))
extern int fprintf_ln(FILE *fp, const char *fmt, ...);
char *xstrdup_tolower(const char *);
char *xstrdup_toupper(const char *);
/**
* Create a newly allocated string using printf format. You can do this easily

245
t/t0028-working-tree-encoding.sh Executable file
View File

@ -0,0 +1,245 @@
#!/bin/sh
test_description='working-tree-encoding conversion via gitattributes'
. ./test-lib.sh
GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
test_expect_success 'setup test files' '
git config core.eol lf &&
text="hallo there!\ncan you read me?" &&
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
printf "$text" >test.utf8.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
# Line ending tests
printf "one\ntwo\nthree\n" >lf.utf8.raw &&
printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
# BOM tests
printf "\0a\0b\0c" >nobom.utf16be.raw &&
printf "a\0b\0c\0" >nobom.utf16le.raw &&
printf "\376\777\0a\0b\0c" >bebom.utf16be.raw &&
printf "\777\376a\0b\0c\0" >lebom.utf16le.raw &&
printf "\0\0\0a\0\0\0b\0\0\0c" >nobom.utf32be.raw &&
printf "a\0\0\0b\0\0\0c\0\0\0" >nobom.utf32le.raw &&
printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
# Add only UTF-16 file, we will add the UTF-32 file later
cp test.utf16.raw test.utf16 &&
cp test.utf32.raw test.utf32 &&
git add .gitattributes test.utf16 &&
git commit -m initial
'
test_expect_success 'ensure UTF-8 is stored in Git' '
test_when_finished "rm -f test.utf16.git" &&
git cat-file -p :test.utf16 >test.utf16.git &&
test_cmp_bin test.utf8.raw test.utf16.git
'
test_expect_success 're-encode to UTF-16 on checkout' '
test_when_finished "rm -f test.utf16.raw" &&
rm test.utf16 &&
git checkout test.utf16 &&
test_cmp_bin test.utf16.raw test.utf16
'
test_expect_success 'check $GIT_DIR/info/attributes support' '
test_when_finished "rm -f test.utf32.git" &&
test_when_finished "git reset --hard HEAD" &&
echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
git add test.utf32 &&
git cat-file -p :test.utf32 >test.utf32.git &&
test_cmp_bin test.utf8.raw test.utf32.git
'
for i in 16 32
do
test_expect_success "check prohibited UTF-${i} BOM" '
test_when_finished "git reset --hard HEAD" &&
echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
echo "*.utf${i}le text working-tree-encoding=utf-${i}LE" >>.gitattributes &&
# Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian)
# but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32).
# In these cases the BOM is prohibited.
cp bebom.utf${i}be.raw bebom.utf${i}be &&
test_must_fail git add bebom.utf${i}be 2>err.out &&
test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out &&
cp lebom.utf${i}le.raw lebom.utf${i}be &&
test_must_fail git add lebom.utf${i}be 2>err.out &&
test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out &&
cp bebom.utf${i}be.raw bebom.utf${i}le &&
test_must_fail git add bebom.utf${i}le 2>err.out &&
test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out &&
test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out &&
cp lebom.utf${i}le.raw lebom.utf${i}le &&
test_must_fail git add lebom.utf${i}le 2>err.out &&
test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out &&
test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out
'
test_expect_success "check required UTF-${i} BOM" '
test_when_finished "git reset --hard HEAD" &&
echo "*.utf${i} text working-tree-encoding=utf-${i}" >>.gitattributes &&
cp nobom.utf${i}be.raw nobom.utf${i} &&
test_must_fail git add nobom.utf${i} 2>err.out &&
test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
test_i18ngrep "use UTF-${i}BE or UTF-${i}LE" err.out &&
cp nobom.utf${i}le.raw nobom.utf${i} &&
test_must_fail git add nobom.utf${i} 2>err.out &&
test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
test_i18ngrep "use UTF-${i}BE or UTF-${i}LE" err.out
'
test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
test_when_finished "git reset --hard HEAD^" &&
cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
cp crlf.utf${i}.raw eol.utf${i} &&
cat >expectIndexLF <<-EOF &&
i/lf w/-text attr/text eol.utf${i}
EOF
git add eol.utf${i} &&
git commit -m eol &&
# UTF-${i} with CRLF (Windows line endings)
rm eol.utf${i} &&
git -c core.eol=crlf checkout eol.utf${i} &&
test_cmp_bin crlf.utf${i}.raw eol.utf${i} &&
# Although the file has CRLF in the working tree,
# ensure LF in the index
git ls-files --eol eol.utf${i} >actual &&
test_cmp expectIndexLF actual &&
# UTF-${i} with LF (Unix line endings)
rm eol.utf${i} &&
git -c core.eol=lf checkout eol.utf${i} &&
test_cmp_bin lf.utf${i}.raw eol.utf${i} &&
# The file LF in the working tree, ensure LF in the index
git ls-files --eol eol.utf${i} >actual &&
test_cmp expectIndexLF actual
'
done
test_expect_success 'check unsupported encodings' '
test_when_finished "git reset --hard HEAD" &&
echo "*.set text working-tree-encoding" >.gitattributes &&
printf "set" >t.set &&
test_must_fail git add t.set 2>err.out &&
test_i18ngrep "true/false are no valid working-tree-encodings" err.out &&
echo "*.unset text -working-tree-encoding" >.gitattributes &&
printf "unset" >t.unset &&
git add t.unset &&
echo "*.empty text working-tree-encoding=" >.gitattributes &&
printf "empty" >t.empty &&
git add t.empty &&
echo "*.garbage text working-tree-encoding=garbage" >.gitattributes &&
printf "garbage" >t.garbage &&
test_must_fail git add t.garbage 2>err.out &&
test_i18ngrep "failed to encode" err.out
'
test_expect_success 'error if encoding round trip is not the same during refresh' '
BEFORE_STATE=$(git rev-parse HEAD) &&
test_when_finished "git reset --hard $BEFORE_STATE" &&
# Add and commit a UTF-16 file but skip the "working-tree-encoding"
# filter. Consequently, the in-repo representation is UTF-16 and not
# UTF-8. This simulates a Git version that has no working tree encoding
# support.
echo "*.utf16le text working-tree-encoding=utf-16le" >.gitattributes &&
echo "hallo" >nonsense.utf16le &&
TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16le) &&
git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16le &&
COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
git update-ref refs/heads/master $COMMIT &&
test_must_fail git checkout HEAD^ 2>err.out &&
test_i18ngrep "error: .* overwritten by checkout:" err.out
'
test_expect_success 'error if encoding garbage is already in Git' '
BEFORE_STATE=$(git rev-parse HEAD) &&
test_when_finished "git reset --hard $BEFORE_STATE" &&
# Skip the UTF-16 filter for the added file
# This simulates a Git version that has no checkoutEncoding support
cp nobom.utf16be.raw nonsense.utf16 &&
TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
git update-ref refs/heads/master $COMMIT &&
git diff 2>err.out &&
test_i18ngrep "error: BOM is required" err.out
'
test_expect_success 'check roundtrip encoding' '
test_when_finished "rm -f roundtrip.shift roundtrip.utf16" &&
test_when_finished "git reset --hard HEAD" &&
text="hallo there!\nroundtrip test here!" &&
printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
# SHIFT-JIS encoded files are round-trip checked by default...
GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 |
grep "Checking roundtrip encoding for SHIFT-JIS" &&
git reset &&
# ... unless we overwrite the Git config!
! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage \
add .gitattributes roundtrip.shift 2>&1 |
grep "Checking roundtrip encoding for SHIFT-JIS" &&
git reset &&
# UTF-16 encoded files should not be round-trip checked by default...
! GIT_TRACE=1 git add roundtrip.utf16 2>&1 |
grep "Checking roundtrip encoding for UTF-16" &&
git reset &&
# ... unless we tell Git to check it!
GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-16, UTF-32" \
add roundtrip.utf16 2>&1 |
grep "Checking roundtrip encoding for utf-16" &&
git reset &&
# ... unless we tell Git to check it!
# (here we also check that the casing of the encoding is irrelevant)
GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-32, utf-16" \
add roundtrip.utf16 2>&1 |
grep "Checking roundtrip encoding for utf-16" &&
git reset
'
test_done

65
utf8.c
View File

@ -401,18 +401,40 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width,
strbuf_release(&sb_dst);
}
/*
* Returns true (1) if the src encoding name matches the dst encoding
* name directly or one of its alternative names. E.g. UTF-16BE is the
* same as UTF16BE.
*/
static int same_utf_encoding(const char *src, const char *dst)
{
if (istarts_with(src, "utf") && istarts_with(dst, "utf")) {
/* src[3] or dst[3] might be '\0' */
int i = (src[3] == '-' ? 4 : 3);
int j = (dst[3] == '-' ? 4 : 3);
return !strcasecmp(src+i, dst+j);
}
return 0;
}
int is_encoding_utf8(const char *name)
{
if (!name)
return 1;
if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8"))
if (same_utf_encoding("utf-8", name))
return 1;
return 0;
}
int same_encoding(const char *src, const char *dst)
{
if (is_encoding_utf8(src) && is_encoding_utf8(dst))
static const char utf8[] = "UTF-8";
if (!src)
src = utf8;
if (!dst)
dst = utf8;
if (same_utf_encoding(src, dst))
return 1;
return !strcasecmp(src, dst);
}
@ -538,6 +560,45 @@ char *reencode_string_len(const char *in, int insz,
}
#endif
static int has_bom_prefix(const char *data, size_t len,
const char *bom, size_t bom_len)
{
return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
}
static const char utf16_be_bom[] = {0xFE, 0xFF};
static const char utf16_le_bom[] = {0xFF, 0xFE};
static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
{
return (
(same_utf_encoding("UTF-16BE", enc) ||
same_utf_encoding("UTF-16LE", enc)) &&
(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
) || (
(same_utf_encoding("UTF-32BE", enc) ||
same_utf_encoding("UTF-32LE", enc)) &&
(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
);
}
int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
{
return (
(same_utf_encoding(enc, "UTF-16")) &&
!(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
) || (
(same_utf_encoding(enc, "UTF-32")) &&
!(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
);
}
/*
* Returns first character length in bytes for multi-byte `text` according to
* `encoding`.

28
utf8.h
View File

@ -70,4 +70,32 @@ typedef enum {
void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
const char *s);
/*
* If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
* BOM must not be used [1]. The same applies for the UTF-32 equivalents.
* The function returns true if this rule is violated.
*
* [1] http://unicode.org/faq/utf_bom.html#bom10
*/
int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
/*
* If the endianness is not defined in the encoding name, then we
* require a BOM. The function returns true if a required BOM is missing.
*
* The Unicode standard instructs to assume big-endian if there in no
* BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
* used in HTML5 recommends to assume little-endian to "deal with
* deployed content" [3].
*
* Therefore, strictly requiring a BOM seems to be the safest option for
* content in Git.
*
* [1] http://unicode.org/faq/utf_bom.html#gen6
* [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
* Section 3.10, D98, page 132
* [3] https://encoding.spec.whatwg.org/#utf-16le
*/
int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
#endif