This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy.
Fixes #552
This implements surrogate handling on decoding as it is in the standard library. Lone escaped surrogates and any raw surrogates in the input result in surrogates in the output, and escaped surrogate pairs get decoded into non-BMP characters. Note that raw surrogate pairs get treated differently on platforms/compilers with 16-bit `wchar_t`, e.g. Microsoft Windows.
`JSON_EncodeObject` returns `NULL` when an error occurs, but without freeing the buffer. This leads to a memory leak when the buffer is internally allocated (because the caller's buffer was insufficient or none was provided at all) and any error occurs. Similarly, `objToJSON` did not clean up the buffer in all error conditions either.
This adds the missing buffer free in `JSON_EncodeObject` (iff the buffer was allocated internally) and refactors the error handling in `objToJSON` slightly to also free the buffer when a Python exception occurred without the encoder's `errorMsg` being set.
This allows surrogates anywhere in the input, compatible with the json module from the standard library.
This also refactors two interfaces:
- The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context.
- Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them.
Fixes #156
Fixes #447
Fixes #537
Supersedes #284
In DEBUG mode, this ensures that all buffer appends are safe.
It also refactors direct `memcpy` calls into a helper `Buffer_memcpy` function that ensures correct buffer pointer movement and has a similar safety check.
* Removed the reservations in Buffer_EscapeStringUnvalidated and Buffer_EscapeStringValidated as those are not needed and may hide other bugs.
* Debug check in Buffer_EscapeStringValidated was triggering incorrectly.
* The reservation on JT_RAW was much larger than necessary; the value is copied directly, so the factor six is not needed, and this may hide other bugs.
* Explicit accurate reservations everywhere else.
Add a few extra memory reserve calls to account for the extra space that
indentation needs.
These kinds of memory issues are hard to spot because the buffer is resized in
powers of 2 meaning that a miscalculation would only show any symptoms if the
required buffer size is estimated to be just below a 2 power but is actually
just above. Add a debug mode which replaces the 2 power scheme with reserving
only the memory explicitly requested and adds some overflow checks.
Previously, we'd output a couple of new lines between the start and end
of the object, whereas the stdlib doesn't bother with whitespace if
they're empty.
In my testing, the only difference in indented serialization now is
float representation.
Fix segfaults on musl libc when ultrajson runs in a thread. On musl libc
the default thread stack size is only 80k so allocating a 128k buffer on
stack will guarantee a crash. There seems not to be any evident
performance benefit using big buffer on stack either so we just reduce
the default.
fixes #254