- no longer pass the array length to `utf8_decode`
- add `utf8_decode_len` for border cases
- use switch based dispatch in `utf8_decode_len` to work around a gcc 12.2 optimizer bug
Ensure proper UTF-8 encoding (1 to 4 bytes).
Handle invalid encodings (return 0xFFFD and consume a single byte)
Individually encoded surrogate code points are accepted.
- add `utf8_scan()` to analyze a byte array for UTF-8 contents
detects invalid encoding, computes number of codepoints and content kind:
plain ASCII, 8-bit, 16-bit or larger codepoints.
- add `utf8_encode_len(c)` to compute the number of bytes to encode `c`
- rename `unicode_to_utf8` as `utf8_encode`
- rename `unicode_from_utf8` as `utf8_decode`
- add `utf8_decode_buf8(dest, size, src, len)` to decode a UTF-8 encoded
byte array known to contain only ASCII and 8-bit codepoints.
- add `utf8_decode_buf16(dest, size, src, len)` to decode a UTF-8 encoded
byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs
for non-BMP1 codepoints.
- add `utf8_encode_buf8(dest, size, src, len)` to encode an array of 8-bit
codepoints as a UTF-8 encoded null terminated string
- add `utf16_encode_buf8(dest, size, src, len)` to decode an array of 16-bit
codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string
- detect invalid UTF-8 encoding in RegExp parser
- simplify `JS_AtomGetStrRT`, `JS_NewStringLen` using the above functions
- simplify UTF-8 decoding and error testing
* Fix bug in `GET_PREV_CHAR` macro
- pass `cbuf_type` variable to `XXX_CHAR` macros in `lre_exec_backtrack()`
- improve readability of these macros
- fix `GET_PREV_CHAR` macro: `cptr` was decremented twice on invalid high surrogate.
- minimize non functional changes
* Fix big endian serialization
Big endian serialization was broken because:
- it partially relied on `WORDS_ENDIAN` (unconditionally undef'd in cutils.h)
- endianness was not handled at all in the bc reader.
- `bc_tag_str` was missing the `"RegExp"` string
- `lre_byte_swap()` was broken for `REOP_range` and `REOP_range32`
Modifications:
- remove `WORDS_ENDIAN`
- use `bc_put_u32()` / `bc_put_u64()` in `JS_WriteBigInt()`
- use `bc_get_u32()` / `bc_get_u64()` in `JS_ReadBigInt()`
- handle host endianness in `bc_get_u16()`, `bc_get_u32()`, `bc_get_u64()` and
`JS_ReadFunctionBytecode()`
- handle optional littleEndian argument as specified in
`js_dataview_getValue()` and `js_dataview_setValue()`
- fix `bc_tag_str` and `lre_byte_swap()`
The non-Unicode version of the pattern was already accepted.
test262 tests it in an inverted sense in
test/built-ins/RegExp/unicode_restricted_identity_escape.js but
it appears to be per spec and both V8 and Spidermonkey accept it.
Fixes: https://github.com/quickjs-ng/quickjs/issues/286
Add REOP_char8 that matches single bytes. Compresses bytecode for the
ASCII common case by 33% and reduces regexp_ascii benchmark running time
by 4%. The regexp_utf16 benchmark is unaffected.