Commit graph

21 commits

Author SHA1 Message Date
Charlie Gordon
1baa6763f8
Improve UTF-8 decoding and encoding functions (#410)
Ensure proper UTF-8 encoding (1 to 4 bytes).
Handle invalid encodings (return 0xFFFD and consume a single byte)
Individually encoded surrogate code points are accepted.

- add `utf8_scan()` to analyze a byte array for UTF-8 contents
  detects invalid encoding, computes number of codepoints and content kind:
  plain ASCII, 8-bit, 16-bit or larger codepoints.
- add `utf8_encode_len(c)` to compute the number of bytes to encode `c`
- rename `unicode_to_utf8` as `utf8_encode`
- rename `unicode_from_utf8` as `utf8_decode`
- add `utf8_decode_buf8(dest, size, src, len)` to decode a UTF-8 encoded
  byte array known to contain only ASCII and 8-bit codepoints.
- add `utf8_decode_buf16(dest, size, src, len)` to decode a UTF-8 encoded
  byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs
  for non-BMP1 codepoints.
- add `utf8_encode_buf8(dest, size, src, len)` to encode an array of 8-bit
  codepoints as a UTF-8 encoded null terminated string
- add `utf16_encode_buf8(dest, size, src, len)` to decode an array of 16-bit
  codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string
- detect invalid UTF-8 encoding in RegExp parser
- simplify `JS_AtomGetStrRT`, `JS_NewStringLen` using the above functions
- simplify UTF-8 decoding and error testing
2024-05-21 14:08:33 +02:00
Charlie Gordon
5abbeacc62
Fix bug in GET_PREV_CHAR macro (#278)
* Fix bug in `GET_PREV_CHAR` macro

- pass `cbuf_type` variable to `XXX_CHAR` macros in `lre_exec_backtrack()`
- improve readability of these macros
- fix `GET_PREV_CHAR` macro: `cptr` was decremented twice on invalid high surrogate.
- minimize non functional changes
2024-03-03 17:12:52 +01:00
Charlie Gordon
708dbcbf5b
Fix big endian serialization (#269)
* Fix big endian serialization

Big endian serialization was broken because:
- it partially relied on `WORDS_ENDIAN` (unconditionally undef'd in cutils.h)
- endianness was not handled at all in the bc reader.
- `bc_tag_str` was missing the `"RegExp"` string
- `lre_byte_swap()` was broken for `REOP_range` and `REOP_range32`

Modifications:
- remove `WORDS_ENDIAN`
- use `bc_put_u32()` / `bc_put_u64()` in `JS_WriteBigInt()`
- use `bc_get_u32()` / `bc_get_u64()` in `JS_ReadBigInt()`
- handle host endianness in `bc_get_u16()`, `bc_get_u32()`, `bc_get_u64()` and
  `JS_ReadFunctionBytecode()`
- handle optional littleEndian argument as specified in
  `js_dataview_getValue()` and `js_dataview_setValue()`
- fix `bc_tag_str` and `lre_byte_swap()`
2024-03-02 18:38:29 +01:00
Ben Noordhuis
f406d6f78c
Accept /[\-]/u as a valid regular expression (#288)
The non-Unicode version of the pattern was already accepted.

test262 tests it in an inverted sense in
test/built-ins/RegExp/unicode_restricted_identity_escape.js but
it appears to be per spec and both V8 and Spidermonkey accept it.

Fixes: https://github.com/quickjs-ng/quickjs/issues/286
2024-03-02 13:29:15 +01:00
Ben Noordhuis
f0ef9e1593
Implement RegExp 'v' flag, part 1 (#229)
This commit implements the flag itself and teaches the regex engine to
reject previously accepted patterns when in unicodeSets mode.

Refs: https://github.com/quickjs-ng/quickjs/issues/228
2023-12-21 19:37:31 +01:00
Ben Noordhuis
f6ed206bd5
Change regexp flags field from uint8 to uint16 (#185)
I need the extra bits to store the 'v' flag as described in
https://github.com/tc39/proposal-regexp-v-flag
2023-12-09 16:47:05 +01:00
Ben Noordhuis
f7d2169999
Rename LRE_FLAG_UTF16 to LRE_FLAG_UNICODE (#186)
Prep work for https://github.com/tc39/proposal-regexp-v-flag a.k.a.
UnicodeSets.
2023-12-08 10:58:00 +01:00
Ben Noordhuis
42b708622c
Use named constant for regexp bytecode size field (#183) 2023-12-07 23:00:32 +01:00
Linus Groh
3b034b84d9
Fix null pointer arithmetic UB in libregexp (#136)
This is a patch I originally wrote for the Kiesel JS engine:
https://codeberg.org/kiesel-js/kiesel/src/branch/main/patches/libregexp.patch
2023-11-29 14:43:02 +01:00
Ben Noordhuis
5c3077e091
Implement RegExp serialization (#153)
JS_WriteObject() and JS_ReadObject() now support RegExp objects.
2023-11-29 08:50:53 +01:00
Saúl Ibarra Corretgé
a721bda7b5 Drop CONFIG_ALL_UNICODE and enable it by default 2023-11-20 10:52:04 +01:00
Ben Noordhuis
bef2a12566
DRY surrogate pair handling (#95) 2023-11-20 09:46:02 +01:00
Ben Noordhuis
d1960d1bfe
Implement RegExp 'd' flag (#86) 2023-11-20 09:45:44 +01:00
Ben Noordhuis
e2bc6441f8
Optimize RegExp ASCII literal matching (#94)
Add REOP_char8 that matches single bytes. Compresses bytecode for the
ASCII common case by 33% and reduces regexp_ascii benchmark running time
by 4%. The regexp_utf16 benchmark is unaffected.
2023-11-19 17:26:45 +01:00
Ben Noordhuis
b56cbb143c
Implement extended named capture group identifiers (#90)
Perfectly reasonable and not at all uncommon regular expressions like
/(?<𝑓𝑜𝑥>fox).*(?<𝓓𝓸𝓰>dog)/ are now accepted.
2023-11-19 11:01:38 +01:00
Ben Noordhuis
162a8b7409
Remove trailing whitespace (#46)
Not purely cosmetic because it breaks navigation with { and } in the
One True Editor.
2023-11-12 10:01:40 +01:00
bellard
b1f67dfc1a 2020-11-08 release 2020-11-08 14:30:56 +01:00
bellard
7c312df422 2020-09-06 release 2020-09-06 19:10:15 +02:00
bellard
8900766099 2020-07-05 release 2020-09-06 19:07:30 +02:00
bellard
383e2b06c8 2020-03-16 release 2020-09-06 19:02:03 +02:00
bellard
91459fb672 2020-01-05 release 2020-09-06 18:53:08 +02:00