Tokens

Tokens (also called terminals) cannot be further divided. There are the following token types used in the grammar:

Name

Names (or identifiers) consist of a letter or underscore (_), followed by any number of letters, digits and underscores. For example:

# valid identifiers
hello  i18n  _foo_  Gänsefüßchen

# invalid identifiers
kebab-case  42  👍‍

A letter is any code point with the Alphabetic property, which can be matched in most regex flavors with \p{Alpha}. A digit is any code point from the Number general categories, which can be matched in most regex flavors with \pN.

Note that group names have more restrictions than variable names: They must be ASCII-only and may not contain underscores.

Identifiers may not be one of the following reserved words:

U
let
lazy
greedy
range
base
atomic
enable
disable
if
else
recursion
regex
test

There are some contextual keywords that have a special meaning only in a certain context:

match
reject
as
in
unicode

Contextual keywords can be used as variable and group names without issues.

Number

A whole number without a sign and without leading zeros. For example:

# valid numbers
0  1  42  10000

# invalid numbers
042  -30  +30  30.1  10_000  10,000

String

A string is a sequence of code points surrounded by single or double quotes. In double quoted strings, double quotes and backslashes are escaped by preceding them with a backslash. No other escapes are supported. Single quoted strings don’t support any escaping:

# valid strings
'test'  "test"  "C:\\User\\Dwayne \"The Rock\" Johnson"  'C:\User\Dwayne "The Rock" Johnson'

'this is a
multiline string'

"this is a
multiline string"

# invalid strings
"\n"  "\uFFFF"  '\''

Within string literals, \r\n (CRLF) sequences are replaced with a single \n (LF). This is because text editors do not display the type of line ending, so users might save a Pomsky file with the wrong file ending by accident. In most regex engines, \n matches a line break regardless of the platform convention used.

StringOneChar

Same as String, with the limitation that the string must contain exactly one code point. Example:

'a'  'ŧ'  "\\"

CodePoint

A codepoint consists of U, +, and 1 to 6 hexadecimal digits (0-9, a-f, A-F). It must represent a valid Unicode scalar value. This means that it must be a valid codepoint, but not a UTF-16 surrogate. For example:

# valid codepoints
U+0  U+10  U+FFF  U+10FFFF  U + FF

# invalid codepoints
U+300000  U+00000001  U+D800  U+FGHI

The code point token is ‘special’ in that the + may be surrounded by spaces.

Punctuation

Punctuation tokens consist of visible ASCII characters. Most punctuation tokens are exactly one character, except for <<, >>, and ::. The full list of supported punctuation tokens is

>> << :: ^ $ < > % * + ? | : ( ) { } , ! [ - ] . ; =

Pomsky’s lexer can also lex a variety of illegal constructs, e.g. backslash escapes like \g<0> and groups such as (:?), in order to show more useful error messages.