Skip to content
You are viewing the next version of this website. See current version

Tokens

Tokens (also called terminals) cannot be further divided. There are the following token types used in the grammar:

Names (or identifiers) consist of a letter or underscore (_), followed by any number of letters, digits and underscores. For example:

# valid identifiers
hello i18n _foo_ Gänsefüßchen
# invalid identifiers
kebab-case 42 👍‍

A letter is any code point with the Alphabetic property, which can be matched in most regex flavors with \p{Alpha}. A digit is any code point from the Number general categories, which can be matched in most regex flavors with \pN.

Note that group names have more restrictions than variable names: They must be ASCII-only and may not contain underscores.

Identifiers may not be one of the following reserved words:

  • U
  • let
  • lazy
  • greedy
  • range
  • base
  • atomic
  • enable
  • disable
  • if
  • else
  • recursion
  • regex
  • test

There are some contextual keywords that have a special meaning only in a certain context:

  • match
  • reject
  • as
  • in
  • unicode

Contextual keywords can be used as variable and group names without issues.

A whole number without a sign and without leading zeros. For example:

# valid numbers
0 1 42 10000
# invalid numbers
042 -30 +30 30.1 10_000 10,000

A string is a sequence of code points surrounded by single or double quotes. In double quoted strings, double quotes and backslashes are escaped by preceding them with a backslash. No other escapes are supported. Single quoted strings don’t support any escaping:

# valid strings
'test' "test" "C:\\User\\Dwayne \"The Rock\" Johnson" 'C:\User\Dwayne "The Rock" Johnson'
'this is a
multiline string'
"this is a
multiline string"
# invalid strings
"\n" "\uFFFF" '\''

Within string literals, \r\n (CRLF) sequences are replaced with a single \n (LF). This is because text editors do not display the type of line ending, so users might save a Pomsky file with the wrong file ending by accident. In most regex engines, \n matches a line break regardless of the platform convention used.

Same as String, with the limitation that the string must contain exactly one code point. Example:

'a' 'ŧ' "\\"

A codepoint consists of U, +, and 1 to 6 hexadecimal digits (0-9, a-f, A-F). It must represent a valid Unicode scalar value. This means that it must be a valid codepoint, but not a UTF-16 surrogate. For example:

# valid codepoints
U+0 U+10 U+FFF U+10FFFF U + FF
# invalid codepoints
U+300000 U+00000001 U+D800 U+FGHI

The code point token is ‘special’ in that the + may be surrounded by spaces.

Punctuation tokens consist of visible ASCII characters. Most punctuation tokens are exactly one character, except for <<, >>, and ::. The full list of supported punctuation tokens is

>> << :: ^ $ < > % * + ? | : ( ) { } , ! [ - ] . ; =

Pomsky’s lexer can also lex a variety of illegal constructs, e.g. backslash escapes like \g<0> and groups such as (:?), in order to show more useful error messages.