Common Elements of Lexemes

The following tables list common elements of lexemes.

Concept

Rule

Representation

Description

Decimal digit DG [0-9] One character from '0'..'9'.
Octal digit OC [0-7] One character from '0'..'7'.
Hexadecimal digit HX [0-9a-fA-F] Any of the characters '0'..'9' and any of the letters 'A'..'F' and 'a'..'f'.
Single letter LT [A-Za-z_$] Any of the characters 'A'..'Z', 'a'..'z', and the underscore (_) and dollar sign ($) characters.
Single letter
from the International Character Set
LT18N [A-Za-z_$\200-\377] Any of the characters 'A'..'Z', 'a'..'z', the underscore (_) and dollar sign ($) characters, and any character in the top half of the 8-bit character set.
Shell 'word' WD [^ \t;\n'"] Any character except space, tab, semicolon (;), linefeed, less than (<), greater than (>), and quotes (' or ").
File name FL [^ \t\n\}\;\>\<] Any character except space, tab, semicolon (;), linefeed, right brace (}), less than (<), greater than (>), and tick (`).
Optional exponent Exponent [eE][+-]?{DG}+ Numbers often allow an optional exponent. It is represented as an 'e' or 'E' followed by an optional plus (+) or minus (-), and then one or more decimal digits.
Whitespace Whitespace [ \t]+ Whitespace is often used to separate two lexemes that would otherwise be misconstrued as a single lexeme. For example, stop in is two keywords, but stopin is an identifier. Apart from this separating property, Whitespace is usually ignored. Whitespace is a sequence of one or more tabs or spaces.
String literal stringChar ([^"\\\n]|([\\]({simpleEscape}|
{octalEscape}|{hexEscape})))
Any character except the terminating quote character ("), or a newline (\n). If the character is a backslash (\), it is followed by an escaped sequence of characters.
Character literal charChar ([^'\\\n]|([\\]({simpleEscape}|
{octalEscape}|{hexEscape
})))
Any character except the terminating quote (') character, or a newline (\n). If the character is a backslash (\), it is followed by an escaped sequence of characters.
Environment variable identifier EID [^ \t\n;='"&\|] Any character except space, tab, linefeed, less-than (<), greater-than (>), semicolon (;), equal sign (=), quotes (' or "), ampersand (&), backslash (\), and bar (|).
Universal character name UCN \\u{HX}{4}|\\U{HX}{8} A universal character name is a backslash (\) followed by either a lowercase 'u' and 4 hexadecimal digits, or an uppercase 'U' and 8 hexadecimal digits.

The escaped sequence of characters can be one of following three forms:

Concept Rule Representation Description
Simple escape simpleEscape ([A-Za-z'"?#*\\]) One of 'A'-'Z' or 'a'-'z'. Some of these have special meanings, the most common being 'n' for newline and 't' for tab. Can be a quote (' or ") character that does not finish the literal, a question mark (?), a pound sign (#), an asterisk (*), or a backslash (\), which then becomes part of the string literal rather than causing a further escape sequence.
Octal escape octalEscape (OC{1,3}) 1 to 3 octal digits, the combined numeric value of which is the character that becomes part of the string literal.
Hexadecimal escape hexEscape ([xX]HX{1,8}) An 'x' or an 'X' followed by 1 to 8 hexadecimal digits, the combined numeric value of which is the character that becomes part of the string literal.