Common Elements of Lexemes

The following tables list common elements of lexemes.

Concept	Rule	Representation	Description
Decimal digit	DG	[0-9]	One character from '0'..'9'.
Octal digit	OC	[0-7]	One character from '0'..'7'.
Hexadecimal digit	HX	[0-9a-fA-F]	Any of the characters '0'..'9' and any of the letters 'A'..'F' and 'a'..'f'.
Single letter	LT	[A-Za-z_$]	Any of the characters 'A'..'Z', 'a'..'z', and the underscore (_) and dollar sign ($) characters.
Single letter from the International Character Set	LT18N	[A-Za-z_$\200-\377]	Any of the characters 'A'..'Z', 'a'..'z', the underscore (_) and dollar sign ($) characters, and any character in the top half of the 8-bit character set.
Shell 'word'	WD	[^ \t;\n'"]	Any character except space, tab, semicolon (;), linefeed, less than (<), greater than (>), and quotes (' or ").
File name	FL	[^ \t\n\}\;\>\<]	Any character except space, tab, semicolon (;), linefeed, right brace (}), less than (<), greater than (>), and tick (`).
Optional exponent	Exponent	[eE][+-]?{DG}+	Numbers often allow an optional exponent. It is represented as an 'e' or 'E' followed by an optional plus (+) or minus (-), and then one or more decimal digits.
Whitespace	Whitespace	[ \t]+	Whitespace is often used to separate two lexemes that would otherwise be misconstrued as a single lexeme. For example, stop in is two keywords, but stopin is an identifier. Apart from this separating property, Whitespace is usually ignored. Whitespace is a sequence of one or more tabs or spaces.
String literal	stringChar	([^"\\\n]\|([\\]({simpleEscape}\| {octalEscape}\|{hexEscape})))	Any character except the terminating quote character ("), or a newline (\n). If the character is a backslash (\), it is followed by an escaped sequence of characters.
Character literal	charChar	([^'\\\n]\|([\\]({simpleEscape}\| {octalEscape}\|{hexEscape})))	Any character except the terminating quote (') character, or a newline (\n). If the character is a backslash (\), it is followed by an escaped sequence of characters.
Environment variable identifier	EID	[^ \t\n;='"&\\|]	Any character except space, tab, linefeed, less-than (<), greater-than (>), semicolon (;), equal sign (=), quotes (' or "), ampersand (&), backslash (\), and bar (\|).
Universal character name	UCN	\\u{HX}{4}\|\\U{HX}{8}	A universal character name is a backslash (\) followed by either a lowercase 'u' and 4 hexadecimal digits, or an uppercase 'U' and 8 hexadecimal digits.

The escaped sequence of characters can be one of following three forms:

Concept	Rule	Representation	Description
Simple escape	simpleEscape	([A-Za-z'"?#*\\])	One of 'A'-'Z' or 'a'-'z'. Some of these have special meanings, the most common being 'n' for newline and 't' for tab. Can be a quote (' or ") character that does not finish the literal, a question mark (?), a pound sign (#), an asterisk (*), or a backslash (\), which then becomes part of the string literal rather than causing a further escape sequence.
Octal escape	octalEscape	(OC{1,3})	1 to 3 octal digits, the combined numeric value of which is the character that becomes part of the string literal.
Hexadecimal escape	hexEscape	([xX]HX{1,8})	An 'x' or an 'X' followed by 1 to 8 hexadecimal digits, the combined numeric value of which is the character that becomes part of the string literal.