Limitations of regex in parsing semantics
Parsing languages is a widely researched subject. Languages are mostly structured. Linear languages appear in the form of bytecode or stack-based assembly formats that have the benefit of being streamed. Ideally, we would also need a way to linearize structure from a string. In order for this to be remotely possible, there are formalisms required, at least according to mainstream PL design approaches.
- Context-free grammars and pushdown automata for syntactic analysis
- Attribute grammars or other formalisms for semantic analysis
- Full Turing machines for complete language processing
Could regular expressions achieve this? No.
This is also regarded as obvious and there is a hilarious anecdote on StackOverflow where a question was posed about parsing HTML with regular expressions.
1. Introduction #
Regular expressions constitute a formal language for pattern matching that corresponds to regular languages, the weakest class in the Chomsky hierarchy. We will look into why alternative parsing mechanisms with higher computational capabilities are required for the analysis of hierarchically structured PLs and formal systems.
We make the claim that regular expressions, as finite state automata, cannot recognize nested hierarchical structures of arbitrary depth due to their inherent memory limitations which makes them inadequate for parsing Turing-complete languages which require at minimum the computational power of a pushdown automaton (PDA) for syntax analysis and a Turing machine for semantic evaluation.
2. Automata theory #
2.1 Formal definitions #
We begin by reviewing existing definitions before our analysis.
Definition 2.1.1 A deterministic finite automaton (DFA) is a 5-tuple
is a finite set of states is a finite alphabet is the transition function is the initial state is the set of accepting states
Definition 2.1.2 A nondeterministic finite automaton (NFA) is a 5-tuple
are defined as in the DFA is the transition relation, where and is the power set of
Definition 2.1.3 A regular language is a language that can be recognized by a DFA or NFA.
Definition 2.1.4 A regular expression over an alphabet
(empty set) is a regular expression (empty string) is a regular expression- For each
, is a regular expression - If
and are regular expressions, then: (concatenation) (alternation) (Kleene star) are regular expressions
Theorem 2.1.5 (Kleene's Theorem) A language is regular if and only if it can be described by a regular expression.
2.2 The Pumping Lemma for regular languages #
The Pumping Lemma is a method for proving that certain languages are not regular.
Theorem 2.2.1 (Pumping Lemma) For any regular language
- For all
,
2.3 Proof of non-regularity for context-free languages #
Theorem 2.3.1 The language
is not regular.
Proof: By contradiction. Assume that
Since
This is a quick illustration that even the basic nested structure of matching opening and closing symbols (
3. Representation of hierarchical structures #
3.1 The Chomsky hierarchy #
The Chomsky hierarchy provides a classification of formal grammars and their corresponding languages.
Definition 3.1.1 A formal grammar is a 4-tuple
is a finite set of non-terminal symbols is a finite set of terminal symbols, disjoint from is a finite set of production rules is the start symbol
It classifies grammars based on the restrictions on their production rules.
- Type-0 (unrestricted): No restrictions on productions. Recognizable by Turing machines.
- Type-1 (context-sensitive): Productions of the form
where and with . Recognizable by linear-bounded automata. - Type-2 (context-free): Productions of the form
where and . Recognizable by pushdown automata. - Type-3 (regular): Productions of the form
or where and . Recognizable by finite automata.
3.2 Computational discrepancy between regular and context-free #
Theorem 3.2.1 The class of regular languages is strictly contained within the class of context-free languages.
Proof: Every regular language is trivially context-free since Type-3 grammars are a restricted form of Type-2 grammars. We must show that there exists at least one context-free language that is not regular. The language
3.3 Recursive nesting and memory requirements #
To reiterate, the limitation of regular expressions is their inability to maintain state beyond a fixed, finite amount. This is also besides the issue of unrepresentability of certain grammars.
Theorem 3.3.1 Any language requiring unbounded counting or memory cannot be recognized by a finite automaton.
Proof: Let
Another idea is to parse a string line-by-line and use basic production rules. E.g. a Markdown parser with rules applied to every line. A single chunk of regular expression could define rules for matching against each line, although this would not account for the entire CommonMark specification which includes HTML blocks and multiline Markdown tags!
For hierarchical structures like balanced parentheses or matched XML tags, the required memory grows with nesting depth. This category of languages are, by definition, inherently unparsable by regular expressions alone.
4. Proof theoretic approach #
4.1 Logical characterization of regular languages #
Is there a way to characterize regular languages?
Regular languages can be characterized using monadic second-order logic (MSO) over strings.
Theorem 4.1.1 (Büchi-Elgot-Trakhtenbrot) A language
is regular if and only if it is definable in MSO logic over strings.
MSO can express local properties and finite counting but cannot express the global matching constraints required for balanced structures.
4.2 Hierarchical structures as inductive types #
From a proof-theoretic perspective, hierarchical structures like trees are naturally represented as inductive types.
Definition 4.2.1 An inductive type
Quick example, binary trees can be defined in standard ML as:
datatype BinaryTree = Leaf | Node of BinaryTree * BinaryTree
Regular expressions correspond to finite automata which cannot recursively traverse such structures.
4.3 Curry-Howard correspondence #
Theorem 4.3.1 (Curry-Howard Isomorphism) There is a one-to-one correspondence between:
- Proofs in intuitionistic logic and typed
-calculus terms - Propositions and types
- Proof normalization and term evaluation
Type checking in PLs requires validating nested structures and scope rules. Any such verification necessitates at minimum a CFG since it must maintain a type environment and verify that variables are declared before use—a non-regular property.
In both cases, we validate recursive tree structures where child elements must satisfy parent constraints according to compositional rules, with XML-like tag matching (i.e. <p>...</p>
) being isomorphic to type judgment validation in the STLC where expression typing follows Γ ⊢ e:τ with nesting requirements that demonstrably exceed the pattern-matching capabilities of regular languages.
5. Language recognition #
5.1 Typed -Calculus and PLs #
Most PLs have features that cannot be recognized by regular expressions.
Definition 5.1.1 The simply typed
- Types:
- Terms:
The problem of type checking—determining if a term
General recursion in Turing-complete languages is supported through fixed-point combinators.
5.2 Recursion and fixed-point combinators #
Definition 5.2.1 The Y combinator is defined as
For any function
5.3 Formal proof of parsing requirements #
Theorem 5.3.1 Parsing a Turing-complete language requires at minimum a pushdown automaton.
Proof: Consider a simple Turing-complete language with function definitions and calls. To correctly parse this language, we require the following:
- Match function declarations with their corresponding bodies
- Variables must be used within their corresponding scope
- Verify that function calls correspond to declared functions
Let us focus on the specific problem of checking balanced nested function calls, i.e.
This language contains the subset
6. PL constructs #
6.1 Balanced parentheses and nested blocks #
Theorem 6.1.1 The language of balanced parentheses
is not regular.
Proof: This follows from a direct application of the Pumping Lemma, similar to the proof in Theorem 2.3.1.
Nearly the same proofs apply to nested code blocks, XML tags and other hierarchical constructs.
6.2 Recursive descent parsing with a stack #
Theorem 6.2.1 Parsing nested structures of unbounded depth requires a stack of unbounded size.
Proof: Consider the language
7. Context-sensitive and recursive features #
Parser generators like LALRPOP implement deterministic algorithms such as LALR(1) that handle context-free languages by constructing pushdown automata with fixed lookahead. It offers a massive performance over RDP by pre-computing parse tables and avoiding backtracking with
PEG (Parsing Expression Grammar) parsers represent an alternative approach that works well for unambiguous parsing. Unlike CFG parsers, PEG uses unlimited lookahead and ordered choice, i.e. eliminates many cases of unambiguity in patterns that would require semantic predicates in LALR, although it can easily fall into stack depth issues.
The cost of an intuitive grammar composition is memory.
Definition 7.1 A context-sensitive feature requires knowledge of the surrounding context beyond trivial nesting structure.
Examples include:
- Variable declaration before use
- Type checking with polymorphism
Theorem 7.2 Complete Turing-complete language processing requires the power of a Turing machine.
Proof: By definition, a Turing-complete language can encode any computable function. If a weaker computational model could process such a language completely, it would contradict the Church-Turing thesis by computing functions beyond its theoretical capabilities.
8. Language recognition with categorical semantics #
8.1 Algebraic structure of regular languages #
From a category-theoretic perspective, regular languages can be characterized as the free Kleene algebra generated by an alphabet.
Definition 8.1.1 A Kleene algebra is a structure
is a commutative monoid is a monoid- Multiplication distributes over addition
is the Kleene star
Regular expressions form the free Kleene algebra generated by the alphabet
Definition 8.1.2 An algebraic theory consists of operations and equations specifying how these operations interact.
A wordy way of saying this can be found on Wikipedia, an algebraic theory is a theory that uses axioms stated entirely in terms of equations between terms with free variables, so context-free languages can be characterized using algebraic theories with recursively defined operations.
References #
-
Wikipedia - Büchi-Elgot-Trakhtenbrot, Curry–Howard correspondence
-
nLab - algebraic theory
-
Types and Programming Languages — Pierce, B. C. (2002).
-
Hopcroft, J. E., Motwani, R. & Ullman, J. D. (2013). Introduction to Automata Theory, Languages, and Computation. Pearson.
-
A completeness theorem for Kleene algebras and the algebra of regular events — Kozen, D. (1994), Cornell University.
- Previous: Portable vectorization in C++23