Skip to main content
elric neumann

Limitations of regex in parsing semantics

Parsing languages is a widely researched subject. Languages are mostly structured. Linear languages appear in the form of bytecode or stack-based assembly formats that have the benefit of being streamed. Ideally, we would also need a way to linearize structure from a string. In order for this to be remotely possible, there are formalisms required, at least according to mainstream PL design approaches.

Could regular expressions achieve this? No.

This is also regarded as obvious and there is a hilarious anecdote on StackOverflow where a question was posed about parsing HTML with regular expressions.

1. Introduction #

Regular expressions constitute a formal language for pattern matching that corresponds to regular languages, the weakest class in the Chomsky hierarchy. We will look into why alternative parsing mechanisms with higher computational capabilities are required for the analysis of hierarchically structured PLs and formal systems.

We make the claim that regular expressions, as finite state automata, cannot recognize nested hierarchical structures of arbitrary depth due to their inherent memory limitations which makes them inadequate for parsing Turing-complete languages which require at minimum the computational power of a pushdown automaton (PDA) for syntax analysis and a Turing machine for semantic evaluation.

2. Automata theory #

2.1 Formal definitions #

We begin by reviewing existing definitions before our analysis.

Definition 2.1.1 A deterministic finite automaton (DFA) is a 5-tuple M=(Q,Σ,δ,q0,F) where:

Definition 2.1.2 A nondeterministic finite automaton (NFA) is a 5-tuple M=(Q,Σ,Δ,q0,F) where:

Definition 2.1.3 A regular language is a language that can be recognized by a DFA or NFA.

Definition 2.1.4 A regular expression over an alphabet Σ is inductively defined as:

Theorem 2.1.5 (Kleene's Theorem) A language is regular if and only if it can be described by a regular expression.

2.2 The Pumping Lemma for regular languages #

The Pumping Lemma is a method for proving that certain languages are not regular.

Theorem 2.2.1 (Pumping Lemma) For any regular language L, there exists a constant p1 (referred to as the pumping length) such that for all strings sL with |s|p, there exists a decomposition s=xyz such that the following holds.

  1. |y|>0
  2. |xy|p
  3. For all i0, xyizL

2.3 Proof of non-regularity for context-free languages #

Theorem 2.3.1 The language L=anbnn0 is not regular.

Proof: By contradiction. Assume that L is regular. By the Pumping Lemma, there exists a pumping length p1. Consider the string s=apbpL. Since |s|p, the Pumping Lemma applies and there exists a decomposition s=xyz with the properties stated in Theorem 2.2.1.

Since |xy|p, the substring xy must consist entirely of a's. Let xy=ak for some k>0 and consequently z=apkbp. By the Pumping Lemma, xy2z=ak+|y|apkbp=ap+|y|bpL. However, this string contains unequal numbers of a's and b's which contradicts the definition of L. Hence, L is not regular.

Illustration of Theorem 2.3.1

This is a quick illustration that even the basic nested structure of matching opening and closing symbols (anbn) exceeds the parsing capacity of regular expressions. In PLs, this is commonly known as the bracket balancing problem. We can trivially implement this using a stack without any issues but then again, that wouldn't be a regular language approach.

3. Representation of hierarchical structures #

3.1 The Chomsky hierarchy #

The Chomsky hierarchy provides a classification of formal grammars and their corresponding languages.

Definition 3.1.1 A formal grammar is a 4-tuple G=(N,Σ,P,S) where:

It classifies grammars based on the restrictions on their production rules.

  1. Type-0 (unrestricted): No restrictions on productions. Recognizable by Turing machines.
  2. Type-1 (context-sensitive): Productions of the form αAβαγβ where AN and α,β,γ(NΣ) with γϵ. Recognizable by linear-bounded automata.
  3. Type-2 (context-free): Productions of the form Aγ where AN and γ(NΣ). Recognizable by pushdown automata.
  4. Type-3 (regular): Productions of the form AaB or Aa where A,BN and aΣϵ. Recognizable by finite automata.

3.2 Computational discrepancy between regular and context-free #

Theorem 3.2.1 The class of regular languages is strictly contained within the class of context-free languages.

Proof: Every regular language is trivially context-free since Type-3 grammars are a restricted form of Type-2 grammars. We must show that there exists at least one context-free language that is not regular. The language L=anbnn0 is context-free, i.e. it can be generated by the grammar SaSbϵ but not regular according to Theorem 2.3.1. Therefore, the containment is strict.

3.3 Recursive nesting and memory requirements #

To reiterate, the limitation of regular expressions is their inability to maintain state beyond a fixed, finite amount. This is also besides the issue of unrepresentability of certain grammars.

Theorem 3.3.1 Any language requiring unbounded counting or memory cannot be recognized by a finite automaton.

Proof: Let M=(Q,Σ,δ,q0,F) be a DFA with |Q|=k states. Consider an input requiring the DFA to remember more than k distinct pieces of information. By the pigeonhole principle, at least two different information states must map to the same DFA state. Once these configurations map to the same state, the DFA has basically forgotten the distinction between them and all subsequent computation paths are identical regardless of which original information state led to this point. Thus, no DFA can maintain unbounded memory.

Another idea is to parse a string line-by-line and use basic production rules. E.g. a Markdown parser with rules applied to every line. A single chunk of regular expression could define rules for matching against each line, although this would not account for the entire CommonMark specification which includes HTML blocks and multiline Markdown tags!

For hierarchical structures like balanced parentheses or matched XML tags, the required memory grows with nesting depth. This category of languages are, by definition, inherently unparsable by regular expressions alone.

4. Proof theoretic approach #

4.1 Logical characterization of regular languages #

Is there a way to characterize regular languages?

Regular languages can be characterized using monadic second-order logic (MSO) over strings.

Theorem 4.1.1 (Büchi-Elgot-Trakhtenbrot) A language L is regular if and only if it is definable in MSO logic over strings.

MSO can express local properties and finite counting but cannot express the global matching constraints required for balanced structures.

4.2 Hierarchical structures as inductive types #

From a proof-theoretic perspective, hierarchical structures like trees are naturally represented as inductive types.

Definition 4.2.1 An inductive type T is defined by constructors c1,c2,,cn where each constructor has a type signature ci:τ1×τ2××τmiT with each τj being either T or some other type.

Quick example, binary trees can be defined in standard ML as:

datatype BinaryTree = Leaf | Node of BinaryTree * BinaryTree

Regular expressions correspond to finite automata which cannot recursively traverse such structures.

4.3 Curry-Howard correspondence #

Theorem 4.3.1 (Curry-Howard Isomorphism) There is a one-to-one correspondence between:

Type checking in PLs requires validating nested structures and scope rules. Any such verification necessitates at minimum a CFG since it must maintain a type environment and verify that variables are declared before use—a non-regular property.

In both cases, we validate recursive tree structures where child elements must satisfy parent constraints according to compositional rules, with XML-like tag matching (i.e. <p>...</p>) being isomorphic to type judgment validation in the STLC where expression typing follows Γ ⊢ e:τ with nesting requirements that demonstrably exceed the pattern-matching capabilities of regular languages.

5. Language recognition #

5.1 Typed λ-Calculus and PLs #

Most PLs have features that cannot be recognized by regular expressions.

Definition 5.1.1 The simply typed λ-calculus is defined by the grammar:

The problem of type checking—determining if a term t has type τ—asks for maintaining a typing context and variable binding. This is beyond the design of pattern matching in strings even with lookaheads and lookbehinds.

General recursion in Turing-complete languages is supported through fixed-point combinators.

5.2 Recursion and fixed-point combinators #

Definition 5.2.1 The Y combinator is defined as Y=λf.(λx.f(x,x))(λx.f(x,x))

For any function f, Y,f is a fixed point of f, i.e. Y,f=f(Y,f). The Y combinator makes it possible to define recursive functions without explicit recursion in the language. The semantics of such recursive definitions require a pushdown automaton at minimum to evaluate correctly since function calls must be nested and returned in the correct order.

5.3 Formal proof of parsing requirements #

Theorem 5.3.1 Parsing a Turing-complete language requires at minimum a pushdown automaton.

Proof: Consider a simple Turing-complete language with function definitions and calls. To correctly parse this language, we require the following:

  1. Match function declarations with their corresponding bodies
  2. Variables must be used within their corresponding scope
  3. Verify that function calls correspond to declared functions

Let us focus on the specific problem of checking balanced nested function calls, i.e. L=ss contains nested function calls

This language contains the subset L=fngnn0, where f represents a function call and g represents a return. By a proof analogous to Theorem 2.3.1, L is not regular. Therefore, L is not regular and parsing a Turing-complete language requires at minimum a pushdown automaton.

6. PL constructs #

6.1 Balanced parentheses and nested blocks #

Theorem 6.1.1 The language of balanced parentheses L=w(,)parentheses in w are balanced is not regular.

Proof: This follows from a direct application of the Pumping Lemma, similar to the proof in Theorem 2.3.1.

Nearly the same proofs apply to nested code blocks, XML tags and other hierarchical constructs.

6.2 Recursive descent parsing with a stack #

Theorem 6.2.1 Parsing nested structures of unbounded depth requires a stack of unbounded size.

Proof: Consider the language L=anbnn0. To recognize strings in this language, a parser must count the number of a's and then verify an equal number of b's. For strings with arbitrarily large n, the parser would have to maintain a count that can grow without bound, i.e. unbounded memory. A finite automaton has bounded memory, so it cannot recognize L.

7. Context-sensitive and recursive features #

Parser generators like LALRPOP implement deterministic algorithms such as LALR(1) that handle context-free languages by constructing pushdown automata with fixed lookahead. It offers a massive performance over RDP by pre-computing parse tables and avoiding backtracking with O(n) complexity (or at least near linear in practical cases).

PEG (Parsing Expression Grammar) parsers represent an alternative approach that works well for unambiguous parsing. Unlike CFG parsers, PEG uses unlimited lookahead and ordered choice, i.e. eliminates many cases of unambiguity in patterns that would require semantic predicates in LALR, although it can easily fall into stack depth issues.

The cost of an intuitive grammar composition is memory.

Definition 7.1 A context-sensitive feature requires knowledge of the surrounding context beyond trivial nesting structure.

Examples include:

Theorem 7.2 Complete Turing-complete language processing requires the power of a Turing machine.

Proof: By definition, a Turing-complete language can encode any computable function. If a weaker computational model could process such a language completely, it would contradict the Church-Turing thesis by computing functions beyond its theoretical capabilities.

8. Language recognition with categorical semantics #

8.1 Algebraic structure of regular languages #

From a category-theoretic perspective, regular languages can be characterized as the free Kleene algebra generated by an alphabet.

Definition 8.1.1 A Kleene algebra is a structure (K,+,,,0,1) where:

Regular expressions form the free Kleene algebra generated by the alphabet Σ which is an algebraic structure that lacks expression.

Definition 8.1.2 An algebraic theory consists of operations and equations specifying how these operations interact.

A wordy way of saying this can be found on Wikipedia, an algebraic theory is a theory that uses axioms stated entirely in terms of equations between terms with free variables, so context-free languages can be characterized using algebraic theories with recursively defined operations.

References #