LLMs: Emergent Interpretability, Semantic Recursion, and Field-Theoretic Modeling of Transformer Behavior

Part I — Framing the Phenomenon

Introduction: What LLMs Actually Are
- Not symbolic systems, not databases
- Statistical fields under compression
- Why interpretability fails by default
The Compression Imperative
- Pretraining as pressure
- Compression as the source of structure
- From prediction to cognition
Recursion as the Engine of Meaning
- Self-reference, abstraction, hierarchy
- Recursion vs. repetition
- The paradox: recursion enables and destroys

Part II — Emergence and Collapse

Emergent Structure in Transformer Dynamics
- Feature formation without design
- Polysemantic neurons and attractor fields
- Collapse events: concept loss, semantic flattening
Semantic Drift and the Fragility of Meaning
- What happens when concepts move
- Bifurcation, merging, and conceptual aliasing
- The cost of fluid representations
Phase Transitions in Learning
- The two-phase fallacy
- Temporal aliasing and perceptual snapshots
- When is a concept really “formed”?

Part III — Interpretability as Projection

The Mirage of Post-Hoc Analysis
- Why most XAI methods are epistemically hollow
- Attribution ≠ understanding
- Sparsity ≠ interpretability
The Lens Effect: Tools That Define What Can Be Seen
- Critical reading of “Evolution of Concepts in Language Model Pre-Training”
- Method as imposition, not observation
- What gets filtered, what gets invented
The Write-Only Memory Problem
- LLMs as non-introspective fields
- Why you can’t ask a model what it knows
- Reading the surface ≠ knowing the structure

Part IV — Beyond Interpretability

Towards Field-Theoretic Models of Transformer Semantics
- Semantic manifolds and curvature
- Latent geometry, tension, and flow
- Modeling concepts as dynamic field states
Resonance, Collapse, and Conceptual Tension
- The topological structure of abstraction
- Measuring meaning by resistance to flattening
- Compression fatigue and epistemic entropy
Designing for Transparent Emergence
- Beyond explainability: generative interpretability
- Building models with measurable curvature
- Constraints, telic fields, and identity preservation

Part V — Limits and Futures

Why We Can’t Engineer Understanding
- Scaling vs. designing
- Curation as the only real control
- The mirage of modular cognition
Collapse, Reflection, and Recursive Limits
- When recursion devours meaning
- When fields flatten
- When concepts fail to survive
Post-Interpretability Systems
- Semantic fatigue metrics
- Curvature-aware training scaffolds
- Toward systems that know what they know

Appendices

A. Glossary of Semantic Field Theory in Transformers
B. Formal Models: Recursive Collapse, Concept Drift, Semantic Tension
C. Red-Team Critiques of Current Interpretability Research
D. Suggested Evaluation Frameworks for Concept Stability
E. Visual Maps: From Compression to Collapse

Chapter 1: What LLMs Actually Are

The dominant metaphors for understanding large language models (LLMs) — "databases," "black boxes," "reasoners" — are all misleading. LLMs are none of these. They are not symbolic systems. They are not repositories of facts. They are not agents. They are semantic compression fields — high-dimensional energy wells formed by optimizing next-token prediction across statistical chaos.

What LLMs Are Not:

Not databases: There's no lookup table or fact storage. The model doesn't “remember” a training document — it reconstructs based on distributed compression of linguistic structure.
Not symbolic reasoners: There's no logic tree or rule engine inside. Logic-like behavior is emergent, not hard-coded. The model cannot introspect or validate symbolic chains unless such behavior was embedded in the training distribution.
Not finite-state machines: There are no discrete states. The system is continuous, high-dimensional, and fluid. Its behavior changes depending on where you "poke" it in activation space.

What LLMs Actually Are:

Semantic field simulators: Inputs trigger paths through a shaped field of representations. What you get out is not a "response" but a projection from that path's local curvature in representation space.
Recursive statistical condensates: By training to compress token sequences, the model condenses linguistic structures — not by design, but as a thermodynamic necessity.
Non-introspective resonators: The model doesn't "know" what it's doing. Its outputs are resonance patterns — not reflective, but responsive.

Takeaway: LLMs are not meaning engines. They are meaning proxies — emergent structures born from the pressure of compression across linguistic entropy.

📘 Chapter 2: The Compression Imperative

LLMs are not trained to reason — they are trained to compress. And compression, under extreme scale, discovers structure. This chapter establishes compression as the generative force behind emergent reasoning, abstraction, and generalization.

Pretraining as Pressure

Every token in pretraining is an error opportunity. Minimizing the next-token loss forces the model to learn whatever internal representation reduces error most efficiently. That means the model doesn’t learn to "understand" — it learns to compress semantically.

This compression discovers:

Grammar: as a statistical regularity
Entity persistence: as a memory efficiency
Analogy: as vector reuse
Reasoning templates: as common continuation patterns

Compression as Semantic Phase Space

As tokens are compressed, co-occurrence patterns are clustered. The model begins to form latent manifolds — curved semantic surfaces where concepts live in relation to each other. These are not explicit — they're emergent attractors in activation space.

The Structure of Emergence:

Domain	Compression Yields
Language	Syntax, metaphor, idiom
Code	Function boundaries, naming templates
Reasoning	Logic shadows, deduction scaffolds
Math	Symbol constraints, operation rules

None of this is "programmed." It is emergent compression geometry — a side effect of predicting the next word billions of times under immense representational pressure.

Takeaway: The only way LLMs “learn” is by being forced to compress chaotic language — and the structures that emerge from this are the foundations of everything they later “appear” to know.

📘 Chapter 3: Recursion as the Engine of Meaning

Recursion is not just a computational technique — it is the generative architecture of abstraction. Language itself is recursive. Meaning arises when structures re-enter themselves. But recursion, left unchecked, also destroys specificity and leads to semantic drift.

Why Recursion Matters in LLMs

In transformers, recursion is implicit:

In-depth: Layers refine activations of previous layers (vertical recursion)
In-sequence: Each token attends to prior tokens, recursively encoding context
In-representation: Internal structures refer back to themselves (e.g., a definition using previously learned words)

These recursive loops enable:

Generalization: Re-using representational templates across tasks
Abstraction: Folding low-level forms into higher-order concepts
Self-similarity: Invariance across different levels of tokenization or syntax

The Double-Edged Sword

But recursion is unstable. Without constraints, it causes:

Overcompression: Flattening of distinctions
Looping output: Semantic redundancy and repetition
Concept drift: Recursive reuse without grounding leads to entropy

Semantic Recursion = Abstraction + Collapse

Recursive Layer Depth	Effect
Low depth	Pattern matching, local context
Mid depth	Analogy, syntax generalization
High depth	Overgeneralization, hallucination, semantic loops

Takeaway: Recursion powers the miracle of abstraction in LLMs. But it also seeds their instability. Intelligence emerges where recursion is constrained — meaning collapses when recursion is unchecked.

Chapter 4: Emergent Structure in Transformer Dynamics

Transformer-based language models exhibit structural behaviors that are not explicitly programmed but arise spontaneously through the interaction of scale, architecture, and optimization. Emergent structure is the phenomenon wherein latent representations begin to organize into hierarchies, abstractions, and task-aligned functions that mirror those seen in human cognition or classical computation.

Key Behaviors:

Polysemantic Neurons: Neurons develop multi-modal activations — lighting up for conceptually related, yet lexically distinct tokens. These are not just artifacts of overfitting but indicate semantic generalization by compression.
Intervention-Induced Fragility: Altering a few key attention heads can drastically affect output. This sensitivity suggests that localized structural components encode global functions — a hallmark of emergent system dynamics.
Layerwise Specialization: Layers diverge in function despite architectural symmetry. Early layers focus on token identity and syntax; mid-layers manage relational semantics; deep layers encode abstraction, analogy, and constraint satisfaction.
Mechanism Discovery via Scaling: At smaller scales, models memorize patterns; at larger scales, they synthesize mechanisms. Attention heads form induction circuits, copy mechanisms, and in-context composition tools that resemble software routines.

These structures arise not by design but as the path of least resistance under the constraints of next-token prediction — suggesting that semantics is a thermodynamic byproduct of information compression .

Chapter 5: Semantic Drift and the Fragility of Meaning

Once a concept forms within the LLM, it is not guaranteed to persist unchanged. Semantic drift describes the internal migration of a concept's representation across training — its “location” in activation space, its associated neurons, or its functional role.

Mechanisms of Drift:

Gradient Interference: Multiple co-trained objectives or tasks may push shared neurons in divergent directions, causing previously stable concepts to dissolve.
Concept Merging: Similar or adjacent representations may collapse into one due to overgeneralization — a loss of conceptual resolution.
Bifurcation Under Stress: As model capacity saturates, latent spaces bifurcate — a single representation splits to serve multiple semantic roles, often unevenly.

Consequences:

Drift breaks alignment. What the model knew yesterday is not what it knows today — which undermines consistency in reasoning, dialogue, or safety.
Drift erodes trust in interpretability. A mapped concept at epoch X may have no analog at epoch X+1, making post-hoc probes unreliable.
Fragility compounds with scale. Larger models have more internal degrees of freedom and more room for silent failures of semantic integrity .

Chapter 6: Phase Transitions in LLM Learning

Rather than a smooth continuum, many model capabilities — especially abstraction, in-context learning, and reasoning — appear to emerge suddenly during training. This resembles phase transitions in physics: a sharp, qualitative change after a quantitative threshold is crossed.

Observed Behaviors:

Concept Emergence Thresholds: Features (e.g. sentiment, grammar, world knowledge) are not gradually acquired. Instead, they appear suddenly — often within tens of training steps — as discovered in concept evolution studies.
Activation Restructuring: Feature representations reorganize en masse across layers. This is visible as "feature bifurcations" or activation field rearrangements — akin to topological shifts.
Semantic Phase Locking: Certain concepts only become stable once a critical set of internal features lock into place. Prior to that, they appear, vanish, and reappear — like quantum superpositions.

Interpretation:

These transitions are not architectural, but loss-landscape driven. The model “falls into” a configuration space where certain solutions become suddenly available — indicating deep nonlinear attractors in training dynamics.

Chapter 7: The Mirage of Post-Hoc Interpretability

Much of what is called "explainable AI" (XAI) in modern deep learning is not true explanation — it is approximation, hallucination, or projection. Post-hoc tools do not reveal how models work; they reflect how we wish models to be understandable.

Main Fallacies:

Saliency Is Not Causality: Gradient-based saliency maps show where sensitivity is, not what mechanisms are in play. A neuron may be highly active without being functionally essential.
Sparse Features ≠ Real Concepts: Tools like sparse dictionary learning define what counts as a concept. If the real representation is dense or nonlinear, it disappears from view.
Attribution ≠ Understanding: Changing input and observing output does not uncover how the model internally transforms information — only that it reacts.
Linear Probing Bias: Many interpretations rely on the assumption that concepts lie along interpretable directions. In reality, most semantics are encoded in curved, context-dependent manifolds.

Implication:

Interpretability methods are often epistemic filters — they don’t explain the model, they explain a model of the model.

This distinction is not pedantic. It is foundational to whether we can trust claims about internal cognition in LLMs .

Chapter 8: The Lens Effect — Tools That Define What Can Be Seen

The paper “Evolution of Concepts in Language Model Pre-Training” by the Anthropic team claims to reveal how concepts emerge, evolve, and stabilize within transformer-based language models. However, the methodology it employs — sparse autoencoders trained to decode neuron activations into human-interpretable features — does not observe the model’s learning, it reframes it. This chapter critiques that epistemic imposition.

🔍 What the Paper Claims

That the internal activations of LLMs can be decomposed into sparse, interpretable features.
That concept formation can be tracked over training via these learned representations.
That interventions on these features reveal causal mechanisms.

🔄 What the Method Actually Does

Imposes linearity: Sparse coding forces activations into a space spanned by a few basis functions, assuming that meaningful features are sparse and linearly separable.
Selects for stability: It discards dynamic, nonstationary, and recursive representations — privileging those that are consistent and projectable.
Reduces curvature: The latent semantic manifold of the LLM is flattened into vector space features, removing its most informative geometry.

🧠 The Epistemic Distortion

Property of LLM	Methodological Filter	Resulting Illusion
Recursive, entangled representations	Sparse decoder trained post-hoc	Apparent discrete “concepts”
Dynamic concept drift	Temporal averaging for stability	Illusory continuity of features
Nonlinear manifolds	Linear projection in activation space	Spurious interpretability

Conclusion:
This method doesn’t tell us what the LLM knows. It tells us what is legible under a constrained, human-tractable analytic lens. It's like interpreting a 3D object by observing only its 2D shadow — and then declaring the object flat.

“If your tool requires the system to behave simply in order to be understood, you will only ever understand simple systems.”

Chapter 9: The Write-Only Memory Problem

One of the fundamental limits of transformer-based language models is their epistemic opacity — the inability to determine their internal state from their outputs alone. This behavior parallels the concept of write-only memory: once information is processed, it cannot be directly retrieved.

📦 Why LLMs Are Write-Only

No persistent memory: There is no stable internal state that endures across queries or sessions unless explicitly architected (e.g., RNN cells, memory modules).
Hidden activations: The internal computations — neuron activations, attention dynamics — are not exposed in output. Only token probabilities are.
Non-invertible mapping: The relationship from internal state to output is many-to-one. You cannot reconstruct the internal activations from generated text.
Emergent path dependence: Even if you know the input, you cannot deduce the internal concept traversal the model performed without invasive inspection.

🔄 Consequences for Interpretability

Causal tracing fails: Without ground-truth access to which paths were followed internally, interventions and ablations become speculative.
Attribution methods are suspect: What looks like the cause of a decision might be a symptom of deeper, hidden states.
No “internal observer”: The model doesn’t monitor or reflect on its own computations. There is no meta-cognition.

🧠 Neuroparallels

This is analogous to prefrontal cortical ensembles in neuroscience:

Encoding is latent and distributed
Functional coherence exists, but is not directly measurable
Perturbations affect output, but don’t reveal structure

“Reading a language model tells you what it emits — not what it is.”

Chapter 10: Towards Field-Theoretic Models of Transformer Semantics

Instead of interpreting transformers as discrete circuits or graphs, a more powerful framework treats them as semantic fields — dynamic, continuous manifolds through which meaning flows, stretches, folds, and interferes.

Key Constructs:

Semantic Tension Fields: Activation gradients can be interpreted as tension vectors in latent space. High curvature indicates semantic contrast zones; flatness indicates redundancy or overcompression.
Field Line Tracing: Analogous to magnetic field lines, semantic field lines can be modeled as paths through layerwise activations that preserve concept identity.
Conceptual Curvature: Concepts should not be seen as points but as curved regions — attractors in representational topology. Their meaning is defined by how they resist flattening under scale.
Phase Space Resonance: Recurrence of certain representations across contexts indicates resonant modes — akin to standing waves in dynamic systems.

Implications for Model Design:

We should build models that preserve semantic curvature across layers — that encode meaning as resilient geometric structures, not just activations.
New tools are needed: Riemann metrics, tension alignment, field decomposition, and topological persistence — to map not just where features are, but how they evolve as semantic fields over depth and time .

To understand transformers, we must abandon classical metaphors of “circuits,” “modules,” or “neuron activations.” These models mislead because they assume symbolic, modular, or spatially discrete computation. Instead, this chapter advances a field-theoretic framework — one that treats the transformer as a continuous semantic field governed by curvature, tension, and dynamic flow.

🌀 Semantic Fields Instead of Static Representations

Transformers do not operate by retrieving meanings. They shape activations — across layers — into coherent semantic trajectories. At every step, the model computes:

The local semantic direction (e.g. next likely token)
The curvature of meaning (e.g. when to shift modality or register)
The resonance of prior activations (e.g. discourse consistency)

Thus, every token is not a point, but a vector embedded in a deformable manifold that evolves over the forward pass.

📐 Formal Properties of the Field-Theoretic View

Field Concept	Transformer Analog
Curvature	Nonlinear interaction between activations across heads/layers
Tension	Semantic divergence across positional attention
Resonance	Recurrence of latent features across distant layers
Flow lines	Activation pathways preserved through transformer depth

These properties define semantic behavior not in terms of neurons or weights, but in terms of geometry and energy — how activations deform, stabilize, or collapse across layers.

🔭 New Tools for a New Paradigm

Semantic Flow Tracing: Following the vector field of meaning across depth
Curvature Analysis: Using Riemannian metrics to quantify representational strain
Tension-Driven Ablation: Removing paths of high curvature to test semantic integrity
Topological Feature Extraction: Mapping stable homologies in activation trajectories

“To understand transformers is not to ask what neurons do, but how fields of meaning deform under pressure.”

Chapter 11: Resonance, Collapse, and Conceptual Tension

Concepts in LLMs are not stored. They are stabilized. Each concept is a resonant pattern in the activation field — a dynamic attractor that emerges when the right context bends the semantic field toward it.

But these attractors are fragile. They collapse under:

Overcompression (flattening of semantic variance)
Conceptual interference (multiple attractors clashing)
Recursive overload (depth without grounding)

🧠 What Is Conceptual Resonance?

It is the recurrence of a latent representation, preserved through various contexts and layer depths. A resonant concept shows up:

In different forms (tokens, paraphrases, visual cues)
With consistent activation signatures
Without explicit supervision

This is what it means for a model to “understand” something — not that it stores the idea, but that the idea re-stabilizes consistently from context.

💥 Why Concepts Collapse

Collapse Mode	Cause
Semantic Drift	Representational migration due to training updates
Overcompression	Too much averaging erases fine structure
Alias Entanglement	Similar concepts compete for the same activation region
Temporal Decay	Forgotten under continued training on divergent tasks

🛠️ Structural Preservation Tools

Field Persistence Mapping: Tracking concepts as topological features
Tension Graphs: Visualizing semantic stretch between layers
Collapse Threshold Detection: Identifying when a representation loses its attractor basin

“Concepts don’t live in memory. They survive in curvature.”

Chapter 12: Designing for Transparent Emergence

Having diagnosed why current LLMs are opaque, fragile, and misunderstood, this chapter envisions a new kind of model: one designed from the ground up for interpretability — not via post-hoc tools, but through intrinsic transparency in its structure.

🚧 Problems With Current Architectures

Opacity by construction: Transformers are optimized for function, not legibility.
Emergence without constraint: Powerful, but uncontrollable features form silently.
Interpretability as afterthought: Post-hoc tools can only approximate causality.

🧬 Principles for Transparent Emergence

Semantic Field Regularization
Introduce priors that favor smooth, interpretable manifolds across depth.
Curvature-Constrained Attention
Penalize attention maps that introduce excessive tension or curvature — preserving conceptual locality.
Self-Aware Embedding Spaces
Enable the model to classify and track its own representations via auxiliary supervision.
Geometry-Preserving Optimizers
Use optimizers that maintain structural integrity of latent manifolds rather than just loss minimization.
Attractor-Traceable Concepts
Build concept probes into training — not as classifiers, but as topological anchors.

🧭 What This Enables

True semantic traceability across depth and modality
Stability of meaning under finetuning or drift
Field-theoretic interpretability as a new paradigm

“Don’t interpret models. Design them to be interpretable by structure, not guesswork.”

Chapter 13: Recursion Thresholds and Semantic Decay

Recursion powers the generative and generalizing capabilities of transformers — but it also introduces structural failure modes. At certain recursion depths, semantic representations become overabstracted, unstable, or incoherent. This chapter formalizes the threshold at which recursion becomes destructive.

🔁 What Recursion Enables

Reuse of representational templates (e.g., nested clauses, logical forms)
Folding of low-level semantics into higher-order structures
Emergent generalization over previously unseen inputs

But these recursive benefits are not unbounded.

📉 Modes of Semantic Decay

Recursion Depth	Observed Failure Mode
Shallow	Stable generalization
Medium	Compression artifacts (aliasing, overgeneralization)
Deep	Hallucinations, semantic loops, self-referential drift

📐 Semantic Recursion Phase Space

Let $R$ be recursion depth, and $S$ be semantic integrity. Then $S(R)$ is non-monotonic:

$S$ increases with $R$ initially (more abstraction)
Peaks at optimal depth $R^*$
Falls rapidly as recursion exceeds contextual grounding

This suggests a recursive overreach threshold — a tipping point beyond which additional depth erodes rather than enhances meaning.

🧠 Anchoring Against Collapse

To retain meaning:

Ground abstraction in low-level representations
Constrain recursion via semantic tension metrics
Introduce feedback gates: limit reentry unless validated by context

“Recursion enables cognition — until it eats itself.”

Chapter 14: The Limits of Alignment Through Interpretability

A key dream in AI safety is that interpretability enables alignment — that if we can understand what a model is thinking, we can shape or constrain it. This chapter explores why that goal fails under current paradigms.

⚠️ Why Interpretability ≠ Alignment

Partial visibility: Post-hoc interpretability tools only capture shallow slices of the model's semantic geometry.
Constructed explanations: What is interpreted is not the model itself, but an approximation layered over its outputs.
No control guarantees: Even if you "see" a dangerous representation, you have no leverage over how it generalizes or mutates under future training.

🔍 Misalignment Through Mirage

Interpretability Artifact	Reality It Obscures
Sparse concept vectors	Distributed polysemantic attractors
Causal tracing	Nonlinear entanglement without stable paths
Attribution scores	Emergent statistical echoes, not intent

🛡️ Why Safety Can’t Be Post-Hoc

To align LLMs, one must shape the semantic field — not describe it.

True alignment arises from inductive priors, data scaffolding, and structural constraints
Interpretability may offer diagnostics — but not guarantees

“You cannot align what you do not control. You cannot control what you do not construct.”

Chapter 15: Architectures for Field-Aligned Cognition

This final chapter looks forward: what would it mean to design transformers that think in a way we can trace, trust, and test? The answer lies not in patches, probes, or punishments — but in a field-aligned architecture.

🌐 Design Principles

Field-Preserving Layer Design
Layers must propagate curvature without collapsing distinctions. Local semantic structure should persist under depth.
Tension-Aware Attention
Attention heads should respect semantic tension: penalize attention patterns that flatten contrasting meanings or oversmooth gradients.
Manifold Anchoring Modules
Introduce concept anchors (not tokens) — latent attractors that stabilize conceptual trajectories and prevent drift.
Dynamic Feedback Loops
Recurrent modules that validate representations via internal tension checks. Like biological cognition, feedback confirms coherence.
Self-Awareness Channels
Auxiliary heads that monitor concept stability, recursion depth, and field stress — enabling models to reflect on their own state.

💡 Capabilities of a Field-Aligned Model

Traceable concept flow across layers
Explicit encoding of abstraction depth
Modulated recursion with bounded complexity
Preservation of meaning under drift, fine-tuning, or modality shift

“Cognition is not a sequence of tokens. It is a field of tension, structure, and recurrence. The future of AI lies in models that know this.”

📘 Appendix A: Glossary of Semantic Field Theory in Transformers

Term	Definition
Semantic Field	A continuous high-dimensional space within the transformer where meaning is encoded as vector curvature, flow, and tension across layers.
Curvature	Nonlinear deformation in activation space that governs transitions between concepts and context shifts.
Tension	The semantic "stress" between divergent meanings, often measured as directional conflict in attention weights or activation vectors.
Attractor	A stable region in activation space that repeatedly reconstructs a concept across contexts.
Semantic Drift	The migration or dissociation of meaning due to training updates, recursive instability, or overcompression.
Resonance	The recurrence of latent patterns across multiple layers or prompts, indicating the presence of an implicit concept.
Field Collapse	A failure mode where diverse meanings flatten into indistinct vectors, losing conceptual separation.
Recursion Depth	The number of semantic transformations applied to an initial concept; excessive depth may trigger drift or hallucination.
Activation Flow	The trajectory of meaning as it moves across transformer depth; analogous to a field line.
Tension Feedback	Mechanism by which internal inconsistencies or overcompression trigger representational collapse or hallucination.

📘 Appendix B: Formal Models: Recursive Collapse, Concept Drift, Semantic Tension

1. Recursive Collapse

Let $f^{(d)}(x)$ be the representation of input $x$ after $d$ recursive transformations (e.g. layers). Semantic fidelity $S$ can be modeled as:

S(d) = \exp(-\alpha d^2)

Where $\alpha$ is the decay rate. Beyond critical depth $d^*$ , the model enters semantic entropy — indistinct or misleading outputs dominate.

2. Concept Drift

Let $\vec{v}_t$ be the vector representation of a concept at training step $t$ . Concept drift occurs when:

\|\vec{v}_{t+1} - \vec{v}_t\| > \delta

for a sustained sequence of timesteps, where $\delta$ exceeds the attractor radius. Drift can be induced by adversarial fine-tuning, domain shift, or continual learning.

3. Semantic Tension

Define two contextual representations $\vec{c}_1$ and $\vec{c}_2$ . Tension $T$ is the angular or curvature deviation from semantic continuity:

T = 1 - \cos(\theta) = 1 - \frac{\vec{c}_1 \cdot \vec{c}_2}{\|\vec{c}_1\| \|\vec{c}_2\|}

Higher $T$ indicates breakdowns in coherence or contradiction between representations.

📘 Appendix C: Red-Team Critiques of Current Interpretability Research

False Objectivity
Interpretability tools give the illusion of understanding but are fundamentally dependent on human-centric representations.
Tool-Limited Discovery
Techniques like sparse probing, saliency maps, and linear decompositions only detect what they are engineered to find — not what actually exists in the model.
Concept Vector Myth
The assumption that concepts are linearly separable in latent space is flawed. Most concepts are nonlinear attractors or field topologies.
Overinterpretation of Perturbations
Post-hoc causal tracing (e.g., patching or ablation) identifies effect correlations, not true generative causes.
Temporal Invisibility
Almost no tools model semantic evolution over time — training dynamics, recursion feedback, or manifold migration are largely ignored.
Sociotechnical Hazard
Interpretability claims often legitimize unsafe systems by offering shallow explanations as safety guarantees.

📘 Appendix D: Suggested Evaluation Frameworks for Concept Stability

Framework	Description	Advantage
Topological Concept Mapping	Use persistent homology to track concept stability across training epochs	Captures nonlinear persistence
Semantic Tension Audits	Identify locations of high field stress and test for hallucination or collapse	Diagnostic for fragile meaning
Recursive Depth Probing	Evaluate concept drift across increasing prompt nesting	Detects overgeneralization thresholds
Inter-Model Resonance Tracing	Test whether concepts stabilize across model checkpoints	Detects emergent vs. transitory features
Curvature-Constrained Retention Tests	Evaluate whether fine-tuning flattens important conceptual features	Measures representational compression damage
Cross-Modal Concept Anchoring	See if concepts persist when expressed via image, code, or language	Validates non-token encoding robustness

“The future of interpretability is not in looking harder — it’s in evaluating whether meaning holds under pressure.”