Methodology
About AUGUR
A live seismograph for collective consciousness — applying statistical physics to cultural discourse.
Scientific Basis
AUGUR applies Early Warning Signal (EWS) theory from statistical physics to cultural discourse. EWS theory predicts that complex systems approaching a phase transition exhibit measurable precursors: rising variance, increasing autocorrelation (critical slowing down), and topological changes in their state space. These phenomena were originally described in the context of ecological collapses (Scheffer et al., 2009), climate tipping points (Lenton et al., 2011), and financial crises (Dakos et al., 2012).
AUGUR extends this framework to the semantic embedding space of online discourse. Each week, posts from six Reddit communities are embedded into a 384-dimensional vector space. The geometry of this space — its variance, autocorrelation structure, topological loops, and spectral properties — encodes the collective informational state of each domain. Shifts in that geometry, measured as Kendall-tau trends over an 8-week rolling window, constitute the Early Warning Signal.
Differentiation from semantic change detection (SCD): SCD methods (e.g., Hamilton et al. 2016) detect what changed — retrospectively. AUGUR detects structural precursors before a transition occurs — prospectively. The goal is EWS prediction, not retrospective shift identification.
Pipeline
- 1.Reddit scrape — PRAW API, 200 posts/subreddit/week across 8 Reddit domains, exponential backoff on rate-limit. Historical data via Arctic Shift (Pushshift replacement).
- 2.Sentence embeddings — sentence-transformers/all-MiniLM-L6-v2, 384 dimensions, batch size 64. Model is fixed; swapping embedders would invalidate UMAP models.
- 3.UMAP dimensionality reduction — 10D for TDA input (n_neighbors=15, min_dist=0.1); 3D for visualization (n_neighbors=10). Models trained once via backfill, then reused weekly.
- 4.Topological Data Analysis — ripser persistent homology on 10D UMAP embeddings. Computes H0 (connected components) and H1 (loops) over a Vietoris-Rips complex (MAX_EDGE_LENGTH=0.8). Topological Persistence Entropy (TPE) summarizes diagram complexity.
- 5.Spectral analysis — kNN graph (k=10) Laplacian. Spectral gap = λ₂ − λ₁. A falling spectral gap signals community fragmentation — a known EWS precursor.
- 6.EWS computation — Rolling 8-week window. Kendall-tau trend statistics computed for variance (τ_var), Centroid Directional Consistency (τ_cdc, stored as ac1), TPE (τ_tpe), and spectral gap (τ_sg). Permutation p-values (n=1000) for all four signals.
- 7.CCI aggregation — Composite Criticality Index: weighted sum of four normalized τ values. τ_SG is negated before normalization so falling gap actively raises CCI. Bootstrap CI (n=200) provides uncertainty bounds. Range [0, 1].
- 8.GDELT enrichment — Weekly event volume and average tone from GDELT Project 2.0 for each domain keyword set. Provides external validation context.
- 9.Persist → Export — Upsert into Supabase weekly_snapshots table. Export static latest.json to public/data/ for edge caching on Vercel.
Composite Criticality Index (CCI)
The CCI aggregates four Kendall-tau trend statistics, each computed over an 8-week rolling window:
- τ_varrising variance — critical slowing down (Dakos et al. 2012)
- τ_cdcrising Centroid Directional Consistency — high-dimensional slowing down proxy
- τ_tpechanging Topological Persistence Entropy — structural complexity trend
- τ_sgfalling spectral gap — community fragmentation (negated before normalization)
CCI = 0.25·n(τ_var) + 0.35·n(τ_cdc) + 0.25·n(τ_tpe) + 0.15·n(−τ_sg)
clipped to [0, 1]
where n(·) = min-max normalisation against 8-week historical range
and −τ_sg is negated before normalisation so falling gap ↑ CCINote: τ_cdc is stored as ac1 in the database for backward compatibility. Centroid Directional Consistency (CDC) measures cosine similarity between consecutive week centroids in 384D space — the high-dimensional analogue of lag-1 autocorrelation.
| State | CCI Range | Interpretation |
|---|---|---|
| Stable | 0.0 – 0.3 | No significant EWS trend detected |
| Elevated | 0.3 – 0.6 | Weak trends present; monitor closely |
| High | 0.6 – 0.8 | Multiple signals converging |
| Critical | 0.8 – 1.0 | Strong EWS; transition likely near |
Validation Results
27 ground-truth transition events spanning all 10 domains, including subtle transitions (seasonal shifts, policy-driven changes) not just dramatic outliers. Split into 17 training events (pre-2023) and 10 held-out test events (2023–2025). Results auto-populated from public/data/validation.json after the pipeline runs.
| Domain | Event | Date | Detected | Lead Time | Peak CCI |
|---|---|---|---|---|---|
| mental_health | COVID-19 mental health crisis | 2020-03-15 | — | — | — |
| mental_health | Seasonal depression surge (2021) | 2021-10-15 | — | — | — |
| mental_health | r/therapy normalisation wave | 2022-07-01 | — | — | — |
| economics | Crypto mania cultural peak | 2021-11-08 | — | — | — |
| economics | r/antiwork discourse peak | 2022-01-25 | — | — | — |
| economics | Inflation anxiety discourse shift | 2022-06-10 | — | — | — |
| economics | Trump 'Liberation Day' tariff announcements | 2025-04-02 | — | — | — |
| technology | ChatGPT / AI discourse explosion | 2022-11-30 | — | — | — |
| technology | GPT-4 release discourse shift | 2023-03-14 | — | — | — |
| technology | GPT-4o launch | 2024-05-13 | — | — | — |
| technology | OpenAI o1 (reasoning model) release | 2024-09-12 | — | — | — |
| technology | DeepSeek R1 release | 2025-01-20 | — | — | — |
| relationships | Pandemic isolation discourse shift | 2020-04-01 | — | — | — |
| relationships | Post-Dobbs discourse shift | 2022-06-24 | — | — | — |
| science_trust | COVID vaccine authorization discourse | 2020-12-11 | — | — | — |
| science_trust | Conspiracy infodemic peak | 2020-11-03 | — | — | — |
| spirituality | Pandemic meaning-seeking wave | 2020-04-01 | — | — | — |
| spirituality | Mindfulness mainstreaming shift | 2021-07-01 | — | — | — |
| climate_environment | IPCC AR6 'Code Red' report | 2021-08-09 | — | — | — |
| climate_environment | Inflation Reduction Act (IRA) passage | 2022-08-16 | — | — | — |
| climate_environment | COP28 'beginning of the end of fossil fuels' | 2023-12-13 | — | — | — |
| climate_environment | Global temperature records broken (July 2023) | 2023-07-15 | — | — | — |
| climate_environment | Los Angeles wildfires (January 2025) | 2025-01-10 | — | — | — |
| geopolitics | Taliban capture of Kabul | 2021-08-15 | — | — | — |
| geopolitics | Russia invades Ukraine | 2022-02-24 | — | — | — |
| geopolitics | Hamas attack and Israel-Gaza war | 2023-10-07 | — | — | — |
| geopolitics | US Presidential Election 2024 | 2024-11-05 | — | — | — |
Metrics populate automatically after the backfill pipeline runs.
Ablation Studies
Each EWS component is tested in isolation, leave-one-out, and with alternative weight configurations. The full four-signal CCI is expected to outperform single-signal and classical baselines. Results from public/data/ablations.json.
| Configuration | Hit Rate | Lead Time | FP Rate | AUC |
|---|---|---|---|---|
| Variance only(single-signal) | — | — | — | — |
| CDC only(single-signal) | — | — | — | — |
| TPE only(single-signal) | — | — | — | — |
| Spectral gap only(single-signal) | — | — | — | — |
| LOO: no variance(leave-one-out) | — | — | — | — |
| LOO: no CDC(leave-one-out) | — | — | — | — |
| LOO: no TPE(leave-one-out) | — | — | — | — |
| LOO: no spectral gap(leave-one-out) | — | — | — | — |
| Baseline: variance trend(classical baseline) | — | — | — | — |
| Baseline: scalar AC1(Dakos 2012 AC1) | — | — | — | — |
| Classical VAR+CDC(Dakos 2012 composite) | — | — | — | — |
| Equal weights (0.25×4)(weight ablation) | — | — | — | — |
| 6-signal CCI (full)(extended signals) | — | — | — | — |
| Full CCI (tuned)(current system) | — | — | — | — |
Ablation results populate after backfill + ablation.py run.
Technical Limitations
- —UMAP reduction for TDA: Persistent homology is applied to UMAP-10D projections, not the full 384D space (O(n³) Vietoris-Rips would be intractable). UMAP preserves local structure but may distort global topology. Sensitivity analysis compares TPE trend directions across domains as a consistency check.
- —CDC vs standard AC1: The "AC1" signal is Centroid Directional Consistency (cosine similarity of consecutive centroids), not the standard lag-1 Pearson autocorrelation of Dakos et al. 2012. CDC captures directional drift toward a new attractor — a theoretically distinct but related EWS proxy.
- —CCI weights: Weights (0.25/0.35/0.25/0.15) are empirically chosen to weight CDC most heavily, consistent with Dakos et al. demonstrating AC1 as the most robust classical EWS signal. Ablation studies validate that results are robust to weight perturbations.
- —Vietoris-Rips threshold: MAX_EDGE_LENGTH = 0.8 (cosine distance). This sets the maximum filtration value for the persistence computation. Values in 0.6–0.9 have been tested; TPE trends are qualitatively consistent.
- —Rolling window: 8 weeks is the default Kendall-tau window. Results are robust to 6–10 week windows (validated in sensitivity analysis). Very fast transitions (< 2 weeks) may not generate a meaningful lead signal.
Domain Limitations
- —Reddit-only: platform-specific discourse, not representative of all populations or demographics.
- —English-language posts only.
- —CCI detects discourse shifts, not real-world events directly. Correlation with external events must be validated case-by-case.
- —UMAP is non-deterministic; retraining on a different corpus may shift the embedding space and invalidate historical comparisons.
- —Cosine similarity in embedding space conflates semantic and stylistic similarity — posts with similar syntax but opposite meaning may cluster together.
Ethics and Data Use
IRB determination: This research uses publicly available Reddit data collected from public subreddits with no access restrictions. All analysis is performed at the aggregate level; no individual users are identified, tracked, or profiled. No personally identifying information is stored. Under standard academic research ethics frameworks, publicly available text data analysed in aggregate does not constitute human subjects research requiring IRB review.
Reddit ToS compliance: Data collected via PRAW within API rate limits. Only public subreddits accessed. No scraping of private or restricted communities. Historical data via Arctic Shift API (community archive service operating with community cooperation). Raw post text is processed into embeddings and not exposed publicly.
Dual-use acknowledgment: A system that detects when a population's discourse approaches a critical juncture could theoretically be used to suppress, manipulate, or amplify discourse. AUGUR is a monitoring and detection tool, not a manipulation tool. Publishing the methodology enables defensive use: any actor aware of these signals can also monitor for them. Reddit's public nature means this analysis does not create new privacy risks beyond those that already exist.
Data availability: UMAP models are committed to the repository. Aggregate CCI time-series data is published via the Supabase API. Reddit post IDs (not text) are available upon request for research reproducibility. Derived embeddings can be shared for research purposes under standard academic data sharing agreements.
Data Sources
- —Reddit — via PRAW; historical data via Arctic Shift (Pushshift replacement)
- —GDELT Project 2.0 — global event database, weekly aggregations
- —sentence-transformers/all-MiniLM-L6-v2 — Hugging Face Model Hub
- —Supabase — PostgreSQL, weekly_snapshots + transitions + domain_config tables
Citation
Rana, T. (2024). AUGUR: A Live Seismograph for Collective Consciousness. GitHub. https://github.com/thetanishrana/augur
References
- [1]Scheffer, M., Bascompte, J., Brock, W. A., et al. (2009). Early-warning signals for critical transitions. Nature. 461, 53–59.
- [2]Dakos, V., Carpenter, S. R., Brock, W. A., et al. (2012). Methods for detecting early warnings of critical transitions in time series illustrated using simulated ecological data. PLOS ONE. 7(7), e41010.
- [3]Lenton, T. M., Livina, V. N., Dakos, V., van Nes, E. H., & Scheffer, M. (2011). Early warning of climate tipping points from critical slowing down: comparing methods to improve robustness. Philosophical Transactions of the Royal Society A. 370(1962), 1185–1204.
- [4]Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of EMNLP 2019.
- [5]Wichers, M., Groot, P. C., & Psychosystems, E.S.M. Group. (2016). Critical slowing down as a personalized early warning signal for depression. Psychotherapy and Psychosomatics. 85(2), 114–116. (Extends CSD to clinical psychological transitions.)
- [6]Gutierrez-Roig, M., Sagarra, O., Oltra, A., et al. (2023). Temporal network sociomarkers for social emergencies. Information Sciences. (Network-structural EWS on digital platforms; differentiates from AUGUR's semantic approach.)
- [7]Giulianelli, M., Del Tredici, M., & Fernández, R. (2021). Analysing lexical semantic change with contextualised word representations. Proceedings of ACL 2021. (Retrospective semantic change detection — differentiated from AUGUR's prospective EWS framing.)
- [8]Bury, T. M., Sujith, R. I., Pavithran, I., et al. (2023). Deep learning for early warning signals of tipping points. Nature Communications. (ML alternative to classical EWS; AUGUR uses interpretable four-signal composite without labeled training data.)
- [9]Kulig, A., Drożdż, S., Kwapień, J., & Oświęcimka, P. (2024). Highly engaging Reddit events reveal semantic and temporal compression. PNAS Nexus. (Within-event semantic compression; AUGUR detects pre-event structural precursors.)
- [10]Ballester, A., Pastor, J. M., & Villacorta-Atienza, J. A. (2024). Topological Data Analysis for NLP: A Comprehensive Survey. arXiv:2411.10298. (Survey of 95+ TDA+NLP papers; contextualises AUGUR within the TDA+NLP literature.)
- [11]De Silva, V., et al. (2024). Detecting Narrative Shifts through Persistent Structures. arXiv:2506.14836. (TDA on media discourse embeddings — retrospective; AUGUR tracks pre-transition TPE trends prospectively.)
- [12]Perotti, A., et al. (2025). TDA-Based Controversy Detection in Reddit. arXiv:2503.03500. (TDA on Reddit for classification; AUGUR uses TDA for temporal EWS, not classification.)