Methodology

Every claim in REALMS is attached to a source text and passed through a four-stage verification pipeline before it appears on the public site. This page describes the pipeline end-to-end and shows the current error rate measured by an independent oracle sampler.

Loading integrity metrics…

1. Source selection

We ingest three layers of source material. Wikipedia articles in English form the first pass — broad, hyperlinked, and well-covered for popular entities. PubMed abstracts provide peer-reviewed anthropology and religious-studies citations for each entity. Internet Archive scans provide pre-1929 public-domain monographs (Frazer, Grey, Stevenson, Parrinder, Mooney, et al.) for primary ethnography.

We hard-exclude Tier 3 (modern entheogenic visionary literature) and Tier 4 (modern occult / fiction / UFO-adjacent) content — see the ethics page for why.

2. Extraction

Each source is fetched, cached, and broken into ~3,500-character chunks. Each chunk is sent through an LLM prompt (current version: v4) asking for a JSON list of entities with fixed fields, explicit relationship role-fields, and a verbatim quote_context that anchors every claim to the source text.

3. Verification (four-stage integrity pipeline)

Stage 1 — deterministic quote check. The quote_context must appear (unicode-normalised, diacritic-stripped) in the source chunk. Missing quotes are rejected outright.
Stage 2 — semantic verification. An independent LLM (different from the extractor; currently Gemini 2.0 Flash) judges whether the quote supports, is ambiguous about, or contradicts the claim. Confidence < 0.85 downgrades the claim.
Stage 3 — per-entity gate. An entity is accepted when ≥99% of its claims pass, flagged for review between 90% and 99%, and rejected below 90%. The thresholds are tunable in env.
Stage 4 — oracle sampling. A nightly job draws a random sample of recent extractions and asks a top-tier model (Claude Opus) to independently judge each claim. The aggregate error rate is what appears in the badge above.

4. Normalisation & corroboration

We match entities by exact case-insensitive name first, then by a stem key (diacritic-stripped, plural-tolerant). Merges carry forward the union of alternate names, powers, domains, cultural associations, and provenance source IDs. Temporal fields follow earliest/widest semantics.

Each extraction carries an LLM-reported confidence. Entity-level consensus_confidence is the average across extractions. We additionally compute a corroboration tier:

tier_3 (triangulated) — at least one Wikipedia, one PubMed, and one archive.org source.
tier_2 (corroborated) — two or more distinct source types.
tier_1 (single tradition) — one source type.
tier_0 (stub) — referenced by another entity but never directly extracted.

5. Cross-database linking

A matcher proposes candidate Wikidata and VIAF identifiers for each entity. A link is auto-accepted only when the top candidate's confidence exceeds 0.85 and beats the runner-up by at least 2×. Otherwise the candidates queue for manual review.

6. Human review

Low-confidence, single-source, and isolated entities surface in a review queue. Reviewers can approve, reject, edit (whitelisted fields only), merge, or propose an external-database link. Every action is logged in the review_actions table, including old and new values; nothing is silently overwritten.

How to cite

Every entity page has a Cite this entity button that generates APA, MLA, Chicago, BibTeX, and CSL-JSON (Zotero / Mendeley) forms. The dataset is licensed CC-BY-4.0 — please attribute REALMS and link back to the entity URL.

Known limitations

Source language is currently English-only; many primary sources are not in English and are under-represented.
Wikipedia is a secondary source. Where it disagrees with primary ethnography, the archive.org layer should take precedence — but our merge strategy currently treats them symmetrically.
Some entity types (e.g., collective ancestor spirits) are difficult to disambiguate from abstract categories; the LLM occasionally over-extracts. These surface in the flagged queue.
Coverage is uneven: traditions with more digitized public-domain scholarship are over-represented.

Reproducibility

The pipeline source is at github.com/apeters247/realms. Every entity row records its extraction_instances and provenance_sources as arrays of integer IDs, and every source row records its URL, DOI, hash, and fetch date. Machine-readable dataset exports are available at /export/entities.json, /export/relationships.csv, and per-entity at /export/entity/<id>.json / .bib / .csl.json.