The full research protocol
Evidence
under adversarial review.
The end-to-end methodology behind every claim audit, peptide grade, and source breakdown. Written to be executable by a human editorial team or by an AI workflow with multiple independent reviewers, adversarial passes, and a final arbitration layer.
The shorter, reader-facing summary lives on /methodology. This page is the full protocol — what we run, in what order, with what gates.
§ 01
TL;DR
Peptigrade should not run on a single “smart model” reading a few papers and improvising a verdict. The protocol is designed around:
- duplicate claim decomposition
- duplicate and transparent search planning and screening
- duplicate independent extraction
- explicit risk-of-bias appraisal
- adversarial challenge from multiple postures
- arbitration with abstention when evidence is thin
- full provenance from verdict back to source text
The default failure mode is not“confident answer from incomplete evidence.” The default failure mode is UNVALIDATED, SPECULATIVE, or LOW confidence until the workflow earns a stronger conclusion.
§ 02
Scope — what this protocol is for
Use this protocol for:
- Claim audits — a public claim or popular framing is adjudicated against the literature.
- Peptide × outcome grades — a molecule is graded for a specific use, not as a blanket product.
- Source breakdowns — a podcast, post, paper, or newsletter is decomposed and traced upstream.
Do not use this protocol to:
- provide medical advice
- recommend dosing to an individual
- infer legality from incomplete regulatory information
- validate claims from memory without source retrieval
§ 03
Scientific scaffolding
This protocol is stitched from established frameworks. We do not adopt any one of them wholesale; we borrow the parts that are useful and explicit.
| Framework | What it contributes here |
|---|---|
| GRADE | Certainty-of-evidence logic: downgrade for risk of bias, inconsistency, indirectness, imprecision, and publication bias. |
| Cochrane RoB 2 | Domain-based risk-of-bias assessment for randomized trials. |
| ROBINS-I style reasoning | Structured bias assessment for non-randomized human studies when RCTs do not exist. |
| PRISMA 2020 | Transparent reporting of search, screening, inclusion, exclusion, and evidence flow. |
| Cochrane duplicate extraction | Independent screening and extraction by at least two reviewers, with predefined disagreement resolution. |
| SciFact / FEVER | Atomic claim decomposition, rationale-linked verification, and explicit support / contradiction / not-enough-information logic. |
| FEVER adversarial evaluation | Stress-testing verdicts against omitted evidence, perturbations, and brittle reasoning. |
| PubPeer / Retraction Watch / registries | Post-publication integrity checks, retraction status, and protocol / registration cross-checks. |
Two principles matter most for AI execution:
- Systematic-review discipline beats model eloquence.
- Adversarial review beats single-pass confidence.
§ 04
Core design principles
These are non-negotiable if the protocol is executed by LLMs.
- § I
Evidence first, never memory first
No model may assert that a claim is supported, contradicted, or untested unless the underlying sources were actually retrieved and logged. A model may use prior knowledge to suggest search terms, but not to decide the verdict.
- § II
Separate retrieval, extraction, and judgment
The same pass that retrieves sources should not be trusted to synthesize them without review. Retrieval, extraction, bias appraisal, and final adjudication should be split into distinct steps or agents.
- § III
Require duplicate independent passes for subjective steps
Claim decomposition, search planning, screening, extraction, and final synthesis all contain judgment. They must be run independently by at least two reviewers or two isolated model contexts before arbitration.
- § IV
Force adversarial postures
At least one reviewer must make the strongest defensible case for the claim; at least one must make the strongest case against. A third reviewer looks for mismatches, omissions, and overgeneralization.
- § V
Do not let mechanism substitute for efficacy
Mechanistic plausibility can raise interest. It cannot validate a human efficacy claim on its own.
- § VI
Human claims require human evidence
Animal and in-vitro evidence can support mechanistic or preclinical sub-claims. They cannot validate a claim about efficacy, safety, or speed of effect in humans.
- § VII
Reviews summarize; primary studies decide
Systematic reviews and meta-analyses are high-value synthesis sources, but they do not replace reading the primary studies that drive the conclusion when a claim is contested or high stakes.
- § VIII
Abstention is a valid result
If the search is incomplete, sources conflict, route/dose/population mismatch is large, or evidence is sparse, the workflow should stop at UNVALIDATED, SPECULATIVE, CONTESTED, or Low confidence rather than force a stronger answer.
- § IX
Every public sentence must be traceable
Every sentence in a verdict summary must map to cited sources, specific evidence snippets or extracted fields, and a documented reasoning step in the audit record.
§ 05
Machine-operable artifacts
An AI workflow should emit these artifacts for every reviewed claim.
| Artifact | Purpose | Minimum contents |
|---|---|---|
| Decomposition packets | Preserves independent framing before reconciliation | Decomposer A output, Decomposer B output, qualifier diffs, reconciliation notes |
| Claim manifest | Defines what is being tested | Original claim, normalized claim, population, intervention, comparator, outcome, route, time horizon, claim type, scope key |
| Search plans | Makes query coverage auditable before execution | Plan A, Plan B / contradiction-seeking plan, query families, required source classes, completeness checklist |
| Search ledger | Makes retrieval auditable | Databases, queries, dates, filters, PRISMA-style counts, missing full texts, citation-chasing log |
| Screening log | Shows inclusion / exclusion decisions | Candidate source, include/exclude decision, reason, reviewer IDs |
| Study inventory | Deduplicated source list | PMID / DOI / registry ID, title, design, year, status, duplicate-linking across reports |
| Evidence cards | Structured study extraction | Population, route, dose, comparator, endpoints, timepoint, effect size, adverse events, exact snippets |
| Bias cards | Structured quality appraisal | RoB 2 / ROBINS-I domains, integrity flags, funding, registration, protocol deviations |
| Contradiction map | Prevents one-sided synthesis | Which sources support, contradict, or merely mention each sub-claim, and why |
| Adversarial review log | Preserves disagreement | Steelman, skeptic, mismatch, omitted-evidence cases, unresolved disputes |
| Decision log | Shows how the verdict was reached | Status, confidence, decisive studies, unresolved uncertainty, escalation notes |
| Public claim sentences | Enforces sentence-level provenance | Sentence text, source evidence cards, snippet IDs, allowed strength |
| Protocol compliance report | Makes completeness machine-checkable | Stage, required artifact, gate result, failure response, sign-off |
| Publication packet | What ships | Final summary, citations, status, confidence, claim-review JSON-LD, last verified date, linter result |
Note
If the workflow cannot produce these artifacts, it has not completed the protocol.
§ 06
Claim unit & normalization
Duplicate claim decomposition
Before any retrieval runs, the workflow produces two independent packets — Claim Decomposition A and Claim Decomposition B — splitting the source material into atomic sub-claims, assigning qualifiers, and stating what evidence would count as direct support or contradiction.
The reconciliation step is deterministic:
- compare claim boundaries
- compare population / route / outcome / time qualifiers
- compare claim-type assignments
- merge exact duplicates
- escalate materially different framings instead of averaging them away
If the two decomposers disagree on what the claim is actually asserting, the workflow has not earned the right to search yet.
Atomic claim extraction
A claim is the smallest assertion that can be tested against evidence. Long sentences should be split until each sub-claim has a single testable burden.
Yes
Good atomic claims
- “BPC-157 improves tendon healing in rodent Achilles tendon injury models.”
- “BPC-157 improves tendon healing in humans with chronic tendinopathy.”
- “BPC-157 acts through FAK-paxillin signaling in tendon fibroblasts.”
No
Bad atomic claims
- “BPC-157 is the Wolverine peptide.”
- “BPC-157 heals everything fast and safely.”
Every atomic claim is normalized into this frame:
type ClaimManifest = {
originalText: string;
normalizedText: string;
claimType:
| "empirical"
| "mechanistic"
| "comparative"
| "quantitative"
| "predictive"
| "definitional";
population?: string;
intervention?: string;
comparator?: string;
outcome?: string;
route?: string;
timeHorizon?: string;
scopeKey?: string;
scopeNotes?: string[];
};For internal evidence review, the meaningful unit is usually closer to peptide × outcome × population × route × time horizon. Public pages may collapse presentation for readability, but the protocol retains these qualifiers internally. If a qualifier is implicit or unknown, it is marked as such and certainty is capped.
§ 07
Search & screening protocol
Duplicate and adversarial search planning
Search execution does not begin from a single planner’s instincts. Each finalized ClaimManifest produces:
- Search Plan A — direct-support planner optimized to find the best on-point evidence.
- Search Plan B / contradiction-seeking plan — planner optimized to find null, contradictory, safety, integrity, and scope-mismatch evidence.
These plans are reconciled deterministically. The reconciliation verifies that direct-support and contradiction queries exist for every central sub-claim, that safety / integrity / registry / regulatory checks are present where relevant, that synonyms and scope qualifiers are not missing, and that query coverage is logged before execution.
type SearchPlan = {
subClaimId: string;
queryFamilies: Array<{
family:
| "direct"
| "contradiction"
| "safety"
| "integrity"
| "registry"
| "regulatory";
query: string;
rationale: string;
}>;
requiredSourceClasses: string[];
coverageChecklist: string[];
};Retrieval completeness checklist
For each sub-claim, the planner explicitly considers whether the query set covers:
- canonical peptide name
- synonyms, aliases, development codes
- common misspellings and transliterations
- route terms when route matters
- outcome and endpoint synonyms
- population / indication / disease terms
- comparator or control terminology
- dose or exposure terminology
- human clinical terms
- animal model terms for preclinical sub-claims
- target / pathway terms for mechanistic sub-claims
- negative, null, failed, no-effect terms
- safety and adverse-event terms
- retraction, correction, expression-of-concern terms
- registry identifiers and trial terminology
- regulatory body names and status terminology
“Not searched because irrelevant” is acceptable. Silent omission is not.
Databases and source classes
At minimum, the workflow searches or checks:
- PubMed / MEDLINE
- trial registries (ClinicalTrials.gov and regional equivalents)
- Crossref or DOI resolution
- PubPeer
- Retraction Watch or equivalent retraction source
- relevant regulatory bodies (FDA, EMA, WADA, etc.)
Preprints may be included only when methodology is inspectable, no peer-reviewed equivalent exists, and the verdict is explicitly capped for certainty.
Citation chasing and record linkage
For central claims and decisive studies, retrieval is not complete after keyword search alone. The workflow performs and logs:
- backward citation review of decisive reviews and pivotal primary studies
- forward citation review of decisive primary studies
- registry-to-publication matching for registered trials
- duplicate-report linking across abstracts, articles, supplements, and registry records
- dead-end logging when cited primary sources cannot be retrieved in full text
PRISMA-style search ledger
type SearchDatabaseRecord = {
database: string;
queryFamily:
| "direct"
| "contradiction"
| "safety"
| "integrity"
| "registry"
| "regulatory";
query: string;
searchedAt: string;
filters?: string[];
retrieved: number;
};
type SearchLedger = {
subClaimId: string;
databases: SearchDatabaseRecord[];
totalRetrieved: number;
afterDeduplication: number;
screened: number;
excluded: number;
included: number;
fullTextUnavailable: number;
decisiveSourcesMissingFullText: string[];
citationChasing: {
backwardFrom: string[];
forwardFrom: string[];
registryMatched: string[];
duplicateReportsLinked: string[];
};
};Screening rules
Screening runs independently in two isolated contexts.
Yes
Include when a source
- directly tests the normalized claim
- informs a necessary upstream premise
- is needed to resolve contradiction or integrity concerns
No
Exclude when a source
- is clearly off-topic
- only repeats another source without new data
- is vendor marketing copy posing as research
- is anecdotal context that cannot raise evidence quality
Every exclusion must have a reason code. Screening is not complete until A/B decisions reconcile, exclusion reason codes are finalized, PRISMA counts reconcile to the deduplicated inventory, and decisive full-text gaps are logged.
§ 08
Study extraction
Duplicate extraction
Study characteristics and outcome data are extracted independently by two reviewers. This is especially important for primary outcomes, adverse events, effect sizes, route and dose, population definitions, and follow-up duration.
For AI workflows, “independent” means the second extractor must not see the first extractor’s filled form before submitting its own.
Evidence-card fields
type EvidenceCard = {
studyId: string;
design:
| "systematic_review"
| "meta_analysis"
| "rct"
| "nonrandomized_human"
| "case_series"
| "animal"
| "in_vitro"
| "preprint"
| "registry";
population: string;
n?: number;
route?: string;
dose?: string;
comparator?: string;
endpoint: string;
endpointType: "clinical" | "surrogate" | "mechanistic" | "safety";
timepoint?: string;
resultSummary: string;
effectSize?: string;
adverseEvents?: string;
exactQuotesOrSnippets: string[];
};The extractor must label whether the endpoint is clinical, surrogate, mechanistic, or safety. This prevents mechanistic wins from being mistaken for clinical wins.
§ 09
Bias & integrity assessment
Study-design ladder
When evidence conflicts, we prefer:
- systematic reviews / meta-analyses of high-quality human trials
- registered, peer-reviewed human RCTs
- non-randomized human comparative studies
- case series / uncontrolled pilots
- animal in-vivo studies
- in-vitro / cell / mechanistic studies
- expert opinion without new data
- anecdote / testimony
Domain-based bias appraisal
For randomized trials we use RoB 2 style domains:
- randomization process
- deviations from intended interventions
- missing outcome data
- outcome measurement
- selective reporting
For non-randomized human studies we use ROBINS-I style reasoning:
- confounding
- participant selection
- intervention classification
- deviations from intended interventions
- missing data
- outcome measurement
- selective reporting
For every included study, we also flag:
- preregistration or registry record present / absent
- funding source and sponsor
- author conflicts
- replication status
- retraction status
- PubPeer or major integrity concerns
Integrity override
Note
A source with serious integrity concerns is not silently averaged into the body of evidence. It is flagged, discussed explicitly, and down-weighted or excluded by rule.
§ 10
Adversarial review architecture
This is the part that makes the protocol AI-ready instead of AI-flavored.
Required roles
| Role | Task |
|---|---|
| Decomposer A | Independently splits the source claim into atomic sub-claims and normalizes them |
| Decomposer B | Independently repeats decomposition without seeing A |
| Search planner A | Builds a direct-support query plan for each sub-claim |
| Search planner B / contradiction-seeking planner | Builds null, contradictory, safety, integrity, and scope-mismatch queries |
| Retriever / query executor | Runs the reconciled search plan and builds the search ledger |
| Screening reviewer A | Independently screens candidate records |
| Screening reviewer B | Independently screens the same records without seeing A |
| Extractor A | Independently extracts structured evidence cards |
| Extractor B | Independently extracts the same studies without seeing A |
| Bias auditor | Scores risk of bias and integrity concerns |
| Steelman reviewer | Makes the strongest defensible case that the claim is true |
| Skeptic reviewer | Makes the strongest defensible case that the claim is false or overstated |
| Mismatch reviewer | Looks for population / route / dose / endpoint / timeframe mismatches |
| Omitted-evidence reviewer | Searches specifically for null, contradictory, or inconvenient evidence |
| Arbitrator | Resolves disagreements, assigns status, assigns confidence, writes the decision log |
| Publication linter | Runs deterministic wording and provenance checks before anything ships |
If a single model is used for all roles, contexts are isolated and prior outputs are not revealed until arbitration. Better still: heterogeneous models or at least heterogeneous prompting and retrieval contexts.
Required adversarial questions
Before a verdict is finalized, the workflow must answer:
- What is the strongest evidence for the claim?
- What is the strongest evidence against the claim?
- Are the supportive studies actually testing the same population, route, dose, comparator, and endpoint as the claim?
- Are supportive results clinical outcomes or only mechanistic / surrogate outcomes?
- Are contradictory or null studies being omitted because they are harder to explain?
- Are reviews being used to smuggle in unsupported primary-study conclusions?
- Does the claim generalize beyond the studied tissue, route, timeframe, or population?
- Is the verdict being driven by one lab, one paper, or one uncontrolled pilot?
- Is there any integrity signal that should cap certainty regardless of apparent effect?
Overstatement check
Many public claims are not cleanly true or false. They are partly grounded, then overstated. For composite claims, the workflow computes the formal status of each sub-claim and whether the public framing overgeneralizes the evidence.
OVERSTATED exists precisely so a composite claim is not collapsed into VALIDATED or FALSIFIED when the fairer answer is “the underlying phenomenon exists in narrow scope, but the public framing outruns the evidence.”
§ 11
Unified workflow
Every claim audit runs through these stages in order.
- Stage 01
Run duplicate claim decomposition
- Input
- source artifact · verbatim claim text · target population (if known)
- Output
- Decomposition A · Decomposition B · reconciled claim manifest · atomic sub-claims
- Gate
- every sub-claim is testable; route / outcome / population explicit; material framing differences reconciled or escalated
- Stage 02
Duplicate and adversarial search planning
- Input
- reconciled claim manifest · atomic sub-claims
- Output
- Search Plan A · Search Plan B / contradiction-seeking plan · reconciled query set
- Gate
- direct-support and contradiction queries for every central sub-claim; coverage checklist logged; required source classes explicit
- Stage 03
Transparent retrieval and screening
- Input
- normalized sub-claims · reconciled search plan
- Output
- search ledger · candidate sources · screening log
- Gate
- search terms saved; contradictory-evidence search explicit; A/B screening reconciled; PRISMA counts reconcile; full-text gaps logged
- Stage 04
Trace provenance recursively
- Input
- screened source set · each claim
- Output
- classification per claim: Direct · Inherited · Speculative
- Gate
- primary source reached or dead end recorded; inherited claims not credited as direct evidence
- Stage 05
Extract evidence in duplicate
- Input
- included studies
- Output
- reconciled evidence cards
- Gate
- discrepancies reconciled or escalated; multiple reports merged; exact snippets back every important extracted claim
- Stage 06
Appraise bias and integrity
- Input
- evidence cards
- Output
- bias cards with design classification, domain-based assessment, integrity check, replication note
- Gate
- the workflow can explain why higher- and lower-quality studies were weighted differently
- Stage 07
Build support and contradiction maps
- Input
- evidence cards · bias cards
- Output
- per sub-claim: supports · contradicts · mentions — plus directness and evidence-type annotations
- Gate
- every sub-claim has a contradiction map, even if empty
- Stage 08
Run adversarial review
- Input
- all prior artifacts
- Output
- steelman, skeptic, mismatch, and omitted-evidence reviews
- Gate
- at least one adversarial pass challenged the favored interpretation; unresolved disagreements logged, not buried
- Stage 09
Adjudicate status and confidence
- Input
- evidence cards · bias cards · maps · adversarial findings
- Output
- sub-claim statuses · overall status · confidence
- Gate
- final summary does not outrun the evidence cards; confidence capped by unresolved gaps, directness problems, or integrity concerns
- Stage 10
Publish with provenance and linting
- Input
- decision log
- Output
- verdict summary · what you can / cannot say · citations · evidence chain · last verified date · structured data · sentence-level provenance · linter report
- Gate
- every public sentence maps to evidence cards and snippets; linter checks pass or are explicitly waived by policy; wording does not exceed allowed strength
§ 12
Protocol compliance matrix
The minimum machine-checkable completion spec for a run.
| Stage | Required artifact | Acceptance gate | Failure response |
|---|---|---|---|
| Claim decomposition | DecompositionPacket[], ClaimManifest | Central claims have population, intervention, outcome, route, and time qualifiers where relevant | Re-enter decomposition |
| Search planning | SearchPlan[] | Direct and contradiction-seeking queries logged; coverage checklist complete | Re-enter search planning |
| Retrieval and screening | SearchLedger, ScreeningLog | A/B screening reconciled; PRISMA counts reconcile; full-text gaps logged | Re-enter retrieval or screening |
| Provenance tracing | Provenance trace records | Central claims reach primary source or dead end is logged | Re-enter provenance tracing |
| Extraction | EvidenceCard[] | Critical fields reconciled; exact snippets captured | Re-enter extraction reconciliation |
| Bias and integrity | BiasCard[] | Decisive sources receive bias and integrity review | Re-enter bias appraisal |
| Contradiction mapping | ContradictionMap | Every sub-claim mapped to support, contradiction, or mention | Re-enter contradiction mapping |
| Adversarial review | AdversarialReviewLog | Steelman, skeptic, mismatch, and omitted-evidence passes complete | Re-enter adversarial review |
| Arbitration | DecisionLog | Verdict traceable to evidence cards and capped by rule | Re-enter adjudication or escalate |
| Publication | PublicClaimSentence[], PublicationPacket, linter output | Every sentence has provenance; wording checks pass | Re-enter publication packet only |
§ 13
Publication & provenance controls
Sentence-level provenance is a first-class object, not a formatting afterthought.
type PublicClaimSentence = {
sentenceId: string;
text: string;
sourceEvidenceCardIds: string[];
sourceSnippetIds: string[];
allowedStrength:
| "directly_supported"
| "qualified_support"
| "context_only"
| "not_allowed";
};The publication packet refuses to ship sentences marked not_allowed or sentences whose wording exceeds the allowed strength of their linked evidence.
Publication linter
Before publication, deterministic wording checks run:
- no dosing advice
- no individualized medical advice
- no unsupported safety reassurance
- no “proven,” “clinically established,” or “safe” unless policy and evidence status allow it
- no human efficacy wording from animal-only or in-vitro evidence
- no review article cited as if it were primary evidence
- no factual biomedical sentence without citation coverage
- no sentence stronger than the weakest central sub-claim permits
Any failure blocks publication or requires an explicit, logged override.
§ 14
Status taxonomy
Eight formal statuses:
| Status | Meaning |
|---|---|
| VALIDATED | Multiple independent, reasonably high-quality sources directly support the claim in the relevant scope. |
| CONTESTED | Comparable evidence points in different directions and the disagreement is material, not superficial. |
| UNVALIDATED | The claim has not been tested adequately enough to justify support or falsification. |
| OVERSTATED | Composite popular framing extends beyond what the underlying evidence supports. Parts may be validated in narrow scope; the framing as stated is not. |
| FALSIFIED | Better evidence directly contradicts the claim, or central sub-claims are directly false. Requires replicated, positively contradictory evidence. |
| WITHDRAWN | The supporting source base is retracted or fatally compromised. |
| DEPENDENT | The claim is only true if one or more upstream premises are true, and those premises remain unsettled. |
| SPECULATIVE | No traceable source adequately supports the exact proposition. |
Decision rules for composite claims
Do not use a naive “worst sub-claim wins” rule. Instead:
- identify the central necessary sub-claims
- identify the supporting but non-central sub-claims
- classify each sub-claim separately
- assign the overall status based on the central necessary sub-claims plus any material overstatement
§ 15
Confidence taxonomy
Confidence is separate from status.
| Confidence | When to assign |
|---|---|
| High | Search broad and explicit; evidence directly on point; disagreement limited; major bias concerns resolved; replication status reasonably clear. |
| Moderate | Core evidence present, but gaps in directness, search completeness, replication, or bias resolution. |
| Low | Retrieval gaps, dead ends, sparse studies, serious indirectness, or unresolved integrity problems materially weaken certainty. |
Confidence is judged on five axes: search completeness, directness, quality / bias profile, consistency, integrity / provenance clarity. The final confidence does not exceed the weakest material axis.
Confidence caps and status ceilings
| Condition | Max status | Max confidence | Required action |
|---|---|---|---|
| Human efficacy claim supported only by animal or in-vitro evidence | UNVALIDATED or OVERSTATED | Low | State preclinical scope explicitly |
| Decisive source lacks full text | Depends | Low | Log the gap and avoid strong verdicts |
| One uncontrolled human case series only | Usually UNVALIDATED | Low or Moderate | Explain directness and bias limits |
| Comparable direct support and contradiction exist | CONTESTED | Moderate | Map the contradiction explicitly |
| Serious unresolved integrity concern | WITHDRAWN, UNVALIDATED, or hold | Low | Down-weight, exclude, or escalate |
| Review-only support while readable primary studies remain unread | Block verdict | None | Read the primary studies first |
| Route, dose, population, or time qualifiers missing but central | UNVALIDATED | Low | Narrow the claim or rerun retrieval |
| Preprint-only decisive support | Depends | Low | Mark status as provisional and cap certainty |
§ 16
Hard rules & abstention
Hard rules
- No human efficacy claim may be marked
VALIDATEDon animal or in-vitro evidence alone. - No claim may be marked
FALSIFIEDmerely because support is absent; absence of evidence is usuallyUNVALIDATED. - No review article may be the sole basis for a decisive status if the primary studies are available.
- No final verdict may ignore contradictory evidence found in search.
- No study may be counted twice across multiple reports.
- No final summary may introduce claims absent from the evidence cards.
Abstention triggers
Cap the verdict at UNVALIDATED or Low confidence when:
- full text is not available for decisive sources
- the claim depends on route / dose / population details not reported in the literature
- only uncontrolled pilots exist
- only one lab drives the literature
- the protocol / registry record materially conflicts with the paper and cannot be reconciled
- serious integrity concerns remain unresolved
§ 17
Calibration & regression testing
Treat the protocol itself as something that must be measured. We track at least:
- decisive-source recall
- false inclusion rate
- false exclusion rate
- critical-field extraction accuracy
- verdict agreement with expert benchmark sets
- rate of overconfident wrong verdicts
- disagreement rates across decomposition, search planning, screening, and extraction
Watch
Near-perfect agreement is not automatically a success condition. If supposedly independent reviewers agree almost all the time, verify the pipeline is not leaking context or collapsing into prompt coupling. Thresholds are calibrated empirically against benchmark claims — not treated as universal constants.
§ 18
Operating modes
Both modes use the same machinery above.
Path 1
Forward · Claim → Evidence
Start with a public claim. Decompose it, retrieve evidence, challenge it adversarially, then publish a claim audit.
- Entry
- A claim such as 'BPC-157 has wolverine-like effects'
- Output
- /claims/[slug]
Path 2
Backward · Artifact → Recursive origins
Start with an artifact. Decompose every claim inside it, trace each upstream, and show which parts are direct, inherited, speculative, or overstated.
- Entry
- A podcast, newsletter, paper, or post
- Output
- /breakdowns/[slug]
§ 19
Practical implementation notes for LLM workflows
Independence
If you want fair adversarial review, do not let later reviewers inherit earlier reviewers’ conclusions. Shared context creates correlated errors.
Deterministic checks
Use non-LLM validation wherever possible for:
- DOI / PMID resolution
- deduplication
- publication dates
- registry status
- retraction status
- schema validation of outputs
Calibration
Prompt the arbitrator to justify why a stronger status was earned rather than defaulting to the weaker one. If that justification is thin, downgrade.
Escalation
Escalate to human review when:
- the claim is legally or medically high stakes
- the studies are highly technical and difficult to normalize
- the adversarial reviewers disagree on central sub-claims
- integrity concerns are non-trivial
§ 20
Reference anchors
This protocol is grounded in:
- GRADE certainty-of-evidence logic
- Cochrane RoB 2 and ROBINS-I style bias assessment
- PRISMA 2020 reporting discipline
- Cochrane duplicate independent extraction practice
- SciFact scientific claim verification
- FEVER claim verification and FEVER adversarial evaluation
- post-publication integrity checks via PubPeer, retraction tracking, and trial registries
These anchors matter because they push the workflow toward transparency over vibes, duplication over single-pass confidence, adversarial challenge over confirmation bias, and abstention over overclaiming.
§ 21
Review cadence
The canonical cadence SLAs — per-trigger response times (same-day, 7-day, 30-day, quarterly, annual) — live in the internal grading-and-reassessment protocol. This is the short version; if the two disagree, the internal protocol wins.
- Re-run a claim audit within 30 days when new peer-reviewed evidence materially affects a central sub-claim.
- Re-run immediately (same day) when a decisive source is retracted, corrected, or flagged for serious integrity issues.
- Re-run same day when a major regulatory change alters the real-world status of the molecule or use case.
- Re-run stale claims at least annually even if no trigger fires; quarterly housekeeping audit across the catalog.
Last updated: 2026-04-20