The full research protocol

Evidence
under adversarial review.

The end-to-end methodology behind every claim audit, peptide grade, and source breakdown. Written to be executable by a human editorial team or by an AI workflow with multiple independent reviewers, adversarial passes, and a final arbitration layer.

The shorter, reader-facing summary lives on /methodology. This page is the full protocol — what we run, in what order, with what gates.

§ 01

TL;DR

Peptigrade should not run on a single “smart model” reading a few papers and improvising a verdict. The protocol is designed around:

  • duplicate claim decomposition
  • duplicate and transparent search planning and screening
  • duplicate independent extraction
  • explicit risk-of-bias appraisal
  • adversarial challenge from multiple postures
  • arbitration with abstention when evidence is thin
  • full provenance from verdict back to source text

The default failure mode is not“confident answer from incomplete evidence.” The default failure mode is UNVALIDATED, SPECULATIVE, or LOW confidence until the workflow earns a stronger conclusion.

§ 02

Scope — what this protocol is for

Use this protocol for:

  • Claim audits — a public claim or popular framing is adjudicated against the literature.
  • Peptide × outcome grades — a molecule is graded for a specific use, not as a blanket product.
  • Source breakdowns — a podcast, post, paper, or newsletter is decomposed and traced upstream.

Do not use this protocol to:

  • provide medical advice
  • recommend dosing to an individual
  • infer legality from incomplete regulatory information
  • validate claims from memory without source retrieval

§ 03

Scientific scaffolding

This protocol is stitched from established frameworks. We do not adopt any one of them wholesale; we borrow the parts that are useful and explicit.

FrameworkWhat it contributes here
GRADECertainty-of-evidence logic: downgrade for risk of bias, inconsistency, indirectness, imprecision, and publication bias.
Cochrane RoB 2Domain-based risk-of-bias assessment for randomized trials.
ROBINS-I style reasoningStructured bias assessment for non-randomized human studies when RCTs do not exist.
PRISMA 2020Transparent reporting of search, screening, inclusion, exclusion, and evidence flow.
Cochrane duplicate extractionIndependent screening and extraction by at least two reviewers, with predefined disagreement resolution.
SciFact / FEVERAtomic claim decomposition, rationale-linked verification, and explicit support / contradiction / not-enough-information logic.
FEVER adversarial evaluationStress-testing verdicts against omitted evidence, perturbations, and brittle reasoning.
PubPeer / Retraction Watch / registriesPost-publication integrity checks, retraction status, and protocol / registration cross-checks.

Two principles matter most for AI execution:

  1. Systematic-review discipline beats model eloquence.
  2. Adversarial review beats single-pass confidence.

§ 04

Core design principles

These are non-negotiable if the protocol is executed by LLMs.

  1. § I

    Evidence first, never memory first

    No model may assert that a claim is supported, contradicted, or untested unless the underlying sources were actually retrieved and logged. A model may use prior knowledge to suggest search terms, but not to decide the verdict.

  2. § II

    Separate retrieval, extraction, and judgment

    The same pass that retrieves sources should not be trusted to synthesize them without review. Retrieval, extraction, bias appraisal, and final adjudication should be split into distinct steps or agents.

  3. § III

    Require duplicate independent passes for subjective steps

    Claim decomposition, search planning, screening, extraction, and final synthesis all contain judgment. They must be run independently by at least two reviewers or two isolated model contexts before arbitration.

  4. § IV

    Force adversarial postures

    At least one reviewer must make the strongest defensible case for the claim; at least one must make the strongest case against. A third reviewer looks for mismatches, omissions, and overgeneralization.

  5. § V

    Do not let mechanism substitute for efficacy

    Mechanistic plausibility can raise interest. It cannot validate a human efficacy claim on its own.

  6. § VI

    Human claims require human evidence

    Animal and in-vitro evidence can support mechanistic or preclinical sub-claims. They cannot validate a claim about efficacy, safety, or speed of effect in humans.

  7. § VII

    Reviews summarize; primary studies decide

    Systematic reviews and meta-analyses are high-value synthesis sources, but they do not replace reading the primary studies that drive the conclusion when a claim is contested or high stakes.

  8. § VIII

    Abstention is a valid result

    If the search is incomplete, sources conflict, route/dose/population mismatch is large, or evidence is sparse, the workflow should stop at UNVALIDATED, SPECULATIVE, CONTESTED, or Low confidence rather than force a stronger answer.

  9. § IX

    Every public sentence must be traceable

    Every sentence in a verdict summary must map to cited sources, specific evidence snippets or extracted fields, and a documented reasoning step in the audit record.

§ 05

Machine-operable artifacts

An AI workflow should emit these artifacts for every reviewed claim.

ArtifactPurposeMinimum contents
Decomposition packetsPreserves independent framing before reconciliationDecomposer A output, Decomposer B output, qualifier diffs, reconciliation notes
Claim manifestDefines what is being testedOriginal claim, normalized claim, population, intervention, comparator, outcome, route, time horizon, claim type, scope key
Search plansMakes query coverage auditable before executionPlan A, Plan B / contradiction-seeking plan, query families, required source classes, completeness checklist
Search ledgerMakes retrieval auditableDatabases, queries, dates, filters, PRISMA-style counts, missing full texts, citation-chasing log
Screening logShows inclusion / exclusion decisionsCandidate source, include/exclude decision, reason, reviewer IDs
Study inventoryDeduplicated source listPMID / DOI / registry ID, title, design, year, status, duplicate-linking across reports
Evidence cardsStructured study extractionPopulation, route, dose, comparator, endpoints, timepoint, effect size, adverse events, exact snippets
Bias cardsStructured quality appraisalRoB 2 / ROBINS-I domains, integrity flags, funding, registration, protocol deviations
Contradiction mapPrevents one-sided synthesisWhich sources support, contradict, or merely mention each sub-claim, and why
Adversarial review logPreserves disagreementSteelman, skeptic, mismatch, omitted-evidence cases, unresolved disputes
Decision logShows how the verdict was reachedStatus, confidence, decisive studies, unresolved uncertainty, escalation notes
Public claim sentencesEnforces sentence-level provenanceSentence text, source evidence cards, snippet IDs, allowed strength
Protocol compliance reportMakes completeness machine-checkableStage, required artifact, gate result, failure response, sign-off
Publication packetWhat shipsFinal summary, citations, status, confidence, claim-review JSON-LD, last verified date, linter result

Note

If the workflow cannot produce these artifacts, it has not completed the protocol.

§ 06

Claim unit & normalization

Duplicate claim decomposition

Before any retrieval runs, the workflow produces two independent packets — Claim Decomposition A and Claim Decomposition B — splitting the source material into atomic sub-claims, assigning qualifiers, and stating what evidence would count as direct support or contradiction.

The reconciliation step is deterministic:

  • compare claim boundaries
  • compare population / route / outcome / time qualifiers
  • compare claim-type assignments
  • merge exact duplicates
  • escalate materially different framings instead of averaging them away

If the two decomposers disagree on what the claim is actually asserting, the workflow has not earned the right to search yet.

Atomic claim extraction

A claim is the smallest assertion that can be tested against evidence. Long sentences should be split until each sub-claim has a single testable burden.

Yes

Good atomic claims

  • “BPC-157 improves tendon healing in rodent Achilles tendon injury models.”
  • “BPC-157 improves tendon healing in humans with chronic tendinopathy.”
  • “BPC-157 acts through FAK-paxillin signaling in tendon fibroblasts.”

No

Bad atomic claims

  • “BPC-157 is the Wolverine peptide.”
  • “BPC-157 heals everything fast and safely.”

Every atomic claim is normalized into this frame:

ts
type ClaimManifest = {
  originalText: string;
  normalizedText: string;
  claimType:
    | "empirical"
    | "mechanistic"
    | "comparative"
    | "quantitative"
    | "predictive"
    | "definitional";
  population?: string;
  intervention?: string;
  comparator?: string;
  outcome?: string;
  route?: string;
  timeHorizon?: string;
  scopeKey?: string;
  scopeNotes?: string[];
};

For internal evidence review, the meaningful unit is usually closer to peptide × outcome × population × route × time horizon. Public pages may collapse presentation for readability, but the protocol retains these qualifiers internally. If a qualifier is implicit or unknown, it is marked as such and certainty is capped.

§ 08

Study extraction

Duplicate extraction

Study characteristics and outcome data are extracted independently by two reviewers. This is especially important for primary outcomes, adverse events, effect sizes, route and dose, population definitions, and follow-up duration.

For AI workflows, “independent” means the second extractor must not see the first extractor’s filled form before submitting its own.

Evidence-card fields

ts
type EvidenceCard = {
  studyId: string;
  design:
    | "systematic_review"
    | "meta_analysis"
    | "rct"
    | "nonrandomized_human"
    | "case_series"
    | "animal"
    | "in_vitro"
    | "preprint"
    | "registry";
  population: string;
  n?: number;
  route?: string;
  dose?: string;
  comparator?: string;
  endpoint: string;
  endpointType: "clinical" | "surrogate" | "mechanistic" | "safety";
  timepoint?: string;
  resultSummary: string;
  effectSize?: string;
  adverseEvents?: string;
  exactQuotesOrSnippets: string[];
};

The extractor must label whether the endpoint is clinical, surrogate, mechanistic, or safety. This prevents mechanistic wins from being mistaken for clinical wins.

§ 09

Bias & integrity assessment

Study-design ladder

When evidence conflicts, we prefer:

  1. systematic reviews / meta-analyses of high-quality human trials
  2. registered, peer-reviewed human RCTs
  3. non-randomized human comparative studies
  4. case series / uncontrolled pilots
  5. animal in-vivo studies
  6. in-vitro / cell / mechanistic studies
  7. expert opinion without new data
  8. anecdote / testimony

Domain-based bias appraisal

For randomized trials we use RoB 2 style domains:

  • randomization process
  • deviations from intended interventions
  • missing outcome data
  • outcome measurement
  • selective reporting

For non-randomized human studies we use ROBINS-I style reasoning:

  • confounding
  • participant selection
  • intervention classification
  • deviations from intended interventions
  • missing data
  • outcome measurement
  • selective reporting

For every included study, we also flag:

  • preregistration or registry record present / absent
  • funding source and sponsor
  • author conflicts
  • replication status
  • retraction status
  • PubPeer or major integrity concerns

Integrity override

Note

A source with serious integrity concerns is not silently averaged into the body of evidence. It is flagged, discussed explicitly, and down-weighted or excluded by rule.

§ 10

Adversarial review architecture

This is the part that makes the protocol AI-ready instead of AI-flavored.

Required roles

RoleTask
Decomposer AIndependently splits the source claim into atomic sub-claims and normalizes them
Decomposer BIndependently repeats decomposition without seeing A
Search planner ABuilds a direct-support query plan for each sub-claim
Search planner B / contradiction-seeking plannerBuilds null, contradictory, safety, integrity, and scope-mismatch queries
Retriever / query executorRuns the reconciled search plan and builds the search ledger
Screening reviewer AIndependently screens candidate records
Screening reviewer BIndependently screens the same records without seeing A
Extractor AIndependently extracts structured evidence cards
Extractor BIndependently extracts the same studies without seeing A
Bias auditorScores risk of bias and integrity concerns
Steelman reviewerMakes the strongest defensible case that the claim is true
Skeptic reviewerMakes the strongest defensible case that the claim is false or overstated
Mismatch reviewerLooks for population / route / dose / endpoint / timeframe mismatches
Omitted-evidence reviewerSearches specifically for null, contradictory, or inconvenient evidence
ArbitratorResolves disagreements, assigns status, assigns confidence, writes the decision log
Publication linterRuns deterministic wording and provenance checks before anything ships

If a single model is used for all roles, contexts are isolated and prior outputs are not revealed until arbitration. Better still: heterogeneous models or at least heterogeneous prompting and retrieval contexts.

Required adversarial questions

Before a verdict is finalized, the workflow must answer:

  1. What is the strongest evidence for the claim?
  2. What is the strongest evidence against the claim?
  3. Are the supportive studies actually testing the same population, route, dose, comparator, and endpoint as the claim?
  4. Are supportive results clinical outcomes or only mechanistic / surrogate outcomes?
  5. Are contradictory or null studies being omitted because they are harder to explain?
  6. Are reviews being used to smuggle in unsupported primary-study conclusions?
  7. Does the claim generalize beyond the studied tissue, route, timeframe, or population?
  8. Is the verdict being driven by one lab, one paper, or one uncontrolled pilot?
  9. Is there any integrity signal that should cap certainty regardless of apparent effect?

Overstatement check

Many public claims are not cleanly true or false. They are partly grounded, then overstated. For composite claims, the workflow computes the formal status of each sub-claim and whether the public framing overgeneralizes the evidence.

OVERSTATED exists precisely so a composite claim is not collapsed into VALIDATED or FALSIFIED when the fairer answer is “the underlying phenomenon exists in narrow scope, but the public framing outruns the evidence.”

§ 11

Unified workflow

Every claim audit runs through these stages in order.

  1. Stage 01

    Run duplicate claim decomposition

    Input
    source artifact · verbatim claim text · target population (if known)
    Output
    Decomposition A · Decomposition B · reconciled claim manifest · atomic sub-claims
    Gate
    every sub-claim is testable; route / outcome / population explicit; material framing differences reconciled or escalated
  2. Stage 02

    Duplicate and adversarial search planning

    Input
    reconciled claim manifest · atomic sub-claims
    Output
    Search Plan A · Search Plan B / contradiction-seeking plan · reconciled query set
    Gate
    direct-support and contradiction queries for every central sub-claim; coverage checklist logged; required source classes explicit
  3. Stage 03

    Transparent retrieval and screening

    Input
    normalized sub-claims · reconciled search plan
    Output
    search ledger · candidate sources · screening log
    Gate
    search terms saved; contradictory-evidence search explicit; A/B screening reconciled; PRISMA counts reconcile; full-text gaps logged
  4. Stage 04

    Trace provenance recursively

    Input
    screened source set · each claim
    Output
    classification per claim: Direct · Inherited · Speculative
    Gate
    primary source reached or dead end recorded; inherited claims not credited as direct evidence
  5. Stage 05

    Extract evidence in duplicate

    Input
    included studies
    Output
    reconciled evidence cards
    Gate
    discrepancies reconciled or escalated; multiple reports merged; exact snippets back every important extracted claim
  6. Stage 06

    Appraise bias and integrity

    Input
    evidence cards
    Output
    bias cards with design classification, domain-based assessment, integrity check, replication note
    Gate
    the workflow can explain why higher- and lower-quality studies were weighted differently
  7. Stage 07

    Build support and contradiction maps

    Input
    evidence cards · bias cards
    Output
    per sub-claim: supports · contradicts · mentions — plus directness and evidence-type annotations
    Gate
    every sub-claim has a contradiction map, even if empty
  8. Stage 08

    Run adversarial review

    Input
    all prior artifacts
    Output
    steelman, skeptic, mismatch, and omitted-evidence reviews
    Gate
    at least one adversarial pass challenged the favored interpretation; unresolved disagreements logged, not buried
  9. Stage 09

    Adjudicate status and confidence

    Input
    evidence cards · bias cards · maps · adversarial findings
    Output
    sub-claim statuses · overall status · confidence
    Gate
    final summary does not outrun the evidence cards; confidence capped by unresolved gaps, directness problems, or integrity concerns
  10. Stage 10

    Publish with provenance and linting

    Input
    decision log
    Output
    verdict summary · what you can / cannot say · citations · evidence chain · last verified date · structured data · sentence-level provenance · linter report
    Gate
    every public sentence maps to evidence cards and snippets; linter checks pass or are explicitly waived by policy; wording does not exceed allowed strength

§ 12

Protocol compliance matrix

The minimum machine-checkable completion spec for a run.

StageRequired artifactAcceptance gateFailure response
Claim decompositionDecompositionPacket[], ClaimManifestCentral claims have population, intervention, outcome, route, and time qualifiers where relevantRe-enter decomposition
Search planningSearchPlan[]Direct and contradiction-seeking queries logged; coverage checklist completeRe-enter search planning
Retrieval and screeningSearchLedger, ScreeningLogA/B screening reconciled; PRISMA counts reconcile; full-text gaps loggedRe-enter retrieval or screening
Provenance tracingProvenance trace recordsCentral claims reach primary source or dead end is loggedRe-enter provenance tracing
ExtractionEvidenceCard[]Critical fields reconciled; exact snippets capturedRe-enter extraction reconciliation
Bias and integrityBiasCard[]Decisive sources receive bias and integrity reviewRe-enter bias appraisal
Contradiction mappingContradictionMapEvery sub-claim mapped to support, contradiction, or mentionRe-enter contradiction mapping
Adversarial reviewAdversarialReviewLogSteelman, skeptic, mismatch, and omitted-evidence passes completeRe-enter adversarial review
ArbitrationDecisionLogVerdict traceable to evidence cards and capped by ruleRe-enter adjudication or escalate
PublicationPublicClaimSentence[], PublicationPacket, linter outputEvery sentence has provenance; wording checks passRe-enter publication packet only

§ 13

Publication & provenance controls

Sentence-level provenance is a first-class object, not a formatting afterthought.

ts
type PublicClaimSentence = {
  sentenceId: string;
  text: string;
  sourceEvidenceCardIds: string[];
  sourceSnippetIds: string[];
  allowedStrength:
    | "directly_supported"
    | "qualified_support"
    | "context_only"
    | "not_allowed";
};

The publication packet refuses to ship sentences marked not_allowed or sentences whose wording exceeds the allowed strength of their linked evidence.

Publication linter

Before publication, deterministic wording checks run:

  • no dosing advice
  • no individualized medical advice
  • no unsupported safety reassurance
  • no “proven,” “clinically established,” or “safe” unless policy and evidence status allow it
  • no human efficacy wording from animal-only or in-vitro evidence
  • no review article cited as if it were primary evidence
  • no factual biomedical sentence without citation coverage
  • no sentence stronger than the weakest central sub-claim permits

Any failure blocks publication or requires an explicit, logged override.

§ 14

Status taxonomy

Eight formal statuses:

StatusMeaning
VALIDATEDMultiple independent, reasonably high-quality sources directly support the claim in the relevant scope.
CONTESTEDComparable evidence points in different directions and the disagreement is material, not superficial.
UNVALIDATEDThe claim has not been tested adequately enough to justify support or falsification.
OVERSTATEDComposite popular framing extends beyond what the underlying evidence supports. Parts may be validated in narrow scope; the framing as stated is not.
FALSIFIEDBetter evidence directly contradicts the claim, or central sub-claims are directly false. Requires replicated, positively contradictory evidence.
WITHDRAWNThe supporting source base is retracted or fatally compromised.
DEPENDENTThe claim is only true if one or more upstream premises are true, and those premises remain unsettled.
SPECULATIVENo traceable source adequately supports the exact proposition.

Decision rules for composite claims

Do not use a naive “worst sub-claim wins” rule. Instead:

  1. identify the central necessary sub-claims
  2. identify the supporting but non-central sub-claims
  3. classify each sub-claim separately
  4. assign the overall status based on the central necessary sub-claims plus any material overstatement

§ 15

Confidence taxonomy

Confidence is separate from status.

ConfidenceWhen to assign
HighSearch broad and explicit; evidence directly on point; disagreement limited; major bias concerns resolved; replication status reasonably clear.
ModerateCore evidence present, but gaps in directness, search completeness, replication, or bias resolution.
LowRetrieval gaps, dead ends, sparse studies, serious indirectness, or unresolved integrity problems materially weaken certainty.

Confidence is judged on five axes: search completeness, directness, quality / bias profile, consistency, integrity / provenance clarity. The final confidence does not exceed the weakest material axis.

Confidence caps and status ceilings

ConditionMax statusMax confidenceRequired action
Human efficacy claim supported only by animal or in-vitro evidenceUNVALIDATED or OVERSTATEDLowState preclinical scope explicitly
Decisive source lacks full textDependsLowLog the gap and avoid strong verdicts
One uncontrolled human case series onlyUsually UNVALIDATEDLow or ModerateExplain directness and bias limits
Comparable direct support and contradiction existCONTESTEDModerateMap the contradiction explicitly
Serious unresolved integrity concernWITHDRAWN, UNVALIDATED, or holdLowDown-weight, exclude, or escalate
Review-only support while readable primary studies remain unreadBlock verdictNoneRead the primary studies first
Route, dose, population, or time qualifiers missing but centralUNVALIDATEDLowNarrow the claim or rerun retrieval
Preprint-only decisive supportDependsLowMark status as provisional and cap certainty

§ 16

Hard rules & abstention

Hard rules

  • No human efficacy claim may be marked VALIDATED on animal or in-vitro evidence alone.
  • No claim may be marked FALSIFIED merely because support is absent; absence of evidence is usually UNVALIDATED.
  • No review article may be the sole basis for a decisive status if the primary studies are available.
  • No final verdict may ignore contradictory evidence found in search.
  • No study may be counted twice across multiple reports.
  • No final summary may introduce claims absent from the evidence cards.

Abstention triggers

Cap the verdict at UNVALIDATED or Low confidence when:

  • full text is not available for decisive sources
  • the claim depends on route / dose / population details not reported in the literature
  • only uncontrolled pilots exist
  • only one lab drives the literature
  • the protocol / registry record materially conflicts with the paper and cannot be reconciled
  • serious integrity concerns remain unresolved

§ 17

Calibration & regression testing

Treat the protocol itself as something that must be measured. We track at least:

  • decisive-source recall
  • false inclusion rate
  • false exclusion rate
  • critical-field extraction accuracy
  • verdict agreement with expert benchmark sets
  • rate of overconfident wrong verdicts
  • disagreement rates across decomposition, search planning, screening, and extraction

Watch

Near-perfect agreement is not automatically a success condition. If supposedly independent reviewers agree almost all the time, verify the pipeline is not leaking context or collapsing into prompt coupling. Thresholds are calibrated empirically against benchmark claims — not treated as universal constants.

§ 18

Operating modes

Both modes use the same machinery above.

Path 1

Forward · Claim → Evidence

Start with a public claim. Decompose it, retrieve evidence, challenge it adversarially, then publish a claim audit.

Entry
A claim such as 'BPC-157 has wolverine-like effects'
Output
/claims/[slug]

Path 2

Backward · Artifact → Recursive origins

Start with an artifact. Decompose every claim inside it, trace each upstream, and show which parts are direct, inherited, speculative, or overstated.

Entry
A podcast, newsletter, paper, or post
Output
/breakdowns/[slug]

§ 19

Practical implementation notes for LLM workflows

Independence

If you want fair adversarial review, do not let later reviewers inherit earlier reviewers’ conclusions. Shared context creates correlated errors.

Deterministic checks

Use non-LLM validation wherever possible for:

  • DOI / PMID resolution
  • deduplication
  • publication dates
  • registry status
  • retraction status
  • schema validation of outputs

Calibration

Prompt the arbitrator to justify why a stronger status was earned rather than defaulting to the weaker one. If that justification is thin, downgrade.

Escalation

Escalate to human review when:

  • the claim is legally or medically high stakes
  • the studies are highly technical and difficult to normalize
  • the adversarial reviewers disagree on central sub-claims
  • integrity concerns are non-trivial

§ 20

Reference anchors

This protocol is grounded in:

  • GRADE certainty-of-evidence logic
  • Cochrane RoB 2 and ROBINS-I style bias assessment
  • PRISMA 2020 reporting discipline
  • Cochrane duplicate independent extraction practice
  • SciFact scientific claim verification
  • FEVER claim verification and FEVER adversarial evaluation
  • post-publication integrity checks via PubPeer, retraction tracking, and trial registries

These anchors matter because they push the workflow toward transparency over vibes, duplication over single-pass confidence, adversarial challenge over confirmation bias, and abstention over overclaiming.

§ 21

Review cadence

The canonical cadence SLAs — per-trigger response times (same-day, 7-day, 30-day, quarterly, annual) — live in the internal grading-and-reassessment protocol. This is the short version; if the two disagree, the internal protocol wins.

  • Re-run a claim audit within 30 days when new peer-reviewed evidence materially affects a central sub-claim.
  • Re-run immediately (same day) when a decisive source is retracted, corrected, or flagged for serious integrity issues.
  • Re-run same day when a major regulatory change alters the real-world status of the molecule or use case.
  • Re-run stale claims at least annually even if no trigger fires; quarterly housekeeping audit across the catalog.

Last updated: 2026-04-20