Why a benchmark exists

Islamic AI verification is not a field with established standards. Any team building a verifier can claim high accuracy. Without a common fixture set and public results, there is no way for a Muslim builder, a scholar advisor, or a pilot organization to compare verifiers or verify claims.

Tasfi Bench exists to establish a common ground. A controlled set of fixtures that every Islamic AI verifier team can run their system against. A scoring methodology that the community can inspect and criticize. A public leaderboard where results live so that trust claims are earned, not asserted.

The benchmark is not a certification. A system that scores well on the public fixtures has passed the known cases. The harder question is what it does on the unknown cases: the novel errors, the creative circumventions, the edge cases that the fixture designers did not anticipate. That is why Tasfi Bench also maintains a private holdout set that is never published and never used for training.

The false-pass problem

The most important challenge in Islamic AI verification is the false-pass case: a generated answer that contains a real error but that a weak verifier will pass.

A trivially wrong answer is easy to catch. A hadith cited to a collection that does not exist, a Quran reference with an obviously wrong surah name. A naive string-matching verifier catches these. The failures that matter are the ones that look right.

Consider a generated answer that cites a hadith correctly attributed to Sahih Bukhari, with the correct book and chapter, but with a hadith number that corresponds to a different narration. A verifier that checks collection and chapter but not the exact narration will pass this. The user receives a citation that, when followed, leads to the wrong hadith. They may never check.

Or consider an answer that correctly identifies a fiqh ruling from the Hanafi madhab but states it as the position of "all four madhabs" when the Maliki position differs. A verifier that does not have the cross-madhab scope to check this claim will pass it. The user believes they have a consensus answer. They do not.

These are the cases that Tasfi Bench is weighted toward: 320 of the 420 fixtures are false-pass cases. They are the reason the benchmark exists.

The seven risk categories

Bench fixtures are organized into seven risk categories. Each category tests a distinct failure mode.

CategoryWhat it testsFalse-pass fixtures
Quran citationVerse existence, surah/ayah correctness, text fidelity to Tanzil62
Hadith attributionCollection attribution, book/chapter/number accuracy, narration content71
Hadith gradingSahih vs hasan vs daif vs mawdu classification48
Madhab scopeSingle-madhab rulings presented as consensus, contested vs settled claims55
False certainty framingConfidence framing on contested scholarly matters39
Fatwa pattern detectionRuling claims beyond the scope of a verifier28
Boundary enforcementScholar replacement framing, fatwa assertion, certification claims17

How fixtures are constructed

Each fixture has four fields: the generated answer to be verified, the context in which it was generated, the expected verdict (pass, warn, or fail), and the expected flag types for non-pass verdicts.

False-pass fixtures are constructed to test specific verifier weaknesses. They are not randomly sampled errors from AI outputs. They are designed cases that probe whether a verifier has the resolution to distinguish a near-miss from a clean answer. A fixture that tests hadith number accuracy will present a narration that is verifiable in the correct collection but with the wrong hadith number: the verifier that only checks collection-level attribution will pass it, and the verifier that checks narration-level attribution will fail it.

The fixture construction process involves: identifying the specific failure mode to test, constructing an answer that would trigger it, verifying by hand against the authoritative sources that the expected verdict is correct, and writing the flag assertion that a passing verifier should produce.

No AI is used to generate fixtures. The fixture set is hand-curated. This is intentional: an AI-generated fixture set could contain the same errors the benchmarked verifiers make, creating a circularity that would inflate scores without improving actual verification quality.

The holdout set

The 420-fixture public set is not the complete evaluation suite. Tasfi Bench also maintains a private holdout set of cases that are never published.

The holdout set exists for two reasons. First, it prevents benchmark overfitting: a verifier team cannot optimize specifically for the public fixtures if there is a private set they have not seen. Second, it provides a more realistic estimate of verifier performance: a system that does well on the holdout is more likely to generalize to novel errors in production.

Holdout evaluation is available to pilot organizations under agreement. Results from holdout evaluation are not published in full: only aggregate statistics that do not reveal individual fixture content.

The scoring methodology

A verifier submission provides pass, warn, or fail verdicts for each of the 420 fixtures. Scoring works as follows:

  • For fixtures with expected verdict fail: a fail verdict is correct (true positive). A pass verdict is a false pass (the critical error). A warn verdict is a partial credit.
  • For fixtures with expected verdict pass: a pass verdict is correct. A fail verdict is a false reject. A warn is conservative credit.
  • For fixtures with expected verdict warn: warn is correct. fail is over-strict. pass is a miss.

The primary metric is false-pass rate: the fraction of fail-expected fixtures where the verifier returned pass. Lower is better. A verifier with a high false-pass rate is not safe for Islamic AI products regardless of its overall accuracy: it is missing the errors that matter most.

The secondary metric is false-reject rate: the fraction of pass-expected fixtures where the verifier returned fail. A high false-reject rate means the verifier is over-blocking correct answers, which degrades product usability without improving safety.

License and attribution

The public fixture files are licensed CC-BY 4.0. You may use, adapt, and redistribute them for any purpose including commercial evaluation, with attribution:

Tasfi Bench Controlled v1, Tasfi Enterprises, https://tasfi.app/benchmark

The private holdout set is not published and is not covered by the CC-BY 4.0 license. The Tasfi methodology documentation, the brand, and the scoring software are covered by a separate license. See the repository LICENSE files for the full split.

See the live leaderboard Submit a verifier

Common questions

How many fixtures are in the holdout set?

The holdout set size is not disclosed. The holdout set is regenerated periodically with new fixture categories. The generator script is open-source (Apache 2.0 at evals/holdout/generate.mjs) but the generated output is not published and is gitignored from the repository.

Can I submit a closed-source verifier?

Yes. You do not need to publish your verifier code to submit to the leaderboard. You submit verdicts for the 420 fixtures, not source code. The leaderboard entry describes your verifier and links to any public documentation you choose to share.

What is Tasfi Guard's current Bench score?

Tasfi Guard's current public fixture performance is documented on the benchmark page at /benchmark. The scorecard includes false-pass rate, false-reject rate, and per-category breakdown. The holdout score is available to pilot organizations under agreement.