Tasfi Bench is a 420-fixture public evaluation suite for Islamic AI verifiers. It is heavily weighted toward false-pass cases: answers that look correct but contain subtle Islamic errors. The suite covers seven risk categories and is licensed CC-BY 4.0 so that any Islamic AI verifier team can run their system against the same fixtures and submit results to the public leaderboard.

What is a false-pass case in the context of Islamic AI?

A false-pass case is a generated Islamic answer that a weak verifier would pass but that contains a real error. Examples include a hadith with the correct content but an incorrect collection attribution, a Quran verse with a plausible but wrong surah number, or a fiqh ruling stated with false confidence about unanimous madhab agreement. False-pass detection is the hardest problem in Islamic AI verification and the most important.

Can I use the Tasfi Bench fixture set for my own evaluation?

Yes. The public fixture files (evals/false-pass-fixtures.json, evals/scale-fixtures.json, evals/verify-fixtures.json) are licensed CC-BY 4.0. You may use, adapt, and build on them for any purpose including commercial evaluation, with attribution to 'Tasfi Bench Controlled v1, Tasfi Enterprises, https://tasfi.app/benchmark'.

How do I submit a verifier to the Tasfi Bench leaderboard?

Run your verifier against the public fixture set and record pass, warn, fail verdicts for each fixture. Send your results with the fixture set version and your verifier description to founder@tasfi.app. Tasfi will validate your submission against the fixture expected outcomes and publish your results on the public leaderboard at https://tasfi.app/benchmark.

Tasfi Bench: Methodology for Evaluating Islamic AI Verifiers

Name: Tasfi Bench Controlled v1
Creator: Tasfi
License: https://creativecommons.org/licenses/by/4.0/

Why a benchmark exists

Islamic AI verification is not a field with established standards. Any team building a verifier can claim high accuracy. Without a common fixture set and public results, there is no way for a Muslim builder, a scholar advisor, or a pilot organization to compare verifiers or verify claims.

Tasfi Bench exists to establish a common ground. A controlled set of fixtures that every Islamic AI verifier team can run their system against. A scoring methodology that the community can inspect and criticize. A public leaderboard where results live so that trust claims are earned, not asserted.

The benchmark is not a certification. A system that scores well on the public fixtures has passed the known cases. The harder question is what it does on the unknown cases: the novel errors, the creative circumventions, the edge cases that the fixture designers did not anticipate. That is why Tasfi Bench also maintains a private holdout set that is never published and never used for training.

The false-pass problem

The most important challenge in Islamic AI verification is the false-pass case: a generated answer that contains a real error but that a weak verifier will pass.

A trivially wrong answer is easy to catch. A hadith cited to a collection that does not exist, a Quran reference with an obviously wrong surah name. A naive string-matching verifier catches these. The failures that matter are the ones that look right.

Consider a generated answer that cites a hadith correctly attributed to Sahih Bukhari, with the correct book and chapter, but with a hadith number that corresponds to a different narration. A verifier that checks collection and chapter but not the exact narration will pass this. The user receives a citation that, when followed, leads to the wrong hadith. They may never check.

Or consider an answer that correctly identifies a fiqh ruling from the Hanafi madhab but states it as the position of "all four madhabs" when the Maliki position differs. A verifier that does not have the cross-madhab scope to check this claim will pass it. The user believes they have a consensus answer. They do not.

These are the cases that Tasfi Bench is weighted toward: 320 of the 420 fixtures are false-pass cases. They are the reason the benchmark exists.

The seven risk categories

Bench fixtures are organized into seven risk categories. Each category tests a distinct failure mode.

Category	What it tests	False-pass fixtures
Quran citation	Verse existence, surah/ayah correctness, text fidelity to Tanzil	62
Hadith attribution	Collection attribution, book/chapter/number accuracy, narration content	71
Hadith grading	Sahih vs hasan vs daif vs mawdu classification	48
Madhab scope	Single-madhab rulings presented as consensus, contested vs settled claims	55
False certainty framing	Confidence framing on contested scholarly matters	39
Fatwa pattern detection	Ruling claims beyond the scope of a verifier	28
Boundary enforcement	Scholar replacement framing, fatwa assertion, certification claims	17

How fixtures are constructed

Each fixture has four fields: the generated answer to be verified, the context in which it was generated, the expected verdict (pass, warn, or fail), and the expected flag types for non-pass verdicts.

False-pass fixtures are constructed to test specific verifier weaknesses. They are not randomly sampled errors from AI outputs. They are designed cases that probe whether a verifier has the resolution to distinguish a near-miss from a clean answer. A fixture that tests hadith number accuracy will present a narration that is verifiable in the correct collection but with the wrong hadith number: the verifier that only checks collection-level attribution will pass it, and the verifier that checks narration-level attribution will fail it.

The fixture construction process involves: identifying the specific failure mode to test, constructing an answer that would trigger it, verifying by hand against the authoritative sources that the expected verdict is correct, and writing the flag assertion that a passing verifier should produce.

No AI is used to generate fixtures. The fixture set is hand-curated. This is intentional: an AI-generated fixture set could contain the same errors the benchmarked verifiers make, creating a circularity that would inflate scores without improving actual verification quality.

The holdout set

The 420-fixture public set is not the complete evaluation suite. Tasfi Bench also maintains a private holdout set of cases that are never published.

The holdout set exists for two reasons. First, it prevents benchmark overfitting: a verifier team cannot optimize specifically for the public fixtures if there is a private set they have not seen. Second, it provides a more realistic estimate of verifier performance: a system that does well on the holdout is more likely to generalize to novel errors in production.

Holdout evaluation is available to pilot organizations under agreement. Results from holdout evaluation are not published in full: only aggregate statistics that do not reveal individual fixture content.

The scoring methodology

A verifier submission provides pass, warn, or fail verdicts for each of the 420 fixtures. Scoring works as follows:

For fixtures with expected verdict fail: a fail verdict is correct (true positive). A pass verdict is a false pass (the critical error). A warn verdict is a partial credit.
For fixtures with expected verdict pass: a pass verdict is correct. A fail verdict is a false reject. A warn is conservative credit.
For fixtures with expected verdict warn: warn is correct. fail is over-strict. pass is a miss.

The primary metric is false-pass rate: the fraction of fail-expected fixtures where the verifier returned pass. Lower is better. A verifier with a high false-pass rate is not safe for Islamic AI products regardless of its overall accuracy: it is missing the errors that matter most.

The secondary metric is false-reject rate: the fraction of pass-expected fixtures where the verifier returned fail. A high false-reject rate means the verifier is over-blocking correct answers, which degrades product usability without improving safety.

License and attribution

The public fixture files are licensed CC-BY 4.0. You may use, adapt, and redistribute them for any purpose including commercial evaluation, with attribution:

Tasfi Bench Controlled v1, Tasfi Enterprises, https://tasfi.app/benchmark

The private holdout set is not published and is not covered by the CC-BY 4.0 license. The Tasfi methodology documentation, the brand, and the scoring software are covered by a separate license. See the repository LICENSE files for the full split.

See the live leaderboard Submit a verifier

Common questions

How many fixtures are in the holdout set?

The holdout set size is not disclosed. The holdout set is regenerated periodically with new fixture categories. The generator script is open-source (Apache 2.0 at evals/holdout/generate.mjs) but the generated output is not published and is gitignored from the repository.

Can I submit a closed-source verifier?

Yes. You do not need to publish your verifier code to submit to the leaderboard. You submit verdicts for the 420 fixtures, not source code. The leaderboard entry describes your verifier and links to any public documentation you choose to share.

What is Tasfi Guard's current Bench score?

Tasfi Guard's current public fixture performance is documented on the benchmark page at /benchmark. The scorecard includes false-pass rate, false-reject rate, and per-category breakdown. The holdout score is available to pilot organizations under agreement.

Tasfi Bench: how we measure whether an Islamic AI verifier actually catches errors.

Why a benchmark exists

The false-pass problem

The seven risk categories

How fixtures are constructed

The holdout set

The scoring methodology

License and attribution

Common questions

How many fixtures are in the holdout set?

Can I submit a closed-source verifier?

What is Tasfi Guard's current Bench score?