Lab — Research program

AI-ABA Benchmark

A working research program developing an empirical benchmark for behavior-analytic reasoning in AI systems. Expert-written items, human BCBA baselines, blueprinted against public BACB standards. Text vignettes first; data, graphs, and session artifacts later.

Origin

Where the idea came from

The seed item came out of a real error in an earlier essay draft. An AI-generated scene described a contrived mand teaching opportunity, then had the BCBA collect interval data as an approximation of mand frequency. Two measurement systems were sitting next to each other in a single paragraph, and the text was treating them as equivalent.

Claude Opus 4.7 (xhigh) repeatedly acknowledged the problem without making the correct discrimination. GPT-5.4 via Codex (default high) made the discrimination only after being explicitly pointed at it. The case was useful because it is easy for a behavior analyst to notice, not obviously solved by internet search, and clinically meaningful.

The gap

What existing evaluations do not tell us

General medical, legal, mathematical, and scientific AI benchmarks measure breadth. They do not tell us whether a model can distinguish behavioral measurement systems, interpret ABC data, reason from function rather than topography, select interventions under ethical and contextual constraints, or notice subtle category errors in clinical prose.

Adjacent fields — medicine, law — have begun building standardized expert-authored evaluations. Behavior analysis has not.

The contribution

What this work aims to produce

The likely publishable contribution is not "LLMs can pass a BCBA-style exam." It is an empirical benchmark report: a transparent public task blueprint, expert-written original cases, human BCBA baselines, model results, risk-weighted scoring, inter-rater reliability, item analysis, and a lifecycle plan for contamination, refresh, and saturation.

Design, at a glance

What the benchmark is

Expert-written text cases first; artifact-rich and multimodal items later.
Blueprinted against public BACB Test Content Outlines and competency documents.
Human BCBA baselines collected before any headline model claim.
Rubric scoring with unsafe-error flags and domain subscores.
Public sample items separated from a private held-out evaluation set.

Constraints

What the benchmark will not do

No claim that benchmark performance implies credential readiness or clinical competence.
No single public leaderboard; no "AI beats experts" framing without calibrated baselines.
Not a question bank — a measurement instrument with validity and reliability obligations.

Status

Where the work is now

The project is in a planning and item-development stage. Current work is on the research protocol — construct definition, domain blueprinting against public BACB standards, item authoring, expert review, rubric design, and the analysis plan. The paper itself is an empty file on disk.

This entry lives under Lab because it is not yet ready to sit as a finished project. It is real enough to describe, and underspecified enough that most specifics here are subject to change as items, rubrics, and human data come in.