Research proposal
AI Behavior Analysis Benchmark
A working research program developing an empirical benchmark for behavior-analytic reasoning in AI systems. Expert-written items, human BCBA baselines, informed by publicly available practice standards. Text vignettes first; data, graphs, and session artifacts later.
Why now
People already ask AI for help. The benchmark tests where those answers hold up.
Fluent prose is not the same as judgment. The benchmark tests whether a model can identify what was measured, separate function from topography, interpret ABC relations, read the limits of the data, and avoid unsafe intervention advice.
Those results can guide use. Practitioners and supervisors get a map of tasks that need review. Educators see weak points to teach explicitly. Product teams see where generic AI needs domain guardrails.
The same map also points toward the skills worth building in people. When current systems struggle with measurement choices, function-based interpretation, causal limits, or safety boundaries, those are the discriminations new behavior analysts should practice deliberately and supervisors should protect.
Repeated runs can show which skills remain resistant to automation as the tools improve, and which are becoming assisted work.
Origin
Where the question started
Where AI missed
The idea came out of a real error in an earlier essay draft. An AI-generated scene described a contrived mand teaching opportunity, then had the BCBA collect interval data as an approximation of mand frequency. Two measurement systems were sitting next to each other in a single paragraph, and the text was treating them as equivalent.
Claude Opus 4.7 (xhigh) repeatedly acknowledged the problem without making the correct discrimination. GPT-5.4 via Codex (default high) made the discrimination only after being explicitly pointed at it. The case was useful because it is easy for a behavior analyst to notice, not obviously solved by internet search, and meaningful in the domains where behavior analysis is practised.
Where AI caught
Prompt
i need lrffc stimuli to test conditional (2 dimensions) listener responses (learner will be asked to point at targets in the image). i need the prompt for the image generator as well as the list of SDs and correct responses.
GPT-5.5 via OpenAI Codex (xhigh)
• I’ll build this as one clean visual array where each target is only correct when the learner attends to both parts of the SD, with planned distractors sharing one dimension.
That constraint — distractors sharing one dimension with the target — is what makes the probe measure conditional responding rather than single-cue responding. It is also where practitioners often miss: a learner can attend to one dimension and still be correct, and the data won’t show it.
What discrimination would you add?
Cases submitted
The gap
What existing evaluations do not tell us
General medical, legal, mathematical, and scientific AI benchmarks measure breadth. They do not tell us whether a model can distinguish behavioral measurement systems, interpret ABC data, reason from function rather than topography, select interventions under ethical and contextual constraints, or notice subtle category errors in clinical prose.
Adjacent fields — medicine, law — have begun building standardized expert-authored evaluations in their domains. Behavior analysis has not.
The contribution
What this work aims to produce
The likely publishable contribution is not "LLMs can pass a BCBA-style exam." It is an empirical benchmark report: a transparent public task blueprint, expert-written original cases, human behavior-analyst baselines, model results, risk-weighted scoring, inter-rater reliability, item analysis, and a lifecycle plan for contamination, refresh, and saturation.
Status
Where the work is now
The project is in a planning and item-development stage. Current work is on the research protocol — construct definition, domain mapping against publicly available practice standards, item authoring, expert review, rubric design, and the analysis plan. The paper itself is an empty file on disk.
This is a research proposal — real enough to describe, but underspecified enough that most specifics here are subject to change as items, rubrics, and human data come in.