Research proposal

AI Behavior Analysis Benchmark

A working research program developing an empirical benchmark for behavior-analytic reasoning in AI systems. Expert-written items, human BCBA baselines, informed by publicly available practice standards. Text vignettes first; data, graphs, and session artifacts later.

Why now

People already ask AI for help. The benchmark tests where those answers hold up.

Fluent prose is not the same as judgment. The benchmark tests whether a model can identify what was measured, separate function from topography, interpret ABC relations, read the limits of the data, and avoid unsafe intervention advice.

Those results can guide use. Practitioners and supervisors get a map of tasks that need review. Educators see weak points to teach explicitly. Product teams see where generic AI needs domain guardrails.

The same map also points toward the skills worth building in people. When current systems struggle with measurement choices, function-based interpretation, causal limits, or safety boundaries, those are the discriminations new behavior analysts should practice deliberately and supervisors should protect.

Repeated runs can show which skills remain resistant to automation as the tools improve, and which are becoming assisted work.

A stylized chart titled 'AI Behavior Analysis Benchmark.' A rat climbs ascending bars labelled with behavior-analytic dimensions — responding, reinforcement sensitivity, stimulus control, generalization, durability, fidelity — toward a flag reading 'emergent behavior.' Side notes read 'behavior is the signal' and 'behavior changes, we measure change.'

Origin

Where the question started

Where AI missed

The idea came out of a real error in an earlier essay draft. An AI-generated scene described a contrived mand teaching opportunity, then had the BCBA collect interval data as an approximation of mand frequency. Two measurement systems were sitting next to each other in a single paragraph, and the text was treating them as equivalent.

Claude Opus 4.7 (xhigh) repeatedly acknowledged the problem without making the correct discrimination. GPT-5.4 via Codex (default high) made the discrimination only after being explicitly pointed at it. The case was useful because it is easy for a behavior analyst to notice, not obviously solved by internet search, and meaningful in the domains where behavior analysis is practised.

Where AI caught

Prompt

i need lrffc stimuli to test conditional (2 dimensions) listener responses (learner will be asked to point at targets in the image). i need the prompt for the image generator as well as the list of SDs and correct responses.

GPT-5.5 via OpenAI Codex (xhigh)

• I’ll build this as one clean visual array where each target is only correct when the learner attends to both parts of the SD, with planned distractors sharing one dimension.

That constraint — distractors sharing one dimension with the target — is what makes the probe measure conditional responding rather than single-cue responding. It is also where practitioners often miss: a learner can attend to one dimension and still be correct, and the data won’t show it.

What discrimination would you add?

Left unchecked, your case is shared privately with Multiplicity and is not published.

The gap

What existing evaluations do not tell us

General medical, legal, mathematical, and scientific AI benchmarks measure breadth. They do not tell us whether a model can distinguish behavioral measurement systems, interpret ABC data, reason from function rather than topography, select interventions under ethical and contextual constraints, or notice subtle category errors in clinical prose.

Adjacent fields — medicine, law — have begun building standardized expert-authored evaluations in their domains. Behavior analysis has not.

The contribution

What this work aims to produce

The likely publishable contribution is not "LLMs can pass a BCBA-style exam." It is an empirical benchmark report: a transparent public task blueprint, expert-written original cases, human behavior-analyst baselines, model results, risk-weighted scoring, inter-rater reliability, item analysis, and a lifecycle plan for contamination, refresh, and saturation.

Status

Where the work is now

The project is in a planning and item-development stage. Current work is on the research protocol — construct definition, domain mapping against publicly available practice standards, item authoring, expert review, rubric design, and the analysis plan. The paper itself is an empty file on disk.

This is a research proposal — real enough to describe, but underspecified enough that most specifics here are subject to change as items, rubrics, and human data come in.