play icon for videos

How to Measure Behavior Change After Training (Level 3)

How to measure behavior change after training: define observable behaviors, capture a baseline, follow up at 30/60/90 days, and link each response to one learner record.

Updated
June 20, 2026
360 feedback training evaluation
Use Case
Kirkpatrick Level 3 · Behavior Change

How to measure behavior change after training — the level where most evaluations quit.

Almost everyone collects reaction and quiz scores. Far fewer can answer the question that decides renewal: did the learner actually do the thing differently on the job sixty days later — and can you compare it to their baseline. That gap is not a survey problem; it is a record problem.

Before training
Baseline self-ratingConfidence & current practice
Defined behaviors4–6 observable actions
Manager sign-offWhat "applied" looks like
One learner record
carried to day 60
learner_idpersistent
baseline_practiceday 0
applied_day60L3
manager_observedL3
30 / 60 / 90 days after
Behavior deltavs. the same learner's baseline
Manager + learner narrativeRead & themed on arrival
Transfer evidenceCited, per learner — not a guess
~65%of programs never reach Level 3–4
12%typical response to an unlinked follow-up blast
response when the link ties to the original record
30–90dthe window where transfer actually shows up
Direct answer

What is Level 3 behavior change, and how do you measure it?

Two definitions written to be quoted — one for the level, one for the method.

Kirkpatrick Level 3 (behavior)

Level 3 measures whether learners apply the training on the job — the transfer from knowing to doing. It sits above reaction (Level 1) and learning (Level 2), and is the first level a smile-sheet cannot reach, because it has to be measured weeks later against the same learner's pre-training baseline.

How you measure it

Define four to six observable behaviors before training, capture a baseline, then re-measure at 30/60/90 days through learner and manager observation linked to the same record. The score is the change from baseline; the evidence is the narrative, read and themed on arrival rather than left in an inbox.

L1 · Reaction L2 · Learning L3 · Behavior L4 · Results L5 · ROI

Relevant to: corporate L&D and leadership development · workforce & apprenticeship programs · nonprofit and grant-funded training · compliance and safety training · anyone comparing training providers on long-term transfer.

Why it stalls

Level 3 is where evaluations quit — and it's a record problem, not a willpower problem.

Programs rarely choose to stop at Level 2. They stop because the infrastructure to reach behavior change was never built. Here is the difference between the two postures.

Stalls at Level 2

  • Behavior is asserted from the reaction score — "they felt confident, so they must apply it"
  • No baseline, so even a real day-60 measure has nothing to compare against
  • The follow-up is a bulk email weeks later; 12% answer, none linked to intake
  • Manager observations arrive as free-text emails that can't be aggregated
  • "Did behavior change" becomes a manual reconciliation project that never happens

Reaches Level 3 by default

  • Four to six observable behaviors defined before the cohort starts
  • A baseline captured at intake on the same persistent learner ID
  • Personalized 30/60/90-day links tied to the original record — 3× the response
  • Manager observation collected as a structured rubric, scored automatically
  • The behavior delta and the cited narrative are one query, mid-cohort

The single differentiator is the baseline carried on one record. Without a pre-training reference point on the same learner ID, a 90-day survey measures a mood, not a change. With it, Level 3 stops being a stretch goal and becomes the default output of every cohort — and Level 4 results become reachable for the first time.

The method

Five steps to measure behavior change after training — done continuously.

The method is the same whether you run one cohort or fifty. What changes is whether the steps connect to one record or scatter across four tools. Step one is the one everyone skips.

STEP 1 · THE ONE EVERYONE SKIPS

Define the behaviors before training starts

Name four to six observable actions a learner should do differently on the job — specific enough that a manager could check yes/no. "Runs a structured one-on-one using the framework," not "is a better manager." Define them with the manager during design, not after the cohort ends. If you can't name the behavior, you can't measure its change.

STEP 2

Capture a baseline on day zero

At intake, ask the learner (and ideally the manager) to rate current practice on those same behaviors, on the same persistent learner ID. This is the reference point every later measure compares against — skip it and a 90-day survey measures a mood, not a change.

STEP 3

Re-measure at 30, 60, and 90 days

Transfer shows up weeks later, not on the last day of class. Send the same behavior items at follow-up through a personalized link tied to the original record — not a bulk blast — so each response auto-links to the baseline and response rates hold near 35% instead of 12%.

STEP 4

Add the manager's observation as a rubric

The learner's self-report is one signal; the manager's is the corroboration funders trust. Collect it as a structured rubric on the same four to six behaviors — not a free-text email — so it can be scored and aggregated across the cohort.

STEP 5

Read the narrative, don't just count the scores

The "why" lives in the open comment. Automated theme coding reads every manager and learner note on arrival and ties each theme to a learner record — so you get evidence like "38 of 120 comments cite applying the skill, e.g. #2841: I ran my first review using the framework," not just a number that moved.

From the field

When the record can finally read itself, the hidden pattern surfaces.

Open Play Foundation had run training programs for years. The pre-assessments, attendance, and feedback lived in different systems — so no one could see behavior across a cohort, only what each spreadsheet said in isolation.

"Those statistics that we're now running on Sopact immediately showed me there's something significantly wrong … things like that, we would never have been able to do in the past."

— Marco Botha, CEO, Open Play Foundation

That is Level 3 in practice. A reaction tool tells you learners enjoyed the session. A connected record tells you the cohort whose confidence rose but whose behavior never changed — the signal that used to sit unread across four files, surfaced in time to act on it. The pattern was always in the data; what was missing was one learner record that could read it.

Run this yourself

Draft your Level 3 rubric and follow-up survey in Claude or ChatGPT.

Step 1 of the method — naming the behaviors — is the hardest to do from a blank page. Paste this prompt into any chat assistant, describe your program, and it returns the observable behaviors, the baseline items, and the 30/60/90-day follow-up wording, ready to drop into your tool.

Behavior-change (Level 3) rubric & follow-up builder Works in Claude, ChatGPT, Gemini, or Copilot
You are a training-evaluation designer. Help me build a Kirkpatrick Level 3 (behavior change) measurement for my program.

My program:
- What it teaches: [e.g. frontline manager fundamentals, 6 modules]
- The job role: [e.g. new team leads in retail stores]
- What "success on the job" looks like: [e.g. runs weekly one-on-ones, gives feedback]
- Follow-up I can realistically send: [e.g. at 30 and 90 days, to learner + their manager]

Produce, in order:

1. FOUR TO SIX OBSERVABLE BEHAVIORS — specific enough that a manager could mark each yes/no or rate 1-5. No vague traits like "more confident." Each behavior must describe an action done on the job.

2. A BASELINE INSTRUMENT — the same behaviors phrased for day-zero self-rating (and a manager version), so I have a pre-training reference point.

3. A 30/60/90-DAY FOLLOW-UP — the identical items re-phrased for "in the last month, how often did you…", plus 2 open-ended questions that capture WHY behavior did or didn't transfer.

4. A MANAGER RUBRIC — the same behaviors as a short structured form the manager fills in, with a 1-5 scale and one evidence prompt each.

5. ONE PARAGRAPH on what I must capture at intake so each follow-up links back to the same learner and the change is measurable.

Keep everything specific to MY role and behaviors. No generic survey filler.

Why this works: it forces the day-zero baseline and identical pre/post items that make a real behavior delta possible. Bring the output to a Sopact walkthrough and we'll wire it to one learner record so the 30/60/90-day responses link automatically.

FAQ

What teams ask about measuring behavior change after training.

Kirkpatrick Level 3 questions — from baselines and follow-up timing to comparing training providers on long-term transfer.

01How do I measure behavior change after training?

Define four to six observable behaviors before training, capture a baseline at intake, then re-measure at 30/60/90 days through learner and manager observation linked to the same record. The behavior-change score is the difference from baseline, not an absolute rating. Personalized follow-up links tied to the original learner record return roughly three times the response of a bulk email, and automated theme coding turns open-ended manager notes into evidence without manual work. The whole method depends on a persistent learner ID carried from day zero.

02How do I track post-training behavior versus a pre-training baseline?

Use identical behavior items at both points, attached to one persistent learner ID. At intake the learner (and ideally the manager) rates current practice; at follow-up the same items are re-phrased as "in the last month, how often did you…" The change between the two is the Level 3 measure. The common failure is having no baseline — without a day-zero reference point, a 90-day survey measures a mood, not a change. The second failure is unlinked records, which force manual name-matching that rarely finishes.

03When should I send the follow-up survey — 30, 60, or 90 days?

Transfer shows up weeks after training, so measure at more than one point. Thirty days catches early application while the training is fresh; sixty to ninety days shows whether the behavior stuck once the novelty faded. Many programs send at 30 and 90. The key is that each wave uses the same items and links back to the same learner record, so you see a trajectory — rising, plateauing, or fading — rather than a single snapshot that could be a good or bad week.

04How do I measure the success of on-the-job training?

On-the-job training is measured the same way as classroom training at Level 3 — by observable behavior change against a baseline — with the manager as the primary observer. Define what "doing it right" looks like as a short rubric, rate it before and after, and pair the score with a brief evidence note. Because the work happens in the flow of the job, the manager's structured observation carries more weight than a self-report, and collecting it as a rubric rather than an email is what makes it aggregatable across people.

05Why do most training programs never reach Level 3?

Because Level 3 needs a baseline, a follow-up wave, and a persistent learner ID — and most stacks have none of the three. Reaction and quiz scores are easy and collected in the room. Behavior change requires connecting a 90-day response to the same learner's intake record, across tools that each assign their own ID. Without that link, the analysis becomes a manual reconciliation project that consumes most of the evaluation time and usually doesn't finish before the next cohort starts. It is an infrastructure gap, not a lack of intent.

06How do I compare training providers on long-term behavior change?

Ask each provider to show behavior-change evidence on real cohorts, not satisfaction scores. Specifically: do they define observable behaviors up front, capture a baseline, follow up at 30/60/90 days, and link every response to the same learner. A provider that can only show completion and a smile-sheet average has not measured transfer. The comparison that matters is whether the provider proves the skill was applied on the job weeks later — and can attribute it to specific learners and behaviors.

07What's the difference between Level 2 and Level 3?

Level 2 measures learning — did knowledge or skill increase, usually via a pre/post quiz at the end of training. Level 3 measures behavior — did the learner apply it on the job, measured weeks later. A learner can pass the Level 2 quiz and still change nothing at work; that gap is exactly what Level 3 exists to catch. Level 2 happens in the classroom on the same day; Level 3 requires a follow-up wave and a baseline carried on the same record. Most tools handle Level 2 and quietly stop there.

08Can AI help measure behavior change, or do I still need a system?

Automated scoring helps with one specific task — reading open-ended manager and learner notes into themes — but it can't replace the record that makes Level 3 possible. A generative tool can summarize one export, but it can't maintain a baseline, link a follow-up to the original intake, or hold one learner ID across cohorts. The durable pattern is a persistent data architecture that uses automated theme coding inside it, so the narrative behind a behavior change is read on arrival rather than left unread.

Stop asserting transfer. Start proving it.

Bring your program. We'll wire the Level 3 baseline to one record, live.

Tell us what you train, the behaviors that matter on the job, and when you can follow up. We'll show the connected cohort on Sopact — baseline at intake, 30/60/90-day behavior linked automatically, the narrative read on arrival.