The Kirkpatrick Model of Training Evaluation: Four Levels, Actually Measured
Last updated: April 2026
The funder email arrives on Tuesday. Not about satisfaction scores — those have been answered at 4.3 out of 5. The funder is asking whether participants applied the skills on the job, whether behavior changed at 90 days, whether the cohort produced business outcomes worth renewing. The question is pure Kirkpatrick Level 3 and Level 4, which is exactly where the answer gets stuck. Level 1 data lives in the post-training survey tool. Level 2 lives in the LMS. Level 3 data was never collected because the 90-day follow-up was "planned for next cycle." Level 4 data lives in the HRIS with no connection back to training. This is The Cascade Break — the structural failure where Donald Kirkpatrick's four-level cascade of Reaction → Learning → Behavior → Results is measured as four disconnected events because no persistent learner identity links the levels across the tools that produce them.
The Kirkpatrick Model is not broken. The model has been the global standard for training evaluation since 1959 and remains the framework every serious funder, accreditor, and L&D leader references. What breaks is the data architecture beneath it. The four levels are explicitly designed as a cascade — each level's evidence depends on the previous level's measurement. Break the cascade anywhere, and Levels 3 and 4 become narrative claims rather than measurements. This guide covers all four levels in practical terms, explains why 65% of programs stall at Level 2, and shows what the architecture looks like when the cascade actually holds.
The Cascade Break
Kirkpatrick's four levels are a cascade — until the data architecture snaps it
The Cascade Break is the structural failure where Kirkpatrick's Reaction → Learning → Behavior → Results cascade gets measured as four disconnected events because no persistent learner identity links the levels across the tools that produce them. The model isn't broken. The infrastructure is.
Level 1
Reaction
90% measure
→
Level 2
Learning
83% measure
→
Level 3
Behavior
35% consistently
→
Level 4
Results
12% consistently
1
Name Level 4 first
Reverse design: the organizational outcome defines everything upstream, not the other way around.
2
Assign the ID at enrollment
Persistent learner identity carries through every instrument — no email matching, no CSV merging.
3
Measure the cascade
Pre, post, 90-day follow-up, and outcome all share the spine. The cascade stays intact.
4
Ship the L1–L4 report
All four levels rendered from one spine. Funders see what the model promises, not what got reconciled.
The Kirkpatrick Model is a four-level framework for evaluating training programs that measures participant Reaction (Level 1), Learning (Level 2), Behavior change on the job (Level 3), and organizational Results (Level 4). Developed by Donald Kirkpatrick in 1959 and refined by James and Wendy Kirkpatrick in the New World Kirkpatrick Model, it remains the most widely used training evaluation framework in workforce development, corporate learning, healthcare training, military instruction, and leadership development. The model's power is not its complexity — the four levels are conceptually simple — but its causal logic: each level's outcome depends on the previous level's success.
The Kirkpatrick Model works when the four levels are measured against the same participants over time. It fails when Level 1 satisfaction averages get calculated on Monday, Level 2 pre-post deltas get calculated three weeks later on a different participant list, and Level 3 follow-up surveys go out to whoever opens a bulk email 90 days after program end. Averages across different participant populations are not a cascade — they are four independent snapshots with no causal link between them.
Sopact Sense treats Kirkpatrick implementation as an architecture problem first and an instrument design problem second. Every learner receives a persistent unique ID at enrollment that carries through every Level 1 post-survey, Level 2 rubric, Level 3 follow-up, and Level 4 outcome record — so the four-level cascade is a default output, not a heroic reconciliation effort.
Kirkpatrick Training Evaluation
Step 2: The Four Levels of the Kirkpatrick Model — Reaction, Learning, Behavior, Results
The four levels of the Kirkpatrick Model are Reaction, Learning, Behavior, and Results — measured in that order and answering progressively higher-stakes questions about training effectiveness. Each level requires specific instruments, specific timing, and a persistent participant identifier to link the level back to the learner who generated the previous level's data.
Four Levels · Reaction → Learning → Behavior → Results
The four levels of the Kirkpatrick Model, in practical terms
Each level answers a different stakeholder question and requires different instruments, timing, and a persistent participant ID to link back to the previous level. The cascade holds only when all four share a spine.
01
Reaction
Reaction
"Did participants find the training engaging and relevant?"
Participant satisfaction and perceived relevance of the training experience. The easiest level to measure and the least correlated with downstream outcomes — but it determines whether the next cohort enrolls.
Post-training surveyNet Promoter ScoreSmile sheet
Timing
Within 24 hours of program end
Response rate
85–95% when delivered on-site
02
Learning
Learning
"Did participants acquire the intended knowledge and skills?"
Paired pre-training and post-training assessment using identical items and identical rubrics, with the delta calculated per individual participant. Group averages without pre-post pairing are not Level 2 — they are separate snapshots.
Cascade BreakMost programs stop here — the ID that links Level 2 data rarely carries into Level 3 follow-up. Levels 3 and 4 become narrative claims rather than measurements.
03
Behavior
Behavior
"Are participants applying the skills on the job?"
On-the-job application of the trained behaviors, measured through learner self-report and ideally manager or mentor observation against two to four specifically named behavioral criteria defined before the program begins.
"Did the training produce targeted organizational outcomes?"
Measurable organizational outcomes — productivity, retention, safety incidents, revenue, promotion velocity, customer satisfaction — linked to training records through the same persistent participant ID. Phillips ROI adds a financial translation layer (Level 5).
The cascade only holds on one spine. Sopact Sense assigns a persistent participant ID at enrollment and carries it through every Level 1–4 instrument automatically — so all four levels render from one record, not four reconciliations.
Level 1 — Reaction measures whether participants found the training engaging, relevant, and well-delivered. It is typically captured through a post-training satisfaction survey within 24 hours of program end. Response rates are high (often above 85%) and the data collection is simple. Level 1 tells you almost nothing about whether the program worked, but it tells you whether participants will recommend the program to peers — which affects enrollment for the next cohort.
Level 2 — Learning measures whether participants acquired the intended knowledge, skills, or attitudes. Level 2 requires a paired pre-training and post-training assessment using identical items scored against identical rubrics, with the delta calculated per individual participant — not as an average across two separate groups. Without a persistent learner ID linking pre to post, the Level 2 delta is mathematically impossible to compute for any specific person.
Level 3 — Behavior measures whether participants applied the skills on the job, typically at 30, 60, or 90 days post-program. Level 3 is where 65% of training evaluations stall because the ID required to link the 90-day follow-up to the participant's original pre-training baseline lives in a tool that does not talk to the follow-up delivery tool. Level 3 is also where funders base renewal decisions — so the gap between what the model promises and what most programs actually deliver is expensive.
Level 4 — Results measures whether the training produced targeted organizational outcomes — productivity, retention, safety incidents, revenue, promotion velocity, customer satisfaction. Level 4 requires linking training records to business outcome data over 6 to 12 months post-program. Fewer than 35% of organizations consistently measure Level 4, and most of those measurements are narrative case studies rather than statistical attribution.
Step 3: Kirkpatrick Level 2 Evaluation Methods
Kirkpatrick Level 2 evaluation methods measure knowledge, skill, and attitude change through paired pre-training and post-training assessment against identical rubrics. The most common Level 2 methods are multiple-choice knowledge tests, performance observation scored against a competency rubric, self-reported confidence ratings on named skills, structured case-study analysis, and role-play evaluation against behavioral criteria. The method matters less than the structural requirement: pre and post must share identical items, identical scoring criteria, and — most critically — identical participant identifiers so the delta can be calculated per person.
Level 2 methods fail when pre-training scores live in an LMS export and post-training scores live in a separate survey tool with no shared ID. The typical retrofit is matching responses by email address, which breaks when participants use work email for pre and personal email for post, or when participants change names mid-program. Sopact Sense assigns the persistent ID at enrollment and inherits it into every Level 2 instrument automatically — so every paired pre-post delta is computable without reconciliation.
Step 4: Kirkpatrick Level 3 Evaluation — Measuring Behavior Change on the Job
Kirkpatrick Level 3 evaluation measures whether participants are applying the trained behaviors in their actual work environment, typically through a combination of learner self-report, manager or mentor observation, and direct performance data against named behavioral criteria. Level 3 is the level that matters most to funders and senior sponsors because it is the first level that demonstrates the training changed something beyond the classroom. It is also the level where most programs quietly fail.
The Level 3 measurement sequence has four components. First, define two to four specific observable behaviors the training is intended to produce — written as actions ("conducts structured 1:1s with direct reports weekly") not qualities ("is a better leader"). Second, establish a baseline self-report score against those exact behaviors at enrollment. Third, deliver matched post-program observation at 30, 60, or 90 days — from the learner, and ideally from a manager or mentor against the same behaviors. Fourth, generate the delta disaggregated by cohort, role, and context — paired with open-ended reflection on what enabled or blocked application.
See the workforce cohort example below for what Level 2 and Level 3 look like when pre, post, and mentor observation share the same persistent learner ID. The correlation example shows the deeper pattern: which Level 2 knowledge gains actually predicted Level 3 behavior change, and which did not.
02
Correlation Analysis · Cross-Dimensional
Test Scores vs. Confidence — Qual + Quant Linked
Whether high test scores actually predict high confidence — or whether they're structurally independent
The scenario
"We want to know whether high test scores actually predict high confidence in our cohort — or whether they're independent. Our survey tool keeps these as separate exports. I need a single analysis that links the quantitative test score to the qualitative confidence measure and shows the relationship, or absence of one, clearly."
Quant axis
Test scores
Six rubric dimensions, 1–10 scale
⟷ Bound at collection
Qual axis
Confidence signals
Extracted from open reflections
Sopact Sense produced
Cross-dimensional correlation between quant test scores and AI-extracted confidence scores
Visual correlation map — participant-level scatter across both dimensions
Cluster analysis — high test/high confidence, high test/low confidence, and outlier patterns
Plain-language interpretation of what the correlation means for program design
Why traditional fails
Qualtrics: test scores in one export, open reflections in another — the statistician builds the join
Consultant: a month of analyst time to score confidence from open-ends and merge with quant
SPSS / R: expert-level statistical work before any visualization can begin
ChatGPT: can attempt correlation but output is non-deterministic — different clusters every run
◆
The agentic difference
Confidence was never a separate variable to calculate — Sopact Sense's Intelligent Cell extracts the confidence score from every reflection as data arrives and stores it in a structured column alongside the quant score. The correlation isn't computed after analysis; it's visible from the moment the last response is submitted. Same-session reproducibility guaranteed.
Step 5: The New World Kirkpatrick Model — Reverse Design from Level 4
The New World Kirkpatrick Model, developed by James D. Kirkpatrick and Wendy Kayser Kirkpatrick, explicitly reverses the order of program design while keeping the order of measurement. Design starts at Level 4 by naming the organizational result the training must produce, then identifies the Level 3 critical behaviors that would produce that result, then defines the Level 2 knowledge and skills participants need to perform those behaviors, then designs the Level 1 learning experience to deliver that knowledge. Measurement still flows Level 1 → 2 → 3 → 4, but design flows in reverse — which is why the New World model ships Level 3 and 4 evidence instead of treating them as optional.
The reverse-design logic is not controversial; it is structurally correct. Level 4 results exist only if Level 3 behaviors happen. Level 3 behaviors exist only if Level 2 capability exists. Level 2 capability exists only if the Level 1 experience was well enough designed to transfer knowledge. If you design Level 1 first without knowing the Level 4 target, every downstream level is hope rather than plan. See the training evaluation sibling page for how the reverse-design logic integrates with the broader Kirkpatrick Ceiling framing.
Step 6: Kirkpatrick Model vs. Phillips ROI, CIRO, and Brinkerhoff
Kirkpatrick is the baseline. Phillips ROI extends Kirkpatrick with a fifth level that monetizes Level 4 outcomes for CFO-facing financial justification. CIRO evaluates Context, Input, Reaction, and Outcome — front-loading design quality before measuring outcomes. Brinkerhoff's Success Case Method studies the top and bottom performers through qualitative interviews to isolate what enabled success and what blocked it.
Kirkpatrick vs. Alternatives
Kirkpatrick, Phillips ROI, CIRO, Brinkerhoff — when each one fits
Four evaluation frameworks with overlapping but distinct purposes. Kirkpatrick is the global default. The others extend, augment, or reframe it for specific contexts. Most mature programs run Kirkpatrick as the spine and layer one alternative for a specific gap.
Dimension
KirkpatrickFour-level model · 1959
Phillips ROIFive-level extension
CIROContext / Input / Reaction / Outcome
Brinkerhoff SCMSuccess Case Method
Purpose and scope
Primary question
Did training produce the full cascade — reaction, learning, behavior, results?
Did training produce financial return justifying the cost?
Was the training well designed before outcomes are judged?
What distinguished the top and bottom performers post-program?
Levels or stages
Four (Reaction, Learning, Behavior, Results)
Five (adds ROI layer on top of Kirkpatrick)
Four (Context, Input, Reaction, Outcome)
Extreme-case qualitative interviews
Default for workforce training
Yes. The global standard.
Only when CFO-facing ROI required
When program design quality is primary concern
As a narrative layer, not standalone
Typical use cases
Leadership development
Best fit. L3/L4 directly maps to promotion velocity, team retention, business unit performance.
Common for enterprise programs with board-level sponsorship
Rarely used
Strong companion for narrative case studies
Sales training
Best fit. L4 links directly to quota, pipeline, win rate via CRM.
Very common — CRM data enables financial translation
Uncommon
Used to study top versus bottom sellers
Compliance and regulatory
Adequate — L1/L2 typically sufficient
Required when risk-avoidance ROI is justified
Uncommon
Rarely used
Public sector and international development
Common — especially for workforce and health training
Less common — financial ROI not always the question
Common. CIPP variant widely used.
Used for narrative reporting to funders
Coaching and mentoring
Standard fit. L3 behavior change is the core measurement.
The practical rule: use Kirkpatrick as the spine. Add Phillips ROI when the CFO is in the funding conversation. Add Brinkerhoff Success Case Method when narrative depth matters to the funder. Use CIRO when the question is design quality, not outcomes. All four require the same underlying architecture — a persistent participant ID that carries through every instrument — which is exactly what Sopact Sense provides.
The choice between them is rarely binary. Most mature programs run Kirkpatrick as the default framework with Brinkerhoff layered in for narrative depth, or Phillips ROI for enterprise compliance and leadership development where financial justification is required. CIRO and CIPP are most common in multi-phase public sector and international development programs. See the training evaluation methods page for the full seven-method comparison.
Step 7: How to Apply the Kirkpatrick Model in Practice
Apply the Kirkpatrick Model by starting with the funder's or sponsor's actual question, naming which level answers that question, then designing backward from there. A funder asking "did the program produce outcomes?" needs Level 4. A funder asking "are participants applying the skills?" needs Level 3. A program director asking "is our instructional design working?" needs Level 2. A conference organizer asking "did attendees find it valuable?" needs Level 1.
The application sequence has five steps. Name the target level. Define the two to four specific observable behaviors that level measures. Design the instruments that will capture those behaviors at the required intervals. Assign a persistent participant ID before the first instrument runs. Build the report template against the ID chain before the cohort ends.
Masterclass · Level 3 without LMS dependency
Kirkpatrick Level 3 without an LMS — mentor observation as the measurement layer
Most Level 3 implementations fail because the LMS is treated as the only measurement surface — and the LMS participant ID doesn't travel to managers, mentors, or 90-day follow-up. This walkthrough shows mentor-based behavior evaluation running on a persistent learner ID spine, no LMS integration required.
LEVEL 3 — MENTOR OBSERVATION
Sopact Masterclass
01 · Named behaviors
Two to four observable behaviors defined before the cohort — scored identically by learner and mentor against the same rubric.
02 · Persistent ID
Mentor observations inherit the learner's ID automatically — no CSV matching, no LMS integration.
03 · Triangulated evidence
Self-report, mentor observation, and 90-day follow-up — agreement and dissent surface automatically.
Watch the walkthrough — then see the workforce and correlation examples above for what Level 2 pre-post deltas and Level 3 behavior evidence look like on the same spine.
Step 8: Build Your Kirkpatrick Implementation Before the Cohort Opens
The single highest-leverage decision in Kirkpatrick implementation is made before the first intake form is built — not after the cohort graduates and the funder question arrives. Programs that design the persistent ID architecture upfront produce Level 3 and Level 4 evidence as default outputs. Programs that retrofit after the fact produce narrative claims dressed as measurements.
Frequently Asked Questions
What is the Kirkpatrick Model?
The Kirkpatrick Model is a four-level framework for evaluating training programs, measuring Reaction (Level 1), Learning (Level 2), Behavior on the job (Level 3), and organizational Results (Level 4). Developed by Donald Kirkpatrick in 1959 and extended in the New World Kirkpatrick Model, it remains the most widely used training evaluation framework globally. The four levels form a causal cascade — each level's outcome depends on the previous level's success.
What are the four levels of the Kirkpatrick Model?
The four levels of the Kirkpatrick Model are Reaction, Learning, Behavior, and Results. Level 1 Reaction measures participant satisfaction with the training experience. Level 2 Learning measures knowledge and skill acquisition through pre-post assessment. Level 3 Behavior measures on-the-job application at 30-to-90 days post-program. Level 4 Results measures organizational outcomes — productivity, retention, revenue, safety incidents — over 6 to 12 months.
What is the Cascade Break in Kirkpatrick evaluation?
The Cascade Break is the structural failure where Kirkpatrick's four-level cascade (Reaction → Learning → Behavior → Results) gets measured as four disconnected events because no persistent learner identity links the levels across the tools that produce them. The model is not broken — the data architecture beneath it is. Sopact Sense closes the cascade by assigning persistent learner IDs at enrollment that carry through every subsequent instrument.
How do you measure Kirkpatrick Level 2 learning?
Measure Kirkpatrick Level 2 learning through paired pre-training and post-training assessment using identical items and identical scoring rubrics, with the delta calculated per individual participant. Common methods include multiple-choice knowledge tests, performance observation scored against a competency rubric, self-reported confidence ratings on named skills, structured case-study analysis, and role-play evaluation. The pre-post pair must share the same participant identifier or the delta is statistically meaningless.
How do you measure Kirkpatrick Level 3 behavior change?
Measure Kirkpatrick Level 3 behavior change by defining two to four specific observable behaviors at intake, capturing a baseline self-report score at enrollment, collecting matched scores from the learner and ideally a manager or mentor at 30, 60, or 90 days post-program, and pairing the statistical delta with open-ended reflection on what enabled or blocked application. All four measurements must share the same persistent participant ID.
How do you measure Kirkpatrick Level 4 results?
Measure Kirkpatrick Level 4 results by linking training records to business outcome data — productivity, retention, safety incidents, revenue, promotion velocity — over 6 to 12 months post-program through the persistent participant ID assigned at enrollment. Fewer than 35% of organizations consistently measure Level 4 because the participant ID in training records rarely matches the ID in HRIS or operational systems without manual reconciliation.
What is the New World Kirkpatrick Model?
The New World Kirkpatrick Model, developed by James D. Kirkpatrick and Wendy Kayser Kirkpatrick, reverses the order of program design while keeping the order of measurement. Design starts at Level 4 (target organizational result) and works backward through Level 3 behaviors, Level 2 capabilities, and Level 1 experience. Measurement still flows Level 1 through 4. Reverse-design ensures Level 3 and 4 evidence is built in from the start rather than retrofitted.
What is the difference between Kirkpatrick and Phillips ROI Model?
The Phillips ROI Model extends the Kirkpatrick Model with a fifth level that monetizes Level 4 organizational outcomes using the formula ROI% equals Net Program Benefits divided by Program Costs, multiplied by 100. Phillips is required when CFO-facing financial justification is needed for enterprise leadership development or compliance training investments. Kirkpatrick's four levels remain the underlying structure — Phillips adds the financial translation layer on top.
How is the Kirkpatrick Model used for leadership development evaluation?
The Kirkpatrick Model is used for leadership development evaluation by naming Level 4 outcomes (promotion velocity, team retention under the leader, 360-degree feedback scores, business unit performance) and working backward to Level 3 critical leadership behaviors and Level 2 capability development. Level 3 for leadership programs typically includes structured manager observation, direct-report 360 feedback, and self-report against named leadership behaviors — all tied to the same participant ID.
How is the Kirkpatrick Model used for sales training evaluation?
The Kirkpatrick Model is used for sales training evaluation by defining Level 4 outcomes (quota attainment, pipeline velocity, deal size, win rate) and working backward to Level 3 sales behaviors (discovery call structure, objection handling, closing technique), Level 2 product and methodology knowledge, and Level 1 training experience quality. Sales evaluation benefits from Level 4 availability — CRM systems already track outcomes — but requires persistent participant ID to link training to CRM records.
Can you use the Kirkpatrick Model for coaching evaluation?
Yes — the Kirkpatrick Model applies to coaching evaluation with Level 1 measuring coach-coachee fit and session quality, Level 2 measuring insight or skill acquisition through structured self-report, Level 3 measuring behavior change in the workplace at 60 to 90 days post-engagement, and Level 4 measuring outcomes aligned with the coaching goal. Coaching evaluation at Level 3 depends heavily on manager observation and 360-degree feedback tied to the coachee's persistent ID.
Why do most programs stop at Kirkpatrick Level 2?
Most programs stop at Kirkpatrick Level 2 because the data architecture required to link Level 3 and Level 4 evidence back to individual participants does not exist in standard training stacks. LMS platforms issue their own participant IDs that do not follow learners outside the LMS. Survey tools issue separate response IDs. HRIS systems use yet another identifier. Without a persistent learner ID chain, Level 3 and Level 4 become manual reconciliation projects that rarely finish before the reporting deadline.
How long does it take to implement the Kirkpatrick Model?
Implementing the Kirkpatrick Model takes between 2 weeks and 3 months depending on scope. A single-cohort Level 1-2 implementation can be stood up in 2 weeks with existing instruments. A full Level 1-4 implementation with persistent ID architecture, Level 3 follow-up cadence, and Level 4 outcome linkage typically takes 4 to 8 weeks of design work before the first cohort enrolls. The architecture is built once and reused across every subsequent cohort.
Close the Cascade Break · Pick Your Next Step
Implement Kirkpatrick all four levels — starting with the next cohort
The highest-leverage decision in Kirkpatrick implementation happens before the first intake form is built — not after graduation when the funder question arrives. Three concrete next steps. Pick the one that matches where your program stands today.
01 · Evaluate
See all four levels in live data
Open the workforce cohort and correlation examples above without a login. See Level 2 pre-post deltas, Level 3 mentor observations, and the citation chain linking every finding to source responses — all running on one persistent learner ID spine.
Three common workflows — application review, bound pre/post/mentor 360° evaluation, or accounting-system integration. The wizard on the training evaluation page names the architecture that fits each, plus where Sopact is the wrong answer.
Bring the Level 3 or Level 4 question you cannot currently answer. In 30 minutes we show what the cascade looks like with persistent IDs, named behaviors, mentor observation, and citation-backed evidence on your actual program data.