play icon for videos

Kirkpatrick Model: 4 Levels of Training Evaluation Explained

The Kirkpatrick four-level model of training evaluation: reaction, learning, behavior, results. Definitions, examples, sample questions per level, plus the New World Kirkpatrick upgrade for 2026.

US
Pioneering the best AI-native application & portfolio intelligence platform
Updated
May 15, 2026
360 feedback training evaluation
Use Case
Kirkpatrick Model: 4 Levels of Training Evaluation Explained
LEVEL 01 · REACTION How they felt Engagement, perceived relevance, risk flags. Captured continuously from open-ended responses, not a post-training smile sheet.
LEVEL 02 · LEARNING What they learned Pre to Post score deltas plus a six-axis skills radar with Pre and Mid overlaid. Concept mastery extracted from open responses.
LEVEL 03 · BEHAVIOR If they apply it Peer-rated effectiveness from 6 cohort members plus real-world application count, both on one persistent record.
LEVEL 04 · RESULTS If the org benefited Distribution shift, risk flags cleared, benchmark comparison. Multivariate regression for defensible attribution.
Definition

What is the Kirkpatrick model?

The Kirkpatrick model is a four-level framework for evaluating training programs, developed by Donald Kirkpatrick in 1959. The four levels measure progressively deeper effects of training: Level 1 Reaction (how participants felt), Level 2 Learning (what they learned), Level 3 Behavior (whether they apply it on the job), and Level 4 Results (whether the organization benefited). It remains the most widely used training evaluation model worldwide in 2026.

Donald Kirkpatrick developed the four-level model as part of his 1954 doctoral dissertation at the University of Wisconsin, then published it as a series of four articles in 1959 in the Training and Development Journal. His 1994 book Evaluating Training Programs: The Four Levels codified the framework and made it the de facto standard for corporate L&D evaluation. His son Jim Kirkpatrick and daughter-in-law Wendy Kayser Kirkpatrick continue the work today through Kirkpatrick Partners, including the 2016 New World Kirkpatrick Model update covered later on this page.

The reason the model has lasted seven decades is structural. Each level answers a question a different stakeholder actually asks. Participants and instructors care about Level 1 Reaction. L&D teams care about Level 2 Learning. Managers and HR business partners care about Level 3 Behavior. The CFO and the board care about Level 4 Results. A single evaluation that satisfies all four levels means the same dataset can answer four different audiences.

The challenge is that most enterprise training programs stop at Level 1. The end-of-session smile sheet captures Reaction and goes nowhere near Behavior or Results. Capturing Levels 3 and 4 traditionally required expensive 360 surveys 3 to 6 months post-training, which most programs cannot afford. The result is a generation of L&D programs that satisfy Kirkpatrick Level 1 on paper while having no defensible answer to "did the training change behavior" or "did the business benefit."

The AI-native upgrade to Kirkpatrick does not replace the four levels. It changes the data collection underneath them. Persistent participant IDs mean every Pre, Mid, Post, peer rating, audio reflection, and LMS event lands on the same row automatically. AI extraction on collection means open-ended responses become Level 1 sentiment, Level 2 evidence, and Level 3 indicators in real time. Cross-system joins with the LMS and feedback platforms make Level 4 attribution defensible. The rest of this page works through one cohort end to end, with each section mapped explicitly to the Kirkpatrick levels it satisfies.

Interactive lifecycle · cohort program

Click any stage. Watch one record evolve.

12 weeks, 24 participants, one persistent learner ID each. Open-ended responses captured alongside scaled metrics. Mid-cycle coaching interviews ingested as structured evidence. AI narrative summaries written for every participant.

Cohort pulse
Communication Skills Cohort · Spring 2026 · 24 participants · 12-week program with weekly mentor sessions
100%
Low Confidence Pre
70%
High Confidence Post
+1.2
Peer rating Δ
4
Risk flags resolved
Coordinator view
Enroll a new participant
Marcus Thompson
m.thompson@example.org
Communication Skills · Spring 2026
12-week · weekly mentor sessions + peer practice
Self-referred
Sopact platform
Cohort table · 24 participants enrolled
IDNameCohortSourceStatus
P-1247Marcus ThompsonSpring 2026Self-referredEnrolled
P-1246Priya SundaramSpring 2026Sponsor-fundedEnrolled
P-1245James LiuSpring 2026Sponsor-fundedEnrolled
P-1244Aisha KhanSpring 2026Self-referredEnrolled
P-1243Diego RamirezSpring 2026Sponsor-fundedEnrolled
+19...19 moreSpring 2026MixedEnrolled
Validation at intake24 enrolled, 2 records flagged. Duplicate email caught for P-1233 (existing in Fall 2025 cohort). Missing email for P-1252, surfaced for HR re-collection. Persistent ID assigned to all 24. Every Pre, Mid, Post, and audio file from here on will land on these rows automatically.
01 · EnrollAuto-validation catches duplicates and missing fields at intake. Data infrastructure in place before the first measurement, not bolted on after.
Participant view · pre-assessment
Marcus answers 3 questions in week 1
Q1 · scale 0–100
Speaking confidence self-rating
48 / 100
Q2 · yes/no
Have you led a meeting or presentation in the past 30 days?
No
Yes
Q3 · open-ended · the one that matters most
What worries you most about speaking up in meetings or presenting?
I freeze when I have to speak up in meetings. I rehearse what I want to say a hundred times but never raise my hand. I'm afraid of looking stupid in front of people who are more senior.
Sopact platform · AI on collection
Marcus's record · open answer becomes structured data
AI
Extracted from Q3
P-1247 · Pre · Jan 13
Sentiment
Anxious · self-aware
Top fear
Looking unprepared in front of senior colleagues
Readiness
Low
Themes
freeze responseover-rehearsalstatus anxiety
Predicted track
Cluster B · benefits most from low-stakes practice with peer pairs (weeks 2-4)
AI narrative summary · for the coach
Marcus shows classic over-preparation anxiety, with status concern (fear of looking unprepared to senior people) as the dominant theme. His response pattern matches participants who benefit most from low-stakes peer practice in weeks 2-4. Recommend pairing with Priya S. (similar profile) for weekly speaking drills. Risk to flag: avoidance may persist past Mid if not surfaced in week 3 check-in.
Cohort sentiment quadrant · all 24 at Pre
N=24 · plotted from open-ended responses
ConfidentUncertainConfidentAnxiousExcitedCluster A · 7Cluster C · 4Cluster B · 11Cluster D · 2Marcus
Top fears from 24 open-ended responses
AI clusters
46% Status anxiety
33% Freeze response
21% Visual aids
02 · PreThe open question is the unlock. Q1 says 48. Q2 says No. Q3 says why: Marcus is in Cluster B, fearing exposure to senior people, ready for week-2 peer drills. The AI writes a coaching note specific to him from one sentence.
Mentor view · 45-min structured interview
Week 6 mentor session · Marcus and Tom Anderson
TA
Mentor: Tom Anderson · Marcus T. (P-1247)
Mid · interview · Feb 24, 2026 · 45 min · recorded with consent
Skills practiced this cycle
Marcus volunteered to speak in 4 group settings this cycle (target was 2). Two were full team meetings, one was an external client demo, one was a cross-team presentation. Self-rates the delivery quality 7/10.
Real situation faced
Marcus presented the quarterly update to 30 colleagues. Rehearsed three times, voice unsteady in the first 30 seconds. By the third slide his pacing settled and the points landed. Two colleagues asked questions, both got clear answers.
Confidence in own words
"It's still scary but no longer terrifying. I'm rehearsing less. I have a structure now. Slides help when my voice is unsteady. I still freeze when someone interrupts me mid-sentence."
Concern flagged
Has not yet led a meeting facing pushback or interruption. Defaults to one-on-one prep over group facilitation. Recommendation: weeks 7-9 facilitation module with mock interruptions.
Sopact platform · interview to structured data
AI processes 45 minutes into one record
AI
Mid interview extraction
P-1247 · 45 min audio + notes
Readiness
65  +17 vs Pre
Speaking events
4 instances · target 2 · 200% of target
Confidence
Moderate · up from Low
Strengths
preparation disciplinestructure adoptionrecovery in delivery
Risk signal
interruption-response gap · flag for weeks 7-9 facilitation module
Marcus skills profile · 6 competencies
PreMid
VoiceStructureSlidesPushbackListeningPresence
Cohort readiness shift · Pre to Mid
N=24 · 4 risk flags
17% Low
50% Moderate
33% High
Low 4Moderate 12High 8
03 · Mid · InterviewA 45-minute conversation produces richer evidence than any survey. AI extracts the score, the feedback count, the confidence shift, the strength tags, and a new risk signal in one pass. The radar chart shows two competencies (Pushback, Presence) still under-developed.
Participant view · week 12
Final assessment plus 360 plus audio
Q1 · scale 0–100
Final speaking confidence
82 / 100
Q2 · peer-rated effectiveness from 6 cohort members
Peer-rated effectiveness score
7.8 / 10
3:08
"I gave the all-hands presentation last month. Knees shaking, voice steady. Sarah from the cohort told me afterward she could see I was nervous but my points landed. I want to facilitate the next program orientation."
Sopact platform · the full Pre to Post arc
Marcus's longitudinal record
12-week readiness trajectory
—Marcus- - cohort avg
10080604020W1W4W6 · MidW9W12486582
AI narrative · final coaching note
Marcus completed the program with a +34 confidence score lift (48 to 82), outperforming cohort average of +24. His turning point was the quarterly update presentation in week 6, which broke the avoidance pattern surfaced at Pre. Peer-rated effectiveness rose from 6.2 to 7.8 over 12 weeks. Recommend: post-program facilitator role for the Summer 2026 cohort.
Score ΔPre to Post
+34
82 vs 48
Peer effectiveness
+1.6
7.8 vs 6.2
Risk status
Cleared
interruption gap resolved
04 · PostThe Pre baseline is what makes the Post reading mean something. From "I freeze in meetings" to giving the all-hands presentation. From 48 to 82. Peer-rated effectiveness rose +1.6 points. The behavior change is what funders, CFOs, and program officers all want to see.
Program manager view
Four canonical reports, one dataset
Funder · board · staff · participants
English, Português, Español, French
Correlation · Impact · Multivariate · Cohort compare
Same 24 participants, same Pre + Mid + Post data. Four different report shapes for four different audiences. All reproducible at the click of a button.
Sopact platform · live preview
Impact snapshot · Spring cohort
+24
Avg confidence lift
+1.2
Peer effectiveness pts
88%
Completion rate
Click into Component 2 below to switch between the four reports: Correlation (confidence vs peer effectiveness), Impact (cohort-wide deltas), Impact in Spanish, and Multivariate (what predicts high-confidence completion).
05 · ReportsExec, CHRO, board, participants. Same dataset, four report shapes. Multilingual is one click, not a translation project.
Program manager view · AI agent
Ask Claude anything · three example prompts
Prompt 1 · risk flag
Which participants showed early-warning patterns at Mid?
Prompt 2 · external benchmark
Compare our cohort confidence lift against industry benchmarks.
Prompt 3 · cross-system join
Join our data with internal feedback system. Which graduates now mentor others?
Sopact + Claude · joined live
Sample answer · prompt 2 preview
Avg confidence lift · our cohort vs benchmarks
Our Spring cohort
+24
Toastmasters P75
+16
Self-paced P50
+12
Claude's readYour cohort outperforms benchmarks by 8 to 12 points. Driver candidates from the multivariate analysis:45-min Mid interviews (most programs use a 15-min check-in), AI-assisted coach narratives (cited in 19 of 24 exit reflections), and structured peer pairing in weeks 4-6. See Component 3 below for the full Claude playground with all three prompts.
06 · ActionData + a plain-English question. No SQL, no BI ticket. AI joins, charts, explains. Three prompts · run all three in Component 3 below.
Stage 1 of 6 · Enroll
The four levels

Kirkpatrick four levels of training evaluation, explained

The four Kirkpatrick levels measure progressively deeper effects of training: Reaction (L1), Learning (L2), Behavior (L3), and Results (L4). Each level answers a question a different stakeholder actually asks. The data collection design determines which levels you can actually evaluate. Most programs stop at L1 because L2 to L4 traditionally required expensive separate studies. The AI-native approach captures all four levels from one persistent record.

Every measurement point in the cohort above maps to one or more Kirkpatrick levels. The Pre assessment captures L1 baseline reaction plus L2 starting score. The Mid interview captures L2 mid-cycle plus an L3 early-application signal. The Post score plus peer 360 captures L2 final, L3 sustained behavior, and inputs to the L4 distribution analysis. The same persistent record feeds all four levels. The four cards below cover each level in detail.

LEVEL 01 · REACTION

How participants felt about the training

8.7/10 average · 4 risk flags cleared

What it measures. Engagement, perceived relevance to the job, confidence in the instructor or content, and willingness to recommend the program. The most superficial level, but a leading indicator: a participant who flags low engagement at Pre is unlikely to show L2 learning at Post unless the program intervenes.

Traditional method. The end-of-session smile sheet survey. Captures Reaction once, at the worst possible moment (right after a long workshop, when participants want to leave). Produces inflated positivity that does not predict L2 to L4.

AI-native upgrade. Sentiment captured continuously from one open-ended question at Pre, Mid, and Post. AI extracts engagement, perceived relevance, and risk flags from each response. A risk flag at Pre week 1 is addressed in week 2, not noticed at Post week 12. The Spring 2026 Communication Skills cohort cleared 4 of 4 risk flags by Post.

Example Level 1 question. "What worries you most about applying these skills at work? Be specific about a recent situation if you can." AI extracts sentiment polarity, theme cluster, and risk flag from a single paragraph.

LEVEL 02 · LEARNING

What participants learned

+24 average confidence · 100% Low → 70% High

What it measures. Knowledge, skills, attitudes, confidence, or commitment acquired during the training. The classic Level 2 instrument is a Pre to Post score delta on a target competency. For Communication Skills, the Spring 2026 cohort moved an average of 24 confidence points (52 to 76 on a 0-100 scale). The distribution shift is more informative than the average: the cohort moved from 100% Low confidence at Pre to 70% High at Post.

Traditional method. Pre-test and post-test on a knowledge assessment. Captures recall but rarely captures application or attitude shifts. Often graded by the instructor, introducing scoring bias.

AI-native upgrade. Same scaled self-rating Pre and Post for clean delta. Six skill dimensions on a radar chart with Pre and Mid overlaid so the program manager sees which competencies still need work in week 7. AI extracts evidence of concept mastery from open-ended responses at each measurement point. Marcus Thompson moved Voice 4 to 9, Structure 3 to 8, Pushback 2 to 7 over 12 weeks.

Example Level 2 question. "On a 0 to 100 scale, how confident are you speaking up in cross-functional meetings?" Same question, asked at Pre, Mid (in the interview), and Post. The delta is the L2 signal.

LEVEL 03 · BEHAVIOR

Whether participants apply the learning on the job

Peer 6.4 → 7.6 · 9 application events

What it measures. Whether the learned skill or behavior shows up in the participant's daily work. The hardest Kirkpatrick level to capture defensibly. The Spring 2026 cohort's peer-rated effectiveness moved from 6.4 to 7.6 of 10 (+1.2 points), and Marcus Thompson logged 9 speaking events (meetings led, presentations given) during the 12 weeks.

Traditional method. 360 feedback survey 3 to 6 months post-training. Expensive (often requires HR to coordinate), slow (results arrive after the cohort has moved on), and most programs skip it entirely.

AI-native upgrade. Three signals captured during the program, all on one record. First, peer-rated effectiveness from 6 cohort members at Post. Second, real-world application count (speaking events for Communication Skills, customer calls for Sales Enablement, customer presentations for Customer Success). Third, LMS application activity joined via persistent ID. The Spring 2026 cohort's L3 signal was visible at week 12, not 3 months later.

Example Level 3 prompt. "Walk me through the hardest moment so far where you applied a skill from this program. What did you try? What worked?" Asked during the Mid-cycle interview. The story becomes Level 3 evidence captured as structured data, not anecdote.

LEVEL 04 · RESULTS

Whether the organization benefited

+6 to +15 points above benchmarks

What it measures. The targeted business or operational outcome the training was designed to produce. For Communication Skills, the L4 result was peer-rated effectiveness lifting by 1.0+ points and 4 of 4 risk flags cleared. The cohort delivered +1.2 peer effectiveness and 4 of 4 risk flags cleared, outperforming Toastmasters P75 (+18) by 6 points and corporate L&D average (+9) by 15 points.

Traditional method. Business KPI attribution. Methodologically difficult because the training is one of many factors influencing the KPI. Most attempts produce numbers that the CFO discounts as unfalsifiable.

AI-native upgrade. Multivariate regression with standardized beta coefficients ranking program drivers. The Spring 2026 model returned: Mentor session minutes β=0.42, Peer pair sessions β=0.31, Speaking events β=0.24, AI narrative engagement β=0.18, LMS module completion β=0.09 (not significant). The model explains 68% of the variance. This is the defensible attribution that satisfies a finance team. The cross-system AI agent in Component 3 above lets a program manager interrogate the L4 result further with plain-English questions.

Example Level 4 question. "Compared to the Toastmasters P75 benchmark of +18 confidence points, did our cohort beat or miss?" The benchmark comparison surfaces in Report 2 of Component 2 above. The board uses this number, not the raw delta.

Level What it measures Traditional method AI-native upgrade Cohort signal
L1 · Reaction How participants felt End-of-session smile sheet Continuous sentiment from open-ended responses 4 of 4 risk flags cleared
L2 · Learning What they learned Pre-test, Post-test Skills radar with Pre/Mid overlay, AI evidence extraction +24 confidence · 100% Low → 70% High
L3 · Behavior If they apply it on the job 360 survey 3-6 months post Peer rating + application count + LMS activity, all on one record Peer 6.4 → 7.6 · 9 events for Marcus
L4 · Results If the organization benefited Business KPI attribution Multivariate regression with standardized β +6 to +15 above 3 benchmarks
L5 · ROI
Phillips extension
Financial ratio of benefits to costs Cost-benefit analysis on L4 result Monetize L4 outcome, divide by program cost (Optional · only when CFO requests)
Apply it to your cohort

Run all four Kirkpatrick levels on your training program

Walk one of your past cohorts through the L1 to L4 measurement design. 30 minutes with a Sopact specialist. See the reports your CFO would actually accept.

Book a demo →
Component 2 · Reports

Four reports. One dataset. One click each.

Same 24 participants. Same Pre, Mid, Post evidence. Different shape for different audience. Multilingual is a toggle, not a translation project.

Correlation report

Confidence × peer-rated effectiveness

Spring 2026 Communication Skills cohort · N=24 · Pearson correlation analysis

Pearson r
0.74
Strong positive
P-value
<0.001
Highly significant
Sample size
24
complete records
Outliers
2
P-1244 · P-1232
The scatter
Self-rated confidence (Post) vs peer-rated effectiveness
r = 0.74 · slope 0.041
10 8 6 4 2 20 40 60 80 100 Post confidence (self-rated, 0-100) Peer effectiveness (1-10) Marcus T. Aisha K. outlier
Headline Confidence and peer-rated effectiveness move together. A 10-point lift in self-reported confidence corresponds to a 0.4-point lift in peer ratings on average. The relationship is strong (r=0.74) and significant (p<0.001).
Why this matters Internal feeling tracks external behavior. Participants are not merely claiming to feel better; their direct reports and peers see the change. The two outliers (Aisha K., one other) felt confident but did not change peer perception, flagged for follow-up.
Generated May 15, 2026 · Author Tom Anderson, Program Director · Source Sopact Sense
ConfidencePeer effectiveness
Impact report · Q1 2026

Communication Skills Cohort · Spring 2026

Pre to Post movement · cohort distribution · benchmark comparison · for board and exec audiences

Avg confidence lift
+24
52 → 76 of 100
Completion rate
88%
21 of 24 finished
Peer effectiveness
+1.2
6.4 → 7.6 of 10
Risk flags cleared
4 of 4
100% resolved by Post
Cohort distribution shift
Pre · W1
100% Low confidence
N=24
Mid · W6
17%
50% Moderate
33% High
N=24
Post · W12
30%
70% High confidence
N=21
Benchmarks · external comparison
Our Spring cohort
+24
Toastmasters P75
+18
Self-paced LMS P50
+11
Corporate L&D avg
+9
Bottom line for the board The cohort outperformed every external benchmark by 6 to 15 points. Driver candidates from the multivariate (Report 04): 45-minute Mid mentor interviews, structured peer pairing in weeks 2-4, and AI-assisted coaching narratives. Recommend: continue the model for Summer 2026 cohort with same mentor-to-participant ratio.
Generated May 15, 2026 · Author Tom Anderson, Program Director · Source Sopact Sense
For the boardEN
Relatório de impacto · 1º trimestre 2026

Coorte de Habilidades de Comunicação · Primavera 2026

Movimento Pré para Pós · distribuição da coorte · comparação com referências · para diretoria e executivos

Ganho médio de confiança
+24
52 → 76 de 100
Taxa de conclusão
88%
21 de 24 concluíram
Efetividade entre pares
+1,2
6,4 → 7,6 de 10
Sinais de risco
4 de 4
100% resolvidos até Pós
Mudança de distribuição da coorte
Pré · S1
100% Baixa confiança
N=24
Meio · S6
17%
50% Moderada
33% Alta
N=24
Pós · S12
30%
70% Alta confiança
N=21
Referências · comparação externa
Nossa coorte da Primavera
+24
Toastmasters P75
+18
LMS auto-guiado P50
+11
Média L&D corporativo
+9
Conclusão para a diretoria A coorte superou todas as referências externas em 6 a 15 pontos. Fatores explicativos do Relatório 04: entrevistas de mentoria de 45 minutos na Semana 6, pareamento estruturado nas semanas 2-4, e narrativas de coaching assistidas por IA. Recomendação: manter o modelo para coorte do Verão 2026 com mesma proporção mentor-participante.
Gerado em 15 de maio de 2026 · Autor Tom Anderson, Diretor de Programa · Fonte Sopact Sense
Para a diretoriaPT
Multivariate analysis

What predicts high-confidence completion

Linear regression · 5 program variables predicting Pre-to-Post confidence delta · N=24

R² · model fit
0.68
68% variance explained
F-statistic
7.83
p<0.001
Strongest predictor
β=.42
Mentor session minutes
Weakest predictor
β=.09
LMS module completion
Standardized coefficients · ranked
Mentor session minutesLive, structured, recorded with consent
β = 0.42
p<0.001 ★
Peer pair sessionsWeekly 30-min practice with assigned partner
β = 0.31
p<0.001 ★
Speaking events countVolunteered meetings, presentations, demos
β = 0.24
p<0.01 ★
AI narrative engagementTimes participant referenced their coaching note
β = 0.18
p<0.05
LMS module completionAsync self-paced content from Cornerstone LMS
β = 0.09
n.s.
The model says Human elements drive confidence change. Mentor minutes, peer pairs, and real-world speaking events together explain 90% of the variance the model captures. LMS module completion was not statistically significant after controlling for the others.
Implication for Summer 2026 If we cut anything, cut LMS modules first. Reallocating 2 hours per participant from async content to extra mentor minutes is projected to add 6 to 8 points of confidence lift. Component 3 below joins these results with live LMS data to identify the specific modules to deprioritize.
Generated May 15, 2026 · Author Tom Anderson, Program Director · Methods OLS regression, standardized coefficients
For program designAnalytical
In practice

Kirkpatrick model sample questions, by level

Three questions per measurement point usually outperforms a 40-question survey: one scaled self-rating, one yes/no behavioral check, and one open-ended question that earns its keep. The same three-question pattern maps to all four Kirkpatrick levels when designed correctly. The open-ended question is where Level 1 sentiment, Level 2 evidence, and Level 3 behavioral signal all come from in a single paragraph.

The questions below are the actual instrument used for the Spring 2026 Communication Skills cohort. Each question is tagged with the Kirkpatrick level it primarily satisfies. A few questions satisfy multiple levels, which is the point: well-designed instruments map a single response to multiple levels of analysis.

LEVEL 1 REACTION · sample questions

How they felt

  • Asked at Pre week 1. "What worries you most about applying these skills at work?" Surfaces L1 risk + L2 baseline
  • Asked at Mid week 6. "What part of this program has felt most relevant so far? What has felt least relevant?" L1 mid-cycle reaction with directional signal
  • Asked at Post week 12. "Would you recommend this program to a peer in the same role? Why or why not?" L1 net promoter equivalent
LEVEL 2 LEARNING · sample questions

What they learned

  • Asked at Pre and Post. "On a 0 to 100 scale, how confident are you speaking up in cross-functional meetings?" Same question both times: the Pre-Post delta is the L2 signal
  • Asked at Pre and Post. "Rate yourself 1 to 10 on six dimensions: Voice, Structure, Slides, Pushback, Listening, Presence." Drives the skills radar with Pre/Mid overlay
  • Asked at Mid in interview. "Walk me through the framework you used when you had to handle pushback in that meeting." L2 evidence of concept mastery via narrative
LEVEL 3 BEHAVIOR · sample questions

If they apply it on the job

  • Asked at Pre and Post · participant. "In the last 30 days, have you volunteered to present in a meeting of 5 or more people?" Behavioral check, Pre-Post delta
  • Asked at Mid in interview. "How many speaking events have you participated in since Pre? Which were comfortable, which were not?" Application count + qualitative L3 detail
  • Asked at Post · 6 peers. "Rate this participant's effectiveness on Communication on a 1 to 10 scale. What changed since Q4?" External L3 signal, the Kirkpatrick gold standard
LEVEL 4 RESULTS · sample questions

If the organization benefited

  • Asked at planning · with business sponsor. "What is the operational outcome that would make this program worth running again?" Defines the L4 target before any L1-L3 design
  • Asked at Post · in the multivariate report. "Which program element correlates most strongly with confidence lift?" L4 driver attribution via standardized β coefficients
  • Asked at Post · in the impact report. "How did our cohort compare to Toastmasters P75, self-paced LMS P50, and corporate L&D average?" L4 benchmark comparison the board uses
Component 3 · Actionable insight

Ask Sopact + Claude. Plain English. Cross-system data.

No SQL. No BI ticket. The AI agent joins Sopact data with your LMS and your internal feedback system. Click a prompt to watch the answer come back with the sources tagged.

Connected systems · live
Last sync 4 min ago · 3 of 3 systems healthy
Sopact Sense
PARTICIPANT DATA
Pre + Mid + Post assessments, AI narratives, 24 participants, persistent IDs
Cornerstone LMS
LEARNING ACTIVITY
12 modules, completion rates, time in platform, quiz scores, last activity dates
Lattice Feedback
PEER + 360 SIGNALS
Peer feedback given and received, 360 review responses, public recognition counts
AI
Click any prompt above The AI agent will join data from Sopact + LMS + Feedback systems and stream the answer back with sources tagged.
Compare LMS engagement against Post confidence. Show me where the engagement paradox lives.
AI
Claude · joining Sopact Sense + Cornerstone LMS
1.4s · 48 records joined on P-ID
Joining 24 Sopact records with 24 LMS records on participant ID...

The engagement paradox lives in two participants who completed everything in the LMS but barely moved on Post confidence.

Plotting LMS module completion against Post confidence for the Spring 2026 cohort surfaces a quadrant pattern. Most participants cluster around the diagonal: high LMS engagement tracks with high Post confidence (top-right). But two outliers break the pattern in opposite directions.

LMS modules completed × Post confidence · N=24
2 outliers flagged · joined on P-ID
100 60 20 0 6/12 12/12 LMS modules completed (Cornerstone) Post confidence (Sopact) human elements worked the expected pattern need re-engagement engagement paradox Aisha K. Diego R.

Aisha K. (P-1244) completed all 12 LMS modules with a 95 average quiz score, the highest in the cohort. Her Post confidence only rose +6 points (52 to 58), bottom quartile. Pattern matches participants who treat the LMS as a checklist exercise without internalizing the skill. Diego R. (P-1243) finished only 8 of 12 modules but his Post confidence jumped +22 points, driven by 14 attended peer-pair sessions and 9 volunteered speaking events.

What this means: LMS completion is not the change driver. Two participants saturated on async content and still showed the smallest growth. Three under-engaged on LMS but grew most. The human elements of the program carry the lift.

Sources joined: Sopact Sense · 24 Post records Cornerstone LMS · 24 user records · 12 modules
Rank LMS modules by their correlation with confidence lift. Which content actually moves the needle?
AI
Claude · ranking 6 LMS modules + 2 program elements
1.8s · Pearson r vs Pre-to-Post confidence Δ
Correlating module completion with confidence delta across 24 participants...

The human elements outrank every single LMS module. Mentor sessions correlate twice as strongly with confidence lift as your best async module.

I correlated each program element with the Pre-to-Post confidence delta across 24 participants. Higher r means the element more reliably predicts a participant's confidence growth. Two non-LMS elements (mentor sessions, peer pairs) are ranked alongside the 6 Cornerstone LMS modules to show the comparison.

Pearson r · program element vs confidence Δ · N=24
Spring 2026 cohort
Mentor session minutesSOPACT · live coaching
0.78
Peer-pair sessionsSOPACT · structured practice
0.67
Module 04 · Handling pushbackLMS · 22 min video + role-play
0.61
Module 06 · Executive presenceLMS · 18 min video + reflection
0.42
Module 05 · Active listeningLMS · 14 min video + worksheet
0.34
Module 02 · Structure your messageLMS · 16 min video + worksheet
0.18
Module 01 · Voice basicsLMS · 12 min video + quiz
0.12
Module 03 · Slides that workLMS · 20 min video + assignment
0.09

What this means: The 22-minute video on handling pushback (Module 04) is the only async content with a meaningful signal. It is also the module that maps closest to the most-rehearsed real-world situation, which probably explains the correlation. The five other modules sit at or below r=0.42.

Action: for Summer 2026, recommend keeping Module 04, replacing Modules 01 and 03 with one extended mentor session, and tracking whether the freed time materially shifts the cohort's Post confidence distribution.

Sources joined: Sopact Sense · 24 confidence deltas Cornerstone LMS · per-module completion
Find graduates ready to mentor. Cross-reference completion, recent LMS activity, and peer-feedback giving.
AI
Claude · joining Sopact + Cornerstone + Lattice
2.3s · 72 records joined across 3 systems
Filtering Sopact graduates with active LMS sessions and high Lattice peer-feedback giving rates...

Five Spring 2026 graduates qualify as Summer 2026 mentors based on the three-system join.

Filter criteria applied across all three systems: Sopact · completed program with Post confidence above 75. Cornerstone LMS · logged into platform in the past 14 days, suggesting continued investment. Lattice · gave at least 4 pieces of peer feedback in the past month, indicating they are comfortable being a source of feedback for others. Five of 21 graduates meet all three criteria.

Marcus Thompson P-1247 · Engineering
Δ +34 confidence 12/12 modules · last 6d ago 9 peer feedbacks this month
SOPACT 82/100LMS ACTIVELATTICE 9 GIVEN
Assign →
Priya Sundaram P-1246 · Sales
Δ +26 confidence 12/12 modules · last 3d ago 7 peer feedbacks this month
SOPACT 78/100LMS ACTIVELATTICE 7 GIVEN
Assign →
James Liu P-1245 · Operations
Δ +21 confidence 11/12 modules · last 9d ago 6 peer feedbacks this month
SOPACT 76/100LMS ACTIVELATTICE 6 GIVEN
Assign →
Sarah Chen P-1242 · Customer Success
Δ +22 confidence 10/12 modules · last 12d ago 5 peer feedbacks this month
SOPACT 79/100LMS ACTIVELATTICE 5 GIVEN
Assign →
Diego Ramirez P-1243 · Engineering
Δ +22 confidence 8/12 modules · last 4d ago 4 peer feedbacks this month
SOPACT 71/100LMS ACTIVELATTICE 4 GIVEN
Assign →

Note on Diego: his SOPACT score is the lowest of the five at 71, but the lift was outsized (+22) and his Lattice giving rate suggests he learned through peer practice rather than module completion. Could be the strongest peer-style mentor for Cluster B participants in Summer 2026.

Sources joined: Sopact Sense · graduation status Cornerstone LMS · last 14d activity Lattice · peer feedback giving rate
Ask anything · join data from your connected systems click a prompt above to try
Updates and alternatives

New World Kirkpatrick, plus how it compares to Phillips, CIRO, and Brinkerhoff

The New World Kirkpatrick Model, published by Jim and Wendy Kirkpatrick in 2016, updates Donald's original four levels with backward planning from Level 4 and the concept of Required Drivers as the bridge between Level 2 Learning and Level 3 Behavior. Phillips ROI Methodology adds a fifth level for financial return. CIRO reorders around context and input. Brinkerhoff's Success Case Method uses narrative evidence. Kirkpatrick still anchors most programs because its four levels map cleanly to questions stakeholders actually ask.

The original Kirkpatrick four levels came out in 1959. A lot has changed since then, including the rise of cheap LMS data, AI extraction on collection, and the demand from CFOs for defensible financial attribution. The model has been updated and challenged repeatedly. The two most important variants are the New World Kirkpatrick Model (2016) and the Phillips ROI Methodology (1996). Two less common but historically relevant alternatives are CIRO and Brinkerhoff. The table below covers all four.

2016 UPDATE

The New World Kirkpatrick Model

Published in 2016 by Jim Kirkpatrick (Donald's son) and Wendy Kayser Kirkpatrick through Kirkpatrick Partners. The four levels stay the same: Reaction, Learning, Behavior, Results. The big change is the planning sequence and the introduction of Required Drivers.

Backward planning. Start at Level 4 by defining the operational outcome with the business sponsor. Then design Level 3 Required Drivers (the systems, manager behaviors, and processes that reinforce on-the-job application). Then Level 2 Learning objectives. Then Level 1 Reaction design. This is the opposite of how most legacy programs work, which start at Level 1 and hope Level 4 will follow.

Required Drivers. The processes that reinforce L3 behavior change between training and 6 months post. Examples: a manager 1:1 in week 4 that explicitly asks about applying the new skill, a peer accountability pairing, a Slack channel where cohort members share weekly wins. The Kirkpatricks argue that the bridge between L2 and L3 is where most training programs fail, not in the training itself. The Spring 2026 Communication Skills cohort's mentor sessions (β=0.42 in the multivariate model) are an example of a Required Driver delivering measurable L3 lift.

Model Levels or stages Key idea Best for Where it falls short
Kirkpatrick
1959 original · 2016 New World
4 levels: Reaction, Learning, Behavior, Results Progressively deeper effects of training, each level a different stakeholder question Anchoring any training evaluation program; clearest stakeholder mapping L3 and L4 are expensive without persistent IDs and cross-system data
Phillips ROI
1996
5 levels: Kirkpatrick's 4 plus Level 5 ROI Monetize the Level 4 outcome, divide by program cost, report as a financial ratio Justifying training spend to finance and the board; ROI calculations Isolating training's contribution from other variables is methodologically difficult
CIRO
Warr, Bird, Rackham 1970
4 stages: Context, Input, Reaction, Outcome Needs analysis first (Context), then resource fit (Input), then reactions, then outcome Program design choices before training begins, especially HRD contexts Less commonly understood by external stakeholders; lacks ROI calculation
Brinkerhoff
Success Case Method, 2003
Not levels; narrative cases of high and low performers Identify the top 5% and bottom 5% of participants, interview both, look for the design and manager behaviors that drive the gap Surfacing which design choices and manager behaviors drive outcomes Qualitative-heavy; hard to aggregate across large cohorts without AI extraction

The Spring 2026 Communication Skills cohort used Kirkpatrick as the spine with one Brinkerhoff overlay: Marcus Thompson (peer 7.8, +34 confidence lift) and Aisha K. (LMS 12/12 modules, +6 lift) were treated as the high-performer and low-mover cases. Their archetypes informed the Summer 2026 design changes. The cohort did not run Phillips L5 ROI because the business sponsor's question was operational, not financial. CIRO would have been overkill for a single-cohort program of 24 participants.

Frequently asked

Kirkpatrick model questions, answered

What is the Kirkpatrick model?

The Kirkpatrick model is a four-level framework for evaluating training programs, developed by Donald Kirkpatrick in 1959. The four levels measure progressively deeper effects of training: Level 1 Reaction (how participants felt), Level 2 Learning (what they learned), Level 3 Behavior (whether they apply it on the job), and Level 4 Results (whether the organization benefited). It remains the most widely used training evaluation model worldwide in 2026.

What are the four levels of the Kirkpatrick model?

Level 1 Reaction measures how participants felt about the training. Level 2 Learning measures what knowledge, skills, or attitudes they acquired, typically via Pre to Post score deltas. Level 3 Behavior measures whether they apply the learning on the job, usually via 360 feedback or peer-rated effectiveness. Level 4 Results measures whether the training produced the targeted business or operational outcome. The Phillips ROI Methodology adds a fifth level for financial return.

What does Level 1 Reaction measure in the Kirkpatrick model?

Level 1 measures how participants felt about the training: engagement, perceived relevance to their job, confidence in the instructor or content, and willingness to recommend the program. Traditional Level 1 measurement is the end-of-session smile sheet. AI-native Level 1 captures sentiment continuously from one open-ended question per measurement point, flagging risk responses in real time so a participant can be re-engaged in week 7 rather than counted as a failure at Post.

What does Level 2 Learning measure in the Kirkpatrick model?

Level 2 measures what participants learned during the training: knowledge, skills, attitudes, confidence, or commitment. The most common Level 2 instrument is a Pre to Post score delta on a target competency, expressed as a single number (such as +24 points on a 0 to 100 confidence scale). A skills radar chart with Pre and Mid overlaid reveals which sub-competencies have moved and which need additional work mid-cycle.

What does Level 3 Behavior measure in the Kirkpatrick model?

Level 3 measures whether participants apply the learning on the job. Traditional Level 3 instruments are 360 feedback surveys 3 to 6 months after training, which are expensive and slow, and most programs skip them entirely. AI-native Level 3 combines peer-rated effectiveness from cohort members at Post, real-world application counts logged during the program (such as speaking events or sales calls), and LMS application activity all on the same persistent participant record.

What does Level 4 Results measure in the Kirkpatrick model?

Level 4 measures whether the training produced the targeted business or operational outcome. Examples include cohort distribution shift from 100% Low confidence at Pre to 70% High at Post, 4 of 4 risk flags cleared, or peer-rated effectiveness lifting by more than 1.0 points. Isolating training's contribution from other variables is methodologically difficult, which is why multivariate regression with standardized beta coefficients is the AI-native approach to defensible Level 4 attribution.

What is the difference between Kirkpatrick and Phillips ROI models?

The Phillips ROI Methodology adds a fifth level on top of Kirkpatrick's four: Level 5 Return on Investment, expressed as a financial ratio of training benefits to training costs. Kirkpatrick stops at Level 4 Results, which captures operational outcomes. Phillips's L5 requires monetizing the Level 4 result and dividing by total program cost. Most enterprise L&D programs use Kirkpatrick as the spine and add Phillips L5 only when a CFO asks for ROI specifically.

What is the New World Kirkpatrick Model?

The New World Kirkpatrick Model, published by Jim and Wendy Kirkpatrick in 2016, is an update to Donald Kirkpatrick's original four levels. The major change is the emphasis on backward planning: start at Level 4 by defining the business result, then design Level 3 Required Drivers, then Level 2 Learning, then Level 1 Reaction. The model also introduces the concept of Required Drivers, which are the systems and processes that reinforce on-the-job behavior change as the bridge between Level 2 and Level 3.

How do you apply the Kirkpatrick model in 2026?

Define the Level 4 business result first, with the business sponsor. Design Level 1, 2, and 3 measurement instruments that map to it. Use persistent participant IDs so all Pre, Mid, Post, peer, audio, and LMS data lands on one record. Capture open-ended responses at each measurement point for AI extraction of L1 sentiment and L2 evidence. Conduct a Mid-cycle structured interview for L2 plus L3 mid-signal. Run multivariate regression on the final dataset for defensible L4 attribution.

Who created the Kirkpatrick model?

Donald Kirkpatrick created the four-level training evaluation model as part of his 1954 PhD dissertation at the University of Wisconsin. He published it as a series of four articles in 1959 in the Training and Development Journal, then expanded it in his 1994 book Evaluating Training Programs: The Four Levels. His son Jim Kirkpatrick and daughter-in-law Wendy Kayser Kirkpatrick continue the work today through Kirkpatrick Partners, including the 2016 New World Kirkpatrick Model update.

Go deeper

The full Kirkpatrick implementation playbook

From backward planning at Level 4 to multivariate attribution. Frameworks, sample instruments, report templates, and how persistent IDs make all four levels measurable from one record.

Read the stakeholder intelligence guide →
Get started

Capture all four Kirkpatrick levels from one persistent record

Walk one of your past training cohorts through the L1 to L4 measurement design above. See the reports your CFO would actually accept. 30 minutes with a Sopact specialist.