
New webinar on 3rd March 2026 | 9:00 am PT
In this webinar, discover how Sopact Sense revolutionizes data collection and analysis.
The Kirkpatrick Model evaluates training across four levels: Reaction, Learning, Behavior, and Results. Learn why only 35% measure Level 4 — and how AI fixes the gap.
The Kirkpatrick Model is the world's most widely used framework for evaluating training effectiveness, breaking assessment into four levels: Reaction, Learning, Behavior, and Results. Developed by Donald Kirkpatrick in the 1950s, the model helps organizations move beyond satisfaction surveys to measure whether training investments actually change performance and deliver business outcomes. While nearly 90% of organizations evaluate Level 1 (Reaction), only 35% consistently measure Level 4 (Results) — revealing a measurement gap that costs organizations millions in unvalidated training spend each year.
This guide explains each level of the Kirkpatrick Model in practical terms, shows why most organizations get stuck at Levels 1 and 2, and demonstrates how modern data architecture finally makes Level 3 and Level 4 measurement operationally feasible — not just theoretically desirable.
The Kirkpatrick Model is a four-level evaluation framework designed to assess the effectiveness of training programs by measuring progressively deeper indicators of impact. Each level builds on the one before it, moving from immediate participant reactions to long-term organizational results.
Donald Kirkpatrick, a professor at the University of Wisconsin, first developed the framework for his doctoral research in the 1950s. He published the model formally in 1959 through a series of journal articles, and it became the dominant evaluation framework in learning and development over the following decades. His son James Kirkpatrick and daughter-in-law Wendy Kayser Kirkpatrick later evolved the model into the "New World Kirkpatrick Model," emphasizing the importance of starting evaluation planning at Level 4 and working backward.
The four levels are sequential in measurement but should be designed in reverse — starting with desired results and working backward to reaction. Here is what each level measures:
Level 1: Reaction measures how participants respond to the training experience. Did they find it engaging, relevant, and valuable? This is typically assessed through post-training surveys — often called "smile sheets." Roughly 80–90% of training events include Level 1 evaluation, making it by far the most common form of training assessment.
Level 2: Learning measures the degree to which participants acquired the intended knowledge, skills, and attitudes. This involves pre-test and post-test comparisons, skills demonstrations, or competency assessments. About 83% of organizations measure at this level.
Level 3: Behavior measures whether participants apply what they learned when they return to their work environment. This is where training evaluation becomes genuinely difficult — it requires observation, manager feedback, and follow-up measurement three to six months after training. Only about 60% of organizations evaluate at this level, and many of those do so inconsistently.
Level 4: Results measures the degree to which targeted organizational outcomes occur as a result of training and subsequent on-the-job application. This includes metrics like reduced costs, improved productivity, higher retention, increased sales, and safety improvements. Only about 35% of organizations consistently measure at Level 4.
The gap between Level 2 and Level 4 measurement is the central challenge of training evaluation. Organizations know how to ask whether participants liked training and whether they learned something. They struggle to prove whether training changed behavior and delivered business results. This is not a failure of the Kirkpatrick Model itself — it is an infrastructure problem.
The Kirkpatrick Model's four levels are simple to understand but operationally difficult to execute beyond Level 2. Research consistently shows a dramatic drop-off in evaluation rigor as organizations move up the levels. According to ATD research, approximately 90% of organizations implement Level 1 evaluation and 83% measure Level 2 — but only 35% consistently evaluate Level 4 business results.
This is not because L&D professionals don't understand the model. In fact, 80% of training professionals say evaluating training results is important to their organization. The problem is structural: the data infrastructure required for Levels 3 and 4 simply doesn't exist in most organizations.
Level 1 evaluation dominates because it is easy. A post-training survey takes five minutes to administer and produces instant data. But reaction data has a well-documented weakness: participant satisfaction does not reliably predict learning transfer or behavioral change. A training program can receive excellent satisfaction scores while producing zero measurable impact on job performance. Conversely, challenging, uncomfortable training experiences sometimes produce the strongest behavioral changes.
When organizations rely primarily on Level 1 data, they optimize for the wrong outcomes. Training programs evolve to be more "enjoyable" rather than more effective. The smile sheet becomes the goal rather than the diagnostic tool it was intended to be.
Level 2 evaluation — measuring knowledge and skill acquisition — is more rigorous but still operates in a controlled environment. Pre-test and post-test comparisons can show that learners gained knowledge during a training session. But acquiring knowledge in a classroom is fundamentally different from applying that knowledge on the job under real-world conditions.
The gap between knowing and doing is where most training investments fail to translate into organizational value. And it is precisely this gap — the transition from Level 2 to Level 3 — where traditional measurement infrastructure breaks down.
Level 3 (Behavior) requires tracking whether individuals actually change their on-the-job behavior after training. This demands several capabilities that most organizations lack:
First, it requires longitudinal tracking — measuring the same individuals over time, not just at the point of training completion. Behavioral change emerges over weeks and months, not hours.
Second, it requires multi-source data collection — manager observations, peer feedback, self-assessments, and ideally objective performance metrics, all linked to the same individual who completed training.
Third, it requires persistent participant identity — the ability to connect a person's training completion record with their subsequent performance data, survey responses, and behavioral observations across different systems and time periods.
Most organizations collect training data in an LMS, performance data in an HRIS, manager feedback in a separate survey tool, and business results in yet another system. These systems rarely share participant identifiers. Without connected data, Level 3 evaluation becomes a manual research project rather than an operational process.
Level 4 (Results) is often described as the most difficult level to measure, but Kirkpatrick Partners argues this is a misconception. The real challenge is not measuring business results — those metrics usually already exist somewhere in the organization. The challenge is connecting training activities to those results with sufficient confidence to make decisions.
Business outcomes are influenced by many factors beyond training: market conditions, management quality, organizational culture, technology changes, and dozens of other variables. Isolating the contribution of a specific training program requires either controlled comparison groups or sophisticated analytical approaches that account for confounding variables.
The New World Kirkpatrick Model addresses this by introducing Return on Expectations (ROE) — defined by key stakeholders before training begins — and Contributive ROI (cROI), which acknowledges that training contributes to results rather than causing them in isolation. These are more realistic approaches than traditional ROI calculations, but they still require connected data infrastructure to execute.
The modern approach to applying the Kirkpatrick Model follows the "reverse" design principle championed by the New World Kirkpatrick Model: start at Level 4 and work backward. Define desired business results first, then identify the behaviors that drive those results, then design learning that builds those behaviors, and finally create an experience that engages participants.
Before designing any training program, answer these questions: What organizational outcome are we trying to improve? How will we know if it improved? What data already exists to measure it?
For a sales training program, the Level 4 metric might be average deal size or win rate. For leadership development, it might be employee retention on that leader's team. For safety training, it might be incident rates. For workforce development programs, it might be employment outcomes or wage increases.
Identify both leading indicators (early signals that change is happening) and lagging indicators (the ultimate business outcomes). Most Level 4 data already exists somewhere in the organization — in CRM systems, HRIS platforms, financial reporting, or operational dashboards. The evaluation plan should specify where this data lives and how it will be accessed.
With Level 4 outcomes defined, identify the 3-5 critical behaviors that, if performed consistently, would drive those results. These should be observable, measurable actions — not abstract qualities.
For example, if the Level 4 goal is improved customer satisfaction scores, the Level 3 critical behaviors might include: using active listening techniques during customer calls, following the prescribed issue resolution workflow, and proactively following up within 24 hours of issue resolution.
Level 3 measurement requires a plan for how and when these behaviors will be observed. Options include manager observation checklists, self-reporting surveys administered at intervals (30, 60, 90 days post-training), peer feedback, and automated tracking through business systems where possible. The key is that each of these data collection points must be linked to the individual participant. For organizations running training evaluation programs across multiple cohorts, this means persistent participant IDs that connect training records to behavioral observations.
With critical behaviors defined, determine what knowledge, skills, and attitudes learners need to perform those behaviors. Design assessments that measure acquisition of these specific capabilities — not general satisfaction.
Effective Level 2 assessment uses pre-test and post-test designs, skills demonstrations evaluated against rubrics, scenario-based assessments that require application of knowledge, and confidence and commitment checks that predict transfer likelihood.
With learning objectives defined, design the experience to be relevant, engaging, and practical. Level 1 evaluation should focus on three dimensions: relevance ("Will I use this?"), engagement ("Did this hold my attention?"), and satisfaction ("Was this a good use of my time?").
The New World Kirkpatrick Model recommends formative Level 1 evaluation — pulse checks during training, not just end-of-course surveys — so that facilitators can course-correct in real time rather than discovering problems after the fact.
The Kirkpatrick Model was developed in an era when training evaluation was a research exercise — a periodic study conducted after the fact to justify training budgets. The model itself is sound. The problem is that most organizations implement it using tools and processes designed for a fundamentally different purpose.
In the traditional approach, training evaluation follows this pattern: deploy a course, collect smile sheets, run a knowledge assessment, wait six months, send a follow-up survey, manually compile data in spreadsheets, produce an annual report. By the time Level 3 and Level 4 data is available, the training program has already been running for months — and the insights arrive too late to improve the current cohort's outcomes.
This batch evaluation model has several structural flaws. Data is collected in disconnected systems — LMS, survey tools, HRIS, business intelligence platforms — with no shared participant identity. Follow-up surveys achieve low response rates because they are manual and disconnected from participants' regular workflow. Analysis requires dedicated evaluation staff who spend weeks or months reconciling data from different sources. Results are delivered as static reports that describe the past rather than informing current decisions.
AI-native data architecture makes a fundamentally different approach possible. Instead of treating evaluation as a post-hoc research project, modern platforms can embed measurement into the training lifecycle from the point of enrollment.
The key architectural difference is persistent unique participant IDs. When a participant enrolls in training, they receive an identifier that follows them through every subsequent touchpoint: pre-training assessment, training completion, post-training knowledge checks, 30-day behavioral surveys, 90-day manager observations, and performance data from business systems. All of this data is automatically linked to a single longitudinal record.
With connected data architecture, training evaluation stops being an annual reporting exercise and becomes continuous intelligence. L&D teams can see which training programs are producing behavioral change in real time — not six months after the fact. They can identify which learner segments are struggling with transfer and intervene before the training investment is lost. They can correlate training program variations with Level 3 and Level 4 outcomes to continuously improve program design.
Platforms like Sopact Sense are built on this architectural principle. Instead of bolting evaluation tools onto existing LMS systems, they provide the data infrastructure layer that makes connected, longitudinal measurement operational. The Intelligent Suite — Intelligent Cell for individual assessment analysis, Intelligent Row for participant-level tracking, Intelligent Column for theme extraction from qualitative feedback, and Intelligent Grid for cross-cohort comparison — provides the analysis capabilities that transform raw evaluation data into actionable training intelligence.
This is not about replacing the Kirkpatrick Model. It is about giving organizations the infrastructure to actually implement all four levels rather than getting stuck at Level 1 and Level 2.
The Kirkpatrick Model applies across virtually every type of organizational training. Here is how each level manifests in common training scenarios.
Level 1: Participants rate the relevance and quality of leadership workshops and coaching sessions. Level 2: 360-degree assessments measuring leadership competency before and after the program. Level 3: Managers demonstrate specific leadership behaviors — conducting regular one-on-ones, providing structured feedback, delegating effectively — measured through team surveys and behavioral observation. Level 4: Retention rates on participating leaders' teams, engagement scores, and promotion readiness pipeline metrics.
Level 1: Sales representatives rate the training's relevance to their daily challenges. Level 2: Skills assessments measuring product knowledge and objection handling capability. Level 3: CRM data showing whether reps are following the prescribed sales methodology — discovery call frameworks, proposal structures, follow-up cadences. Level 4: Win rates, average deal size, time to close, and revenue per representative.
Level 1: Employees rate clarity and relevance of compliance content. Level 2: Knowledge assessments confirming understanding of policies and regulations. Level 3: Audit results showing adherence to compliance procedures in daily operations. Level 4: Reduction in compliance violations, regulatory fines, and associated legal costs.
Level 1: Participants rate training quality, instructor effectiveness, and relevance to career goals. Level 2: Pre-test and post-test comparisons measuring technical skill acquisition and confidence growth. Level 3: Employment outcomes, on-the-job application of skills, and employer satisfaction surveys collected 90+ days after program completion. Level 4: Wage increases, job retention rates, career advancement, and program-wide employment placement rates.
For organizations managing training across multiple cohorts and programs, the challenge is not evaluating any single program — it is maintaining consistent evaluation infrastructure across all programs simultaneously. This requires training effectiveness measurement systems that automate data collection, link participant records longitudinally, and generate cross-program comparisons without manual data reconciliation.
The Kirkpatrick Model's longevity — over seven decades — means that significant institutional knowledge has accumulated about how organizations misapply it. Here are the most consequential errors.
The original model was published as a sequence (Level 1 → 2 → 3 → 4), which led many practitioners to treat it as a linear progression. In practice, all four levels should be planned simultaneously, with Level 4 outcomes defined first. Organizations that design training without defining desired business results upfront have no anchor for their evaluation efforts.
When business metrics improve after training, it is tempting to attribute the improvement entirely to the training program. But business outcomes are influenced by many factors. The Kirkpatrick Model works best when used with control groups, when possible, or when Level 3 behavioral data provides the causal chain connecting training to results.
Behavioral change does not happen immediately. Assessing behavior one week after training is too early — learners are still in the enthusiasm phase. Waiting twelve months is too late — you have lost the ability to intervene. The optimal window for initial Level 3 measurement is 30–90 days post-training, with follow-up checks at 6 and 12 months.
When Level 1 data lives in one survey tool, Level 2 data in an LMS, Level 3 data in a separate observation platform, and Level 4 data in business intelligence systems, no one can connect the story across levels. Integrated data architecture with shared participant identifiers is essential for the model to work as intended.
The New World Kirkpatrick Model explicitly acknowledges that the work environment affects transfer. Even excellent training fails if the organizational culture, management practices, or available tools do not support the desired behaviors. Level 3 evaluation should assess environmental barriers alongside behavioral change.
While the Kirkpatrick Model is the most widely used training evaluation framework, several alternatives and extensions exist. Understanding how they relate helps practitioners choose the right approach.
Jack Phillips extended the Kirkpatrick framework by adding a fifth level: Return on Investment. Phillips' Level 5 converts Level 4 results into monetary values and compares them to program costs. This is useful when financial justification is the primary goal but requires significant analytical rigor to isolate training's financial contribution.
Robert Brinkerhoff's approach focuses on identifying the most and least successful participants and understanding what made the difference. Rather than measuring average outcomes, it finds extreme cases and investigates the factors that enabled or prevented success. This is particularly useful for understanding why training transfers for some learners and not others.
The CIPP model (Context, Input, Process, Product) provides a broader evaluation framework that encompasses needs assessment and program design — areas the Kirkpatrick Model addresses less directly. CIPP is more commonly used in educational evaluation than corporate training.
Roger Kaufman extended evaluation beyond organizational results to societal impact — asking whether the training ultimately contributes to broader societal value. This is particularly relevant for nonprofit training programs and workforce development where the goal extends beyond organizational performance to community-level outcomes.
Each of these frameworks has merit. The Kirkpatrick Model's advantage is its simplicity, flexibility, and universal recognition. Most practitioners benefit from using Kirkpatrick as the foundation and borrowing elements from other frameworks as needed.
Artificial intelligence does not replace the Kirkpatrick Model — it makes the higher levels operationally feasible for the first time. Here is how AI transforms measurement at each level.
Traditional Level 1 evaluation relies on Likert-scale ratings that produce aggregate scores but miss nuance. AI-powered analysis of open-ended feedback responses extracts themes, sentiment patterns, and specific actionable insights that structured ratings cannot capture. Instead of knowing that a training program received a 4.2/5.0 satisfaction score, L&D teams can understand that participants found the content relevant but felt the pace was too fast during the technical sections, and that first-time learners had significantly different reactions than experienced practitioners.
AI enables adaptive assessment that adjusts difficulty based on learner responses, provides more precise measurement of knowledge and skill levels, and identifies specific knowledge gaps rather than just pass/fail outcomes. AI can also analyze qualitative assessment responses — written explanations, scenario analyses, and project submissions — that traditional automated scoring cannot evaluate.
This is where AI makes the biggest difference. Level 3 has historically been the bottleneck because behavioral observation at scale requires enormous human effort. AI can analyze qualitative data from multiple sources — manager feedback, peer observations, self-reflections, and open-ended survey responses — to identify behavioral patterns and transfer indicators automatically. When combined with persistent participant IDs, this analysis can be correlated directly with training program data to understand which program elements are driving behavioral change.
AI enables continuous correlation analysis between training activities and business outcomes. Rather than waiting for an annual evaluation study, organizations can monitor leading indicators of Level 4 impact in real time, identify early signals when programs are not delivering expected results, and adjust program design based on data rather than intuition.
The infrastructure requirement for AI-powered evaluation is the same as for traditional evaluation — just more critical: connected, clean, longitudinal data with persistent participant identifiers. AI amplifies the value of good data architecture. It cannot compensate for fragmented, disconnected data systems.
The Kirkpatrick Model is the world's most widely used framework for evaluating training effectiveness. Developed by Donald Kirkpatrick in the 1950s, it breaks training evaluation into four progressive levels — Reaction (did participants like the training?), Learning (did they learn the intended knowledge and skills?), Behavior (did they apply what they learned on the job?), and Results (did the training produce the desired business outcomes?). The model is used across corporate, government, military, and nonprofit sectors globally.
Level 1 (Reaction) measures participant satisfaction and perceived relevance. Level 2 (Learning) measures knowledge and skill acquisition through pre-test and post-test assessments. Level 3 (Behavior) measures whether participants apply new skills on the job, typically assessed 30–90 days after training. Level 4 (Results) measures the business impact of training, including metrics like productivity improvements, cost reductions, retention rates, and revenue growth.
According to ATD research, approximately 90% of organizations evaluate Level 1 and 83% evaluate Level 2, but only 35% consistently measure Level 4 business results. The primary barriers are disconnected data systems that cannot link training records to behavioral and performance data, the time and cost of manual follow-up data collection, difficulty isolating training's contribution from other factors that influence business outcomes, and lack of persistent participant identifiers that connect data across systems and time periods.
The New World Kirkpatrick Model, developed by Jim and Wendy Kirkpatrick, introduces several updates. It emphasizes planning evaluation from Level 4 backward (start with desired results, not participant reactions). It introduces Return on Expectations (ROE) as a more practical alternative to traditional ROI calculations. It acknowledges the performance environment as a critical factor in training transfer. And it positions evaluation as a continuous improvement process rather than a one-time post-training report.
Jack Phillips extended the Kirkpatrick framework by adding a fifth level focused on financial Return on Investment. While the Kirkpatrick Model's Level 4 measures business results, Phillips' Level 5 converts those results into monetary values and compares them to program costs. The Kirkpatrick approach uses Return on Expectations (ROE) and Contributive ROI (cROI) as alternatives that acknowledge training is one contributor to results, not the sole cause.
The recommended initial measurement window is 30–90 days after training, with follow-up assessments at 6 and 12 months. Measuring too early (within the first week) captures enthusiasm rather than sustained behavioral change. Measuring too late (beyond 12 months) makes it difficult to attribute changes to the training program and misses the window for corrective intervention.
Yes. While originally designed for training evaluation, the Kirkpatrick framework has been applied to change management initiatives, coaching programs, leadership development, onboarding processes, and organizational development interventions. The 2026 update to the model by Vanessa Milara Alzate explicitly expanded its application beyond learning and development to enterprise performance intelligence more broadly.
Workforce development programs benefit from the full four-level approach: Level 1 assesses participant experience with training quality and relevance; Level 2 measures skill acquisition through pre/post assessments; Level 3 tracks employment outcomes, job application of skills, and employer satisfaction; and Level 4 measures program-wide outcomes like employment rates, wage increases, and career advancement. The key challenge is tracking participants longitudinally across program stages and into post-program employment, which requires persistent unique participant identifiers.
At minimum, you need: survey tools for Level 1 and follow-up data collection, assessment platforms for Level 2, observation or feedback tools for Level 3, and access to business performance data for Level 4. The critical requirement — and the one most organizations lack — is a data infrastructure layer that connects participant records across all four levels with persistent identifiers. Platforms like Sopact Sense provide this connected architecture, enabling longitudinal tracking from enrollment through outcome measurement.
The Kirkpatrick Model remains the global standard for training evaluation. Its longevity — over seven decades — reflects its fundamental soundness. The model itself is not outdated; the implementation infrastructure is what has historically been inadequate. Modern AI-native data platforms make the higher levels of the model operationally feasible for the first time, increasing the model's practical relevance rather than diminishing it.



