play icon for videos

Training Evaluation Survey Questions by Kirkpatrick Level

Training evaluation survey questions for every Kirkpatrick level. Pre and post examples, behavior-anchored prompts, and the question architecture funders accept.

Updated
May 29, 2026
360 feedback training evaluation
Use Case
Question formats

Six question formats, side by side

A training evaluation question takes one of six formats. The format determines what the question can measure, when it should run, and the decision it can feed. The four-level pathway above tells you which level to ask. This grid tells you which format to ask in. Most evaluation surveys use only two of the six formats and miss what the other four would have caught.

FORMAT 01

Likert scale

L1 reaction

A 1-to-5 or 1-to-7 ordered scale capturing agreement, relevance, clarity, or confidence. The most-used format in training evaluation. Best paired with an open-ended item asking what produced the rating.

When to useEnd-of-session reaction items, Level 1 instrument design, and pre/post confidence measures where the construct is attitudinal. The scale stays locked at 1-5 or 1-7 across every wave.

Three examples
  • How relevant was today's content to a case you are working on?
  • How well did the pace match the depth of the material?
  • How confident do you feel applying today's protocol in the next week?

Common mistakeUsing a 1-5 at Pre and a 1-7 at Post. Scale drift turns the delta into a measurement artifact, not a measurement of change.

FORMAT 02

Open-ended

Pairs with Likert

A free-text prompt placed immediately after a Likert rating, asking what produced the score. The reasoning behind every number lives here. Without it, ratings produce averages no one can interpret.

When to useEvery Likert item gets a paired open-end. Also: Pre-program barrier prompts, Post-program application stories, Level 3 barrier surfacing. AI extraction codes themes without a manual analyst.

Three examples
  • What one moment from today's session was clearest?
  • Describe a real case you expect to encounter where this protocol applies.
  • What barriers, if any, prevented you from applying the trained protocol?

Common mistakeThe Orphan Open-end: collected, exported to CSV, never coded. Themes need rubric coding (manual or AI-assisted), not a wall of unread text.

FORMAT 03

Scenario item

L2 learning

A one-paragraph case followed by an action question, scored against a published rubric. The format that produces a real Level 2 measure. Tests applied understanding rather than memorized recall.

When to usePaired Pre and Post knowledge measurement. Same scenario at Pre, same scenario at Post, same rubric. The delta per person is the Level 2 score. Use 4 to 8 scenarios covering the program's learning objectives.

Three examples
  • A participant arrives at intake disclosing unsafe housing. What is the first action you would take and why?
  • A case raises a child-welfare concern. What is your mandated-reporter obligation and timeline?
  • A case requires referral. Name the two systems you would consult to identify the right partner agency.

Common mistakeReplacing scenario items with recall items ("list the four steps"). Recall measures memorization. Scenario measures understanding. Level 2 belongs to scenario.

FORMAT 04

Anchored count

L3 behavior

A behavioral question anchored to a specific timeframe and a specific application moment: "In the past thirty days, how many times did you use [protocol] from training?" Produces comparable counts across respondents.

When to useLevel 3 behavior measurement at 30, 60, and 90 days post-training. Always paired with the application moment named in the end-of-training reaction question. Best paired with manager observation when the program has manager visibility.

Three examples
  • In the past thirty days, how many times did you use the housing-disclosure protocol from training?
  • In the past thirty days, how many cases involved a substance-use disclosure, and on how many did you apply the trained next-step sequence?
  • In the past thirty days, did any case meet the mandated-reporter threshold? If yes, how many?

Common mistakeSelf-rated frequency scales ("how often do you use the protocol?" on 1-5). Personality scales the rating. Counts do not have this drift; use counts where comparability matters.

FORMAT 05

Rubric-scored

L2 / L3

A free-form response (written or observed) scored against a multi-point rubric by an instructor, assessor, or trained observer. Common in clinical, safety, and credential training. The rubric is the instrument; calibrated rubrics produce reliable scores across observers.

When to usePractical demos, observation of skills in controlled settings, scoring scenario-item responses, credential assessments. The rubric is locked across cohorts so year-over-year comparison holds.

Three examples
  • Score the participant's intake interview against the 4-point protocol-fidelity rubric.
  • Observe the participant lead a 5-minute case briefing; score against the 6-criterion communication rubric.
  • Score the participant's written care plan against the 8-item documentation rubric.

Common mistakeOne observer per rubric. Single-rater rubrics drift over time and across raters. Use 2 trained observers for high-stakes assessments and check inter-rater agreement before averaging.

FORMAT 06

Tied operational metric

L4 results

Not a survey question at all. A metric definition, a date range, and a comparison cohort, pulled from an operational system that existed before training. The defensible Level 4 instrument.

When to useLevel 4 results measurement at 90 days to 12 months. Workforce: placement rate, retention. Clinical: case-resolution time, patient outcomes. Sales: conversion rate, prescription lift. Defined before training begins so attribution is auditable.

Three examples
  • Cohort certification pass rate vs prior cohort, pulled from certifying-body records, 90 days post-program.
  • Job placement rate at 6 months post-program, pulled from program management system.
  • Compliance violations per 100 cases handled, pulled from compliance audit, quarterly.

Common mistakeSurvey-proxy substitution: asking "rate your team's performance" instead of pulling the team's actual performance metric. Survey proxies cannot be audited; tied metrics can.

More program contexts

Five additional training contexts, 30 more questions

The workforce worked example above covers case-management certification. The same question architecture transfers to other training contexts with different scenarios and different operational metrics at Level 4. Below: six questions each for clinical training, pharma sales enablement, leadership and management development, compliance training, and technical or software training. Persistent participant identity, paired pre/post on Level 2, anchored Level 3 counts, and tied Level 4 metrics carry across every context.

CONTEXT 01 · 6 questions

Clinical & healthcare training

Hospital-based protocol training; nurses, residents, allied staff; 60-day cohort
C1·01 L1 REACTION "On a 1 to 5 scale, how relevant was today's bedside-rounding protocol training to a case you saw this week?"Format: Likert · Decision: cohort-mid relevance review
C1·02 L1 OPEN "What one moment from today's simulation lab on the SEPSIS protocol was clearest?"Format: Open-ended · Decision: facilitator content adjustment
C1·03 L2 LEARNING "A patient on a busy ward shows early signs of sepsis. Walk through the first three actions in the SEPSIS protocol, in order."Format: Scenario, rubric-scored · Asked Pre and Post · Decision: protocol-fidelity reinforcement
C1·04 L3 BEHAVIOR "In the past 60 days, how many times have you used the new sterile-technique protocol on procedures where it was indicated?"Format: Anchored count · 60-day post · Manager observation paired · Decision: skills-lab refresher scheduling
C1·05 L3 BARRIERS "What barriers, if any, prevented you from applying the new bedside-rounding protocol during the past 30 days?"Format: Open-ended · 30-day post · Decision: workflow-barrier review
C1·06 L4 RESULTS Hospital-acquired-infection rate by unit, quarterly, vs trailing 4-quarter average.Format: Tied operational metric · Quarterly · Decision: protocol continuation and unit-by-unit coverage
CONTEXT 02 · 6 questions

Pharma sales enablement

Product launch training; field medical reps; 4-week intensive plus 90-day follow-up
C2·01 L1 REACTION "On a 1 to 5 scale, how relevant was today's launch-product training to a customer conversation you had this week?"Format: Likert · Decision: relevance-by-region adjustment
C2·02 L1 OPEN "What one moment from today's role-play was clearest about the new compliance boundary?"Format: Open-ended · Decision: compliance-language coaching priorities
C2·03 L2 LEARNING "A KOL raises a question about off-label use of [product]. Walk through your compliant response, step by step."Format: Scenario, rubric-scored against compliance criteria · Asked Pre and Post · Decision: compliance module depth
C2·04 L3 BEHAVIOR "In the past 30 days, how many customer conversations included the launch product's mechanism-of-action discussion?"Format: Anchored count · 30-day post · Paired with CRM call notes · Decision: talking-point reinforcement
C2·05 L4 RESULTS Prescription lift for [product] in trained-rep territories vs control territories, quarterly.Format: Tied operational metric · Quarterly · Decision: training continuation and rep coaching
C2·06 L4 COMPLIANCE Compliance-review flags per 100 detail calls, monthly, post-training vs pre-training baseline.Format: Tied operational metric · Monthly · Decision: refresher cadence and compliance-coaching depth
CONTEXT 03 · 6 questions

Leadership & management development

New-manager cohort; 12-week program; 60 first-time managers with 3-7 direct reports each
C3·01 L1 REACTION "On a 1 to 5 scale, how relevant was today's feedback-conversation framework to a recent management situation?"Format: Likert · Decision: framework reinforcement priority
C3·02 L2 LEARNING "A direct report misses two consecutive 1-on-1s without notice. Describe how you would open the next conversation, step by step."Format: Scenario, rubric-scored · Asked Pre and Post · Decision: difficult-conversation module depth
C3·03 L3 BEHAVIOR "In the past 60 days, how many 1-on-1s have you held with each direct report? List by report."Format: Anchored count · 60-day post · Decision: cadence coaching
C3·04 L3 APPLICATION "Walk through a recent feedback conversation where you applied the SBI model. What worked, what surprised you?"Format: Open-ended · 60-day post · Decision: case-study material for next cohort
C3·05 L4 RESULTS Engagement-survey eNPS score for direct reports of trained managers vs untrained managers, 6 months post.Format: Tied operational metric · 6 months · Decision: program continuation
C3·06 L4 RESULTS Voluntary attrition rate within direct reports of trained managers vs untrained, 12 months post.Format: Tied operational metric · 12 months · Decision: program continuation and partner-team coverage
CONTEXT 04 · 6 questions

Compliance & regulatory training

GDPR data-handling refresher; 240 customer-service staff; 2-week cohort + 90-day follow-up
C4·01 L1 REACTION "On a 1 to 5 scale, how relevant was today's training on the new GDPR data-handling protocol to your daily workflow?"Format: Likert · Decision: relevance-by-role adjustment
C4·02 L2 LEARNING "A customer requests deletion of their personal data. Walk through the steps you would take, in order, including timeframes."Format: Scenario, rubric-scored against regulatory criteria · Asked Pre and Post · Decision: process-step reinforcement
C4·03 L3 BEHAVIOR "In the past 90 days, how many data-deletion requests have you processed? On how many did you meet the 30-day timeline?"Format: Anchored count + count · 90-day post · Decision: process-bottleneck review
C4·04 L3 BARRIERS "In the past 90 days, have you received any data-handling request you were unsure how to process? If yes, how many?"Format: Anchored count + open-ended · 90-day post · Decision: gap-coverage refresher
C4·05 L4 RESULTS Compliance-audit findings per 100 transactions, quarterly.Format: Tied operational metric · Quarterly · Decision: refresher cadence
C4·06 L4 RESULTS Time-to-acknowledge for data-subject requests, monthly, post-training vs baseline.Format: Tied operational metric · Monthly · Decision: process automation priorities
CONTEXT 05 · 6 questions

Technical & software training

Engineering org rolling out a new CI/CD pipeline; 120 engineers across 12 teams; 6-week training
C5·01 L1 REACTION "On a 1 to 5 scale, how relevant was today's introduction to the new CI/CD pipeline to your current sprint?"Format: Likert · Decision: relevance-by-team adjustment
C5·02 L2 LEARNING "A failing build blocks deployment 30 minutes before a release window. Walk through your diagnostic steps using the new pipeline tools."Format: Scenario, rubric-scored · Asked Pre and Post · Decision: troubleshooting-module depth
C5·03 L3 BEHAVIOR "In the past 30 days, how many pull requests have you merged using the new branch-protection workflow?"Format: Anchored count · 30-day post · Paired with Git logs · Decision: adoption coaching
C5·04 L3 APPLICATION "Describe a moment in the past 30 days where the new pipeline tooling saved you time. What did you try, what worked?"Format: Open-ended · 30-day post · Decision: case-study evidence for skeptical teams
C5·05 L4 RESULTS Mean time to recovery (MTTR) by team, monthly, post-training vs pre-training baseline.Format: Tied operational metric · Monthly · Decision: training continuation and reinforcement priorities
C5·06 L4 RESULTS Deployment frequency by trained team vs control team, weekly.Format: Tied operational metric · Weekly · Decision: rollout pace across remaining teams
Sample answers

What good training feedback answers look like

Most pages on training evaluation show the questions. This one shows what strong participant responses look like at each Kirkpatrick level and contrasts them with weak responses to the same question. The difference between a strong and weak response is rarely effort; it is structure. The analyst-action column below each pair shows what the program manager actually does with each kind of response. Programs that train participants in how to respond, not only what to respond to, raise the quality of their entire evidence base.

Level 1 · Reaction

End-of-session open-ended responses

Question

"What one moment from today's session was clearest?"

Strong response

The walkthrough of the housing-disclosure scenario in hour two clicked because I worked a similar case last week and felt unsure mid-conversation. Hearing the specific phrasing for opening the consent question made it concrete.

Why it works. Names a specific moment (hour two), connects to a real case the participant remembers, identifies what the participant will now do differently (specific phrasing for consent).

Weak response

Great session, learned a lot.

Why it does not. Pleasant. Generic. No specific moment, no application, no signal for the facilitator. The rating attached to this is uninterpretable.

What the analyst doesThe strong response feeds two queues: facilitator-feedback (hour-two content earned its place) and case-study material (the participant's housing case becomes anonymized teaching material for the next cohort). The weak response gets coded as "no actionable content" and adds nothing to either queue.
Question

"What one moment from today's session was least clear?"

Strong response

The documentation requirements for mandated-reporter cases in hour three. The slide said "within 24 hours" but the example walked through what looked like a 72-hour window. I left unsure which applies and would want a sentence clarifying the conditions for each.

Why it works. Points to the contradiction between slide and example, identifies the precise ambiguity, and proposes a specific fix that takes the facilitator under a minute to implement.

Weak response

The documentation part was confusing.

Why it does not. Confirms there is a problem but provides nothing the curriculum designer can act on. Translates to: "look at the documentation module" without saying which part of which slide.

What the analyst doesThe strong response feeds the curriculum revision queue with a precisely scoped item (clarify 24-hour vs 72-hour conditions, hour three slide). The weak response generates a tag ("documentation module needs review") that the curriculum designer has to investigate manually before they can revise anything.
Level 2 · Learning

Pre and Post scenario responses

Question (asked Pre and Post on the same participant)

"A participant arrives at intake disclosing unsafe housing. What is the first action you would take and why?"

Strong response (Post)

First action: I would verify the participant's immediate safety with a non-leading question, then ask consent to discuss housing-specific support. Per protocol, I would not file a report until I have their consent unless safety triggers are present (children, immediate danger). I would document the disclosure with permission. The reason: forcing intervention without consent breaks trust and reduces follow-through.

Why it works. Names the specific protocol steps in order, identifies the exception conditions (safety triggers), explains the reasoning. Rubric scores 4 of 4 on protocol-fidelity criteria.

Weak response (Post)

I would help them find housing and report the unsafe situation to the right people.

Why it does not. Compassionate but not protocol-fidelity. Skips the consent step, conflates safety triggers with default behavior, does not identify the documentation requirement. Rubric scores 1 of 4.

What the analyst doesThe Pre-to-Post delta on rubric score is the Level 2 measure. Marcus moved from a 2 of 4 at Pre to a 4 of 4 at Post on this scenario; that is +2 points. The strong response also feeds the curriculum library as an exemplar; the weak response triggers a check on whether the consent-step content was reinforced enough during week 3.
Level 3 · Behavior

30-day follow-up application narratives

Question (asked 30 days post-program)

"Walk through a case in the past 30 days where you applied the housing-disclosure protocol. What did you try, what worked, what surprised you?"

Strong response

I used it four times in the past 30 days. The case that stood out: a same-day intake where housing came up in minute three. I followed the consent step (asked permission to discuss). What surprised me: the participant disclosed two more housing-instability events I would have missed without the open-ended phrasing from week 3. The case took 18 minutes longer than my pre-training baseline, but I caught two issues I would have closed out unaware of.

Why it works. Includes the count (4 times), names the specific case, identifies what was new in behavior (consent step, open-ended phrasing), surfaces a trade-off (time cost), and reports a positive surprise that confirms protocol value.

Weak response

Used it a few times. Worked fine.

Why it does not. No count, no case, no application detail. Tells the program manager that something happened but provides no signal about what or whether the protocol worked the way it was meant to.

What the analyst doesThe strong response feeds three queues: case-study material (anonymized), barrier-review (the 18-minute time cost is a workflow input), and L3-evidence count (four applications in 30 days). The weak response is logged as "applied" without specificity, which counts for adoption but does not count for evidence.
Question (Pre-to-Post confidence reasoning at Post wave)

"At Pre you rated your confidence speaking up in cross-functional meetings as 4 out of 10. What is your rating now, and what produced the change?"

Strong response

7 out of 10. At Pre I had only led one such meeting in the past year and felt out of my depth on how to open. After the course, the framework for opening (state purpose, name three goals, ask for adjustments before content) is now clear. I have led three meetings since then. The remaining gap is handling pushback mid-meeting, which I am still working on.

Why it works. Confirms the rating, anchors the change to specific learned content (opening framework), gives behavioral evidence (3 meetings led), and names the remaining gap honestly. The Pre and Post numbers connect to the underlying mechanism.

Weak response

7 out of 10. I feel more confident now.

Why it does not. The rating moved but the reasoning is empty. We do not learn what worked, what did not, or where the participant still has gaps. The +3 delta is a number without a mechanism behind it.

What the analyst doesThe strong response feeds two reports: the cohort effectiveness narrative (specific content that produced change) and the next-cohort design (pushback handling is the gap to address in week 7). The weak response counts toward the +3 average delta but cannot support the narrative report.
The pattern Strong responses share four properties: a specific anchor (moment, case, count), a named mechanism (what content, what step), a behavioral consequence (what the participant did), and an honest gap (what is still unresolved). Programs that surface these four properties in the question wording (instead of leaving them implicit) raise response quality by roughly half. Examples: replace "what did you learn" with "describe one moment in the past 30 days where you applied something from this program" (anchors the response to a specific moment and forces application detail).
Q.18

What is a training questionnaire?

A training questionnaire is the instrument used to collect data for training evaluation. It contains the specific question items, the scales, the open-ended prompts, and the metadata that lets responses be analyzed. A complete training questionnaire spans all four Kirkpatrick levels and runs as a set of instruments rather than a single form: an end-of-session reaction questionnaire, a Pre/Post knowledge questionnaire, a 30-day behavior follow-up questionnaire, and a tied-metric Level 4 indicator definition. Most software called "training questionnaire tools" only produces the first one.

Q.19

What is a Kirkpatrick model questionnaire?

A Kirkpatrick model questionnaire is a training questionnaire organized by the four-level model: reaction, learning, behavior, results. Each level has its own format and cadence. Reaction items run at end-of-session (Likert plus paired open-ended). Learning items run at intake and end-of-program as paired scenarios. Behavior items run 30 to 90 days post-program as anchored counts. Results indicators run 90 days to 12 months post-program as tied operational metrics. A real Kirkpatrick questionnaire is therefore a set of four to five instruments, not one survey labeled "Kirkpatrick questionnaire." The 30 questions in the worked example above show one full set.

Q.20

What are workshop evaluation questions?

Workshop evaluation questions are training evaluation questions adapted for shorter, often single-session formats. The Kirkpatrick level structure applies the same way as for multi-week training, but the cadence compresses: reaction items run at end of workshop, learning items use Pre and end-of-workshop pairing on a 4-to-6 scenario rubric, behavior items run 30 days post if the workshop content covers an applicable skill, results items run only when the workshop ties to a tracked operational metric. The six-question format taxonomy above transfers directly. Use Likert and Open-ended for reaction, Scenario for learning, Anchored count for behavior. Workshops shorter than 90 minutes typically cannot support Level 3 behavior measurement; the cadence does not give enough application opportunity.

Q.21

What are lessons-learned survey questions?

Lessons-learned questions are retrospective evaluation items asked at end-of-program that capture what worked, what did not, and what would change for a future cohort. They are distinct from Kirkpatrick reaction items because the unit of analysis is the program, not the session. Common formats: "What is the one thing this program would have benefited from doing differently?" (open-ended), "Which week had the highest application value for you and why?" (open-ended with implicit ranking), and "If you were redesigning week three, what would you change?" (open-ended targeted at a specific module). They feed program-design queues, not facilitator-feedback queues. Pair every lessons-learned item with a participant identity tag so responses can be filtered by role, site, or cohort segment.

Q.22

What does a strong post-training feedback answer look like?

A strong post-training feedback answer has four properties: a specific anchor, a named mechanism, a behavioral consequence, and an honest gap. The anchor names a specific session, module, or case ("the housing-disclosure scenario in hour two"). The mechanism names what content produced the change ("the specific phrasing for opening the consent question"). The behavioral consequence reports what the participant did differently ("I applied this in three cases since training"). The gap names what is still unresolved ("I still struggle with pushback mid-meeting"). Programs that ask questions worded to surface these four properties (rather than leaving them implicit) raise response quality measurably; the Sample Answers section above contrasts strong and weak responses side-by-side for each Kirkpatrick level.