play icon for videos

Training Evaluation Survey Questions by Kirkpatrick Level

Training evaluation survey questions for every Kirkpatrick level. Pre and post examples, behavior-anchored prompts, and the question architecture funders accept.

Updated
June 19, 2026
360 feedback training evaluation
Use Case
Training Evaluation · Interactive

Build a training evaluation survey in any GenAI tool — mapped to every Kirkpatrick level

Most training surveys stop at reaction: a 4.3 out of 5 that answers a question no funder or sponsor is asking. This page is a set of prompts you paste into Claude, ChatGPT, or Gemini to draft questions across all four Kirkpatrick levels, make them pre/post matched, and turn the results into evidence. A prompt writes the questions in a minute. The same person answered twice, linked by one ID, is what makes them count.

You describe
Your course outline + audience
The learning objectives
A survey you already run
One learner record
Every question tied to a level and an ID
L1 ReactionL2 LearningL3 BehaviorL4 Results Participant IDPre / Post
Your team gets
Pre/post delta per person
90-day behavior on schedule
A Level 4 metric, audit-ready
4
Kirkpatrick levels covered
8
prompts for any GenAI tool
Pre→Post
matched by one participant ID
5%→95%
questions to real evidence

The short answer

What training evaluation survey questions are

Two definitions to keep straight before you write a single item — then the four levels every strong evaluation covers.

What are training evaluation survey questions?

Training evaluation survey questions are the items used to measure whether a training program worked — not just whether people liked it. A complete set runs across four Kirkpatrick levels: reaction at session end, learning as paired pre/post scenarios, behavior at 30 to 90 days, and results as a tied operational metric. Most surveys ask only the first level and call it an evaluation.

What are the four Kirkpatrick levels?

Reaction, Learning, Behavior, and Results. Reaction asks how the experience landed; Learning measures what changed in knowledge or skill, baseline to exit; Behavior tracks what people actually did differently on the job; Results ties the program to an operational outcome a sponsor cares about. Each level uses a different question format and a different cadence.

L1Reaction L2Learning L3Behavior L4Results

Build it live

Eight prompts that build a four-level evaluation, one level at a time

A question bank gives you items to pick from. These give you a working session. The order matters — each prompt builds on the answer before it.

Paste each prompt into Claude, ChatGPT, or Gemini, in order. Fill the [brackets] with your own program. Keep the same chat open so the tool remembers the survey it just wrote.
01
Level 1 + Level 2
Draft reaction and learning items from a course description

Start from what you teach. A sentence about the course is enough for the tool to draft both levels.

Create a course evaluation survey with Likert items and open-ended questions, mapped to Kirkpatrick Level 1 (reaction) and Level 2 (learning). My course is [course name and topic] for [audience], running [length]. For Level 1, use 1-to-5 Likert items on relevance, clarity, and confidence, each paired with one open-ended "what produced that rating." For Level 2, write three to four short scenario questions that test applied understanding, not recall.
02
Pre / Post
Turn the learning items into a matched pre/post pair

The delta is only real if the same person answers the same item twice on a locked scale. This sets that up.

Turn the Level 2 scenarios into a matched pre/post pair. Use the exact same scenarios and the same rubric at intake and at end-of-program, and keep every Likert scale locked at 1-to-5 across both waves. Explain how I'd score the per-person delta, so the change is a real measure rather than a scale artifact.
03
Level 3
Add a behavior follow-up at 30, 60, and 90 days

Behavior is what people did on the job. Anchored counts beat self-rated frequency, which just measures personality.

Add a Level 3 behavior follow-up to run 30, 60, and 90 days after training. Use anchored-count questions tied to a specific application moment — "in the past 30 days, how many times did you use [skill] where it applied" — not self-rated frequency scales. Pair each with one open-ended question surfacing the barriers that stopped people from applying it.
04
Level 4
Define the results metric — before training starts

Level 4 is not a survey question. It is an operational metric with a date range and a comparison cohort, defined now so attribution holds up later.

Help me define the Level 4 results measure. It is not a survey question — it is an operational metric that existed before training, with a date range and a comparison cohort. For my program, the outcome that matters is [describe it]. Propose one or two tied metrics I could pull, the source system for each, and the cadence, defined now so the attribution is auditable later.
05
Open-ends
Pair every rating with a reason, and make it codeable

A number with no reasoning behind it produces an average no one can interpret. This adds the why, and a rubric to theme it.

For every Likert item in my survey, write the paired open-ended question that asks what produced the score, worded to get a specific anchor rather than "it was good." Then give me a short coding rubric — four to six themes — I could use to tag the open-ended answers consistently across the whole cohort.
06
Quality check
Catch the things that quietly break an evaluation

Leading questions, double-barreled items, scale drift, and outcomes with no instrument behind them. This finds and fixes them.

Review my whole survey for the things that quietly break training evaluation: leading or double-barreled questions, vague items that can't be acted on, scale drift between pre and post, and outcomes I'm claiming with no instrument behind them. Rewrite each weak item and tell me in one line what was wrong with the original.
07
Analysis
Turn responses into a pre/post read, not a wall of text

Collection was never the bottleneck — analysis was. This computes the delta and themes the open-ends in one pass.

I've collected responses. Here is the data: [paste or describe it]. For each Level 2 scenario, compute the pre-to-post delta per participant and flag anyone who didn't move. Theme the open-ended answers using the rubric, and tell me which themes show up most among the people whose scores didn't improve.
08
Adapt
Rewrite the bank for your training context

The four-level architecture transfers; the scenarios and the Level 4 metric do not. This swaps them for your setting.

Rewrite this question bank for [sales enablement / clinical training / a nonprofit workforce program / compliance training]. Keep the four-level structure, but swap in scenarios, application moments, and a Level 4 metric that fit that setting. Flag any level that doesn't apply to a short, single-session format.

The honest part

Where the prompt stops and the data system begins

A GenAI tool can write every question across all four levels — that is the easy 5%. The other 95% is what links one person's answers across waves so the numbers mean something.

What the prompt gives you — the 5%

·A four-level question set drafted from a course description
·Pre/post pairing and a locked scale, written correctly
·A coding rubric for the open-ended answers
·The analysis steps, once you have the data

What a data system gives you — the 95%

+A participant ID assigned at enrollment, inherited by every wave
+The same person's pre and post scored as one delta, automatically
+The 30/60/90-day follow-up that actually goes out
+Open-ends themed on arrival, sitting next to the scores on one record

Pre · intake

L2 scenario, rubric2 / 4
Confidence (1–10)4
Applied on the job
One ID links them
participant_0431

Post · exit + 90 days

L2 scenario, rubric4 / 4
Confidence (1–10)7
Applied, past 30 days
Learning Δ +2 Confidence Δ +3 Behavior confirmed

Without the shared ID, the same participant is two anonymous rows in two separate exports — and the delta cannot be computed at all. The ID is the difference between a 4.3 average and proof that this person changed.

A prompt writes the questions. A record links the answers. Five hundred post-surveys you read ten of prove nothing. The same person answered at intake and again ninety days later, held together by one ID, with the reasons themed beside the scores — that is the evidence a sponsor accepts.

Your next evaluation

Stop running satisfaction surveys. Start measuring change.

A prompt drafts the four-level survey in a minute. Sopact Sense is where those questions become a connected learner record — pre to post to behavior to results, one ID holding it together.

  • A persistent participant ID assigned at enrollment, inherited by every instrument
  • Pre/post scored as one delta per person — no spreadsheet matching, no dropped records
  • Open-ended answers themed on arrival and analyzed beside the scores