play icon for videos

Primary Data: AI-Era Collection, Analysis & Examples Guide

What primary data is, how to collect it, and how to analyze it when AI enters the workflow. Persistent IDs, locked codebooks, and the red-flag automation pattern.

US
Pioneering the best AI-native application & portfolio intelligence platform
Updated
May 14, 2026
360 feedback training evaluation
Use Case
Primary Data: AI-Era Collection, Analysis & Examples Guide
Stage 1 Collect primary
Stage 2 Structure with persistent IDs
Stage 3 Join with secondary
Stage 4 Query via Claude Code + MCP
Stage 5 Act on the signal

A practical guide to primary data in 2026

Primary data on its own answers what happened to your participants. Joined with real public secondary data, it answers whether the program is reaching who it should. This guide walks the architecture, then shows the full worked example: Sopact Sense as the primary layer, the City of Chicago Data Portal and US Census as secondary, and Claude Code as the unified query layer that joins them via MCP.

Reading time: 16 minutes  ·  Updated May 14, 2026  ·  Part of the stakeholder intelligence series

Definitions, without the textbook

What primary data is, in one paragraph

Primary data is information you collect directly for your current research question, rather than reusing data someone else collected. Surveys, interviews, observations, focus groups, assessments, and program records all qualify. The defining property is purpose-fit: the instrument was designed to answer your question, the participants were selected for your study, and the format matches your analysis plan.

Four sources of primary data · in the order an evidence-based report uses them

Source · 01

People answering questions you wrote

Surveys, questionnaires, structured assessments. The largest-N category. Carries closed-ended items and short-form open-ended responses.

Source · 02

People answering in their own words

Interviews and focus groups. Lower N, higher depth. Carries mechanism and lived experience.

Source · 03

Behavior you record directly

Observations, attendance logs, milestone tracking, sensor or usage data. Independent of self-report bias.

Source · 04

Documents created during the program

Case notes, intake forms, uploaded artifacts (pitch decks, transcripts, audits). The text the program produces about itself.

What separates strong primary data from a stack of forms

Five properties carry a primary dataset from operational artifact to evidence. Persistent identity: the same participant is recognizable across intake, mid-program, and follow-up, even if their name, email, or language preferences change. Without it, longitudinal analysis becomes approximation.

Aligned definitions: every form, cohort, and fund uses the same dictionary. Skills training, capacity building, and professional development rolling up to one outcome category requires the dictionary to say they do. Without alignment, cross-cohort comparison breaks at the merge step.

Paired quant + qual: every closed-ended item has an open-ended probe on the same record. Pairing happens at the source, not at the end of analysis. Correlation then becomes a query against one table.

Documented sampling: who was eligible, who was reached, who responded, who dropped out, and what differs between them. A "convenience sample of trainees" is not a sampling frame. Documentation is what lets a reader judge generalizability.

Audit trail: every field traces back to the instrument, the participant consent terms, and the collection wave. Audit-ready primary data is the precondition for the AI workflow that follows.

How AI changes primary data analysis

Two failures Gen AI makes with primary data on its own

Gen AI changes the cost of qualitative coding, the speed of pattern detection, and the economics of dashboards. It does not change the need for purpose-fit collection. Run Gen AI directly against a collection of CSV exports and two failures appear immediately: numbers that look plausible but do not reconcile, and theme labels that drift between baseline and endline. Both failures share a root cause: no persistent layer underneath, so every session starts from zero.

Failure 01 · Numeric hallucination

on large quantitative primary data

What goes wrong

LLMs do approximate numerical reasoning, not exact computation. On 80 rows the placement rate reconciles to the source. On 8,000 rows the answer drifts by 3–8%.

The number looks plausible. The team ships the report. The funder asks where a specific cohort went and the trail goes cold.

The fix

The LLM calls out to a structured query against a system of record. It does not compute the total itself. Sopact Sense provides that system of record via MCP, so Claude Code can pull exact counts instead of estimating them.

Failure 02 · Session amnesia

on longitudinal qualitative primary data

What goes wrong

Gen AI codes one transcript well. Run baseline in March, midpoint in July, endline in November and three sessions produce three slightly different codebooks. Theme labels drift. Segment definitions re-derive.

The endline report tries to compare three waves and a full week of reconciliation work appears before the analysis can begin.

The fix

A locked codebook applied by the same model across the full dataset, not per session. The codebook lives in the persistent layer; AI reads from it instead of regenerating it. Baseline remains comparable to endline by construction.

Both failures resolve with the same architecture: a persistent data layer that holds identity, dictionary, codebook, and rubric state across sessions. The next section shows what that layer looks like and what AI then does on top of it.

Run primary data collection that holds up under MCP query

Persistent IDs across waves, locked codebook for longitudinal qualitative work, paired quant + qual on the same record, and the MCP interface that lets Claude Code join your participant data with public secondary sources in one prompt.

See how Sopact Sense works →

The architecture that makes primary data work in the AI era

The persistent layer for primary data, and what Claude Code reads from it

Primary data needs two layers to be useful: a persistent layer that holds the structures that have to stay stable across cycles, and an analytical layer that does the language and pattern work. Sopact Sense is the persistent layer. Claude Code, BI tools, and notebooks are the analytical layer. The two communicate over MCP, which means any structured query against the primary data is one prompt away, and any join with secondary data is the same prompt with more sources named.

Sopact's persistent primary layer

What the platform owns

  • Persistent participant IDs. Assigned at intake, carried across every survey, interview, and cycle.
  • Data dictionary. Form-level labels map to one outcome category across cohorts.
  • Locked codebook. The same theme labels apply across baseline, midpoint, and endline.
  • Deterministic AI scoring. Same rubric input, same output, auditable across runs.
  • Framework rollups. Theory of Change, IRIS+, Logic Model categories applied to every record.
  • MCP interface. Every field queryable from Claude Code, BI tools, and notebooks.
queries · MCP

Claude Code + the secondary stack

What the analytical layer owns

  • Public data via MCP. Census, BLS, HUD, city portals, World Bank.
  • Cross-source joins. Sopact primary data joined with secondary baselines in one query.
  • Ad-hoc dashboards. Board-meeting one-offs in minutes, not days.
  • Workflow automation. Route signals to Slack, Asana, email at the right moment.
  • Custom modeling. Regression, segmentation, predictive risk scoring on the unified data.
  • Per-role personalized views. Program officer, board, and finance each get their own surface.

70–80% of standard analytics runs inside Sopact. The 20–30% custom work runs in Claude Code, against the same data layer.

Worked example · primary + secondary unified via MCP

Is the TechBridge Chicago program reaching the highest-need community areas?

The question is equity, and answering it requires four data sources joined in one query. A workforce program tracks 80 participants in Sopact Sense. Two public datasets carry the neighborhood-level context. A fourth dataset, HUD's USPS crosswalk, bridges between zip code and community area. Claude Code, via MCP, queries all four and returns the answer the program coordinator needs to plan the next cohort's outreach.

The program · TechBridge Chicago

illustrative composite · 2 cohorts · 80 participants · 16-week digital skills training

The participants

80 young adults, ages 18–24, recruited from Chicago community-based organizations across 14 of Chicago's 77 community areas. Two cohorts: 2026-Q1 and 2026-Q2.

The data collected

Per-participant attendance, weekly reflections, pre and post skills assessments, 90-day placement outcomes with wage data. All tied to a persistent participant ID and a zip code.

The question

Is the program reaching participants from the lowest-income community areas? Are outcomes equitable across neighborhoods? Where should outreach focus for cohort 2027?

The four data sources, and what each one carries

One question · four sources · joined via persistent IDs and shared geographic dimensions

Source · 01 · PRIMARY

Sopact Sense

80 participant records · persistent IDs · attendance, skills lift, placement, wage, zip code

Source · 02 · SECONDARY

Chicago Data Portal

Per-capita income by community area · published by City of Chicago · Socrata API · no key required for public data

Source · 03 · SECONDARY

US Census Bureau

ACS 5-year estimates · median household income by census tract · variable B19013_001E · Census API

Source · 04 · CROSSWALK

HUD USPS Crosswalk

ZIP → census tract → community area · HUD User API · bearer token

Layer · 05 · QUERY

Claude Code (MCP)

Three MCP servers configured · one unified prompt · result computed in seconds, audit-logged

What primary data looks like inside Sopact Sense

A simplified sample of the participant table. In production this carries far more columns (consent flags, instrument versions, framework tags, narrative themes, sentiment scores), but these are the fields the equity analysis pulls.

participant_id cohort zip_code attended_pct skills_lift placed_90d wage_90d
P-004122026-Q16062494%+1.8yes$19.20
P-004132026-Q16061787%+1.2yes$20.40
P-004142026-Q16062471%+0.6non/a
P-004152026-Q16065196%+2.1yes$18.90
P-004162026-Q16061982%+1.5yes$21.10
… 75 more rows · full table accessible via MCP query

What the Chicago Data Portal carries (Source 02)

Real public dataset on data.cityofchicago.org: Per Capita Income by Community Area. Socrata-powered, no API key required for public reads. Accessed via the OpenGov MCP server (github.com/srobbin/opengov-mcp-server), which exposes Socrata's SoQL query language as MCP tools.

community_area community_area_name per_capita_income income_quartile
27East Garfield Park$13,840Q1 (lowest)
26West Garfield Park$14,210Q1 (lowest)
67West Englewood$13,180Q1 (lowest)
40Washington Park$14,990Q1 (lowest)
32Loop$94,830Q4 (highest)
8Near North Side$108,420Q4 (highest)
… 71 more community areas · figures illustrative, refresh from live Socrata source at query time

What the US Census API carries (Source 03)

The American Community Survey (ACS) 5-year estimates, published by the US Census Bureau. Variable B19013_001E is median household income, available at census tract granularity. Accessed via the official Census MCP server (github.com/uscensusbureau/us-census-bureau-data-api-mcp), maintained by the Census Bureau itself.

state county tract median_household_income (B19013_001E) vintage
17031270500$28,420ACS 2023 5-yr
17031260400$29,180ACS 2023 5-yr
17031671100$26,950ACS 2023 5-yr
17031080100$142,840ACS 2023 5-yr
… Cook County, IL has 1,331 tracts · query filters to ~80 with participants

The crosswalk that joins ZIP to tract to community area (Source 04)

HUD's USPS Crosswalk is the published mapping from ZIP code to census tract to county subdivision and beyond. It is the bridge that lets the participant's ZIP (in Sopact) join the Census's tract (in ACS) and the City of Chicago's community area (in the Data Portal). Accessed via HUD User API, free bearer token, 60 requests per minute.

zip tract community_area community_area_name res_ratio
606241703127050027East Garfield Park0.61
606241703128120027East Garfield Park0.39
606171703146040049Roseland0.42
606511703126100023Humboldt Park0.58
… one ZIP often spans multiple tracts and community areas · res_ratio weights the join

Four sources, each with its own schema, its own update cadence, and its own access method. The next section shows how Claude Code joins them with one prompt.

The unified query · primary + secondary · one prompt

How Claude Code joins all four sources in one MCP-enabled prompt

One natural-language prompt becomes four MCP tool calls and one cross-source aggregation. Claude Code reads the MCP servers, plans the join, executes the queries in sequence, and returns a disaggregated answer the program coordinator can act on. The whole exchange takes seconds. What used to be a three-week reconciliation project against three exported CSVs.

Step 1 · The coordinator's prompt to Claude Code

Claude Code · natural-language prompt from the program coordinator # Conversational prompt · no code required from the user For TechBridge Chicago cohort 2026, I want to evaluate equity: - Are we reaching the lowest-income community areas? - What is the placement rate by neighborhood income quartile? - Where should we target outreach for cohort 2027? Use Sopact for participant data, the Chicago Data Portal for community area income, and the HUD crosswalk to map participant ZIP codes to community areas.

Step 2 · Claude plans the joins and calls the MCP servers

What Claude Code does under the hood · four MCP tool calls in sequence # 1. Pull primary data from Sopact via MCP sopact.query( table="participants", filter={"program": "techbridge-chicago", "cohort_year": 2026}, columns=["participant_id", "zip_code", "placed_90d", "wage_90d", "skills_lift"] ) # 2. Map ZIP → community area via HUD USPS Crosswalk hud.crosswalk( from_geo="zip", to_geo="tract", zips=["60624", "60617", "60651", "60619", "…"] ) # 3. Pull Chicago per-capita income via OpenGov MCP (Socrata) opengov.query( domain="data.cityofchicago.org", dataset="per-capita-income-by-community-area", soql="SELECT community_area, per_capita_income" ) # 4. Pull ACS median household income via Census MCP census.acs5( year=2023, variables=["B19013_001E"], geography={"state": "17", "county": "031", "tract": "*"} ) # 5. Join: participant → tract → community area → income → quartile # 6. Aggregate: placement_rate, wage_avg, by income quartile
Claude: "I queried four MCP sources. Here is the equity analysis for TechBridge Chicago cohort 2026 (n = 80, distributed across 14 community areas)."

Step 3 · The result, disaggregated by community-area income quartile

Income quartile Median income range Participants reached Placement rate (90d) Average wage Skills lift (pre→post)
Q1 (lowest income) $13K–$22K per capita 24 (30%) 79% $19.40 / hr +1.8 points
Q2 $22K–$38K 22 (28%) 73% $20.10 / hr +1.5 points
Q3 $38K–$60K 18 (22%) 71% $21.30 / hr +1.3 points
Q4 (highest income) $60K–$108K 16 (20%) 68% $22.50 / hr +1.1 points
Chicago citywide baseline all youth 18-24 n/a 65% (secondary) $17.80 / hr n/a

Illustrative figures · figures refresh against live MCP sources at query time · disaggregation by community-area income quartile.

Step 4 · The insight Claude returns

The program is most effective where it is most needed. Placement rate in the lowest-income quartile is 79%, 14 percentage points above the Chicago citywide baseline of 65%. The skills lift is also largest in that quartile (+1.8 points). Wage gradient runs the other direction: higher-income participants are placed into higher-wage roles on average, suggesting employer network effects that warrant separate investigation.

Source: Claude Code output · run_id: r-2026-05-14-103 · 4 MCP sources · audit log attached

Step 5 · What happens with the answer

The unified query is one prompt away, which means the coordinator can ask the next question immediately. "Which Q1 community areas are we under-represented in?" Claude joins the same four sources, filters to Q1 areas, computes participants per area as a share of 16-to-24-year-old population, and returns a list of three community areas with high need and low representation.

The coordinator routes the list to the recruiting team via Slack. The Asana task "Cohort 2027 outreach plan" is created with the three community areas and the partner CBO names attached. The unified output is not a dashboard the team has to open; it is operational input the team already uses.

The same persistent layer that supports this analysis supports the next twelve like it: funder reports, board summaries, accreditation evidence packages, equity audits. Each one is a different prompt against the same primary data and the same secondary sources, joined on the same dimensions, with audit trails preserved.

Primary data collection methods

Five methods, ranked by what each one does best

The five method families cover most applied primary data work. Method choice follows the research question and the stage of the program, not the researcher's training. Surveys for scale, interviews for depth, observations for unbiased behavior, focus groups for shared meaning, assessments for skill measurement. Each method has a different cost, a different N, and a different fit with the questions a foundation needs to answer.

Method Best for Typical N Cost per response AI value-add Where it falls short
Surveys + questionnaires Scale comparison across many participants 100–10,000+ Low ($0–$5) High · automated coding, sentiment, pattern detection Surface depth; self-report bias
Interviews Depth on individual perspectives and mechanism 10–100 High ($50–$200) High · transcription, theme coding, rubric scoring Generalizability without quant pairing
Focus groups Group dynamics, shared meaning-making 6–60 (in groups of 6–8) Medium ($30–$100) Medium · transcript coding; speaker attribution harder Groupthink; minority views suppressed
Observations Behavior in context, independent of self-report 5–50 sites/sessions High (analyst time) Medium · attendance auto-logging, milestone tracking Observer effect; ethical permission
Skills assessments Pre/post change measurement against rubric 20–500 Medium ($10–$50) Very high · deterministic rubric scoring at scale Test fatigue; rubric drift across waves

★ Typical N depends on program scale. The deterministic AI scoring column is what changed since 2024. It made mixed-methods work tractable for the first time.

The combination that works for most foundation programs

For a workforce, education, or training program of 50 to 500 participants, the combination that consistently produces usable evidence is: survey + skills assessment as the closed-ended core, one open-ended item per closed-ended scale, and 5 to 15 interviews per cohort for depth on specific outcome stories. Observations are added for programs with significant in-person components where behavior in context matters.

Focus groups are useful in design phase (understanding stakeholder priorities) but less useful for outcome measurement because the group dynamic obscures individual attribution.

For deeper material on each method, see the data collection methods and qualitative interview guides.

Trade-offs that drive the design choice

Advantages and disadvantages of primary data, in trade-offs

Primary data has three structural advantages and three real costs. The advantages: purpose-fit, current, controllable. The costs: time, sample size, operational complexity. The fixes: automation in collection, triangulation with secondary data, deterministic instrument design that locks across waves. The trade-offs decide whether primary is right for a given question; rarely eliminate the need entirely.

Advantage 01

Purpose-fit by construction

The instrument matches the question. No reuse compromises. Every item earns its place by feeding a specific decision. Variables, scales, and timing align with the analytical plan.

Disadvantage 01

Higher cost in time and operations

Fielding, response management, identifier tracking, cleaning. Primary collection takes weeks where secondary access takes hours. Automation in collection and AI in analysis are the cost levers.

Advantage 02

Current to the moment

The data reflects the present participants, not last year's population. For programs that change cohort to cohort or run multiple waves a year, current primary data outperforms two-year-old secondary baselines.

Disadvantage 02

Bounded sample size

Sample size is bounded by your operational reach. A foundation program reaches hundreds, not the thousands needed for high-power subgroup analysis. The fix is triangulating with secondary data at the population level for context.

Advantage 03

Full control over consent and ethics

You set the sampling frame, the consent terms, the retention period, and the disclosure scope. Participant trust is built into the design rather than inherited from someone else's collection rules.

Disadvantage 03

Quality depends on instrument design

Instrument design is harder than it looks. Poorly worded items, missing pairs, unmapped categories, and drift between waves all degrade analysis. The fix is designing the instrument once, deterministically, and locking it across the cohort.

For the head-to-head decision with secondary data, see the primary vs secondary data guide. For when secondary data is the right starting point, see secondary data analysis.

Frequently asked questions

Common questions about primary data

What is primary data?

Primary data is information you collect directly for your current research question, rather than reusing data that someone else collected. Surveys, interviews, observations, focus groups, assessments, and program records all qualify. The defining characteristic is purpose-fit: the instrument was designed to answer the question, the participants were selected for the study, and the format matches the analysis plan. Primary data sits opposite secondary data, which is reused information collected for some other purpose.

What are the main methods of primary data collection?

Five method families cover most applied work. Surveys and questionnaires for standardized comparison across many participants. Interviews (structured, semi-structured, unstructured) for depth on individual perspectives. Focus groups for group dynamics and shared meaning-making. Observations (participant and non-participant) for behavior in context. Experiments and assessments for cause-and-effect or skill measurement. Method choice follows the research question, the stage of the program, and the consent and access available.

What are examples of primary data?

A pre-program skills assessment scored on a five-point rubric. An end-of-cohort survey with paired Likert and open-ended items. Interview transcripts coded against a locked codebook. Attendance logs from a workforce training program. Wage and placement records collected ninety days after completion. A focus group transcript on barriers to participation. Each one was collected directly for the study and attaches to a specific participant record.

What are the advantages of primary data?

Primary data has three structural advantages. It is purpose-fit: the instrument matches the question, with no reuse compromises. It is current: the data reflects the present participants, not last year's population. It is controllable: you set the sampling frame, the consent terms, the variable definitions, and the data dictionary. The disadvantage is cost, both in time and in operational complexity. Primary data is expensive precisely because it answers a specific question that secondary data cannot.

What are the disadvantages of primary data?

Three real costs. Collection takes time, which delays analysis relative to secondary data that already exists. Sample size is bounded by your operational reach, often hundreds rather than thousands. Data quality depends on instrument design, which is harder than it looks. The fix for the first cost is automation in collection and analysis. The fix for the second is to triangulate with secondary data on the same question. The fix for the third is to design the instrument once, deterministically, and lock it across waves.

What are the sources of primary data?

Sources are not abstract: they are the people and instruments that produce the data. Program participants completing surveys and assessments. Beneficiaries giving interviews. Staff members logging activities. Funders providing self-reported portfolio data. Observation of program activities. Documents created during the program (case notes, attendance, milestones). Every source needs informed consent, a persistent participant ID, and a defined retention period.

How do you analyze primary data?

The workflow has five stages. Clean the data, addressing non-response explicitly. Compute descriptive statistics with both center and spread. Run pre-planned inferential tests, adjusting for multiple comparisons. Pair every quantitative finding with the open-ended response on the same record. Detect cross-signal patterns across cohorts and waves. The last stage is what AI in the workflow adds: patterns that span attendance, sentiment, narrative length, and rubric scores are intractable in spreadsheets and tractable in a persistent layer.

What is the difference between primary and secondary data?

Primary data is collected directly for the current question. Secondary data is collected by someone else for a different purpose and reused. Primary data tells you what your participants did; secondary data tells you what would likely have happened anyway given the broader population. Strong impact evaluation combines both: primary minus secondary equals attributable effect. The architecture matters too: primary data lives at the participant level, while secondary data is usually aggregated and joins on geography, demographics, or time period.

How do you make primary data actionable in real time?

A chart in a dashboard is not action. Action requires three conditions: a decision is on the table, the right person sees the signal within their normal workflow, and the path from signal to action is short. The pattern that produces action: the data layer detects a multi-signal flag, an AI layer drafts the personalized response, and the operational tool (Slack, Asana, email) delivers it to the human who can act, within hours. The dashboard does not produce the action; the operational tool does.

What is a persistent data layer for primary data?

A persistent data layer holds the structures that need to stay stable across cycles: participant IDs, instrument versions, the data dictionary, the codebook for qualitative themes, the rubric for AI scoring, and the cohort tags. AI tools then read from that layer instead of regenerating those structures each session. Without it, cross-wave comparison drifts because every analysis session re-derives categories. With it, the longitudinal join is automatic.

How does AI change primary data analysis?

AI changes three things and leaves the rest unchanged. It changes the cost of qualitative coding, which collapses from hours per transcript to minutes. It changes the speed of pattern detection across signals, which makes real-time intervention practical. It changes the economics of dashboards, which shift from a handful of standing reports to many disposable ones built on the same data layer. What AI does not change is the need for purpose-fit collection, locked codebooks, and persistent identity. AI without the data layer underneath produces plausible-looking outputs that do not reconcile.

The full series

Get the complete stakeholder intelligence guide

The persistent layer pattern applied to grant management, training programs, impact portfolios, and nonprofit operations. The MCP integration walked through in depth, with worked examples across multiple program types.

Read the stakeholder intelligence guide →

Ready when you are

Make your primary data work with the rest of the world.

The persistent layer. The locked codebook for longitudinal work. The MCP interface that joins your participant data with Census, HUD, BLS, and city open data in one prompt. Configured once, run for every funder report, every board meeting, every equity audit.