play icon for videos

Primary vs Secondary Data: When to Use Which (and Combine)

Primary vs secondary data: definitions, side-by-side matrix, and when to combine both. Worked example: workforce program outcomes vs BLS regional baseline producing +10.4pp attributable effect.

US
Pioneering the best AI-native application & portfolio intelligence platform
Updated
May 14, 2026
360 feedback training evaluation
Use Case
Primary vs Secondary Data: When to Use Which (and Combine)
Step 1 State the question
Step 2 Pick primary, secondary, or both
Step 3 Align dimensions at design
Step 4 Join on shared geography or time
Step 5 Compute attributable effect

The decision and the combination

Primary data answers what happened to your participants. Secondary data answers what would likely have happened anyway. The strongest evaluations use both, joined at shared dimensions, with attributable effect as the headline finding. This guide walks the head-to-head decision and the hybrid pattern.

Reading time: 12 minutes  ·  Updated May 14, 2026  ·  Part of the stakeholder intelligence series

Definitions, in one paragraph each

What primary and secondary data each are

Primary data is collected directly for the current question. Secondary data is collected by someone else for a different purpose and reused. The defining difference is who collected the data and why. Primary fits your question by construction; secondary fits approximately, traded for speed and scale. Strong analysis uses both: primary to characterize the participants, secondary to characterize the counterfactual, the join to produce attributable effect.

Side by side · the five dimensions that distinguish them

Dimension · 01

Who collected

You, for this project

Dimension · 02

For what purpose

Your specific research question

Dimension · 03

Variables match

Exact, by construction

Dimension · 04

Time and cost

Weeks of fielding, higher cost

Dimension · 05

Identifiers

Participant-level, persistent

vs

Dimension · 01

Who collected

Government, vendor, or prior researcher

Dimension · 02

For what purpose

A different question, often unrelated

Dimension · 03

Variables match

Approximate, at best

Dimension · 04

Time and cost

Hours of access, lower cost

Dimension · 05

Identifiers

Aggregated by geography or demographics

The architectural difference that matters in practice

Primary data lives at the participant level with full provenance back to the instrument and consent terms. Secondary data lives at the aggregate level: a county, a census tract, a year, an age band, an occupation code. The two cannot join on identity. They join on shared dimensions.

This architectural difference decides the analytical design: the primary dataset must include the dimensions the secondary data is aggregated on. ZIP code, occupation code, year, gender. Without those bridge dimensions in primary, the secondary baseline cannot attach to the participant outcomes.

The decision matrix

When to use primary, when to use secondary, when to combine

The choice is not aesthetic. Each question type points to a different starting source. Participant-specific questions need primary. Population context needs secondary. Causal questions (did the program produce effects above background trend) need both, joined on shared dimensions. The matrix below maps the common evaluation question types to the source combination that answers them.

Question type Primary alone Secondary alone Both, joined Why this combination
"What happened to our participants?" ✓ Strong ✗ Not possible Outcome description needs participant identity.
"How did outcomes change pre to post?" ✓ Strong ✗ Not possible Longitudinal change needs persistent IDs across waves.
"What's the regional baseline?" ✗ Out of scope ✓ Strong Population context only available from secondary.
"How are similar programs trending?" ~ Limited to your data ✓ Strong Sector benchmarks come from secondary research.
"Did our program beat the baseline?" ✗ No counterfactual ✗ No participants ✓ Required Attributable effect = outcome − counterfactual.
"Are we reaching the highest-need areas?" ~ Self-defined ~ Generic ✓ Required Need = secondary; reach = primary.
"Why did the score change?" ✓ Strong with qual ✗ Not possible Mechanism is participant-specific.

★ The joined column wins on every causal and equity question. Primary-only and secondary-only fit narrower question types.

For more detail on collecting primary data, see the primary data guide. For sources, validation, and integration patterns on the secondary side, see secondary data analysis.

Run primary and secondary together, joined automatically

Persistent participant IDs on the primary side, bridge dimensions (state, ZIP, occupation code, year) captured at intake, and an MCP interface that lets Claude Code or BI tools query both layers in one prompt.

See how Sopact Sense works →

Worked example · attributable effect via BLS counterfactual

Did the workforce program beat the regional baseline?

The question is causal, and only the combination of primary and secondary answers it. A 78% placement rate at 90 days sounds strong. Whether it actually represents program effect depends on what the regional baseline did over the same period. BLS data provides the regional baseline at the geography and occupation codes that match the primary participants. The subtraction is attributable effect.

The setup

A workforce program runs across three states (California, Illinois, Texas) with 219 participants in the 2026 cohort. The program trains entry-level technical occupations. The board wants to know whether participants placed above what would have happened anyway given regional labor market conditions. Primary data lives in Sopact Sense; secondary data is the BLS LAUS (Local Area Unemployment Statistics) series for the same period.

The Claude Code prompt

Claude Code · joining Sopact primary with BLS secondary via MCP # 1. Pull primary cohort outcomes from Sopact sopact.query( table="outcomes", filter={"cohort": "2026", "program_type": "workforce"}, columns=["participant_id", "state", "occupation_code", "placement_90d", "wage_90d"] ) # 2. Pull regional employment baseline from BLS for the same period bls.fetch( series="LAUS", states=["CA", "IL", "TX"], period="2026-Q1", occupation_codes=["15-1252", "43-4051", "49-2098"] ) # 3. Join on state + occupation_code, compute lift by state join(sopact, bls, on=["state", "occupation_code"]) .aggregate("placement_lift" = "placement_90d - regional_baseline")
Claude response: California placement_lift: +14.2 pp (program 82%, regional baseline 67.8%) Illinois placement_lift: +9.6 pp (program 76%, regional baseline 66.4%) Texas placement_lift: +6.1 pp (program 71%, regional baseline 64.9%) Composite attributable effect across 3 states: +10.4 pp n = 219 program participants, regional sample n = 4.2M from BLS

Pseudocode for illustration. Actual MCP calls follow each server's documented schema.

219

Primary records. Per-participant 90-day placement, wage, occupation code, and state from Sopact.

4.2M

Secondary baseline. BLS LAUS regional employment data covering the same period and occupation codes.

+10.4 pp

Attributable effect. The lift above what would have happened anyway given regional labor conditions.

Same dataset, two analyses

Primary only

"We placed 76% at 90 days."

76%

A real outcome. Not impact. The reader has no way to know whether this rate is above, below, or at the regional baseline. The funder asks "compared to what?" and the team has no answer.

Primary + secondary joined

"We placed 10.4 pp above regional baseline."

+10.4 pp

Attributable effect. The same primary data, joined with BLS secondary on state and occupation. The funder asks "compared to what?" and the team points to the regional sample of 4.2 million workers.

Architecture for the hybrid

What sits underneath a primary + secondary analysis

The hybrid pattern is not two analyses stapled together. It is one analysis against one data layer that holds both primary records and secondary references. Primary lives at the participant level with persistent IDs. Secondary lives in reference tables aggregated by geography, demographics, or time. The join logic is defined once and runs whenever the analysis runs.

Sopact: primary at participant level

Persistent identity + bridge dimensions

  • Per-participant rows. Every record carries a persistent ID and consent provenance.
  • Bridge dimensions at intake. State, county, ZIP, occupation code, year, demographics.
  • Quant + qual paired. Likert plus narrative on the same record.
  • Locked codebook. Theme labels stable across waves.
  • MCP interface. Structured query exposed to Claude Code and BI tools.
  • Audit log. Every query, join, and aggregation traceable.
join · shared dims

Secondary: reference tables, aggregated

Public sources via MCP or API

  • Census ACS at tract level via official Census MCP server.
  • BLS LAUS / QCEW at county level via BLS API.
  • HUD Fair Market Rents, CHAS, USPS crosswalk via HUD User API.
  • City portals (Chicago, NYC, SF, LA) via OpenGov MCP / Socrata.
  • World Bank indicators at country level via Open Data API.
  • Documented vintage. Source, period, and update cadence per table.

Both layers live in one analytical workspace. The join is automatic when bridge dimensions are present on the primary side.

The join pattern

How primary and secondary actually join

Secondary data has no participant IDs, so the join cannot be on identity. It runs on shared dimensions: geography, occupation, demographics, time. The primary dataset must collect those dimensions at intake. Without them, the secondary baseline cannot attach to the participant outcomes. The fix is design-time: every primary form captures the bridge fields the secondary join will need.

Six bridge dimensions · capture in primary, join with secondary

Bridge · 01

Geography

State, county, ZIP, tract, community area

Bridge · 02

Occupation

SOC code (6-digit) or category

Bridge · 03

Time period

Year, quarter, month at outcome

Bridge · 04

Age band

In census-aligned brackets

Bridge · 05

Gender

With consent

Bridge · 06

Sector / industry

NAICS code or category

primary aggregates to · secondary already at

Join · 01

Census / ACS

All six available at tract

Join · 02

BLS LAUS

State, county, occupation

Join · 03

BLS QCEW

County, NAICS, quarter

Join · 04

HUD

ZIP, tract, county, MSA

Join · 05

City portals

Community area, ward, district

Join · 06

World Bank

Country, year

For the deeper worked example of all four sources joined via MCP in one prompt, see the primary data hub's TechBridge Chicago equity analysis. Same pattern, different geography and question.

Where the hybrid analysis breaks

Three common mistakes when combining primary and secondary

Each mistake produces a counterfactual that is technically computable and substantively wrong. Mismatched baseline, time-shifted comparison, and methodology drift are the three most frequent failures in foundation evaluation reports. Each one degrades the credibility of the headline finding. Validation before reuse is the prevention.

Mistake 01 · Mismatched baseline

geographic or demographic scope mismatch

How it goes wrong

Program participants concentrate in three urban counties. The secondary baseline used is the national average across all 3,143 counties. The national rate masks the urban context that actually drives the comparison.

The fix

Pull the secondary baseline at the same geographic resolution as primary aggregation. Three counties on the primary side, three-county baseline on the secondary side. ACS and BLS both support this granularity.

Mistake 02 · Time-shifted comparison

primary period vs secondary period misaligned

How it goes wrong

Primary outcomes are 2026 Q1. Secondary baseline is 2022 ACS 5-year, the most recent publicly available at analysis time. Four years of labor market change sits between them, invalidating the comparison.

The fix

Match on time, even at the cost of accepting older primary data or more recent higher-frequency secondary. BLS monthly LAUS is current; ACS 5-year is two years lagged. Choose the secondary source whose period aligns with the primary.

Mistake 03 · Methodology drift

primary definitions vs secondary definitions

How it goes wrong

Primary defines "employed" as paid placement at 90 days. BLS defines "employed" as worked one hour for pay in the reference week. The same word means different things, and the apparent comparison hides a definitional gap.

The fix

Document the definition for every joined variable in both sources. If they diverge, either align primary collection to the secondary definition or explicitly note the gap in the published report.

Mistake 04 · Hidden source in the report

reader cannot tell primary from secondary

How it goes wrong

The final report mixes primary numbers and secondary numbers in the same chart or paragraph without naming the source per figure. A reader cannot tell which figure is the program's own data and which is the baseline.

The fix

Tag every number with its source, time window, and known limitation. Use visually distinct styling (color, weight, footnote) to make primary versus secondary obvious at a glance. Methodology appendix names each source explicitly.

Examples across program types

What primary + secondary looks like in four common evaluation domains

The pattern generalizes across program types. Workforce, housing, education, and health each have well-documented secondary sources that pair with primary collection. The table below names the primary instrument, the secondary source, the join dimension, and the kind of question each combination answers.

Domain Primary instrument Secondary source Join dimension Question answered
Workforce training Skills assessment, 90-day placement survey BLS LAUS, QCEW State, county, occupation code Did placement beat the regional baseline?
Housing stability Tenant intake, 6-month sustainment follow-up HUD Fair Market Rents, CHAS ZIP, tract, MSA Are rent burdens reducing relative to area median?
Education / STEM Pre/post rubric, retention tracking Census ACS, IPEDS Tract, age band, race / ethnicity Is the program reaching underserved demographics?
Health equity Patient outcome surveys, biometric assessments NHANES, CDC Wonder, county health rankings County, age band, demographic Are outcomes improving against county baseline?
Urban development Participant intake, neighborhood survey City open data portals, Census ACS ZIP, tract, community area Is the program reaching highest-need neighborhoods?

★ Each secondary source is publicly accessible. Census, BLS, HUD, and CDC sources all have official APIs. The five city portals on the OpenGov MCP cover most US urban evaluation work.

For the urban development case worked end-to-end with all four sources joined via MCP, see the TechBridge Chicago example on the primary data hub.

Frequently asked questions

Common questions about combining primary and secondary data

What is the difference between primary and secondary data?

Primary data is collected directly for the current research question. Secondary data is collected by someone else for a different purpose and reused. Primary tells you what your participants did. Secondary tells you what the broader population did during the same period. The architectural difference matters too: primary data lives at the participant level with full identity; secondary data is usually aggregated and joins on geography, demographics, or time.

When should you use primary data?

Use primary data when the question is participant-specific, the variables you need do not exist in any reusable source, or the comparison requires longitudinal tracking of the same people across time. Funder-required outcome metrics, program-specific skills assessments, and cohort retention analysis all need primary collection. The cost is time and operational complexity; the payoff is purpose-fit data that answers your exact question.

When should you use secondary data?

Use secondary data when a credible source already covers the population you care about and the variables match your question. Regional employment statistics, demographic baselines, sector benchmarks, and published impact studies are all candidates. The cost is lower (someone else paid for collection), but the data is rarely a perfect fit. Validate the methodology, the period of coverage, and the unit of analysis before reusing.

When should you combine primary and secondary data?

Combine them when the question is causal: did the program produce effects above what would have happened anyway. Primary data alone shows outcomes; secondary data alone shows the baseline; the combination produces attributable effect. The workforce example: program participants placed at 78% at 90 days, regional baseline at 67%, attributable lift of 11 percentage points. Neither dataset reveals this in isolation.

How do you join primary and secondary data?

Secondary data has no participant IDs, so the join cannot be on identity. It joins on shared dimensions: state, region, occupation code, age band, gender, year. The primary dataset aggregates to those same dimensions, and the join becomes a SQL operation. Disaggregation by subgroup matters: a national average obscures state-level variation that the join can reveal.

What are examples of primary data?

Surveys you conducted, interviews you recorded, focus groups you facilitated, assessments you administered, observations you logged, program records you maintained. The common feature: collected directly for the current question, attached to a specific participant or session, with full provenance back to the instrument and sampling frame.

What are examples of secondary data?

BLS labor force statistics, census tables, IPUMS microdata, NHANES health data, World Bank development indicators, published peer-reviewed studies, sector benchmarks from industry associations. Each one was collected for some other purpose and is reused for the current question. Useful when the methodology is documented and the population overlaps with yours.

What is attributable effect in impact analysis?

Attributable effect is outcome minus counterfactual. The outcome is what happened to your participants; the counterfactual is what would have happened to comparable people without the program. Primary data provides the outcome; secondary data provides the counterfactual. The subtraction reveals the program's contribution beyond background trend. In a strong design, the counterfactual is drawn from a population matched on geography, demographics, and time period.

Can AI tools combine primary and secondary data?

AI tools like Claude Code can perform the join and the analysis, but only when both sources are queryable. Sopact's primary data is exposed via MCP, allowing Claude to pull participant outcomes and join them with BLS, census, or other secondary data in one query. Without the persistent-layer interface, the AI has no reliable way to pull primary data; without the public APIs, no way to pull secondary. The combination of both makes the cross-source analysis tractable.

What are common mistakes when combining primary and secondary data?

Three frequent errors. Using a mismatched baseline (national average when participants concentrate in three states). Comparing across different time periods (2024 program data against 2022 baseline). Ignoring methodology differences (survey-based primary against administrative-record secondary). Each mistake produces a counterfactual that is technically computable and substantively wrong. Validation before reuse is the prevention.

The full series

Get the complete stakeholder intelligence guide

The hybrid pattern applied to grant management, training programs, impact portfolios, and nonprofit operations. The MCP integration walked through in depth, with worked examples across multiple evaluation domains.

Read the stakeholder intelligence guide →

Ready when you are

Run primary and secondary in one analysis, every time.

Bridge dimensions captured at intake. Secondary reference tables alongside primary records. The MCP interface that joins Sopact participant data with BLS, Census, HUD, and city portal data in one prompt. Attributable effect in the headline, not on page 12.