What is the difference between primary and secondary data?
Primary data is collected directly for the current research question. Secondary data is collected by someone else for a different purpose and reused.
Primary tells you what your participants did. Secondary tells you what the broader population did during the same period.
The architectural difference matters too: primary data lives at the participant level with full identity; secondary data is usually aggregated and joins on geography, demographics, or time.
When should you use primary data?
Use primary data when the question is participant-specific, the variables you need do not exist in any reusable source, or the comparison requires longitudinal tracking of the same people across time.
Funder-required outcome metrics, program-specific skills assessments, and cohort retention analysis all need primary collection.
The cost is time and operational complexity; the payoff is purpose-fit data that answers your exact question.
When should you use secondary data?
Use secondary data when a credible source already covers the population you care about and the variables match your question.
Regional employment statistics, demographic baselines, sector benchmarks, and published impact studies are all candidates.
The cost is lower (someone else paid for collection), but the data is rarely a perfect fit.
Validate the methodology, the period of coverage, and the unit of analysis before reusing.
When should you combine primary and secondary data?
Combine them when the question is causal: did the program produce effects above what would have happened anyway.
Primary data alone shows outcomes; secondary data alone shows the baseline; the combination produces attributable effect.
The workforce example: program participants placed at 78% at 90 days, regional baseline at 67%, attributable lift of 11 percentage points.
Neither dataset reveals this in isolation.
How do you join primary and secondary data?
Secondary data has no participant IDs, so the join cannot be on identity.
It joins on shared dimensions: state, region, occupation code, age band, gender, year.
The primary dataset aggregates to those same dimensions, and the join becomes a SQL operation.
Disaggregation by subgroup matters: a national average obscures state-level variation that the join can reveal.
What are examples of primary data?
Surveys you conducted, interviews you recorded, focus groups you facilitated, assessments you administered, observations you logged, program records you maintained.
The common feature: collected directly for the current question, attached to a specific participant or session, with full provenance back to the instrument and sampling frame.
What are examples of secondary data?
BLS labor force statistics, census tables, IPUMS microdata, NHANES health data, World Bank development indicators, published peer-reviewed studies, sector benchmarks from industry associations.
Each one was collected for some other purpose and is reused for the current question.
Useful when the methodology is documented and the population overlaps with yours.
What is attributable effect in impact analysis?
Attributable effect is outcome minus counterfactual.
The outcome is what happened to your participants; the counterfactual is what would have happened to comparable people without the program.
Primary data provides the outcome; secondary data provides the counterfactual.
The subtraction reveals the program's contribution beyond background trend.
In a strong design, the counterfactual is drawn from a population matched on geography, demographics, and time period.
Can AI tools combine primary and secondary data?
AI tools like Claude Code can perform the join and the analysis, but only when both sources are queryable.
Sopact's primary data is exposed via MCP, allowing Claude to pull participant outcomes and join them with BLS, census, or other secondary data in one query.
Without the persistent-layer interface, the AI has no reliable way to pull primary data; without the public APIs, no way to pull secondary.
The combination of both makes the cross-source analysis tractable.
What are common mistakes when combining primary and secondary data?
Three frequent errors.
Using a mismatched baseline (national average when participants concentrate in three states).
Comparing across different time periods (2024 program data against 2022 baseline).
Ignoring methodology differences (survey-based primary against administrative-record secondary).
Each mistake produces a counterfactual that is technically computable and substantively wrong.
Validation before reuse is the prevention.