Stay ahead with the latest insights, expert tips, and updates from Sopact.
Great! We'll be in touch, no spam ❤️
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Secondary Data: When to Use External Sources for Faster
Secondary data offers speed and scale primary collection can't match. Learn when to use external sources, where to find them, and how to validate quality.
A program officer at a community foundation opens a spreadsheet she downloaded from her state's labor department. The dataset shows employment outcomes by county, by year, by industry — clean, well-documented, publicly available, and free. She plans to use it to evaluate whether the workforce programs her foundation funds are producing outcomes above or below the regional baseline. Three weeks later, her evaluation team tells her the comparison isn't defensible. The state counts anyone working 20+ hours as "employed"; the foundation's grantees count only full-time, benefits-eligible placements. The state data includes workers under 24; the foundation's cohorts don't. Two datasets, two different definitions of the same word — and the mismatch was invisible in the column headers.
This is The Provenance Gap: the contextual information about how data was collected, defined, and bounded that does not travel with the data when it changes hands. Every dataset you reuse carries invisible assumptions you cannot see until they break your analysis. This article explains what secondary data is, how its types and sources differ, where examples show up in nonprofit evaluation work, and how to tell when secondary data is the right tool — versus when it is masking the need for primary collection.
Last updated: April 2026
Secondary Data · Foundational Guide
Reused data carries invisible assumptions
Secondary data is information someone else collected for a different purpose. The file is clean. The context that made it trustworthy almost never travels with it — and that's where most analyses break.
The contextual information about how data was collected, defined, and bounded that does not travel with the data when it changes hands. Every reused dataset carries invisible assumptions you cannot see until they break your analysis.
14.8k
Monthly searches for "secondary data"
18–36 mo
Typical publication lag on government datasets
5
Major source categories, each with a different Provenance Gap
0
Secondary datasets that attribute outcomes to your program alone
What is secondary data?
Secondary data is information that was collected, compiled, or recorded by someone other than the researcher using it, for a purpose other than the current research question. Government statistics, academic databases, administrative records from other organizations, published survey datasets, and your own organization's historical records all qualify. Unlike primary data — which you design and collect yourself — secondary data arrives with a fixed definition, a fixed sampling frame, and a fixed collection moment that you cannot change.
The practical consequence: you can only decide whether the data as it exists fits the question you are trying to answer. You cannot redefine "employed" to mean full-time. You cannot add a demographic field that was never collected. You cannot re-interview respondents from three years ago to ask the question you actually care about now. Tools like SurveyMonkey and Qualtrics export clean files that look identical whether the underlying questions were answered by your stakeholders last month or by a different population three years ago — the format hides the provenance.
Secondary data falls into two primary categories along the most important axis — who collected it. Internal secondary data comes from your own organization's historical records: past intake forms, case management files, donor CRM exports, program activity logs, previous evaluation surveys. External secondary data comes from outside sources: government statistics, academic research datasets, industry reports, and other organizations' published findings. The internal/external distinction matters because The Provenance Gap presents differently in each case — internal data is archaeologically hard to reconstruct; external data is evaluatively hard to fit.
A second axis separates quantitative secondary data (census tabulations, administrative counts, survey aggregations, financial records) from qualitative secondary data (interview transcripts, policy documents, case studies, archived meeting minutes). The analytical tools differ dramatically — thematic coding versus statistical regression — but the provenance risk is the same. Both require auditing before they enter an analysis.
A third practical axis separates published secondary data (peer-reviewed datasets, government reports, industry publications) from unpublished secondary data (internal records, gray literature, administrative files shared on request). Published sources typically have better methodological documentation; unpublished sources often have better contextual fit to a specific local question.
Sources of secondary data
The five main sources of secondary data each carry a different Provenance Gap signature:
Government and public agencies — census bureaus, labor departments, health ministries, court records, tax authorities. Methodologically rigorous, statistically representative, typically free. The main gap is temporal: publication lag of 18–36 months means the data describing 2024 program outcomes often does not arrive until 2026.
Academic and research institutions — published datasets, longitudinal cohort studies (Framingham Heart, Add Health, PSID), dissertation archives. Well-documented with codebooks and technical reports. The main gap is population fit: an academic sample drawn for a national study rarely matches the specific population your program serves.
Industry and trade bodies — sector associations, industry benchmarking studies, trade publication surveys. Current and contextually relevant to a specific field. The main gap is methodological transparency: sampling methods and response rates are frequently undisclosed.
Internal organizational records — your own past intake data, donor CRM exports, program activity logs, historical impact reports. Contextually aligned to your work. The main gap is definitional drift across time — the intake form from 2021 measured different variables than the 2024 form, and the 2024 form's "mandatory" fields were "optional" when the 2022 cohort filled them in.
Commercial data providers — market research firms (Nielsen, Gartner, Forrester), syndicated consumer panels, third-party behavioral datasets. Current and often granular. The main gap is cost and transparency: syndicated panels rarely disclose sampling methodology in full and single reports run from a few hundred to tens of thousands of dollars.
Internal vs external secondary data
The internal/external split is the most consequential distinction in secondary data strategy because it determines what kind of audit work you need to do before using the dataset.
Internal secondary data looks safer because your organization collected it. But the program director who designed the 2021 intake form left two years ago. The definition of "program completion" shifted when the grant scope changed. The field you want to analyze was added to the form mid-cohort. Internal data requires archaeological work — rebuilding the collection context from whoever is still around, whatever documentation survived, and any version history the system retained. Most nonprofit organizations discover, mid-analysis, that their own historical records have The Provenance Gap baked into them just as severely as any external source.
External secondary data is collected by a party outside your organization entirely. The methodological transparency is often better than internal records — peer-reviewed datasets come with codebooks; your 2021 intake form does not. But the definitional alignment with your specific question is almost never exact. External data requires evaluative work — assessing fit between the data as designed and the question you are trying to answer, and documenting every mismatch explicitly so downstream readers understand the limits.
The practical rule: internal secondary data requires archaeology; external secondary data requires evaluation. Both require audit before analysis. Neither is safer by default.
Examples of secondary data
Concrete examples from nonprofit and foundation evaluation work:
Labor market outcomes — State unemployment insurance records, Bureau of Labor Statistics county-level employment data, O*NET occupational profiles
Education attainment — NCES IPEDS datasets, state education agency graduation files, National Student Clearinghouse research reports
Health outcomes — CDC BRFSS survey data, state hospital discharge databases, Medicare/Medicaid claims aggregates, County Health Rankings
Demographic baselines — US Census ACS 5-year estimates, American Housing Survey, decennial Census tabulations
Philanthropic landscape — Candid 990 data, Giving USA annual reports, Foundation Center grant records
Economic context — Federal Reserve district reports, Bureau of Economic Analysis regional data, Treasury opportunity zone designations
Your own historical intake — Last three cohorts of program enrollment forms, pre/post survey exports from 2020–2024, past logic model frameworks, legacy Salesforce records
Each of these examples has legitimate uses. None of them can substitute for primary data collection when the question requires causal attribution to your program's specific intervention.
Six principles
How to work with secondary data without paying the Provenance Gap
Foundations, programs, and research teams reuse external and internal datasets every day. These six principles decide whether the reuse holds up under scrutiny.
Every dataset enters analysis only after six questions are answered in writing: who, why, when, who was in the frame, how variables are defined, and what is missing from the documentation.
Skipping the audit is how foundations end up comparing two versions of "employed" that mean different things.
02
Principle 02
Separate internal from external
Internal secondary data requires archaeological work to reconstruct context. External secondary data requires evaluative work to assess fit. The audit steps differ; both are non-negotiable.
Internal data feels safer — which is exactly why its Provenance Gap is the one most often missed.
03
Principle 03
Use it as context, not evidence
Secondary data is the outer ring around primary evidence — a baseline, a benchmark, a comparison frame. It cannot carry causal attribution to your program because it did not observe your participants.
If your evaluation rests on secondary aggregates alone, the attribution claim will not survive a funder's first question.
04
Principle 04
Design primary for what's missing
Once the audit exposes the gap, design primary collection for the three variables secondary data cannot give you: participant-level attribution, qualitative narrative of change, and longitudinal tracking of your actual cohort.
The audit without the complementary primary design leaves the most important question unanswered.
05
Principle 05
Name the publication lag
Government datasets typically trail real events by 18–36 months. Comparing 2024 program outcomes to 2022 baseline data is a temporal misalignment that must be disclosed, not glossed.
Readers who catch a silent temporal mismatch will stop trusting the rest of the analysis.
06
Principle 06
Document every compromise
Write the limitations section before the findings section. Every definitional mismatch, every temporal gap, every sampling-frame difference should be stated plainly in the methods — not buried in an appendix.
A transparent limitations section is the strongest signal of analytical discipline. Funders read it first.
These six principles work together. Audit, separate, contextualize, complement, time-align, and document.
Before any secondary dataset enters an analysis, answer six questions in writing:
Who collected this and why? Purpose alignment — the original collection purpose must be compatible with your current use, not identical to it.
When was it collected? Temporal alignment — publication date is not collection date; check both.
Who was included in the sampling frame? Population alignment — what does the dataset represent, and does that match who you need to describe?
How are the key variables defined? Definitional alignment — employment, completion, success, engagement all have many operational definitions.
What's missing from the documentation? Gap identification — every undocumented decision is a hidden assumption you will inherit.
What would I need to add with primary collection to answer my actual question? Complementary design — the most important question, and the one most often skipped.
Most organizations skip step 6. That is where The Provenance Gap does its damage — not because secondary data was used, but because the primary data that should have accompanied it was never designed.
Nonprofit archetypes
Three program shapes, same Provenance Gap
Every nonprofit reuses data — government baselines, prior-cohort intake, partner records. The gap shows up in the same place across very different program models.
A workforce nonprofit runs three distinct programs — job readiness, apprenticeship placement, and business incubation. The evaluation team wants to compare outcomes across programs against a regional baseline. The Provenance Gap shows up in two places at once: their own historical intake records drift in definitions year over year, and the BLS data they pull for context uses occupation categories that don't match their own taxonomy.
01
Source
BLS county data
industry-coded, 24-mo lag
02
Internal
Three intake forms
definitions shifted yearly
03
Evaluation
Cross-program report
comparability breaks here
Traditional stack
Secondary as foundation
BLS industry categories forced onto program taxonomy
Each program's historical intake coded differently — no unifying schema
24-month BLS lag silently compared to 2024 program data
Analyst spends weeks reconciling definitions after collection
With Sopact Sense
Primary designed for the gap
Single participant schema across all three programs
Persistent ID chain from intake → exit → follow-up
BLS used explicitly as context ring, not attribution
Intelligent Column themes qualitative responses as they arrive
A national nonprofit funds 17 implementing partners across 9 states. Each partner collects intake data with their own forms, their own case management systems, and their own definitions of "active participant." The Provenance Gap is multiplied by 17: the headquarters team is effectively analyzing 17 different secondary datasets when they try to aggregate across the network.
01
Partners
17 partner exports
each with its own schema
02
HQ
Reconciliation month
manual ID matching, coding
03
Board
Quarterly report
always 4–6 weeks stale
Traditional stack
17 Provenance Gaps, stacked
Each partner's secondary data has its own collection context
HQ team spends a month reconciling before analysis can start
"Active participant" means something different at every site
Board report is always one quarter behind reality
With Sopact Sense
One schema across the network
HQ defines instruments once; partners collect into the same schema
Persistent IDs carry across every partner site
No reconciliation month — data is comparable by design
HQ sees live roll-ups, partners see their own cohort
A youth-development nonprofit runs one flagship program — 18 months, four cohorts a year, 120 participants per cohort. The evaluation team wants to report on two-year outcomes against regional youth benchmarks. The Provenance Gap is temporal: the most recent public benchmark is already 2–3 years old by the time their 2024 cohort hits the 24-month mark, and their own 2022 intake records use an earlier version of the outcome questions.
01
Benchmark
Public dataset
lagged 2–3 years
02
Own records
2022 intake data
earlier question version
03
Report
Two-year outcomes
time and definitions drift
Traditional stack
Hoping the drift is invisible
2024 cohort compared to 2022-vintage benchmarks without disclosure
Follow-up reuses intake records with mismatched question wording
Cohort IDs lost when the case manager changed systems mid-program
Two-year claim rests on secondary comparisons that never aligned
With Sopact Sense
Primary design holds the line
Same instrument version used across cohorts — comparable by construction
Persistent participant IDs across 24 months, no ID loss on staff change
Secondary benchmarks used only as explicit context, lag named in prose
Longitudinal analysis available the day the 24-month response arrives
The Provenance Gap does not depend on your program shape. It shows up wherever inherited data carries definitional, temporal, or sampling assumptions you did not make. The response is the same in all three cases: design primary for what secondary cannot give.
Secondary data belongs as the outer ring around primary evidence, not as the evidence itself.
Correct use: "Our program's 12-month employment retention rate is 74%; the regional baseline from BLS data is 58%. Our cohort outperformed the regional baseline by 16 points." The secondary data provides comparison context. The primary data provides the attributable outcome.
Incorrect use: "We reviewed state employment data and saw improvement in counties where we operate; therefore our program contributed to the improvement." No causal attribution is possible without primary measurement of your actual participants. Secondary aggregate data cannot tell you what happened to the individuals your program served.
Design primary collection for the evidence secondary cannot give
Once you have audited your secondary sources and identified the gap, design primary collection specifically for the variables secondary data cannot provide. For most nonprofit programs, three things:
Participant-level outcome attribution — did the person who went through your program change on the outcome you care about? Secondary aggregate data cannot answer this.
Qualitative narrative of change — what did participants say changed for them, in their own words? No external dataset contains this.
Longitudinal tracking of your actual cohort — did the change hold at 6 months, 12 months, 24 months? Only participant-level collection with persistent IDs supports this, and it is the foundation of any defensible longitudinal study.
This is where platforms like Sopact Sense fit — as the origin for primary data designed specifically for the evidence secondary data cannot provide. Persistent stakeholder IDs assigned at first contact mean the same participant at intake, at exit, and at 12-month follow-up is the same row forever. No ID reconciliation, no "which Sarah is this?" mid-analysis, no lost linkage when a staff member leaves.
Dimensional comparison
Where secondary data falls short — and what primary adds
Twelve dimensions across collection, analysis, and reporting. Secondary data earns its place on most of them. Attribution is not one of them.
Risk 01
Definition drift
Variables with the same name mean different things across sources — even within your own historical records.
Usually discovered mid-analysis, after commitments to funders.
Risk 02
Sampling-frame gap
The population the dataset represents does not match the population your program serves, and the mismatch is rarely flagged.
Most visible when generalizing to your cohort.
Risk 03
Temporal misalignment
Government publication lag means your 2024 outcomes are compared to 2022 baselines, often without disclosure.
Silent by default until a reader catches it.
Risk 04
Attribution ceiling
Aggregate secondary data cannot tie outcomes to the specific individuals in your program — only primary collection can.
The ceiling that ends most secondary-only evaluations.
The comparison
Secondary data alone vs. primary collection designed for the gap
Dimension
Secondary data alone
Primary collection with Sopact Sense
Section 01
Collection design
Definitional control
Who decides what a variable means
None
Definitions locked by the original collector; inherited as-is
Full
You define every variable at design; schema enforced at collection
Sampling frame
Who the data describes
Fixed
Frame set by the source; overlap with your cohort unknown
Your cohort
Participants you serve, by design, 1:1 with the intervention
Temporal alignment
When the data describes
Lagged 18–36 mo
Government and academic sources publish years after collection
Real-time
Data available for analysis the moment it enters the system
Persistent IDs
Linking one person across touchpoints
Rarely
Aggregated datasets, even panel studies, lose participant identity for external users
Secondary data audited. Primary collection designed. That combination is what holds up under a funder's first pointed question. Sopact Sense is the origin layer that makes the primary half viable.
Step 4: Avoid the three common misuses of secondary data
Three misuses show up repeatedly in nonprofit evaluation:
Substituting secondary for primary because it is cheaper. Budget pressure pushes evaluators to skip primary collection. The result is an evaluation with no attributable outcome measurement, only context. Funders increasingly recognize this pattern and ask directly for participant-level evidence.
Ignoring the publication lag. Government data is often 18–36 months old by the time it is usable. If your program ran in 2024 and the most recent comparable BLS data is from 2022, your comparison is already misaligned on time. Name the lag explicitly in reporting so readers understand what is being compared.
Treating internal records as ground truth. Your own past data carries the same Provenance Gap risks as external data — sometimes worse, because internal users assume internal data is authoritative. Run the same six-question audit on your 2022 intake records that you would run on BLS.
Done well, secondary data is a force multiplier for impact measurement strategy. Done poorly, it is a defensibility risk that only becomes visible when a funder asks a pointed question.
▶ Masterclass
Primary vs secondary data — when to use each, and how to combine both
Secondary data is information originally collected by someone other than the current researcher, for a purpose other than the current research question. Common examples include census statistics, academic study data, industry reports, and an organization's own historical records. Unlike primary data, secondary data has a fixed definition and sampling frame that you cannot change.
What is the secondary data definition?
The secondary data definition is data that was collected, compiled, or recorded by a party other than the current researcher, for a purpose other than the current use. The defining characteristic is that all collection decisions — what was measured, who was included, when, and how — were made by someone else and cannot be modified by the person now reusing the data.
What are the types of secondary data?
Secondary data falls into two primary types: internal (your organization's historical records) and external (government, academic, industry, or commercial sources). A secondary classification separates quantitative secondary data (numerical aggregations, administrative counts, survey tabulations) from qualitative secondary data (transcripts, policy documents, case studies, archived meeting minutes).
What are the sources of secondary data?
The main sources of secondary data are government and public agencies, academic and research institutions, industry and trade bodies, internal organizational records, and commercial data providers. Each source carries a different Provenance Gap risk — government data has long publication lags, commercial data lacks methodological transparency, internal records suffer from definitional drift across time.
What is external secondary data?
External secondary data is data collected by a party outside your organization and made available for reuse. Examples include census tabulations, government labor statistics, academic longitudinal studies like Framingham Heart or Add Health, industry benchmarking reports, and commercial market research panels. External data typically has better methodological documentation than internal records but rarely matches your exact population frame.
What is internal secondary data?
Internal secondary data is historical data your own organization collected for a previous purpose, now being reused for a new question. Examples include past intake forms, donor CRM records, program activity logs, and prior evaluation surveys. Internal data feels safer because your organization owned the collection, but it typically carries definitional drift as forms, staff, and program scopes changed over years.
What are examples of secondary data?
Examples of secondary data include US Census ACS estimates, BLS county-level employment data, NCES education statistics, CDC BRFSS health survey data, Candid 990 nonprofit filings, academic longitudinal studies like Framingham or Add Health, and your own organization's past intake records. Each has legitimate uses, and each carries its own Provenance Gap that must be audited before use.
What is secondary data collection?
Secondary data collection is the systematic process of sourcing, auditing, and preparing existing data for reuse. Unlike primary data collection — which involves designing instruments and collecting new responses — secondary data collection focuses on provenance auditing, source evaluation, documentation review, and assessing fit between the data as it exists and the current research question.
What is the difference between primary and secondary data?
Primary data is collected by the researcher specifically for the current research question; secondary data is reused from a prior collection effort by someone else. Primary data has definitional control and temporal alignment; secondary data has cost, scale, and availability advantages but inherits all the collection decisions of whoever originally compiled it. See the dedicated primary vs secondary data comparison for a full breakdown.
What is The Provenance Gap?
The Provenance Gap is the contextual information about how data was collected, defined, and bounded that does not travel with the data when it changes hands. It is the invisible tax on every reused dataset — definitions drift, sampling frames do not align, collection moments do not match your question. The Provenance Gap is why secondary data should provide context, not carry attribution.
What are the advantages and disadvantages of secondary data?
Advantages of secondary data: lower cost, faster availability, larger samples than most organizations can collect themselves, and baseline context for comparison. Disadvantages: fixed definitions you cannot change, publication lag that misaligns time periods, unknown or poorly documented sampling decisions, and inability to attribute outcomes to your specific program intervention without paired primary measurement.
When should nonprofits use secondary data?
Nonprofits should use secondary data for regional baselines, environmental scanning, comparative context, and literature reviews — any use where aggregate context complements primary evidence. Nonprofits should not rely on secondary data alone for participant-level attribution, longitudinal outcome tracking with their actual cohort, or qualitative narratives of change from the people they directly serve.
How much does secondary data cost?
Government and most academic secondary data is free or low-cost. Commercial datasets from providers like Nielsen, Gartner, or syndicated consumer panels range from a few hundred to tens of thousands of dollars per report. The real cost of secondary data, however, is the time spent on provenance audit and the risk of building analysis on data that does not actually fit the question being asked.
Next step
Close the Provenance Gap at the source
Secondary data works for context. Attribution, longitudinal tracking, and participant narrative require primary collection designed from the start. Sopact Sense is the origin layer — built for the evidence secondary cannot give.
Persistent participant IDs assigned at first contact
Definitions, disaggregation, and schema fixed before collection starts
Intelligent Column themes qualitative responses as they arrive
Secondary data stays in its rightful role: context, not foundation