Abstract
The CenSoc datasets link individual-level 1940 Census records to Social Securitydeath records using deterministic record linkage algorithms. In this technical report, we describe our record linkage methodology and assess the accuracy and representativeness of the CenSoc Version 2.1 matches. The main takeaways of this report are:
1. The CenSoc-DMF and CenSoc-Numident datasets are comprised of individuals that are broadly representative of the general population but slightly skewed towards higher socioeconomic status individuals (e.g., 35.2% of individuals inCenSoc-DMF vs. 32.5% of individuals in the general population completed highschool). Black people are underrepresented in both datasets, comprising 9.6%of the general population but only 4.8% of CenSoc-DMF and 6.5% of CenSoc-Numident. However, the Black samples are broadly representative of the general Black population. Non-representativeness has the potential to bias estimates if the outcome of interest is heterogeneous across the under or over-represented population subgroups. To account for this, researchers can stratify for covariates suchas race and education in their analysis.
2. The overall mortality-adjusted match rate for the CenSoc-DMF is 30% (18% forour set of conservative matches), while the overall mortality-adjusted match rate for CenSoc-Numident is approximately 30% for men (22% conservative) and 32%for women (24% conservative). The match rate for Censoc-Numident is lower for earlier birth cohorts (1895-1915) because of the higher rates of missingness of birthplace, a required matching field.
3. For both datasets, restricting to conservative matches reduces sample size but increases the quality of the matches. The conservative matches are comparably representative of the general population but contain fewer false matches than the standard matches. False matches introduce measurement error resulting in attenuated estimates within a regression framework. We generally recommend researchers restrict to conservative matches to avoid this attenuation bias.
4. For analyses of multiple birth cohorts, we recommend including birth cohort fixed effects. Birth cohort fixed-effects control for each birth cohort being observed for a different window of ages of death and the potential sample composition bias introduced by differential match rates across birth cohorts in CenSoc-Numident.