The CenSoc team is pleased to announce the release of CenSoc Version 2.0, which links the 1940 Census to Social Security mortality records. This version uses an improved matching method, the ABE method developed by Abramitzky, Boustan, and Eriksson (2012, 2014, 2017, 2020). We implement a standard and conservative variant of this method, allowing researchers to test the robustness of their results across different samples, in line with best practices. Version 2.0 also includes several expanded and improved variables.

For a complete list of Version 2.0 revisions to the CenSoc datasets, please see the revision history page.

New Approach Record Linking (ABE)

The overarching goal of CenSoc Version 2.0 was to increase the quality of matches. The key difference between CenSoc Version 1.0 and Version 2.0 is the use of the ABE fully automated linkage method. For Version 1.0, we used an exact match on first name, last name, and date of birth (and birthplace for Numident). For Version 2.0, we used the more sophisticated ABE-exact algorithm, which standardizes names (e.g., Bill to William) and then performs exact matching on names (and birthplace) while allowing for some flexibility on year of birth. This reduces the number of false matches and increases the total number of matches. We implement two variants:

  • Standard: Exact match on First Name, Last Name, Place of birth (in CenSoc-Numident only) and flexibility ± 2 years on year of birth
  • Conservative: Standard matches where names are unique ± 2 years (between and within datasets)

For a more detailed description of the ABE fully-automated approach, see the helpful resources maintained on the Historical Record Linking page. For more information on our specific implementation, please see the CenSoc methods protocol.

Mortality Analyses

The size and richness of the CenSoc project enables researchers to conduct “high resolution” mortality studies, furthering our understanding of mortality determinants and disparities. To what extent are research findings robust across CenSoc datasets constructed with different record linkage methods?

We carry out basic mortality analyses using three different samples based on different linking methods and compare estimated OLS regression coefficients across the three samples:

  1. Our Version 1.0 exact match (“Exact”)
  2. Our Version 2.0 ABE-exact standard approach (“ABE-standard”)
  3. Our Version 2.0 ABE-exact conservative approach (“ABE-conservative”)

The figure below shows the association between wage-income decile in 1940 and longevity for men in the CenSoc-DMF. Estimated coefficients for the ABE-Conservative and Version 1.0 “Exact” match are nearly identical. The ABE-Standard sample shows some evidence of attenuation bias. In this framework, a false match will tend to bias the OLS coefficient estimates towards 0, underestimating the magnitude of any effect. The attenuation bias is most pronounced for the highest and lowest wage deciles.

The table below shows the education gradient for DMF birth cohorts of 1906-1915 for the three different samples.

Our in-house explorations across a range of mortality analyses have found the estimated coefficients for the ABE-Conservative and Version 1.0 “Exact” samples to be very similar. This is promising for researchers using CenSoc Version 1.0, as results will be largely consistent between Version 1.0 and Version 2.0 (ABE-conservative sample).

The ABE-standard sample gives researchers additional flexibility. For some analyses of smaller population subgroups, the benefit of the larger size of the ABE-Standard sample may outweigh the disadvantage of the attenuation bias introduced by false matches.

Resolved April Birthday Issue

Allowing for flexibility on year of birth resolves an issue in CenSoc Version 1.0 – people born in April were systematically linked at a lower rate than those born in other months. (When we imputed birth year from the age at census, we assume the census was taken on census day April 1st, 1940. However, enumeration was occasionally delayed, resulting in an incorrectly imputed birth year.)

The left panel in the figure below shows the extent of the issue in CenSoc-DMF Version 1.0, and the right panel shows the resolution of the issue in CenSoc-DMF Version 2.0.

Story by Casey Breen (caseybreen@berkeley.edu). All errors are his alone.