The CenSoc Project is a public resource. We have endeavored to make it easy for new users. If you have a question or comment, please do not hesitate to contact us at censoc@berkeley.edu.
How do I get started?
The easiest way to learn about CenSoc is to read through one of the vignettes we have written in R that walk through an example of how to use CenSoc data to analyze disparities in mortality.
You can download the full CenSoc data set quickly and easily. But in order to make use of the 1940 census variables, you will need to download a large data set from IPUMS, which will take several hours.
Instead, we recommend downloading the demo dataset that already has several linked variables (wage income, race, and home ownership) for a 1% sample. This is what we use for the vignettes.
You can also look at the full list of 1940 census variables at the IPUMS 1940 website. Make sure to look at both harmonized and unharmonized variables in order to make sure you don’t miss anything.
How is working with matched administrative data different from working with a sample survey?
Administrative data is big, with record numbers in the many millions, not thousands. It can take a bit of practice to work with these large data sets. But the payoff can be substantial, because precise estimates can be made even for relatively small groups, allowing high resolution studies of mortality with the CenSoc data.
Administrative data does not represent a scientific random sample, but rather the totality of what we were able to obtain and make available. In the case of the Death Master File (DMF) and the Numident NARA records, it is not precisely clear which records were included. However, we can look at the total counts of deaths by age and year and estimate that there is high coverage, over 90 percent over age 65 for the periods we release data from. Weights are available to adjust for under-reporting of deaths.
What interesting variables are available in the 1940 Census?
The 1940 census included a number of individual-level variables for the first time, including wage income, homeownership, and educational attainment. Coming at the end of the Great Depression, it also included items on participation in the New Deal.
Researchers can also construct small geographic area variables, e.g., average wage income or racial composition, which can be used to measure socio-economic environment in which people lived.
What interesting variables are available from the Social Security death records?
The DMF include dates of birth and death. The Numident additionally contains zipcode of residence at the time of death, birth place, and race, as well as parental names.
What is the easiest way to measure mortality with CenSoc data?
For the oldest cohorts, the extinct cohort method can be used to measure age-specific mortality directly at old ages.
Standard multivariate regression is useful for analyzing the contours and covariates of group differences in mortality. However, the truncated ages we observe, which differ by cohort, can lead to biased results. We recommend including birth-year fixed effects for all models, and estimating an adjustment factor to translate regression model estimates into effects sizes measured in years of life expectancy at age 65. For further reading, see Breen and Goldstein (2022, pp. 121-123).
We have also developed a maximum likelihood estimation method for truncated data. This method is described in Goldstein et al., 2023, and can be implemented with the R package gompertztrunc.
For comparing time trends from one cohort to another, more sophisticated approaches are needed. Bayesian methods have been developed by Alexander.
How do CenSoc v3.0, v2.0, and v1.0 differ?
CenSoc v1.0, our first release of the data, employed an exact matching strategy. This means that records in the 1940 Census and Social Security mortality data were matched exactly on name (without any standardization or cleaning) and year of birth.
For CenSoc v2.0, we implemented a different record linking method, the ABE method developed by Abramitzky, Boustan, and Eriksson (2012, 2014, 2017, 2020). It increased the number of linked records by 30% (to 7.4 million) in CenSoc-DMF and 16% in CenSoc-Numident (to 7.9 million). Our goal was to produce more and better matches for the research community.
The ABE algorithm standardizes name (e.g., Bill to William) and then performs an exact matching on names and birthplace, while allowing for some flexibility on year of birth. We implemented two variants of ABE algorithm. First, we employed the ABE-standard algorithm to produce the full set of matched records. Next, we used the ABE-conservative algorithm to identify the subset of ABE-standard matches meeting stricter criteria for declaring a match. A flag variable reports the algorithm used to establish a match.
In CenSoc v3.0 we introduce more advanced statistical weights. Using CDC mortality data, we weight to race and birthplace alongside age, year, and sex, allowing better adjustment for disparities in mortality coverage among demographic groups. Version 3.0 datasets use the same ABE matching method as version 2.0. However, we limit published links to those established with the ABE-conservative algorithm, which are less likely to contain false matches.
You can check out CenSoc’s revision history here. For a more detailed description of the ABE method, see the helpful resources maintained on the Historical Record Linking page.
Can I still download CenSoc Previous Versions of CenSoc?
While older versions of CenSoc data (versions 1.0, 2.0, and 2.1) are no longer available to download, we can provide them on request. Please email censoc@berkeley.edu if you are interested in obtaining a previous data release.
How do I cite CenSoc data?
Joshua R. Goldstein, Monica Alexander, Casey Breen, Andrea Miranda González, Felipe Menares, Maria Osborne, Mallika Snyder, Ugur Yildirim. CenSoc Mortality File: Version 3.0. Berkeley: University of California, 2023.
Please also adhere to IPUMS-USA citation guidelines when using 1940 Census data.
Does the CenSoc project have any open positions?
Yes. We have openings for postdocs and graduate students, particularly those with statistical and programming experience who are interested in studying mortality disparities.
How can I contribute?
Be a user, ask us questions, and give us feedback. Cite the data in your publications and presentations. Share the word. We put a lot of work into making this public resource available – and we want people to put it good use. We welcome you to contact us at censoc@berkeley.edu if you have questions or want to discuss your work.
What can’t I do with CenSoc data? What is it bad for?
CenSoc data has two chief limitations: first, because we don’t observe survivors, the usual methods of mortality analysis, life tables, and survival analysis are not directly usable. Instead, alternative methods need to be used that are suitable for studying truncated cohort distributions of death. Second, the CenSoc data observed individuals at a moment in time, the 1940 census, and does not have information between 1940 and the time of death.
What other resources are available for population-level analysis of mortality in the United States?
We highly recommend the Human Mortality Database, a joint project of the Berkeley Demography Department and the Max Planck Institute for Demographic Research.
The National Center for Health Statistics makes aggregate mortality data available through its WONDER system.
The individual-level data set we modeled CenSoc on is the National Longitudinal Mortality Study (NLMS), which links the Current Population Survey since the 1970s to mortality records, is rich resource.
Is CenSoc available for genealogists?
There are no restrictions on the use of the CenSoc data, but it is not particularly useful for genealogists. See the “severe limitations” on the usefulness of IPUMS data for genealogists.
The BUNMD individual records contain identifying information. However, the data files are very large (several Gb), and require statistical software (or very good computer skills) to work with. We believe most or all of the individual records are already indexed at ancestry.com.
How do I access identifiable information for individuals?
CenSoc is built with publicly released records from the 1940 Census and the Social Security Administration. The census records include individual names and street addresses. The SSA records include names and Social Security Numbers.
The publicly released CenSoc data sets (CenSoc-DMF and CenSoc-Numident) include a unique identifier for linking to the public release of the IPUMS 100% 1940 census, but they do not include names, street addresses, or Social Security numbers.
We are happy to share this identifying information with researchers interested in these variables, validating or extending our record linkage methods, or linking CenSoc to other data sources. Researchers will need access to a Full Count Census Repository. Please contact us for details.
What is missing from the current CenSoc releases and what can we expect from future data releases?
The CenSoc team plans to link mortality records to the 1950 Census when the full count data become available sometime in 2024. We also plan to publish siblingships identified in the BUNMD and CenSoc-Numident in the near future.