The CenSoc datasets link the complete-count 1940 U.S. Census to Social Security mortality records. The unprecedented size and richness of CenSoc allows investigators to make new discoveries into geographic, racial, and class-based disparities in mortality in the United States. In order to work with a full CenSoc dataset, researchers will need to:
HISTID
variableThe figure below gives an overview of how researchers merge together
the CenSoc file and the 1940 Census file on the unique identifier
variable HISTID
to create their final dataset for
analysis.
Note: The 1940 Census file is large (10+ GB) — we recommend having an appropriate workflow for handling large datasets in R before getting started.
Download the CenSoc-DMF or CenSoc-Numident file from: https://censoc.berkeley.edu/data/
Whether the CenSoc-DMF or CenSoc-Numident file is a better choice for your analysis will depend on the research question. See the data page for more information.
The CenSoc datasets link the 1940 Census to the mortality records.
IPUMS provides integrated census and survey data from across the world free of charge to the broader research community. To access the IPUMS-USA data collection, you first need to register.
Once you have an account, proceed to https://usa.ipums.org/usa/ and, under ‘CREATE YOUR CUSTOM DATA SET’, select ‘GET DATA’.
Select the 1940 Full Count Census:
This return you to the variable selection page.
All extracts will by default include HISTID
, the
variable used to the link the census file to the CenSoc file.
Choose variables for your analysis. For example, to include
RACE
, slide over the ‘PERSON’ tab and select ‘RACE,
ETHNICITY, AND NATIVITY’.
The IPUMS “select cases” feature allows users to conditionally choose which states to include in an extract. This can be helpful if you are only interested in a subset of the Census. For example, if you are working with the CenSoc-DMF file, which includes only men, it makes sense to restrict your cases to men-only.
To work with IPUMS data in R, it is usually easiest to download the data as a CSV file. To do this, on the EXTRACT REQUEST page, next to ‘DATA FORMAT’, click Change, select ‘Comma delimited (.csv)’ and submit.
You can work with other formats in R as well, but CSV is generally the easiest. The only downside is that variable values are numeric codes. The IPUMSR package helps assign variable labels, value labels, and more.
Once you are satisfied with your dataset, click the ‘SUBMIT EXTRACT’ button. Because it is full count data, you will need to agree to special usage terms. Click OK to extract the dataset.
Given the size of the file, the processing may take several hours. Once the file is ready, you will receive an email from IPUMS with a link to download the resulting dataset. The IPUMS datasets will be compressed in a .zip file, so you will have to open that after the download. For more information on IPUMS extracts, please see IPUMS-USA.
After downloading the 1940 Census and CenSoc files, the files must be
merged before analysis. The HISTID
variable — available in
both CenSoc and Census files — can be used to merge the two
datasets.
Sample R code:
library(tidyverse)
## read in censoc file
<- read_csv('path/to/censoc/file.csv')
censoc
## read in census file
<- read_csv('path/to/census/file.csv')
census
## Join the census files by HISTID
<- censoc %>%
merged_analysis_file inner_join(census, by = "HISTID")