Getting Started with CenSoc

CenSoc Project

2020-07-17

CenSoc Roadmap

  1. Download CenSoc File
  2. Download 1940 census data from IPUMS
  3. Merge CenSoc and census Files on HISTID variable

Note: The 1940 census file is large (10+ GB) — we recommend having an appropriate workflow for handling large datasets in R before getting started.

Downloading CenSoc File

Download the CenSoc-DMF or CenSoc-Numident file from: https://censoc-download.demog.berkeley.edu

Whether the CenSoc-DMF or CenSoc-Numident file is a better choice for your analysis will depend on the research question. See the data page for more information.

Downloading 1940 Full-Count Census from IPUMS

The CenSoc datasets link the 1940 Census to the mortality records.

IPUMS provides integrated census and survey data from across the world free of charge to the broader research community. To access the IPUMS-USA data collection, you first need to register.

Create IPUMS extract

Once you have an account, go to https://usa.ipums.org/usa/ and under ‘CREATE YOUR CUSTOM DATA SET’ click the ‘GET DATA’ button.

Select data sample

Select the 1940 Full Count Census:

  • Click the ‘SELECT SAMPLES’ button. This will take you to a page with all possible census and ACS data available.
  • Uncheck the ‘Default sample from each year’ box
  • Click the ‘USA FULL COUNT’ tab
  • Check the 1940 100% box
  • Click ‘SUBMIT SAMPLE SELECTIONS’ button

This will take you back to the variable selection page.

Select Variables for Analysis

All extracts will by default include HISTID, the variable used to the link the census file to the CenSoc file.

Choose variables for your analysis. For example, you could select RACE, which is under PERSONAL → RACE, ETHNICITY, AND NATIVITY.

Select Cases

The IPUMS “select cases” feature allows users to conditionally choose which states to include in an extract. This can be helpful if you are only interested in a subset of the Census. For example, if you are working with the CenSoc-DMF file, which includes only men, it makes sense to restrict your cases to men-only.

Downloading the IPUMS extract

To work with IPUMS data in R, it is usually easiest to download the data as a CSV file. To do this, on the EXTRACT REQUEST page, next to ‘DATA FORMAT’, click Change, select ‘Comma delimited (.csv)’ and click the submit button.

You can work with other formats in R as well, but CSV is generally the easiest. The only downside is that variable values are numeric codes. The IPUMSR package helps assign variable labels, value labels, and more.

Once you are happy with your dataset, click the ‘SUBMIT EXTRACT’ button. Because it is full count data, you will need to agree to special usage terms. Click OK to extract the dataset.

Given the size of the file, the processing may take several hours. Once the file is ready, you will receive an email from IPUMS with a link to download the resulting dataset. As all IPUMS datasets come compressed, the data needs to be uncompressed before it can be used. For more information on IPUMS extracts, please see IPUMS-USA.

Merging CenSoc and Census

After downloading the 1940 Census and CenSoc files, the files must be merged before analysis. The HISTID variable — available in both CenSoc and Census files — can be used to merge the two datasets.

Sample R code:

library(tidyverse)

## read in censoc file
censoc <- read_csv('path/to/censoc/file.csv')

## read in census file
census <- read_csv('path/to/census/file.csv')

## join the census files by HISTID
merged_df <- censoc %>%
  inner_join(census, by = "HISTID")