ipumsr Workflow for Big CenSoc DataSummary: This vignette presents a workflow for analyzing CenSoc data using the ipumsr package, an easy way to import IPUMS data and its associated metadata into R. The goal of this vignette is to provide motivation as to why CenSoc users may want to use the ipumsr package and provide a basic outline of its functionality.
Before getting started with the vignette, you’ll need to:
Install the following packages if necessary (use the install.packages() function)
tidyverseipumsrDownload the CenSoc-Numident or CenSoc-DMF file
Download the 1940 full count census and accompanying DDI from IPUMS (instructions below)
The original R notebook (.Rmd file) for this vignette can be downloaded here.
ipumsr packageIPUMS provides integrated census and survey data from across the world free of charge to the broader research community. IPUMS is a terrific resource for the social science community—they clean and harmonize data and offer an interactive extract system that allows users to select only the samples and variables relevant for their research question. To access the IPUMS-USA data collection, you first need to register.
The ipumsr package, created by Greg Freedman Ellis at the Institute for Social Research and Data Innovation, reads in IPUMS data and its associated metadata, such as variable descriptions and value labels, into R. It also has some helpful functions for working with big IPUMS datasets. In the context of CenSoc, it’s particularly helpful for assigning meaningful text labels to numeric codes (e.g., for the SEX variable, 1 = “Male”, 2 = “Female”).
For a more thorough coverage of the ipumsr package, please visit the ipumsr website.
Once you have an account, visit https://usa.ipums.org/usa/ to make a data extract. Select ‘GET DATA’, listed under ‘CREATE YOUR CUSTOM DATA SET.’
Make an extract of the the 1940 Full Count Census:
Now you’re ready to select variables!
All extracts will, by default, include HISTID, the variable used to link the census file to the CenSoc file.
Choose variables for your analysis. Note that variables are available at both the person and household level. To select the person-level variable RACE, navigate to PERSON → RACE, ETHNICITY, AND NATIVITY. To select household-level variable RENT, navigate to HOUSEHOLD → ECONOMIC CHARACTERISTICS.
The IPUMS “select cases” feature allows users to conditionally choose which records to include in an extract. If you are only interested in a subset of the Census, this is a great way to reduce the size of your extract! For example, if you are working with the CenSoc-DMF file, which includes only men, it would make sense to restrict your cases to men-only.
To do this, visit the ‘EXTRACT REQUEST’ page and click ‘Change,’ which is located next to ‘DATA FORMAT.’ Then, select ‘Comma delimited (.csv)’ or ‘Fixed-width text (.dat) and click the submit button. Once you are happy with your dataset, click the ’SUBMIT EXTRACT’ button. Agree to the full-count usage terms, and click ‘OK’ to extract the dataset.
Given the size of the extract, it may take a few hours before it’s available for download. Once the file is ready, you will receive an email from IPUMS with a link to download the resulting dataset. For more information on IPUMS extracts, please see IPUMS-USA.
To read in IPUMS data with the ipumsr package, you’ll need to download:
To work with the full CenSoc files, you’ll first need to (i) read in the 1940 census, (ii) read in the CenSoc File, and (iii) merge them on the HISTID variable. Note that we also have pre-linked CenSoc “demo” files — please see this vignette for more information.
One challenge of working with the full CenSoc files is that you must work with large files. For example:
If both datasets fit into your computer or server’s memory, merging the datasets on the HISTID variable is straightforward.
## Library Packages
library(tidyverse) ## functions for data manipulation and visualization
library(ipumsr) ## reads in IPUMS data and associated metadata
## read in censoc file
censoc <- read_csv("/path/to/censoc/censoc_numident_v1.csv")
## read in census file with ipumsr package
census <- read_ipums_micro(data_file = '/path/to/ipums/usa_00036.csv.gz',
ddi = '/path/to/ipums/usa_00036.xml')
## join the census files by HISTID
censoc_numident_linked <- inner_join(censoc, census, by = "HISTID")A memory-conscious solution is to break the census dataset up and work in chunks. The ipumsr package has the read_ipums_micro_chunked() functions to read in a set number of observations at a time (chunks). The code below uses that function to (i) read in a chunk of the 1940 census (ii) merge that chunk to the CenSoc file on the HISTID variable and (iii) repeat for every chunk and then combine all the merged chunks. Then you’ll be ready to get started on your analysis.
Depending on how many variables you’ve included in your 1940 census extract, the approach may take a while. For reference, the below code took approximately 50 minutes on a 2018 Macbook Pro with 16GB of installed memory (RAM). The good news is that you only need to do this once — see the instructions below on how to save your merged data file.
Note: This approach requires 10GB+ of available RAM.
## Set paths to censoc file
censoc <- read_csv("/path/to/censoc/censoc_numident_v1.csv")
## Set paths to IPUMS data file — .csv or .dat file!
ipums_data <- '/path/to/census/usa_00033.csv'
## Set path to IPUMS DDI file
ipums_ddi <- '/path/to/census/ddi/usa_00033.xml'
## Read in data in chunks and merge with censoc
censoc_numident_linked <- read_ipums_micro_chunked(ddi = ipums_ddi,
data_file = ipums_data,
callback = IpumsDataFrameCallback$new(function(x, pos) {
inner_join(x, censoc, by = "HISTID")
}),
chunk_size = 500000,
)
#> Use of data from IPUMS USA is subject to conditions including that users should
#> cite the data appropriately. Use command `ipums_conditions()` for more details.
#> |======================================================== | 90% 16963 MBThere are a few other strategies for handling datasets too large for memory in R:
If you import and link your CenSoc file to the 1940 Census using a different approach, you can still use the IPUMSR package to assign meaningful value labels to the linked file — just use the ipums_collect() function.
ipumsr package functionalityThere are several helpful functions in the ipumsr package to work with the metadata. In this vignette, we’ll only cover a few.
The ipums_view() function will display a webpage with variable descriptions and value labels in the RStudio viewer. This is an easy way to learn more about the 1940 census variables.
The ipumsr package imports the associated metadata, such as variable labels, value labels, and more from the IPUMS extract. Particularly helpful are value labels, which translate the numeric IPUMS code into meaningful text strings (e.g., the SEX variable has value labels: 1 = “Male”, 2 = “Female.”)
The ipumsr package stores labelled values using the labelled class from the haven package. The main way to create a factor variable from these labels is the as_factor() function (note: the base R as.factor() function will not work).
For example, if you’re interested in birthplace, you can convert the numeric BPL into a meaningful text string variable using the as_factor() function:
## Look at value labels for birthplace
ipums_val_labels(censoc_numident_linked$BPL)
## A tibble: 163 x 2
# val lbl
# <dbl> <chr>
# 1 1 Alabama
# 2 2 Alaska
# 3 4 Arizona
# 4 5 Arkansas
# 5 6 California
# 6 8 Colorado
## create a new string variable for birthplace
censoc_numident_linked$BPL_string = as_factor(censoc_numident_linked$BPL)
## alternative method to create new string variable for birth place (style of tidyverse)
censoc_numident_linked <- censoc_numident_linked %>%
mutate(BPL_string = as_factor(BPL))
## look at a few rows
censoc_numident_linked %>%
select(HISTID, BPL, BPL_string) %>% ## print out a few rows
sample_n(5) | HISTID | BPL | BPL_string |
|---|---|---|
| A25606D5-4E07-460D-9DF4-F23C89440249 | 6 | California |
| DB43E7FC-9077-4A20-890A-65CB8F7E51C9 | 17 | Illinois |
| DD0ED6E5-2058-4FBC-A830-D38FC6425DA5 | 17 | Illinois |
| 5AB5B1E4-D332-49DB-A8C9-1F6E4F459771 | 39 | Ohio |
| FEA7EC29-BDF7-41FD-9955-5442D56D8DC2 | 26 | Michigan |
We now can use the easily interpretable BPL_string variable for our analysis.
In some cases, it may be useful to keep the original numeric codes. For example, if you want to restrict to persons born in North America, it’s easier to do so with the original BPL codes, as the codes are ordered in a meaningful way—you just restrict to BPL codes below 200. It’s more difficult, however, if you have to specify each individual state and country in North America.
To save your data file and it’s associated metadata, you need to write it out as an object. The metadata cannot be stored in a .CSV file, so if you want to save the metadata, you’ll need to save the entire object to a file using the saveRDS function.
## save censoc data file with metadata
saveRDS(object = censoc_numident_linked, file = "/path/to/data/censoc_numident_linked.rds")
## read in censoc data file with metadata
censoc_numident_linked <- readRDS("/path/to/data/censoc_numident_linked.rds")Note: Another option is to convert all the relevant variables from numeric to meaningful factor variables at the beginning of your analysis, and then write out the .CSV file. You’ll lose the metadata, but you can always reassign it with the ipums_collect() function.
The ipumsr packages offers a convenient workflow for analyzing large CenSoc datasets. While there are many ways to work with CenSoc data in R, the ipumsr package is the most efficient way to handle IPUMS value labels, variable descriptions, and more.
The ipumsr package is currently maintained by Derek Burk and has a user-support forum.