ipumsr
Workflow for Big
CenSoc DataSummary: This vignette presents a workflow for
analyzing CenSoc data using the ipumsr
package, an easy way
to import IPUMS data and its associated metadata into R. The goal of
this vignette is to provide motivation as to why CenSoc users may want
to use the ipumsr
package and provide a basic outline of
its functionality.
Before getting started with the vignette, you’ll need to:
Install the following packages if necessary (use the
install.packages()
function)
tidyverse
ipumsr
Download the CenSoc-Numident or CenSoc-DMF file
Download the 1940 full count census and accompanying DDI from IPUMS (instructions below)
The original R notebook (.Rmd file) for this vignette can be downloaded here.
ipumsr
packageIPUMS provides integrated census and survey data from across the world free of charge to the broader research community. IPUMS is a terrific resource for the social science community—they clean and harmonize data and offer an interactive extract system that allows users to select only the samples and variables relevant for their research question. To access the IPUMS-USA data collection, you first need to register.
The ipumsr
package, created by Greg Freedman Ellis at
the Institute for Social Research and Data Innovation, reads in IPUMS
data and its associated metadata, such as variable descriptions and
value labels, into R. It also has some helpful functions for working
with big IPUMS datasets. In the context of CenSoc, it’s particularly
helpful for assigning meaningful text labels to numeric codes (e.g., for
the SEX
variable, 1 = “Male”, 2 = “Female”).
For a more thorough coverage of the ipumsr
package,
please visit the ipumsr
website.
Once you have an account, visit https://usa.ipums.org/usa/ and select ‘GET DATA’, listed under ‘CREATE YOUR CUSTOM DATA SET.’
Select the 1940 Full Count Census:
This will take you back to the variable selection page.
All extracts will, by default, include HISTID
, the
variable used to link the census file to the CenSoc file.
Choose variables for your analysis. For example, you could select
RACE
, which is under PERSONAL & RACE, ETHNICITY, AND
NATIVITY.
The IPUMS “select cases” feature allows users to conditionally choose which records to include in an extract. If you are only interested in a subset of the Census, this is a great way to reduce the size of your extract! For example, if you are working with the CenSoc-DMF file, which includes only men, it would make sense to restrict your cases to men-only.
To do this, visit the ‘EXTRACT REQUEST’ page and click ‘Change,’ which is located next to ‘DATA FORMAT.’ Then, select ‘Comma delimited (.csv)’ or ‘Fixed-width text (.dat) and click the submit button. Once you are happy with your dataset, click the ’SUBMIT EXTRACT’ button. Agree to the full-count usage terms, and click ‘OK’ to extract the dataset.
Given the size of the extract, it may take a few hours before it’s available for download. Once the file is ready, you will receive an email from IPUMS with a link to download the resulting dataset. For more information on IPUMS extracts, please see IPUMS-USA.
To read in IPUMS data with the ipumsr
package, you’ll
need to download:
To work with the full CenSoc files, you’ll first need to (i) read in the 1940 census, (ii) read in the CenSoc File, and (iii) merge them on the HISTID variable. Note that we also have pre-linked CenSoc “demo” files — please see this vignette for more information.
One challenge of working with the full CenSoc files is that you must work with large files. For example:
If both datasets fit into your computer or server’s memory, merging
the datasets on the HISTID
variable is straightforward.
## Library Packages
library(tidyverse) ## functions for data manipulation and visualization
library(ipumsr) ## reads in IPUMS data and associated metadata
## read in censoc file
<- read_csv("~/path/to/censoc/censoc_numident_v2.1.csv")
censoc
## read in census file with ipumsr package
<- read_ipums_micro(data_file = '~/path/to/ipums/usa_00036.csv.gz',
census ddi = '~/path/to/ipums/usa_00036.xml')
## join the census files by HISTID
<- inner_join(censoc, census, by = "HISTID") censoc_numident_linked
A memory-conscious solution is to break the census dataset up and
work in chunks. The ipumsr
package has the
read_ipums_micro_chunked()
functions to read in a set
number of observations at a time (chunks). The code below uses that
function to (i) read in a chunk of the 1940 census (ii) merge that chunk
to the CenSoc file on the HISTID
variable and (iii) repeat
for every chunk and then combine all the merged chunks. Then you’ll be
ready to get started on your analysis.
Depending on how many variables you’ve included in your 1940 census extract, the approach may take a while. For reference, the below code took approximately 50 minutes on a 2018 base Macbook Pro with 16 GB of memory. The good news is that you only need to do this once — see the instructions below on how to save your merged data file.
## set paths to censoc file
<- read_csv("~/path/to/censoc/censoc_numident_v2.1.csv")
censoc
## Set paths to IPUMS data file — .csv or .dat file!
<- '~/path/to/census/usa_00033.csv'
ipums_data
## set path to IPUMS DDI file
<- '~/path/to/census/ddi/usa_00033.xml'
ipums_ddi
## Read in data in chunks and merge with censoc
<- read_ipums_micro_chunked(ddi = ipums_ddi,
censoc_numident_linked data_file = ipums_data,
callback = IpumsDataFrameCallback$new(function(x, pos) {
inner_join(x, censoc, by = "HISTID")
}), chunk_size = 500000,
)
<- read_ipums_micro_chunked(ddi = ipums_ddi,
censoc_numident_linked data_file = ipums1pct,
callback = IpumsDataFrameCallback$new(function(x, pos) {
inner_join(x, censoc, by = "HISTID")
}), chunk_size = 500000,
)## This is slow -- it takes ~an hour.
#> Use of data from IPUMS USA is subject to conditions including that users should
#> cite the data appropriately. Use command `ipums_conditions()` for more details.
#> |======================================================== | 90% 16963 MB
There are a few other strategies for handling datasets too large for memory in R:
If you import and link your CenSoc file to the 1940 Census using a
different approach, you can still use the IPUMSR package to assign
meaningful value labels to the linked file — just use the
ipums_collect()
function.
## read in IPUMS DDI for IPUMS extract
<- read_ipums_ddi("/path/to/ddi/usa_00033.xml")
ddi_extract
## assign metadata using the ipumsr package
<- ipums_collect(data = censoc_numident_linked,
censoc_numident_linked ddi = ddi_extract,
var_attrs = c("val_labels", "var_label", "var_desc"))
ipumsr
package functionalityThere are several
helpful functions in the ipumsr
package to work with
the metadata. In this vignette, we’ll only cover a few.
The ipums_view()
function will display a webpage with
variable descriptions and value labels in the RStudio viewer. This is an
easy way to learn more about the 1940 census variables.
ipums_view(censoc_numident_linked)
The ipumsr
package imports the associated metadata, such
as variable labels, value labels, and more from the IPUMS extract.
Particularly helpful are value labels, which translate the numeric IPUMS
code into meaningful text strings (e.g., the SEX
variable
has value labels: 1 = “Male”, 2 = “Female.”)
The ipumsr
package stores labelled values using the
labelled
class from the haven
package. The
main way to create a factor variable from these labels is the
as_factor()
function (note: the base R
as.factor()
function will not work).
For example, if you’re interested in birthplace, you can convert the
numeric BPL
into a meaningful text string variable using
the as_factor()
function:
## Look at value labels for birthplace
ipums_val_labels(censoc_numident_linked$BPL)
## A tibble: 163 x 2
# val lbl
# <dbl> <chr>
# 1 1 Alabama
# 2 2 Alaska
# 3 4 Arizona
# 4 5 Arkansas
# 5 6 California
# 6 8 Colorado
## create a new string variable for birthplace
$BPL_string = as_factor(censoc_numident_linked$BPL)
censoc_numident_linked
## alternative method to create new string variable for birth place (style of tidyverse)
<- censoc_numident_linked %>%
censoc_numident_linked mutate(BPL_string = as_factor(BPL))
## look at a few rows
%>%
censoc_numident_linked select(HISTID, BPL, BPL_string) %>% ## print out a few rows
sample_n(5)
HISTID | BPL | BPL_string |
---|---|---|
A25606D5-4E07-460D-9DF4-F23C89440249 | 6 | California |
DB43E7FC-9077-4A20-890A-65CB8F7E51C9 | 17 | Illinois |
DD0ED6E5-2058-4FBC-A830-D38FC6425DA5 | 17 | Illinois |
5AB5B1E4-D332-49DB-A8C9-1F6E4F459771 | 39 | Ohio |
FEA7EC29-BDF7-41FD-9955-5442D56D8DC2 | 26 | Michigan |
We now can use the easily interpretable BPL_string
variable for our analysis.
In some cases, it may be useful to keep the original numeric codes. For example, if you want to restrict to persons born in North America, it’s easier to do so with the original BPL codes, as the codes are ordered in a meaningful way—you just restrict to BPL codes below 200. It’s more difficult, however, if you have to specify each individual state and country in North America.
To save your data file and it’s associated metadata, you need to
write it out as an object. The metadata cannot be stored in a .CSV file,
so if you want to save the metadata, you’ll need to save the entire
object to a file using the saveRDS
function.
## save censoc data file with metadata
saveRDS(object = censoc_numident_linked, file = "/path/to/data/censoc_numident_linked.rds")
## read in censoc data file with metadata
<- readRDS("/path/to/data/censoc_numident_linked.rds") censoc_numident_linked
Note: Another option is to convert all the relevant
variables from numeric to meaningful factor variables at the beginning
of your analysis, and then write out the .CSV file. You’ll lose the
metadata, but you can always reassign it with the
ipums_collect()
function.
The ipumsr
packages offers a convenient workflow for
analyzing large CenSoc datasets. While there are many ways to work with
CenSoc data in R
, the ipumsr
package is the
most efficient way to handle IPUMS value labels, variable descriptions,
and more.
The ipumsr
package is currently maintained by Derek Burk
and has a user-support
forum.