Summary: The 1940 Census — the “Cen” in CenSoc — was the first to ask detailed questions on income. In this vignette, we look at the characteristics of wage and salary data from the Census. In the latter half, we also model the relationship between wage income and age at death. This is, of course, well-trod territory; the positive relationship between income and longevity is well known and often attributed to the protective effects of income on health. Although it is typically based on cumulative (lifetime) earnings, the association might appear in single-year censal data as well.

For the vignette, we’ll use the CenSoc–Numident demo dataset, which can be downloaded from the CenSoc downloads request page. The dataset merges content from two data files: (1) a 1% sample of the 132 million respondents to the 1940 Census via IPUMS-USA from the University of Minnesota Population Center, and (2) anonymized death records from 1988 to 2005 of persons with Social Security numbers contained in Numident, the Social Security Administration’s database of SSN applications. (Learn more about the demo data and other CenSoc data products here.)

 

Part A: Getting started

We’ll begin by reading in the demo data and loading a few R library packages to help us along the way. In particular, we’ll make use of the collection of packages known as the tidyverse, popular for data workflows in R.

# as a friendly reminder to R neophytes, be sure to set your working directory to the folder containing the demo file you downloaded

# use setwd("dir") to set a different working directory, where "dir" is an absolute file path enclosed in double quotes. Or point-and-click your way to a new directory from the RStudio menu: Session > Set Working Directory > Choose Directory...



# load libraries

library(tidyverse)   # for data manipulation and processing

library(ggplot2)     # for creating neatly formatted figures

library(stargazer)   # for creating neatly formatted tables

library(knitr)       # for creating tidy reports

library(readr)       # for reading in .csv data as a tidyverse-compliant tibble



# load data (adjust the filepath as necessary)

# demo <- read_csv("~/Google Drive/censoc_numident_demo_v1.csv", col_names = TRUE, col_types = NULL, na = c("", "NA" ,"N/A"))  # on a Mac

 demo <- read_csv("C:/Users/Serge/Google Drive/censoc_numident_demo_v1.csv", col_names = TRUE, col_types = NULL, na = c("", "NA" ,"N/A"))  # on a PC

The demo dataset contains nearly 63,000 records on 30 variables. Eight variables are from the Numident file, 21 are from the IPUMS-USA file, and both files are joined on the person-record identifying histid variable. We’re interested in the incwage variable, which reports gross wages or salary earned by a person working as someone else’s employee during the 12 months ending December 31, 1939. To keep things simple, we’ll leave out all but a select few variables whose use will be made clear as we go along.

df <- demo %>%

  select(death_age,  # age at death in yrs (Soc)

         incnonwg,   # nonwage income in 1939 (Cen)

         incwage,    # wage/salary income in 1939 (Cen)

         byear,      # birth year (Soc)

         weight,     # sample weight (Soc)

         perwt,      # person weight (Cen)

         sex,        # sex (Cen)

         age,        # age (Cen)

         )

 

Part B: Missing values and special codes in incwage

Our next step is to look at data missingness and variable coding in incwage. For this task, the data codebook is your friend, and we owe its existence to the IPUMS folks at the Minnesota Population Center who produced comprehensive, detailed codebook content for each variable in the 1940 Census.

The coding for incwage tells us about two special codes. “999998” represents missing data and “999999” indicates the non-applicability of the incwage variable. Note that these are distinct conditions, since a missing value is likely due to human error while an NA value is used when a logical expression evaluates to FALSE. In incwage, a “999998” code would apply if a person forgot to answer the wage income question on his/her Census form. But “999999” would apply if the same person was only 13 years old because individuals had to be 14 or older to be in the Census’ wage-earning universe (i.e., “13” equals “14 or older” = FALSE).

Let’s count up total instances of each code:

sum(df$incwage == 999998)

#> [1] 0

sum(df$incwage == 999999)

#> [1] 20276

We have no missing values and 20,276 records for whom wage income was not applicable. The absence of missing values is a welcome indication of data completeness. As for the many thousands of NA records, representing one-third of the demo dataset, could it be they all were 13 years old or less on April 1, 1940?

sum(df$age < 14)

#> [1] 20037

Not quite, but close. We still have 239 records to account for to get to 20,276. In what ways might a couple hundred individuals be legitimately excluded from observation on incwage, other than being underage? Turning to the codebook, we see our answer might have something to do with the incwage universe not including “institutional inmates,” — i.e., incarcerated persons. For our example, let’s assume this is the explanation and remove these cases from the sample.

df <- df %>%

  filter(incwage != 999999)

There is also a third special code in incwage to consider: “5001”. This is a topcode, or placeholder value for incomes exceeding the high end of a continuously measured income range. (In the 1940 Census, the high end was $5,000 — equivalent to about $93,000 today.) By suppressing the actual incomes of the very highest earners, topcodes protect the confidentiality of individuals for whom the disclosure of their outlier earnings might allow them to be personally identified. Their use, however, distorts the income distribution, since a topcode is essentially equivalent to missing data; “5001” is no more indicative of an actual income value than “999998” is.

Should you keep topcoded records or discard them? That depends. For instance, how many topcoded records are there? Does the income distribution change significantly if those records are dropped? Do we have reason to believe top-earning outliers were considerably different from individuals whose wages were at or near $5,000?

Let’s take these one at a time. First, the command below tells us we have just two topcode instances in our sample of 42,623 individuals:

sum(df$incwage == 5001)  # counts the total number of times we have a "5001" code

#> [1] 2

We’re probably safe in assuming that the removal of these two cases should make only a negligible difference, but to be sure, we’ll compare the five-number summaries and standard deviations of the income distribution with and without the two records:

# five-number summary and std dev, topcoded records retained

summary(df$incwage)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

#>     0.0     0.0     0.0   247.1   312.0  5001.0

sd(df$incwage)

#> [1] 470.7404



# five-number summary and std dev, topcoded records dropped

summary(df$incwage[df$incwage != 5001])

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

#>     0.0     0.0     0.0   246.9   312.0  5000.0

sd(df$incwage[df$incwage != 5001])

#> [1] 469.6236

We confirm that the change to the sample’s income distribution is almost vanishingly small, so let’s discard the topcoded records. Sample size will be unaffected for all intents and purposes and we’ll avoid problems, however minor, caused by our treating “5001” as literally $5,001 in income when it might have been much higher.

df <- df %>%

  filter(incwage != 5001)

 

Part C: Understanding zero values in incwage

Our next area of focus is the zeros in incwage. We’ll see why this is important after calling a histogram of wage incomes:

ggplot(df, aes(x = incwage)) +

  geom_histogram(bins = 100) +

  labs(title = "Distribution of wage incomes in 1939")

That is a lot of zeros. Almost any income distribution you’ll encounter will be right-skewed, but this is extreme. Of course, not every observation in that first bin is exactly $0 — even at 100 bins, covering the distance from $0 to $5,000 means each bin must span $50 (well, $50.01, technically, but our wage data is expressed in integers). But we should still expect a lot of zero-earners here. Earlier, we saw the median incwage value was 0, so we know at least half of our sample earned no wages in 1939. To find out exactly how many, we’ll recall a now-familiar command:

sum(df$incwage == 0)

#> [1] 26661

That’s nearly 63% of our sample claiming no wages earned during any part of 1939. We know the Depression-era economy was weak, but this weak?

Let’s think about this for a moment. On one hand, we can take 0 at face value — the codebook tells us a zero in incwage factually means zero dollars earned in 1939. If we were to disregard everything else we know about our sample, the histogram would be an accurate portrayal of the wage-earning situation. On the other hand, we know wage-earning is an activity affected by socio-cultural context, and if we want wage income to proxy unobserved indicators like socioeconomic position or access to resources — as has long been the norm in the social sciences — we ignore this context at our peril. Without context, we’d have no reason to interpret zero wages as anything but a universally unfavorable or undesired outcome for every zero-earner. That context is what we need to understand why our blanket assumption might be wrong.

For example, it wasn’t until the 1960s that married women began to work outside the home in large numbers. How many of our 26,661 zero-earners might have been stay-at-home wives or mothers, under no normative expectation to work for wages because their husbands were the breadwinners? Or what about the labor force contingent that was self-employed in 1939 — farmers, shopkeepers, white-collar professionals with their own practices, etc.? Their earnings would have been coded 0 in incwage by default. We can also imagine scenarios where individuals earned no wages for reasons unrelated to the state of the economy or job market; e.g., they spent all of 1939 in school, idled by injury, or removed from the labor force for personal or family reasons.

Because zero wages may not always reflect adversity (e.g., financial hardship, the bottom rung of the socioeconomic ladder), how do we disentangle what zero wages really means? For incwage to be useful, we want it “un-confounded.” We want $0 to mean roughly the same thing for everyone in our sample. But rather than assuming that will be the case, we’ll make it so (to the extent feasible), using things we know about the people in our sample to exclude individuals who don’t fit into a specific wage-earning schema. And if we predicate our schema on the notion that everyone in our sample was expected to work in 1939 and all available work was employer-based, that means we should restrict our sample to males, if only because females did not face the same pressure — or have the same opportunity — as males to earn wages in the prewar era.

df <- df %>%

  filter(sex == "Male")

Next, we’ll want to keep only those males whose principal or only means of earning a living was as someone else’s employee. Variables such as the nonwage income indicator incnonwg or the categorical occupation variable occ may be helpful, and in a later vignette we’ll deep-dive both of these. But for now, it’s sufficient to know incnonwg as the equivalent of a dummy variable indicating whether $50 or more of a person’s income in 1939 was from a nonwage (i.e., self-employed) source.

table(df$incnonwg)

#> 

#>          $50+ nonwage, nonsalary income Less than $50 nonwage, nonsalary income 

#>                                    2769                                   16727

We can use that dichotomous coding to filter out the self-employed if we assume that $50 of nonwage income accurately identifies the threshold between earning a living from wages and earning a living from one’s own enterprise. We may then conclude that 2,769 males in our sample do not fit our schema and should be removed.

df <- df %>%

  filter(incnonwg != "$50+ nonwage, nonsalary income")

That still leaves us with 7,846 males with zero wage income in 1939:

head(table(df$incwage == 0))

#> 

#> FALSE  TRUE 

#>  8881  7846

Now things get a little tricky again. All we know about these folks is that they earned no wages in 1939 and might have earned at most $49 working for themselves that year. Yet it’s possible many or even most of them earned nothing at all from any source. For the sake of our schema, how do we decide whether these men were self-employed workers who had a bad year, wage earners who had a very bad year, or individuals who intentionally left the labor market?

The answer is, we can’t. Not with what we presently know about our sample. And this raises a couple of questions when looking at the low-earning end of any income distribution: What is the income threshold below which a given wage-earning schema breaks down? At what point do we risk confounding our analyses because we’ve included too many observations at the low end who do not fit our schema?

In situations like this, part of our job is to make a judgment call and defend our reasoning. Perhaps we might argue that the age range of our sample is too wide at the young end — we include 14- and 15-year-olds, for instance, because they’re eligible to earn wages, but maybe very few of them actually did.

Let’s see if that was in fact the case:

sum(df$age %in% 14:15)

#> [1] 2727



df %>%

   filter(age %in% 14:15) %>%

   group_by(incwage) %>%

   tally() %>%

   spread(incwage, n)

#> # A tibble: 1 x 65

#>     `0`   `1`   `2`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `12`  `13`  `14`

#>   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>

#> 1  2597     2     1     1     5     2     3     2     1     3     4     1     1

#> # ... with 52 more variables: `15` <int>, `16` <int>, `20` <int>, `21` <int>,

#> #   `22` <int>, `24` <int>, `25` <int>, `27` <int>, `29` <int>, `30` <int>,

#> #   `34` <int>, `35` <int>, `36` <int>, `40` <int>, `45` <int>, `48` <int>,

#> #   `50` <int>, `55` <int>, `60` <int>, `66` <int>, `70` <int>, `72` <int>,

#> #   `75` <int>, `78` <int>, `80` <int>, `81` <int>, `84` <int>, `85` <int>,

#> #   `90` <int>, `100` <int>, `120` <int>, `140` <int>, `150` <int>,

#> #   `156` <int>, `180` <int>, `185` <int>, `200` <int>, `210` <int>,

#> #   `240` <int>, `250` <int>, `252` <int>, `260` <int>, `300` <int>,

#> #   `350` <int>, `365` <int>, `520` <int>, `600` <int>, `700` <int>,

#> #   `800` <int>, `900` <int>, `1200` <int>, `2700` <int>

Out of 2,727 males aged 14 or 15 years on Census Day 1940, 2,597 of them (95%!) earned no wages in 1939. It seems a safe bet that the large majority of these males do not fit our schema and should be dropped. In fact, let’s agree to restrict the analytic sample to ages 18 and older, so as to capture young men at an age when they’d be seen by society as adults and expected to work. Let’s also drop all zero-wagers from our sample, based on lack of clarity about their fit with our schema.

df <- df %>%

  filter(age > 17) %>%

  filter(incwage != 0)

Having made those calls, the next question is what to do about those who earned unusually low wages — say, between $1 and $49 — in 1939. One would not expect gainfully employed workers to earn so little in a year, unless they were maybe contending with a decade-long economic slump. In other words, low values in this particular income distribution may be rather anomalous and their interpretive meaning incongruous with what we’d normatively assume about workers with very low earnings — in which case, their presence in the data may have a confounding effect that we’d want to remove.

Using incnonwg as our guide, we’ll assume our schema becomes valid at wage earnings of $50 or more and filter accordingly:

df <- df %>%

  filter(incwage > 49)

This returns our final analytic sample of 8,120 wage-earning males. Let’s now update the incwage histogram and five-number summary:

ggplot(df, aes(x = incwage)) +  

  geom_histogram(bins = 100) +

  labs(title = "Distribution of wage incomes in 1939")  # histogram of income data



summary(df$incwage)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

#>    50.0   325.0   700.0   792.8  1092.0  5000.0
With all the zeros removed, the y-axis has been rescaled, allowing us to see the shape of the distribution more clearly. Wages fall into a much more conventional-looking curve, with a long right tail covering half of the income range. (Note the clumping of income values at regular intervals, indicative of income heaping, i.e., many individuals likely rounded their reported incomes to the nearest hundred.) And our summary tells us that mean wage earnings were $793 (about $14,750 in current dollars); appropriately, but not excessively, higher than the $700 median. We have a reasonable-looking distribution.

 

Part D: A linear regression example

At the top of the vignette, I said our goal was to test the relationship between a person’s wage earnings in 1939 and their age at death. To that end, we’ll use ordinary least squares (OLS) regression, where the natural log of incwage is used to “explain” some portion of the variance in death_age. We’ll employ the lm command in R’s basic statistics package and, because our data come from a sample, we’ll adjust for sample design with our two sample weight variables: perwt from IPUMS-USA, and weight from Numident. Additionally, we’ll control for byear, i.e., we’ll partial out the effects relating to the fact that our sample was born across a wide span of birth years.

You may be wondering why we’re about to work with logged income values instead of original units (dollars), which would be easier to interpret. The logarithmic transformation of income data is actually quite commonly done; one advantage is that it attenuates the effect of extremely high values on the computation of regression coefficients. More pertinent, however, is the theoretical idea that the marginal effect of income on longevity is probably found with proportional and not absolute changes in income. The log model estimates the constant effect of an equal proportional change (e.g., +10%) across the income range, whereas the unlogged model tells us the constant effect of an additive change (e.g. +$100). Neither model is perfect, but economic theory, at least, suggests that the proportional model might be more appropriate, considering the marginal value of $100 is higher for someone who earns less than someone who earns more.

Let’s run our regression model:

inc.ols <- lm(death_age ~ log(incwage) + as.factor(byear),  # we force R, on the fly, to apply a log transformation to incwage and treat byear as a factor variable to isolate the effect of each birth year

               data = df,

               weight = weight * perwt)  # we multiply these variables together to blend their weights



stargazer(inc.ols, title = "Model results", type = "text", df = FALSE, digits = 3, colnames = TRUE)

#> 

#> Model results

#> ================================================

#>                          Dependent variable:    

#>                      ---------------------------

#>                               death_age         

#> ------------------------------------------------

#> log(incwage)                  0.193***          

#>                                (0.072)          

#>                                                 

#> as.factor(byear)1896            0.039           

#>                                (1.110)          

#>                                                 

#> as.factor(byear)1897           -0.941           

#>                                (1.074)          

#>                                                 

#> as.factor(byear)1898           -1.232           

#>                                (0.982)          

#>                                                 

#> as.factor(byear)1899          -3.046***         

#>                                (0.973)          

#>                                                 

#> as.factor(byear)1900          -2.598***         

#>                                (0.909)          

#>                                                 

#> as.factor(byear)1901          -3.188***         

#>                                (0.972)          

#>                                                 

#> as.factor(byear)1902          -2.568***         

#>                                (0.914)          

#>                                                 

#> as.factor(byear)1903          -3.986***         

#>                                (0.901)          

#>                                                 

#> as.factor(byear)1904          -3.979***         

#>                                (0.895)          

#>                                                 

#> as.factor(byear)1905          -6.508***         

#>                                (0.884)          

#>                                                 

#> as.factor(byear)1906          -5.269***         

#>                                (0.870)          

#>                                                 

#> as.factor(byear)1907          -7.297***         

#>                                (0.854)          

#>                                                 

#> as.factor(byear)1908          -8.456***         

#>                                (0.859)          

#>                                                 

#> as.factor(byear)1909          -8.654***         

#>                                (0.855)          

#>                                                 

#> as.factor(byear)1910          -9.220***         

#>                                (0.852)          

#>                                                 

#> as.factor(byear)1911         -10.301***         

#>                                (0.853)          

#>                                                 

#> as.factor(byear)1912         -10.848***         

#>                                (0.845)          

#>                                                 

#> as.factor(byear)1913         -11.495***         

#>                                (0.844)          

#>                                                 

#> as.factor(byear)1914         -12.455***         

#>                                (0.842)          

#>                                                 

#> as.factor(byear)1915         -12.960***         

#>                                (0.842)          

#>                                                 

#> as.factor(byear)1916         -13.831***         

#>                                (0.842)          

#>                                                 

#> as.factor(byear)1917         -14.633***         

#>                                (0.843)          

#>                                                 

#> as.factor(byear)1918         -15.717***         

#>                                (0.846)          

#>                                                 

#> as.factor(byear)1919         -16.470***         

#>                                (0.849)          

#>                                                 

#> as.factor(byear)1920         -16.875***         

#>                                (0.854)          

#>                                                 

#> as.factor(byear)1921         -17.713***         

#>                                (0.872)          

#>                                                 

#> as.factor(byear)1922         -17.813***         

#>                                (1.045)          

#>                                                 

#> Constant                      92.405***         

#>                                (0.959)          

#>                                                 

#> ------------------------------------------------

#> Observations                    8,110           

#> R2                              0.472           

#> Adjusted R2                     0.470           

#> Residual Std. Error            117.163          

#> F Statistic                  257.776***         

#> ================================================

#> Note:                *p<0.1; **p<0.05; ***p<0.01

We find that the expected relationship holds up: wage income had a positive and statistically significant effect on age at death. The model as a whole even manages to explain nearly half of the variance in death_age, as shown by the R-squared statistic, although we should avoid giving too much credit to incwage, for reasons we will see below.

But first, let’s interpret the log-transformed incwage coefficient (0.193). One way to think of it is that dividing this coefficient by 100 gives us the number of units death_age changes for each percent increase in the log of incwage. In other words, each 0.01 increase in log(incwage) extends longevity by 0.00193 years, on average. Such tiny amounts can be rather meaningless in the context of human lifespan (e.g., 0.00193 years is only about 17 hours), so it makes sense to scale it up so that, for instance, a 10% increase is equivalent to 170 hours, or one week of additional lifespan. But this only works on small increases; as gains in incwage get bigger, say, 20% or more, the less proportional the change in lifespan becomes. (This is because the log of incwage is on an exponential scale while death_age is on a linear scale, and the two diverge.)

Here, a better (and formula-driven) approach is to take note of the following equation, \[log(x) - log(y) = log(x/y),\] where x and y are income amounts expressed in dollars. Let’s look at the interquartile range of our log-transformed incwage variable to see how this works, where log(x) is the 3rd quartile and log(y) is the 1st quartile:

summary(log(df$incwage))

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

#>   3.912   5.784   6.551   6.362   6.996   8.517

Exponentiating 6.996 gives us $1,092. Exponentiating 5.784 gives us $325. If we divide 1092 by 325, we get 3.36. The natural log of 3.36 is 1.212, which is also what we get if we subtract 5.784 (log(y)) from 6.996 (log(x)). This bears out the equation above. Between these two values, the distance covered is about equal to 5.784 * 1.212. Put another way, in log terms the 3rd quartile is 21.2% larger than the 1st quartile. If we follow the first approach, we’d multiply 21.2 by the coefficient 0.00193 to get the number of years of additional longevity, which is equal to 0.0409 yrs, or about 15 days. But above 20%, we obtain more accurate results by multiplying our coefficient by log(x/y). There, the result becomes a little smaller at 0.0371 yrs, or 13.5 days.

Let’s turn next to our control variable byear. Coefficients for all birth years except 1896–98 are significant and show an obvious pattern: the more recent the birth year, the larger the negative coefficient. And now look at the constant: the expected age at death is 92.4 years. A large part of the reason for these two (related) outcomes is that controlling for birth year assigns persons born in 1895 as the reference group (since 1895 is the smallest value in the byear vector) and these individuals would have been the oldest cohort, on average, when observed for deaths between 1988 and 2005. This more accurately makes 92.4 the average age at death for this earliest born cohort than the average for the entire sample. The fact that the calendar year window when we can observe deaths is fixed means each newer birth cohort reached 1988 at a younger age. For those born in 1895, we can only observe deaths at age 93 and up (1988 – 1895 = 93). But for the 1900 cohort, we can observe deaths from age 88 and up (and so on). It’s the visibility of ever-younger ages that pulls down the average age at death for each later-born cohort.

Does this situation represent a danger for our model or the way we interpret its results? To some extent, yes, because we do not know how much of the longevity effect is owed to the censoring of deaths at younger ages prior to 1988 rather than the period effects for each birth year. For example, men born in 1922 show an average age at death 17.8 years lower than the reference group, but that doesn’t mean the more recently a male was born, the fewer years he could expect to live. (It should actually be the opposite, considering widespread advances in public health and sanitation in the early 20th century tended to keep more people alive for longer.) What’s actually happening is that the average age at death of the 1922 cohort reflects the deaths we saw between ages 66 and 83 (1988 – 1922 = 66; 2005 – 1922 = 83), and this is being compared to our reference group of the oldest old, none of whom we observed dying until age 93. (This censoring situation will be the subject of a separate vignette, given its importance.)

Having said that, the dramatic difference in coefficient size between log-transformed incwage and byear is reassuring. For most wage-earners, income varies from year to year and there is little reason to expect the income earned in 1939 to be any stronger an indicator of lifetime earnings (the more reliable indicator in the earnings–longevity relationship) than any other year. In other words, it would have been unusual to see a large effect size. But the fact there was a small and highly significant effect indicates how easily the earnings–longevity relationship may be discerned even from a single year of data.

 

Part E: Wrap-up

Wage income can be tremendously useful, being one of the relatively few continuously coded variables for an essential “input” of modern human existence (i.e., money). Before you can make good use of income data, though, you will want to make that data consistent. To that end, it helps to understand the following:

If you’re willing to take a little time for each of these items, your income data will be ready for any kind of modeling or inferences you want to make.