Use data to discover patterns ("social facts" in Durkheim's terms),
and the social mechanisms that bring them about.
Intuition
Application 1: Drinking + Driving = Death
Application 2: Does Electing Women Reduce Corruption?
Application 3: Ethnic discrimination in Access to healthcare and far-right mayors.
E[Y1i|D=1]−E[Y0i|D=0]=E[Y0i+κ|D=1]−E[Y0i|D=0],=κ+E[Y0i|D=1]−E[Y0i|D=0]0(if randomization has worked),=κ.The average causal effect
Society has all kinds of thresholds where people get different treatment above and below that cut point:
What is the causal effect of
making alcohol consumption legal
on mortality?
Let's compare those who are legally too young to drink (>21) to those who are old enough to drink?
pacman::p_load( tidyverse, # Data manipulation, ggplot2, # beautiful figures, estimatr, # OLS with robust SE texreg, # regression tables with nice layout, rdrobust, # Non-parametric regression, RDDtools # easy RDD fitting)# Get the Minimum legal drinking age data!data("mlda", package = "masteringmetrics")mlda <- mlda %>% drop_na()mlda # print the tibble.# # A tibble: 48 × 19# agecell all allfitted internal internalfitted external externalfitted alcohol alcoholfitted homicide# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl># 1 19.1 92.8 91.7 16.6 16.7 76.2 75.0 0.639 0.794 16.3# 2 19.2 95.1 91.9 18.3 16.9 76.8 75.0 0.677 0.838 16.9# 3 19.2 92.1 92.0 18.9 17.1 73.2 75.0 0.866 0.878 15.2# 4 19.3 88.4 92.2 16.1 17.3 72.3 74.9 0.867 0.915 16.7# 5 19.4 88.7 92.3 17.4 17.4 71.3 74.9 1.02 0.949 14.9# 6 19.5 90.2 92.5 17.9 17.6 72.3 74.9 1.17 0.981 15.6# 7 19.6 96.2 92.6 16.4 17.8 79.8 74.8 0.870 1.01 16.3# 8 19.6 89.6 92.7 16.0 17.9 73.6 74.8 1.10 1.03 15.8# 9 19.7 93.4 92.8 17.4 18.1 75.9 74.7 1.17 1.06 16.8# 10 19.8 90.9 92.9 18.3 18.2 72.6 74.6 0.948 1.08 16.6# # ℹ 38 more rows# # ℹ 9 more variables: homicidefitted <dbl>, suicide <dbl>, suicidefitted <dbl>, mva <dbl>, mvafitted <dbl>,# # drugs <dbl>, drugsfitted <dbl>, externalother <dbl>, externalotherfitted <dbl>
mlda <- mlda %>% mutate( # Define those who are allowed to drink over21 = case_when( agecell >= 21 ~ "Yes", TRUE ~ "No") )ggplot(data = mlda, aes(y = all, x = agecell, color = over21)) + geom_point() + theme_minimal() + scale_color_manual(values = c("red", "blue")) + labs(y = "Nr of deaths among US Americans \n aged 20-22 (1997-2003)", x = "Age in months") + guides(color = "none")
So now let's zoom in on the data right around the threshold
mlda <- mlda %>% mutate( # Define those who are allowed to drink close = case_when( agecell >= 20.7 & agecell < 21 ~ "low", agecell >= 21 & agecell < 21.3 ~ "high", TRUE ~ "No"))ggplot(data = mlda, aes(y = all, x = agecell, color = close)) + geom_point() + theme_minimal() + scale_color_manual(values = c("blue", "red", "black")) + labs(y = "Nr of deaths among US Americans \n aged 20-22 (1997-2003)", x = "Age in months") + guides(color = "none")
The simplest approach is to use linear regression, and add a binary variable (0/1) D to indicate whether an observation i is above or below the threshold on the running variable of interest, a. Note that in the equations on the right, y is our dependent variable (e.g. death rate), and α is the intercept.
A simple regression model:
yi=α+ρDi+γai+ei
We might also consider adding polynomial terms to help the model if we are not convinced that the relationship is linear. For instance, we could add a squared term (a "second order" polynomial):
yi=α+ρDi+γ1ai+γ2a2i+ei
We can even add a polynomial with a power of three, or more! For instance, a "third order" polynomial:
yi=α+ρDi+γ1ai+γ2a2i+γ3a3i+ei
Using the R package RDDtools, we can implement all of these models rather easily. The main challenge is just to install it. See Exercise 1 on that point.
First, creating a plot of the discontinuity with RDDtools
# Tell the package that it is an RDD dataset# y = the Y variable of interest, in this case mlda$all# x = the X variable of interest, in this case mlda$agecell# cutpoint = the point of discontinuity in the data, in this case 21drinking_rdd_data <- RDDdata( y=mlda$all, x=mlda$agecell, cutpoint=21 )# Note that this renames the variables to y and x. Do not be alarmed.# And do not change the variable names back, or you will get error messages
# It's helpful to change the axis labels since it changes the variable names# and 'x' and 'y' aren't very informativeplot( drinking_rdd_data, xlab='Age in months\n', # A little extra white space ylab="Nr of deaths among US Americans aged 20-22 (1997-2003)" )
Using the R package RDDtools, we can implement all of these models rather easily. The main challenge is just to install it. See Exercise 1 on that point.
Next, creating a simple linear model without any polynomial terms
# Order refers to the polynomial. Polynomials of order 1 are just normal variables# without any exponents.# Slope can be set to "same" if we want both sides of the threshold to have the same# slope. It can also be set to "separate" if we want to add an interaction term so as# to allow the slopes on each side to differ from one another, as described in the textbook.rdd_linear <- RDDreg_lm( RDDobject = drinking_rdd_data, order = 1, slope='same' )summary(rdd_linear)# # Call:# lm(formula = y ~ ., data = dat_step1, weights = weights)# # Residuals:# Min 1Q Median 3Q Max # -5.056 -1.848 0.115 1.491 5.804 # # Coefficients:# Estimate Std. Error t value Pr(>|t|) # (Intercept) 91.841 0.805 114.08 < 2e-16 ***# D 7.663 1.440 5.32 3.1e-06 ***# x -0.975 0.632 -1.54 0.13 # ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Residual standard error: 2.49 on 45 degrees of freedom# Multiple R-squared: 0.595, Adjusted R-squared: 0.577 # F-statistic: 33 on 2 and 45 DF, p-value: 1.51e-09
# It's helpful to change the axis labels since it changes the variable names# and 'x' and 'y' aren't very informativeplot( rdd_linear, xlab='Age in months\n', # A little extra white space ylab="Nr of deaths among US Americans aged 20-22 (1997-2003)" )
Using the R package RDDtools, we can implement all of these models rather easily. The main challenge is just to install it. See Exercise 1 on that point.
Finally, adding in some polynomial terms (power of 2)
# Order refers to the polynomial. Polynomials of order 1 are just normal variables# without any exponents.# Slope can be set to "same" if we want both sides of the threshold to have the same# slope. It can also be set to "separate" if we want to add an interaction term so as# to allow the slopes on each side to differ from one another, as described in the textbook.rdd_quadratic <- RDDreg_lm( RDDobject = drinking_rdd_data, order = 2, slope='same' )summary(rdd_quadratic)# # Call:# lm(formula = y ~ ., data = dat_step1, weights = weights)# # Residuals:# Min 1Q Median 3Q Max # -4.45 -1.75 0.19 1.16 5.11 # # Coefficients:# Estimate Std. Error t value Pr(>|t|) # (Intercept) 92.903 0.837 110.99 < 2e-16 ***# D 7.663 1.339 5.72 8.7e-07 ***# x -0.975 0.588 -1.66 0.1046 # `x^2` -0.819 0.289 -2.84 0.0069 ** # ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Residual standard error: 2.32 on 44 degrees of freedom# Multiple R-squared: 0.657, Adjusted R-squared: 0.634 # F-statistic: 28.1 on 3 and 44 DF, p-value: 2.61e-10
# It's helpful to change the axis labels since it changes the variable names# and 'x' and 'y' aren't very informativeplot( rdd_quadratic, xlab='Age in months\n', # A little extra white space ylab="Nr of deaths among US Americans aged 20-22 (1997-2003)" )
But so far we are using all the data points, and using polynomials to home in on real effect with our estimates. This lets us keep all the data, which gives us more statistical power. But there is an alternative.
Parametric Regression Discontinuity: Model the entire dataset, taking into account potential non-linearities in the data
Nonparametric Regression Discontinuity: Only use data that is within a certain distance of the threshold
If we draw the green box too small (too close to the threshold), we will lose statistical power and won't be able to draw meaningful conclusions. If we draw the green box too big, we get more observations, but we lose the strength of the comparison. We call the width of the box the "bandwidth"
That is, we want to estimate this: yi=α+ρDi+γai+ei
But we only want to estimate it in a sample where: threshold−bandwidth≥ai≤threshold+bandwidth
Thankfully, finding the right bandwidth is automated with a function in RDDtools. So the process is relatively simple for us here.
# Weird-looking function name, but it figures out the optimal bandwidth given your databandwidth <- RDDbw_IK(drinking_rdd_data)reg_nonparam <- RDDreg_np(RDDobject = drinking_rdd_data, bw = bandwidth, slope='same')coef_nonparam <- summary(reg_nonparam)$coefMat[1]se_nonparam <- summary(reg_nonparam)$coefMat[2]
summary(reg_nonparam)# ### RDD regression: nonparametric local linear#### Bandwidth: 1.56 # Number of obs: 38 (left: 19, right: 19)# # Weighted Residuals:# Min 1Q Median 3Q Max # -7.633 -2.290 -0.309 0.784 4.221 # # Coefficient:# Estimate Std. Error z value Pr(>|z|) # D 9.19 1.74 5.29 1.2e-07 ***# ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Local R squared: 0.709
From Pereira & Fernandez-Vazquez. (2022). Does electing women reduce corruption? Legislative Studies Quarterly
In 2007 Spain made a new rule saying that for towns with a population larger than 5000 people, parties had to provide at least 40% of their seats to women. They use this as an RDD to look into the well-known correlation between women in government and corruption.
There's an effect! But why?
There's an effect! But why?
From Biegert, Kühhirt, & Van Lancker. (2022). They Can’t All Be Stars: The Matthew Effect, Cumulative Status Bias, and Status Persistence in NBA All-Star Elections American Sociological Review
Sociologists have argued for 80 years that small early advantages between equals can create large inequalities. Here they study this using unique data from professional basketball.
Does winning the fan vote in time t increase a person's chances of winning the fan vote again in time t+1 and beyond?
There's an effect! But why?
Sociologist Robert Merton argued that this happens because we don't have enough available space to reward every deserving person (in his case, it was scientists...think the Nobel prize). We have to make decisions between people who are more or less equal, and rewarding one person at the threshold but not the other creates an actual inequality where there didn't use to be any.
]
Maybe the players who won the first time are just better? Nope. Every measure of performance is insignificant, while the RDD estimate remains significant.
I can't show you the evidence, because it isn't published yet! But I saw your professor present this work in Leipzig.
Newly arrived residents of Italian regions that just barely elect far-right governments are more likely to experience delays and difficulties getting their healthcare access set up, compared to those who live in regions that just barely did not elect far-right governments.
Use data to discover patterns ("social facts" in Durkheim's terms),
and the social mechanisms that bring them about.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Use data to discover patterns ("social facts" in Durkheim's terms),
and the social mechanisms that bring them about.
Intuition
Application 1: Drinking + Driving = Death
Application 2: Does Electing Women Reduce Corruption?
Application 3: Ethnic discrimination in Access to healthcare and far-right mayors.
E[Y1i|D=1]−E[Y0i|D=0]=E[Y0i+κ|D=1]−E[Y0i|D=0],=κ+E[Y0i|D=1]−E[Y0i|D=0]0(if randomization has worked),=κ.The average causal effect
Society has all kinds of thresholds where people get different treatment above and below that cut point:
What is the causal effect of
making alcohol consumption legal
on mortality?
Let's compare those who are legally too young to drink (>21) to those who are old enough to drink?
pacman::p_load( tidyverse, # Data manipulation, ggplot2, # beautiful figures, estimatr, # OLS with robust SE texreg, # regression tables with nice layout, rdrobust, # Non-parametric regression, RDDtools # easy RDD fitting)# Get the Minimum legal drinking age data!data("mlda", package = "masteringmetrics")mlda <- mlda %>% drop_na()mlda # print the tibble.# # A tibble: 48 × 19# agecell all allfitted internal internalfitted external externalfitted alcohol alcoholfitted homicide# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl># 1 19.1 92.8 91.7 16.6 16.7 76.2 75.0 0.639 0.794 16.3# 2 19.2 95.1 91.9 18.3 16.9 76.8 75.0 0.677 0.838 16.9# 3 19.2 92.1 92.0 18.9 17.1 73.2 75.0 0.866 0.878 15.2# 4 19.3 88.4 92.2 16.1 17.3 72.3 74.9 0.867 0.915 16.7# 5 19.4 88.7 92.3 17.4 17.4 71.3 74.9 1.02 0.949 14.9# 6 19.5 90.2 92.5 17.9 17.6 72.3 74.9 1.17 0.981 15.6# 7 19.6 96.2 92.6 16.4 17.8 79.8 74.8 0.870 1.01 16.3# 8 19.6 89.6 92.7 16.0 17.9 73.6 74.8 1.10 1.03 15.8# 9 19.7 93.4 92.8 17.4 18.1 75.9 74.7 1.17 1.06 16.8# 10 19.8 90.9 92.9 18.3 18.2 72.6 74.6 0.948 1.08 16.6# # ℹ 38 more rows# # ℹ 9 more variables: homicidefitted <dbl>, suicide <dbl>, suicidefitted <dbl>, mva <dbl>, mvafitted <dbl>,# # drugs <dbl>, drugsfitted <dbl>, externalother <dbl>, externalotherfitted <dbl>
mlda <- mlda %>% mutate( # Define those who are allowed to drink over21 = case_when( agecell >= 21 ~ "Yes", TRUE ~ "No") )ggplot(data = mlda, aes(y = all, x = agecell, color = over21)) + geom_point() + theme_minimal() + scale_color_manual(values = c("red", "blue")) + labs(y = "Nr of deaths among US Americans \n aged 20-22 (1997-2003)", x = "Age in months") + guides(color = "none")
So now let's zoom in on the data right around the threshold
mlda <- mlda %>% mutate( # Define those who are allowed to drink close = case_when( agecell >= 20.7 & agecell < 21 ~ "low", agecell >= 21 & agecell < 21.3 ~ "high", TRUE ~ "No"))ggplot(data = mlda, aes(y = all, x = agecell, color = close)) + geom_point() + theme_minimal() + scale_color_manual(values = c("blue", "red", "black")) + labs(y = "Nr of deaths among US Americans \n aged 20-22 (1997-2003)", x = "Age in months") + guides(color = "none")
The simplest approach is to use linear regression, and add a binary variable (0/1) D to indicate whether an observation i is above or below the threshold on the running variable of interest, a. Note that in the equations on the right, y is our dependent variable (e.g. death rate), and α is the intercept.
A simple regression model:
yi=α+ρDi+γai+ei
We might also consider adding polynomial terms to help the model if we are not convinced that the relationship is linear. For instance, we could add a squared term (a "second order" polynomial):
yi=α+ρDi+γ1ai+γ2a2i+ei
We can even add a polynomial with a power of three, or more! For instance, a "third order" polynomial:
yi=α+ρDi+γ1ai+γ2a2i+γ3a3i+ei
Using the R package RDDtools, we can implement all of these models rather easily. The main challenge is just to install it. See Exercise 1 on that point.
First, creating a plot of the discontinuity with RDDtools
# Tell the package that it is an RDD dataset# y = the Y variable of interest, in this case mlda$all# x = the X variable of interest, in this case mlda$agecell# cutpoint = the point of discontinuity in the data, in this case 21drinking_rdd_data <- RDDdata( y=mlda$all, x=mlda$agecell, cutpoint=21 )# Note that this renames the variables to y and x. Do not be alarmed.# And do not change the variable names back, or you will get error messages
# It's helpful to change the axis labels since it changes the variable names# and 'x' and 'y' aren't very informativeplot( drinking_rdd_data, xlab='Age in months\n', # A little extra white space ylab="Nr of deaths among US Americans aged 20-22 (1997-2003)" )
Using the R package RDDtools, we can implement all of these models rather easily. The main challenge is just to install it. See Exercise 1 on that point.
Next, creating a simple linear model without any polynomial terms
# Order refers to the polynomial. Polynomials of order 1 are just normal variables# without any exponents.# Slope can be set to "same" if we want both sides of the threshold to have the same# slope. It can also be set to "separate" if we want to add an interaction term so as# to allow the slopes on each side to differ from one another, as described in the textbook.rdd_linear <- RDDreg_lm( RDDobject = drinking_rdd_data, order = 1, slope='same' )summary(rdd_linear)# # Call:# lm(formula = y ~ ., data = dat_step1, weights = weights)# # Residuals:# Min 1Q Median 3Q Max # -5.056 -1.848 0.115 1.491 5.804 # # Coefficients:# Estimate Std. Error t value Pr(>|t|) # (Intercept) 91.841 0.805 114.08 < 2e-16 ***# D 7.663 1.440 5.32 3.1e-06 ***# x -0.975 0.632 -1.54 0.13 # ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Residual standard error: 2.49 on 45 degrees of freedom# Multiple R-squared: 0.595, Adjusted R-squared: 0.577 # F-statistic: 33 on 2 and 45 DF, p-value: 1.51e-09
# It's helpful to change the axis labels since it changes the variable names# and 'x' and 'y' aren't very informativeplot( rdd_linear, xlab='Age in months\n', # A little extra white space ylab="Nr of deaths among US Americans aged 20-22 (1997-2003)" )
Using the R package RDDtools, we can implement all of these models rather easily. The main challenge is just to install it. See Exercise 1 on that point.
Finally, adding in some polynomial terms (power of 2)
# Order refers to the polynomial. Polynomials of order 1 are just normal variables# without any exponents.# Slope can be set to "same" if we want both sides of the threshold to have the same# slope. It can also be set to "separate" if we want to add an interaction term so as# to allow the slopes on each side to differ from one another, as described in the textbook.rdd_quadratic <- RDDreg_lm( RDDobject = drinking_rdd_data, order = 2, slope='same' )summary(rdd_quadratic)# # Call:# lm(formula = y ~ ., data = dat_step1, weights = weights)# # Residuals:# Min 1Q Median 3Q Max # -4.45 -1.75 0.19 1.16 5.11 # # Coefficients:# Estimate Std. Error t value Pr(>|t|) # (Intercept) 92.903 0.837 110.99 < 2e-16 ***# D 7.663 1.339 5.72 8.7e-07 ***# x -0.975 0.588 -1.66 0.1046 # `x^2` -0.819 0.289 -2.84 0.0069 ** # ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Residual standard error: 2.32 on 44 degrees of freedom# Multiple R-squared: 0.657, Adjusted R-squared: 0.634 # F-statistic: 28.1 on 3 and 44 DF, p-value: 2.61e-10
# It's helpful to change the axis labels since it changes the variable names# and 'x' and 'y' aren't very informativeplot( rdd_quadratic, xlab='Age in months\n', # A little extra white space ylab="Nr of deaths among US Americans aged 20-22 (1997-2003)" )
But so far we are using all the data points, and using polynomials to home in on real effect with our estimates. This lets us keep all the data, which gives us more statistical power. But there is an alternative.
Parametric Regression Discontinuity: Model the entire dataset, taking into account potential non-linearities in the data
Nonparametric Regression Discontinuity: Only use data that is within a certain distance of the threshold
If we draw the green box too small (too close to the threshold), we will lose statistical power and won't be able to draw meaningful conclusions. If we draw the green box too big, we get more observations, but we lose the strength of the comparison. We call the width of the box the "bandwidth"
That is, we want to estimate this: yi=α+ρDi+γai+ei
But we only want to estimate it in a sample where: threshold−bandwidth≥ai≤threshold+bandwidth
Thankfully, finding the right bandwidth is automated with a function in RDDtools. So the process is relatively simple for us here.
# Weird-looking function name, but it figures out the optimal bandwidth given your databandwidth <- RDDbw_IK(drinking_rdd_data)reg_nonparam <- RDDreg_np(RDDobject = drinking_rdd_data, bw = bandwidth, slope='same')coef_nonparam <- summary(reg_nonparam)$coefMat[1]se_nonparam <- summary(reg_nonparam)$coefMat[2]
summary(reg_nonparam)# ### RDD regression: nonparametric local linear#### Bandwidth: 1.56 # Number of obs: 38 (left: 19, right: 19)# # Weighted Residuals:# Min 1Q Median 3Q Max # -7.633 -2.290 -0.309 0.784 4.221 # # Coefficient:# Estimate Std. Error z value Pr(>|z|) # D 9.19 1.74 5.29 1.2e-07 ***# ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Local R squared: 0.709
From Pereira & Fernandez-Vazquez. (2022). Does electing women reduce corruption? Legislative Studies Quarterly
In 2007 Spain made a new rule saying that for towns with a population larger than 5000 people, parties had to provide at least 40% of their seats to women. They use this as an RDD to look into the well-known correlation between women in government and corruption.
There's an effect! But why?
There's an effect! But why?
From Biegert, Kühhirt, & Van Lancker. (2022). They Can’t All Be Stars: The Matthew Effect, Cumulative Status Bias, and Status Persistence in NBA All-Star Elections American Sociological Review
Sociologists have argued for 80 years that small early advantages between equals can create large inequalities. Here they study this using unique data from professional basketball.
Does winning the fan vote in time t increase a person's chances of winning the fan vote again in time t+1 and beyond?
There's an effect! But why?
Sociologist Robert Merton argued that this happens because we don't have enough available space to reward every deserving person (in his case, it was scientists...think the Nobel prize). We have to make decisions between people who are more or less equal, and rewarding one person at the threshold but not the other creates an actual inequality where there didn't use to be any.
]
Maybe the players who won the first time are just better? Nope. Every measure of performance is insignificant, while the RDD estimate remains significant.
I can't show you the evidence, because it isn't published yet! But I saw your professor present this work in Leipzig.
Newly arrived residents of Italian regions that just barely elect far-right governments are more likely to experience delays and difficulties getting their healthcare access set up, compared to those who live in regions that just barely did not elect far-right governments.