Multiple Regression & Fundamentals of Causal Inference

class: center, middle, inverse, title-slide

.title[
# Multiple Regression & <br><br> Fundamentals of Causal Inference
]
.subtitle[
## 4 OLS from a assumptions to visualization
]
.author[
### Merlin Schaeffer<br> Department of Sociology
]
.date[
### 2025-09-24
]

---

# Goal of empirical sociology

.font130[.center[Use data to .alert[discover patterns], <br> and the social mechanisms that bring them about.]]

<img src="https://researchleap.com/wp-content/uploads/2021/12/Population-data.jpeg" width="70%" style="display: block; margin: auto;" />
---
class: inverse
# Today's schedule

1. **Today's research question**: Colonialism and poverty
  
2. **OLS assumptions**
  + No outliers
  + Linearity
  
3. **Categorical predictors**
  + Dummy coding
  
4. **Binary outcomes**
  + Linear probability model (LPM)
  
5. **Visualizing regression**
  + Coefficient plots
  + Model predictions
  
---
class: inverse
# Colonial legacy

.push-left[
<img src="https://cdn.cfr.org/sites/default/files/styles/open_graph_article/public/image/2020/01/France-Macron-Pau-Summit-G5-Sahel.jpg?h=63648819" width="100%" style="display: block; margin: auto;" />
]

.push-right[
<img src="https://en.natmus.dk/typo3temp/assets/images/csm_neky-holbech-16x9_b72c084303_2836164db1.jpg" width="80%" style="display: block; margin: auto;" />

<img src="https://ichef.bbci.co.uk/news/976/cpsprodpb/6E2E/production/_113460282_mediaitem113460280.jpg" width="80%" style="display: block; margin: auto;" />
]

---
class: inverse
# Today's research questions

.left-column[
.font130[Is poverty lower in countries that have been independent longer?]

.font130[How do different colonial legacies compare to one another?]
]

.right-column[
.font130[.center[Colonial empires 1945]]
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a9/Colonization_1945.png" width="100%" style="display: block; margin: auto;" />
.center[*Source*: [.white[Wikipedia]](https://en.wikipedia.org/wiki/File:World_1914_empires_colonies_territory.PNG)]]

---
# Preparations

.panelset[
.panel[.panel-name[Packages for today's session]

``` r
pacman::p_load(
  tidyverse, # Data manipulation,
  ggplot2, # beautiful figures,
  kableExtra, # for table formatting,
  vdemdata, # download democracy datasets used in the scholarly literature.
  wbstats, # download data from Worldbank. Tremendous source of global socio-economic data.
  estimatr, # OLS with robust SE,
  modelsummary) # regression tables with nice layout,
```
]
.panel[.panel-name[Get WB data]

``` r
(Dat <- wb_data("SI.POV.DDAY", # Download poverty data: <$2.15 per day,
                start_date = 2000, end_date = 2025) %>%
   rename(poverty = SI.POV.DDAY) %>% # rename poverty variable,
   select(country, date, poverty) %>% # Keep only 3 variables
   drop_na(poverty) %>% group_by(country) %>% # Group by country,
   filter(date == max(date)) %>% # Keep the most recent data per country.
   mutate(date = as.numeric(date)) %>% ungroup())
# # A tibble: 168 × 3
#    country     date poverty
#    <chr>      <dbl>   <dbl>
#  1 Albania     2020     0.3
#  2 Algeria     2011     0  
#  3 Angola      2018    39.3
#  4 Argentina   2023     1.2
#  5 Armenia     2023     1.9
#  6 Australia   2018     0.5
#  7 Austria     2022     0.6
#  8 Azerbaijan  2005     0  
#  9 Bangladesh  2022     8  
# 10 Barbados    2016     1.7
# # ℹ 158 more rows
```
]]

---
# Colonial legacy

.panelset[
.panel[.panel-name[A study]

.push-left[
<img src="./img/SocialForces.png" width="100%" style="display: block; margin: auto;" />
]

.push-right[
<img src="./img/Colonial.png" width="100%" style="display: block; margin: auto;" />
.center[.backgrnote[*Source*: Lange and Dawson (2009)]]
]
]

.panel[.panel-name[Its data]

.push-left[
<img src="./img/Colonial2.png" width="100%" style="display: block; margin: auto;" />
]

.push-right[
<img src="./img/Colonial3.png" width="100%" style="display: block; margin: auto;" />
.center[.backgrnote[*Source*: Lange and Dawson (2009)]]
]]

.panel[.panel-name[Coding of colonizer]
.font90[

.panel[.panel-name[Plot Colonizer]
<img src="4-OLS-Wisdoms_files/figure-html/col-powers-1.png" width="100%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Coding of independence]
.font60[

``` r
Dat <- Dat %>% 
  mutate(
    years_indep = case_when(
      is.na(colonizer) ~ as.numeric(NA),
      country == "United States" ~ date - 1783, country == "Haiti" ~ date - 1804, 
      country == "Paraguay" ~ date - 1811, country == "Chile" ~ date - 1818,
      str_detect(country, "Argentina|Bolivia|Colombia") ~ date - 1819,
      str_detect(country, "Costa Rica|Dominican Republic|Mexico|Nicaragua|Panama|El Salvador|Guatemala|Honduras|Venezuela") ~ date - 1821,
      str_detect(country, "Brazil|Ecuador") ~ date - 1822,
      country == "Peru" ~ date - 1824, country == "Uruguay" ~ date - 1828,
      country == "Cuba" ~ date - 1899, country == "Australia" ~ date - 1901,
      country == "New Zealand" ~ date - 1907, country == "South Africa" ~ date - 1910,
      country == "Egypt" ~ date - 1922, country == "Iraq" ~ date - 1932,
      str_detect(country, "Korea|Taiwan|Vietnam") ~ date - 1945,
      str_detect(country, "Lebanon|Philippines|Syria") ~ date - 1946,
      str_detect(country, "Bangladesh|Pakistan|India|Liberia") ~ date - 1947,
      str_detect(country, "Myanmar|Israel|Jordan|Sri Lanka") ~ date - 1948,
      country == "Indonesia" ~ date - 1949, country == "Libya" ~ date - 1951,
      str_detect(country, "Cambodia|Loas") ~ date - 1954,
      str_detect(country, "Morocco|Sudan|Tunisia") ~ date - 1956,
      str_detect(country, "Malaysia|Ghana") ~ date - 1957,
      country == "Guinea" ~ date - 1958, country == "Singapore" ~ date - 1959,
      str_detect(country, "Benin|Burkina Faso|Central African Republic|Chad|Congo, Dem. Rep.|Congo, Rep.|Code D'Ivoire|Mali|Mauritania|Niger|Nigeria|Senegal|Gabon|Somalia|Togo") ~ date - 1960,
      str_detect(country, "Kuwait|Sierra Leone|Tanzania") ~ date - 1961,
      str_detect(country, "Algeria|Burundi|Rwanda|Jamaica|Trinidad/ Tobago|Uganda") ~ date - 1962,
      country == "Kenya" ~ date - 1963,
      str_detect(country, "Malawi|Zambia") ~ date - 1964,
      str_detect(country, "Gambia|Zimbabwe") ~ date - 1965,
      str_detect(country, "Botswana|Lesotho|Guyana") ~ date - 1966,
      str_detect(country, "Canada|Yemen") ~ date - 1967,
      str_detect(country, "Mauritius|Swaziland") ~ date - 1968,
      country == "Fiji" ~ date - 1970,
      str_detect(country, "Bahrain|Oman|Qatar|United Arab Emirates") ~ date - 1971,
      country == "Guinea-Bissau" ~ date - 1974,
      str_detect(country, "Angola|Mozambigue|Papua New Guinea") ~ date - 1975,
      country == "Djibouti" ~ date - 1977, 
      country == "Namibia" ~ date - 1990))
```
]]

.panel[.panel-name[Plot indep.]
<img src="4-OLS-Wisdoms_files/figure-html/pov-indep-1.png" width="100%" style="display: block; margin: auto;" />
]]

---
# Poverty and years of independence

.left-column[.font90[

``` r
# Estimate OLS regression
ols <- lm_robust(
  poverty ~ years_indep, 
  data = Dat)
# Regression table
modelsummary(
  list("Poverty" = ols), stars = TRUE,
  gof_map = c("nobs", "r.squared"), 
  output = "kableExtra")
```

<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Poverty </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 32.136* </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (14.035) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> years_indep </td>
   <td style="text-align:center;"> −0.039 </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (0.229) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 51 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.001 </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>

]]

.right-column[
<img src="4-OLS-Wisdoms_files/figure-html/indep-world-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: inverse middle center
# OLS assumptions

---
class: clear
# OLS Assumptions

.left-column[
.content-box-blue[
1. **No outliers**
2. **Linearity**
3. Homoscedasticity; don't worry: `lm_robust()`.
4. Independent observations.

`$\rightarrow$` Scatter plots!
]]

.right-column[
<iframe src='https://seeing-theory.brown.edu/regression-analysis/index.html#section1' width='700' height='580' frameborder='0' scrolling='yes'></iframe>
]

---
# Outlier

.left-column[
- **Gray dotted line:** OLS fit

- **Z-standardized residuals:** Distance from regression line; Examples: 
  + Nr. 62 Haiti (poverty rate 40.4%).
  + Nr. 32 Congo, Dem. Rep. (poverty rate 85.3%).
  + Nr. 69 Ireland (poverty rate 0.2%).

- **Leverage:** High influence on regression; `$x_i$` far from `$\bar{x}$`.

- **Cook's D**: Change in `$\sum{\hat{y}}$` (in std. residuals) if case `$i$` was removed
  + Thresholds: gray dashed lines!
]

.right-column[

``` r
# Re-estimate model using lm(),
lm(poverty ~ years_indep, data = Dat) %>%
* plot(., which = 5) # The best outlier plot.
```

<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-15-1.png" width="75%" style="display: block; margin: auto;" />
]

---
# Linearity

.left-column[
- **Gray dotted line:** OLS fit

- **Fitted values:** `$\hat{Y}$`

- **Red line:** Smoothed relationship between residuals and fitted values

- **Ideal:** Red line matches gray dotted line

- **Our case:** Linearity assumption clearly violated
]

.right-column[

``` r
# Re-estimate model using lm(),
lm(poverty ~ years_indep, data = Dat) %>%
* plot(., which = 1) # The best linearity plot.
```

<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-16-1.png" width="75%" style="display: block; margin: auto;" />
]

---
layout: true
# Removing the outlier

.left-column[

``` r
Dat <- Dat %>%
* filter(country != "Haiti")
```
]

---

.right-column[

``` r
# Re-estimate model using lm(),
lm(poverty ~ years_indep, data = Dat) %>%
* plot(., which = 5) # The best outlier plot.
```

<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-18-1.png" width="75%" style="display: block; margin: auto;" />
]

---

.right-column[

``` r
# Re-estimate model using lm(),
lm(poverty ~ years_indep, data = Dat) %>%
* plot(., which = 1) # The best linearity plot.
```

<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-19-1.png" width="75%" style="display: block; margin: auto;" />
]

---
layout: false
# Poverty and years of independence

.left-column[.font90[

<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Poverty </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 63.190*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (16.305) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> years_indep </td>
   <td style="text-align:center;"> −0.575* </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (0.260) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.047 </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>

]]

.right-column[
<img src="4-OLS-Wisdoms_files/figure-html/indep-world2-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: inverse middle center
# Categorical predicators

---
# Categorical predicators

.panelset[
.panel[.panel-name[Scatter plot]

.left-column[
.content-box-green[
1. What `$\hat{Y}$` value corresponds to British-colonized countries?

2. What `$\hat{Y}$` value corresponds to Belgian-colonized countries?

3. How does this difference in `$\hat{Y}$` values relate to `$\hat{\beta}$`?
]]
.right-column[
<img src="4-OLS-Wisdoms_files/figure-html/categorical-1.png" width="100%" style="display: block; margin: auto;" />
]]

.panel[.panel-name[Dummy coding]

.push-left[
`$$x=
  \begin{cases}
    1, & \text{if condition is met} \\
    0 & \text{otherwise}
  \end{cases}$$`

Contintent                       | Britain | France | ...
---------------------------------|----|----|----
Kenya                 | 1  | 0  | 0 
India                 | 1  | 0  | 0
...                 | 1  | 0  | 0
Cambodia                          | 0  | 1  | 0  
Algeria                          | 0  | 1  | 0
...                          | 0  | 1  | 0
Reference <br> .backgrnote[(Belgium)] | 0  | 0  | 0
]

.push-right[
<img src="4-OLS-Wisdoms_files/figure-html/categorical2-1.png" width="100%" style="display: block; margin: auto;" />
]]

.panel[.panel-name[How it's done in R]
.push-left[.font70[

``` r
# R recognizes categorical variables automatically,
# if they are factor or character vectors.
ols_2 <- lm_robust(poverty ~ colonizer, data = Dat)
# Regression table
modelsummary(list("Poverty" = ols_2), stars = TRUE,
  # Rename for a better-looking table
  coef_rename = c("colonizerFrance" = "France", 
                  "colonizerBritain" = "Britain"),
  gof_map = c("nobs", "r.squared"), output = "kableExtra")
```
]

<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-22-1.png" width="75%" style="display: block; margin: auto;" />
]
.push-right[
<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Poverty </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 74.433*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (6.208) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> France </td>
   <td style="text-align:center;"> −45.668*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (7.968) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Britain </td>
   <td style="text-align:center;"> −50.609*** </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (7.843) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 55 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.181 </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>

]]

.panel[.panel-name[Interpretation]
.push-left[.font90[
<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Poverty </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 74.433*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (6.208) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> France </td>
   <td style="text-align:center;"> −45.668*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (7.968) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Britain </td>
   <td style="text-align:center;"> −50.609*** </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (7.843) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 55 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.181 </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>

]]
.push-right[.font90[

$$
`\begin{aligned}
\operatorname{\widehat{poverty}} &= 74.43 - 45.67(\operatorname{colonizer}_{\operatorname{France}})\ - \\
&\quad 50.61(\operatorname{colonizer}_{\operatorname{Britain}})
\end{aligned}`
$$

- Average poverty in Belgian-colonized countries: 74.43% 
  + (When France = Britiain = 0)

- French-colonized countries compared to Belgian: -45.67 percentage points lower poverty on average
  + Average poverty in French-colonized countries: 74.43% + -45.67% = 28.76%
  
- British-colonized countries compared to Belgian: -50.61 percentage points lower poverty on average
  + Average poverty in British-colonized countries: 74.43% + -50.61% = 23.82%
]]]]

---
class: inverse middle center
# Break

---
class: middle clear

.left-column[
<img src="https://cdn.dribbble.com/users/10549/screenshots/9890798/media/f38f0e4d71d9763c7533641d2418b35b.png?compress=1&resize=1200x900&vertical=top" width="100%" style="display: block; margin: auto;" />

.right-column[
<br>

---
class: inverse middle center
# Break

---
class: inverse middle center
# Visualizing regression models

---
class: clear
# (1) Coefficient plots

.panelset[
.panel[.panel-name[Preparation]

``` r
(plotdata <- lm_robust(poverty ~ colonizer, data = Dat) %>%
*  tidy() %>% # Turn results into a tibble,
   mutate( # Rename variables for the plot.
     term = case_when(
       term == "colonizerFrance" ~ "France",
       term == "colonizerBritain" ~ "Britiain",
       term == "(Intercept)" ~ "Intercept \n (Belgium)")) %>%
   filter(term != "Intercept \n (Belgium)"))
#       term estimate std.error statistic  p.value conf.low conf.high df outcome
# 1   France    -45.7      7.97     -5.73 5.09e-07    -61.7     -29.7 52 poverty
# 2 Britiain    -50.6      7.84     -6.45 3.68e-08    -66.3     -34.9 52 poverty
```
]

.panel[.panel-name[Plotting]

``` r
ggplot(data = plotdata, 
       aes(y = estimate,
           # Order by effect size
           x = reorder(term, estimate))) +
  # Reference line
  geom_hline(yintercept = 0, color = "red", lty = "dashed") + 
  # Point with confidence interval,
* geom_pointrange(aes(min = conf.low, max = conf.high)) +
* coord_flip() + # Flip Y- & X-Axis,
  labs(title = "OLS regression results",
       x = "Countries colonized by:",
       y = "Average Difference in poverty rate compared to countries colonized by Belgium") +
  theme_minimal()
```
]
.panel[.panel-name[Plot]
.push-left[

<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" />
]
.push-right[

<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Poverty </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 74.433*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (6.208) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> France </td>
   <td style="text-align:center;"> −45.668*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (7.968) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Britain </td>
   <td style="text-align:center;"> −50.609*** </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (7.843) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 55 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.181 </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>

]]]

---
# (2) Model predictions

.panelset[
.panel[.panel-name[OLS model]
.push-left[

``` r
(ols <- lm_robust(poverty ~ years_indep, data = Dat))
#             Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
# (Intercept)   63.190      16.30    3.88 0.000322     30.4  95.9732 48
# years_indep   -0.575       0.26   -2.21 0.031745     -1.1  -0.0524 48
```
]
.push-right[
<img src="4-OLS-Wisdoms_files/figure-html/Coefplot2-1.png" width="100%" style="display: block; margin: auto;" />
]]
.panel[.panel-name[Predictions]
.push-left[.font90[
**Step 1**: Create synthetic (i.e., fictional)  `$x$` data with theoretically relevant values.

``` r
(fict_dat <- tibble( # Create a new tibble named 'fict_dat'
  # Generate a sequence from 1 to 500. 
  # This represents years of independence
  years_indep = 1:500))
# # A tibble: 500 × 1
#    years_indep
#          <int>
#  1           1
#  2           2
#  3           3
#  4           4
#  5           5
#  6           6
#  7           7
#  8           8
#  9           9
# 10          10
# # ℹ 490 more rows
```
]]
.push-right[.font90[
**Step 2**: Predict `$\hat{y}$` from OLS model for our synthetic data.
]
.font80[

``` r
(fict_dat <- predict( # Generates predictions
  object = ols, # Use the previously fitted OLS model
  newdata = fict_dat, # Apply the model to our synthetic data
  # Calculate 95% confidence intervals and fitted values
  interval = "confidence", level = 0.95)$fit %>% 
   as_tibble() %>% # Convert results to a tibble (data frame)
   # Combine original synthetic data with predictions
   # (. represents the piped prediction results)
   bind_cols(fict_dat, .))
# # A tibble: 500 × 4
#    years_indep   fit   lwr   upr
#          <int> <dbl> <dbl> <dbl>
#  1           1  62.6  30.3  94.9
#  2           2  62.0  30.3  93.8
#  3           3  61.5  30.2  92.7
#  4           4  60.9  30.1  91.6
#  5           5  60.3  30.1  90.5
#  6           6  59.7  30.0  89.5
#  7           7  59.2  29.9  88.4
#  8           8  58.6  29.9  87.3
#  9           9  58.0  29.8  86.2
# 10          10  57.4  29.7  85.1
# # ℹ 490 more rows
```
]]]
.panel[.panel-name[Visualization]
.push-left[.font80[

``` r
# Plots years of independence on the x-axis and 
# predicted poverty on the y-axis
ggplot(data = fict_dat, aes(y = fit, x = years_indep)) +
  # Add vertical reference lines at 34 and 236 years
  geom_vline(xintercept = c(34, 236),
             color = "red", lty = "dashed") +
  # Add shaded area for 95% confidence interval
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.5) +
  # Add the main prediction line (OLS regression line)
  geom_line() +
  # Set labels for the plot
  labs(
    title = "Prediction based on OLS regression",
    x = "Years since independence",
    y = "Predicted average of extreme poverty") +
  # Use a minimal theme for clean appearance
  theme_minimal()
```
]]
.push-right[
<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" />
]]]

---
# Learning goal achieved!

.left-column[
.font130[Is poverty lower in countries that have been independent longer?]

.font130[How do different colonial legacies compare to one another?]
]

.right-column[
<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-34-1.png" width="60%" style="display: block; margin: auto;" />

<img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-35-1.png" width="60%" style="display: block; margin: auto;" />
]

---
class: inverse middle center
# Break

---
class: middle clear

.right-column[
<br>

---
class: inverse
# Today's general lessons

1. Outliers can significantly impact OLS regression results. Cook's D helps identify these outliers.

2. OLS assumes a linear relationship between continuous predictors and outcomes. Verify this assumption, but disregard for categorical predictors.

3. Categorical predictors in regression are typically dummy coded, showing average outcome differences between each category and a reference group.

4. R automatically dummy codes categorical variables in OLS regression, using the first category as the reference.

5. Coefficient plots are standard for visualizing OLS regression results.

6. For continuous predictors, visualizing model predictions with synthetic data points is valuable.

---
# Today's (important) functions

- Post `lm()` diagnostics:
  + `plot(model, which = 5)`: Identify outliers
  + `plot(model, which = 1)`: Check linearity assumption

- Useful functions:
  + `tidy()`: Convert OLS results to tibble
  + `predict()`: Apply model to synthetic data to obtain `$\hat{Y}$`

---
# References

.font80[
Lange, M. and A. Dawson (2009). "Dividing and Ruling the World? A Statistical Test of the Effects of
Colonialism on Postcolonial Civil Violence". In: _Social Forces_, pp. 785-817.
]

---
class: inverse middle center
# Binary outcomes

---
class: clear
# LPM versus Generalized Linear Models (GLM)

.left-column[
**Linear Probability Model (LPM)**:

- Uses OLS to predict binary outcomes <br> (0 = "No" / 1 = "Yes")

- Predicts `$\hat{\text{P}(y_i = 1|x_{i})}$`

- Controversial issues:
 + Violates linearity assumption
 + Can predict probabilities < 0 or > 1
]

.right-column[
<img src="http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1523361626/linear_vs_logistic_regression_h8voek.jpg" width="100%" style="display: block; margin: auto;" />
.center[.backgrnote[*Source*: [datacamp: Logistic Regression in R Tutorial](https://www.datacamp.com/community/tutorials/logistic-regression-R)]]
]

---
# Logistic regression .font70[GLM for binary outcomes]

.push-left[
`$$\begin{align*}
\text{logit}(y_{i}) &= \alpha + \beta x, \\
y_{i} &= \text{logit}^{-1}(\alpha + \beta x) \\
& = \text{logistic}(\alpha + \beta x) \\
&= \frac{1}{1+e^{-(\alpha + \beta x)}}.
\end{align*}$$`

.font90[
| `$\alpha + \beta x$`  | `$e^{(\alpha + \beta x)}$` | `$\frac{1}{e^{(\alpha + \beta x)}} \color{gray}{= e^{-(\alpha + \beta x)}}$` | `$\frac{1}{1 + e^{(\alpha + \beta x)}}$` | `$\frac{1}{1 +e^{-(\alpha + \beta x)}}$` | 
|--------------------:|-------------------------:|-------------------------------------------------------------:|---------------------------------------:|---------------------------------------:|
|        -2           | 0.135                    | 7.389                                                        | 0.881                                  | 0.119                                  | 
|        -1           | 0.368                    | 2.718                                                        | 0.731                                  | 0.269                                  | 
|         0           | 1                        | 1                                                            | 0.5                                    | 0.5                                    | 
|         1           | 2.718                    | 0.368                                                        | 0.269                                  | 0.731                                  | 
|         2           | 7.389                    | 0.135                                                        | 0.119                                  | 0.881                                  |

]]

.push-right[
We need a 'link function' `$g^{-1}$` that maps the linear model results to the [0, 1] range. The sigmoid-shaped logistic function serves this purpose, with its inverse being the logit (log odds).

<img src="4-OLS-Wisdoms_files/figure-html/Logit-fun-1.png" width="80%" style="display: block; margin: auto;" />
]

---
class: middle center

GLMs (e.g., logistic regression) have their own issues (Breen, Karlson, and Holm, 2018).

.alert[We use Linear Probability Models for binary outcomes.]

For categorical outcomes, use 0/1 coding.