class: center, middle, inverse, title-slide .title[ # Multiple Regression &
Fundamentals of Causal Inference ] .subtitle[ ## 4 OLS from a assumptions to visualization ] .author[ ### Merlin Schaeffer
Department of Sociology ] .date[ ### 2025-09-24 ] --- # Goal of empirical sociology .font130[.center[Use data to .alert[discover patterns], <br> and the social mechanisms that bring them about.]] <img src="https://researchleap.com/wp-content/uploads/2021/12/Population-data.jpeg" width="70%" style="display: block; margin: auto;" /> --- class: inverse # Today's schedule 1. **Today's research question**: Colonialism and poverty 2. **OLS assumptions** + No outliers + Linearity 3. **Categorical predictors** + Dummy coding 4. **Binary outcomes** + Linear probability model (LPM) 5. **Visualizing regression** + Coefficient plots + Model predictions --- class: inverse # Colonial legacy .push-left[ <img src="https://cdn.cfr.org/sites/default/files/styles/open_graph_article/public/image/2020/01/France-Macron-Pau-Summit-G5-Sahel.jpg?h=63648819" width="100%" style="display: block; margin: auto;" /> ] -- .push-right[ <img src="https://en.natmus.dk/typo3temp/assets/images/csm_neky-holbech-16x9_b72c084303_2836164db1.jpg" width="80%" style="display: block; margin: auto;" /> <img src="https://ichef.bbci.co.uk/news/976/cpsprodpb/6E2E/production/_113460282_mediaitem113460280.jpg" width="80%" style="display: block; margin: auto;" /> ] --- class: inverse # Today's research questions .left-column[ .font130[Is poverty lower in countries that have been independent longer?] .font130[How do different colonial legacies compare to one another?] ] .right-column[ .font130[.center[Colonial empires 1945]] <img src="https://upload.wikimedia.org/wikipedia/commons/a/a9/Colonization_1945.png" width="100%" style="display: block; margin: auto;" /> .center[*Source*: [.white[Wikipedia]](https://en.wikipedia.org/wiki/File:World_1914_empires_colonies_territory.PNG)]] --- # Preparations .panelset[ .panel[.panel-name[Packages for today's session] ``` r pacman::p_load( tidyverse, # Data manipulation, ggplot2, # beautiful figures, kableExtra, # for table formatting, vdemdata, # download democracy datasets used in the scholarly literature. wbstats, # download data from Worldbank. Tremendous source of global socio-economic data. estimatr, # OLS with robust SE, modelsummary) # regression tables with nice layout, ``` ] .panel[.panel-name[Get WB data] ``` r (Dat <- wb_data("SI.POV.DDAY", # Download poverty data: <$2.15 per day, start_date = 2000, end_date = 2025) %>% rename(poverty = SI.POV.DDAY) %>% # rename poverty variable, select(country, date, poverty) %>% # Keep only 3 variables drop_na(poverty) %>% group_by(country) %>% # Group by country, filter(date == max(date)) %>% # Keep the most recent data per country. mutate(date = as.numeric(date)) %>% ungroup()) # # A tibble: 168 × 3 # country date poverty # <chr> <dbl> <dbl> # 1 Albania 2020 0.3 # 2 Algeria 2011 0 # 3 Angola 2018 39.3 # 4 Argentina 2023 1.2 # 5 Armenia 2023 1.9 # 6 Australia 2018 0.5 # 7 Austria 2022 0.6 # 8 Azerbaijan 2005 0 # 9 Bangladesh 2022 8 # 10 Barbados 2016 1.7 # # ℹ 158 more rows ``` ]] --- # Colonial legacy .panelset[ .panel[.panel-name[A study] .push-left[ <img src="./img/SocialForces.png" width="100%" style="display: block; margin: auto;" /> ] .push-right[ <img src="./img/Colonial.png" width="100%" style="display: block; margin: auto;" /> .center[.backgrnote[*Source*: Lange and Dawson (2009)]] ] ] .panel[.panel-name[Its data] .push-left[ <img src="./img/Colonial2.png" width="100%" style="display: block; margin: auto;" /> ] .push-right[ <img src="./img/Colonial3.png" width="100%" style="display: block; margin: auto;" /> .center[.backgrnote[*Source*: Lange and Dawson (2009)]] ]] .panel[.panel-name[Coding of colonizer] .font90[ ``` r Dat <- Dat %>% mutate( colonizer = case_when( str_detect(country, "Algeria|Benin|Burkina Faso|Cambodia|Central African Republic|Chad") | str_detect(country, "Djibouti|Gabon|Guinea|Laos|Haiti|Lebanon|Madagascar|Mali|Congo, Rep.|Cote D'Ivoire") | str_detect(country, "Mauritania|Niger|Senegal|Syria|Togo|Tunisia|Viet Nam") ~ "France", # France str_detect(country, "Bahrain|Bangladesh|Botswana|Cyprus|Egypt|Fiji") | str_detect(country, "Gambia|Ghana|Guyana|India|Iraq|Jamaica|Jordan|Kenya|Kuwait") | str_detect(country, "Lesotho|Malawi|Malaysia|Mauritius|Myanmar|Nigeria|Oman") | str_detect(country, "Pakistan|Qatar|Sierra Leone|Singapore") | str_detect(country, "Sri Lanka|Sudan|Swaziland|Tanzania|Trinidad/Tobago|Uganda") | str_detect(country, "United Arab Emirates|Yemen|Zambia|Zimbabwe") ~ "Britain", # Britain str_detect(country, "Burundi|Congo, Dem. Rep.|Rwanda") ~ "Belgium")) # Belgium # Australia, South Africa, Canada, Israel, New Zealand, United States; left out as settler societies # str_detect(country, "Angola|Brazil|Mozambique|Guinea-Bissau") ~ "Portugal", # Portugal # str_detect(country, "Argentina|Bolivia|Chile|Colombia|Costa Rica|Cuba|Dominican Republic|Ecuador") | # str_detect(country, "El Salvador|Guatemala|Honduras|Mexico|Nicaragua|Panama|Paraguay|Peru") | # str_detect(country, "Uruguay|Venezuela, RB") ~ "Spain", # Spain # str_detect(country, "Liberia|Philippines") ~ "USA", # USA # str_detect(country, "Libya|Somalia") ~ "Italy", # Italy # str_detect(country, "Indonesia") ~ "Holland", # Holland # str_detect(country, "Namibia") ~ "South Africa", # South Africa # str_detect(country, "Korea|Taiwan") ~ "Japan")) # Japan ``` ]] .panel[.panel-name[Plot Colonizer] <img src="4-OLS-Wisdoms_files/figure-html/col-powers-1.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Coding of independence] .font60[ ``` r Dat <- Dat %>% mutate( years_indep = case_when( is.na(colonizer) ~ as.numeric(NA), country == "United States" ~ date - 1783, country == "Haiti" ~ date - 1804, country == "Paraguay" ~ date - 1811, country == "Chile" ~ date - 1818, str_detect(country, "Argentina|Bolivia|Colombia") ~ date - 1819, str_detect(country, "Costa Rica|Dominican Republic|Mexico|Nicaragua|Panama|El Salvador|Guatemala|Honduras|Venezuela") ~ date - 1821, str_detect(country, "Brazil|Ecuador") ~ date - 1822, country == "Peru" ~ date - 1824, country == "Uruguay" ~ date - 1828, country == "Cuba" ~ date - 1899, country == "Australia" ~ date - 1901, country == "New Zealand" ~ date - 1907, country == "South Africa" ~ date - 1910, country == "Egypt" ~ date - 1922, country == "Iraq" ~ date - 1932, str_detect(country, "Korea|Taiwan|Vietnam") ~ date - 1945, str_detect(country, "Lebanon|Philippines|Syria") ~ date - 1946, str_detect(country, "Bangladesh|Pakistan|India|Liberia") ~ date - 1947, str_detect(country, "Myanmar|Israel|Jordan|Sri Lanka") ~ date - 1948, country == "Indonesia" ~ date - 1949, country == "Libya" ~ date - 1951, str_detect(country, "Cambodia|Loas") ~ date - 1954, str_detect(country, "Morocco|Sudan|Tunisia") ~ date - 1956, str_detect(country, "Malaysia|Ghana") ~ date - 1957, country == "Guinea" ~ date - 1958, country == "Singapore" ~ date - 1959, str_detect(country, "Benin|Burkina Faso|Central African Republic|Chad|Congo, Dem. Rep.|Congo, Rep.|Code D'Ivoire|Mali|Mauritania|Niger|Nigeria|Senegal|Gabon|Somalia|Togo") ~ date - 1960, str_detect(country, "Kuwait|Sierra Leone|Tanzania") ~ date - 1961, str_detect(country, "Algeria|Burundi|Rwanda|Jamaica|Trinidad/ Tobago|Uganda") ~ date - 1962, country == "Kenya" ~ date - 1963, str_detect(country, "Malawi|Zambia") ~ date - 1964, str_detect(country, "Gambia|Zimbabwe") ~ date - 1965, str_detect(country, "Botswana|Lesotho|Guyana") ~ date - 1966, str_detect(country, "Canada|Yemen") ~ date - 1967, str_detect(country, "Mauritius|Swaziland") ~ date - 1968, country == "Fiji" ~ date - 1970, str_detect(country, "Bahrain|Oman|Qatar|United Arab Emirates") ~ date - 1971, country == "Guinea-Bissau" ~ date - 1974, str_detect(country, "Angola|Mozambigue|Papua New Guinea") ~ date - 1975, country == "Djibouti" ~ date - 1977, country == "Namibia" ~ date - 1990)) ``` ]] .panel[.panel-name[Plot indep.] <img src="4-OLS-Wisdoms_files/figure-html/pov-indep-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- # Poverty and years of independence .left-column[.font90[ ``` r # Estimate OLS regression ols <- lm_robust( poverty ~ years_indep, data = Dat) # Regression table modelsummary( list("Poverty" = ols), stars = TRUE, gof_map = c("nobs", "r.squared"), output = "kableExtra") ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Poverty </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 32.136* </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (14.035) </td> </tr> <tr> <td style="text-align:left;"> years_indep </td> <td style="text-align:center;"> −0.039 </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.229) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 51 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.001 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ]] .right-column[ <img src="4-OLS-Wisdoms_files/figure-html/indep-world-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse middle center # OLS assumptions --- class: clear # OLS Assumptions .left-column[ .content-box-blue[ 1. **No outliers** 2. **Linearity** 3. Homoscedasticity; don't worry: `lm_robust()`. 4. Independent observations. `\(\rightarrow\)` Scatter plots! ]] .right-column[ <iframe src='https://seeing-theory.brown.edu/regression-analysis/index.html#section1' width='700' height='580' frameborder='0' scrolling='yes'></iframe> ] --- # Outlier .left-column[ - **Gray dotted line:** OLS fit - **Z-standardized residuals:** Distance from regression line; Examples: + Nr. 62 Haiti (poverty rate 40.4%). + Nr. 32 Congo, Dem. Rep. (poverty rate 85.3%). + Nr. 69 Ireland (poverty rate 0.2%). - **Leverage:** High influence on regression; `\(x_i\)` far from `\(\bar{x}\)`. - **Cook's D**: Change in `\(\sum{\hat{y}}\)` (in std. residuals) if case `\(i\)` was removed + Thresholds: gray dashed lines! ] .right-column[ ``` r # Re-estimate model using lm(), lm(poverty ~ years_indep, data = Dat) %>% * plot(., which = 5) # The best outlier plot. ``` <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-15-1.png" width="75%" style="display: block; margin: auto;" /> ] --- # Linearity .left-column[ - **Gray dotted line:** OLS fit - **Fitted values:** `\(\hat{Y}\)` - **Red line:** Smoothed relationship between residuals and fitted values - **Ideal:** Red line matches gray dotted line - **Our case:** Linearity assumption clearly violated ] .right-column[ ``` r # Re-estimate model using lm(), lm(poverty ~ years_indep, data = Dat) %>% * plot(., which = 1) # The best linearity plot. ``` <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-16-1.png" width="75%" style="display: block; margin: auto;" /> ] --- layout: true # Removing the outlier .left-column[ ``` r Dat <- Dat %>% * filter(country != "Haiti") ``` ] --- .right-column[ ``` r # Re-estimate model using lm(), lm(poverty ~ years_indep, data = Dat) %>% * plot(., which = 5) # The best outlier plot. ``` <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-18-1.png" width="75%" style="display: block; margin: auto;" /> ] --- .right-column[ ``` r # Re-estimate model using lm(), lm(poverty ~ years_indep, data = Dat) %>% * plot(., which = 1) # The best linearity plot. ``` <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-19-1.png" width="75%" style="display: block; margin: auto;" /> ] --- layout: false # Poverty and years of independence .left-column[.font90[ ``` r # Estimate OLS regression ols <- lm_robust( poverty ~ years_indep, data = Dat) # Regression table modelsummary( list("Poverty" = ols), stars = TRUE, gof_map = c("nobs", "r.squared"), output = "kableExtra") ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Poverty </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 63.190*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (16.305) </td> </tr> <tr> <td style="text-align:left;"> years_indep </td> <td style="text-align:center;"> −0.575* </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.260) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 50 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.047 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ]] .right-column[ <img src="4-OLS-Wisdoms_files/figure-html/indep-world2-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse middle center # Categorical predicators --- # Categorical predicators .panelset[ .panel[.panel-name[Scatter plot] .left-column[ .content-box-green[ 1. What `\(\hat{Y}\)` value corresponds to British-colonized countries? 2. What `\(\hat{Y}\)` value corresponds to Belgian-colonized countries? 3. How does this difference in `\(\hat{Y}\)` values relate to `\(\hat{\beta}\)`? ]] .right-column[ <img src="4-OLS-Wisdoms_files/figure-html/categorical-1.png" width="100%" style="display: block; margin: auto;" /> ]] .panel[.panel-name[Dummy coding] .push-left[ `$$x= \begin{cases} 1, & \text{if condition is met} \\ 0 & \text{otherwise} \end{cases}$$` Contintent | Britain | France | ... ---------------------------------|----|----|---- Kenya | 1 | 0 | 0 India | 1 | 0 | 0 ... | 1 | 0 | 0 Cambodia | 0 | 1 | 0 Algeria | 0 | 1 | 0 ... | 0 | 1 | 0 Reference <br> .backgrnote[(Belgium)] | 0 | 0 | 0 ] .push-right[ <img src="4-OLS-Wisdoms_files/figure-html/categorical2-1.png" width="100%" style="display: block; margin: auto;" /> ]] .panel[.panel-name[How it's done in R] .push-left[.font70[ ``` r # R recognizes categorical variables automatically, # if they are factor or character vectors. ols_2 <- lm_robust(poverty ~ colonizer, data = Dat) # Regression table modelsummary(list("Poverty" = ols_2), stars = TRUE, # Rename for a better-looking table coef_rename = c("colonizerFrance" = "France", "colonizerBritain" = "Britain"), gof_map = c("nobs", "r.squared"), output = "kableExtra") ``` ] <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-22-1.png" width="75%" style="display: block; margin: auto;" /> ] .push-right[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Poverty </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 74.433*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (6.208) </td> </tr> <tr> <td style="text-align:left;"> France </td> <td style="text-align:center;"> −45.668*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (7.968) </td> </tr> <tr> <td style="text-align:left;"> Britain </td> <td style="text-align:center;"> −50.609*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (7.843) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 55 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.181 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ]] .panel[.panel-name[Interpretation] .push-left[.font90[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Poverty </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 74.433*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (6.208) </td> </tr> <tr> <td style="text-align:left;"> France </td> <td style="text-align:center;"> −45.668*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (7.968) </td> </tr> <tr> <td style="text-align:left;"> Britain </td> <td style="text-align:center;"> −50.609*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (7.843) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 55 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.181 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ]] .push-right[.font90[ $$ `\begin{aligned} \operatorname{\widehat{poverty}} &= 74.43 - 45.67(\operatorname{colonizer}_{\operatorname{France}})\ - \\ &\quad 50.61(\operatorname{colonizer}_{\operatorname{Britain}}) \end{aligned}` $$ - Average poverty in Belgian-colonized countries: 74.43% + (When France = Britiain = 0) - French-colonized countries compared to Belgian: -45.67 percentage points lower poverty on average + Average poverty in French-colonized countries: 74.43% + -45.67% = 28.76% - British-colonized countries compared to Belgian: -50.61 percentage points lower poverty on average + Average poverty in British-colonized countries: 74.43% + -50.61% = 23.82% ]]]] --- class: inverse middle center # Break <iframe src='https://panel.letstimeit.com/instant-timer/15-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> --- class: middle clear .left-column[ <img src="https://cdn.dribbble.com/users/10549/screenshots/9890798/media/f38f0e4d71d9763c7533641d2418b35b.png?compress=1&resize=1200x900&vertical=top" width="100%" style="display: block; margin: auto;" /> <iframe src='https://panel.letstimeit.com/instant-timer/20-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> ] .right-column[ <br> <iframe src='exercise1.html' width='1000' height='600' frameborder='0' scrolling='yes'></iframe> ] --- class: inverse middle center # Break <iframe src='https://panel.letstimeit.com/instant-timer/10-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> --- class: inverse middle center # Visualizing regression models --- class: clear # (1) Coefficient plots .panelset[ .panel[.panel-name[Preparation] ``` r (plotdata <- lm_robust(poverty ~ colonizer, data = Dat) %>% * tidy() %>% # Turn results into a tibble, mutate( # Rename variables for the plot. term = case_when( term == "colonizerFrance" ~ "France", term == "colonizerBritain" ~ "Britiain", term == "(Intercept)" ~ "Intercept \n (Belgium)")) %>% filter(term != "Intercept \n (Belgium)")) # term estimate std.error statistic p.value conf.low conf.high df outcome # 1 France -45.7 7.97 -5.73 5.09e-07 -61.7 -29.7 52 poverty # 2 Britiain -50.6 7.84 -6.45 3.68e-08 -66.3 -34.9 52 poverty ``` ] .panel[.panel-name[Plotting] ``` r ggplot(data = plotdata, aes(y = estimate, # Order by effect size x = reorder(term, estimate))) + # Reference line geom_hline(yintercept = 0, color = "red", lty = "dashed") + # Point with confidence interval, * geom_pointrange(aes(min = conf.low, max = conf.high)) + * coord_flip() + # Flip Y- & X-Axis, labs(title = "OLS regression results", x = "Countries colonized by:", y = "Average Difference in poverty rate compared to countries colonized by Belgium") + theme_minimal() ``` ] .panel[.panel-name[Plot] .push-left[ <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] .push-right[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Poverty </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 74.433*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (6.208) </td> </tr> <tr> <td style="text-align:left;"> France </td> <td style="text-align:center;"> −45.668*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (7.968) </td> </tr> <tr> <td style="text-align:left;"> Britain </td> <td style="text-align:center;"> −50.609*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (7.843) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 55 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.181 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ]]] --- # (2) Model predictions .panelset[ .panel[.panel-name[OLS model] .push-left[ ``` r (ols <- lm_robust(poverty ~ years_indep, data = Dat)) # Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF # (Intercept) 63.190 16.30 3.88 0.000322 30.4 95.9732 48 # years_indep -0.575 0.26 -2.21 0.031745 -1.1 -0.0524 48 ``` ] .push-right[ <img src="4-OLS-Wisdoms_files/figure-html/Coefplot2-1.png" width="100%" style="display: block; margin: auto;" /> ]] .panel[.panel-name[Predictions] .push-left[.font90[ **Step 1**: Create synthetic (i.e., fictional) `\(x\)` data with theoretically relevant values. ``` r (fict_dat <- tibble( # Create a new tibble named 'fict_dat' # Generate a sequence from 1 to 500. # This represents years of independence years_indep = 1:500)) # # A tibble: 500 × 1 # years_indep # <int> # 1 1 # 2 2 # 3 3 # 4 4 # 5 5 # 6 6 # 7 7 # 8 8 # 9 9 # 10 10 # # ℹ 490 more rows ``` ]] .push-right[.font90[ **Step 2**: Predict `\(\hat{y}\)` from OLS model for our synthetic data. ] .font80[ ``` r (fict_dat <- predict( # Generates predictions object = ols, # Use the previously fitted OLS model newdata = fict_dat, # Apply the model to our synthetic data # Calculate 95% confidence intervals and fitted values interval = "confidence", level = 0.95)$fit %>% as_tibble() %>% # Convert results to a tibble (data frame) # Combine original synthetic data with predictions # (. represents the piped prediction results) bind_cols(fict_dat, .)) # # A tibble: 500 × 4 # years_indep fit lwr upr # <int> <dbl> <dbl> <dbl> # 1 1 62.6 30.3 94.9 # 2 2 62.0 30.3 93.8 # 3 3 61.5 30.2 92.7 # 4 4 60.9 30.1 91.6 # 5 5 60.3 30.1 90.5 # 6 6 59.7 30.0 89.5 # 7 7 59.2 29.9 88.4 # 8 8 58.6 29.9 87.3 # 9 9 58.0 29.8 86.2 # 10 10 57.4 29.7 85.1 # # ℹ 490 more rows ``` ]]] .panel[.panel-name[Visualization] .push-left[.font80[ ``` r # Plots years of independence on the x-axis and # predicted poverty on the y-axis ggplot(data = fict_dat, aes(y = fit, x = years_indep)) + # Add vertical reference lines at 34 and 236 years geom_vline(xintercept = c(34, 236), color = "red", lty = "dashed") + # Add shaded area for 95% confidence interval geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.5) + # Add the main prediction line (OLS regression line) geom_line() + # Set labels for the plot labs( title = "Prediction based on OLS regression", x = "Years since independence", y = "Predicted average of extreme poverty") + # Use a minimal theme for clean appearance theme_minimal() ``` ]] .push-right[ <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ]]] --- # Learning goal achieved! .left-column[ .font130[Is poverty lower in countries that have been independent longer?] .font130[How do different colonial legacies compare to one another?] ] .right-column[ <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-34-1.png" width="60%" style="display: block; margin: auto;" /> <img src="4-OLS-Wisdoms_files/figure-html/unnamed-chunk-35-1.png" width="60%" style="display: block; margin: auto;" /> ] --- class: inverse middle center # Break <iframe src='https://panel.letstimeit.com/instant-timer/15-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> --- class: middle clear .left-column[ <img src="https://cdn.dribbble.com/users/10549/screenshots/9890798/media/f38f0e4d71d9763c7533641d2418b35b.png?compress=1&resize=1200x900&vertical=top" width="100%" style="display: block; margin: auto;" /> <iframe src='https://panel.letstimeit.com/instant-timer/20-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> ] .right-column[ <br> <iframe src='exercise2.html' width='1000' height='600' frameborder='0' scrolling='yes'></iframe> ] --- class: inverse # Today's general lessons 1. Outliers can significantly impact OLS regression results. Cook's D helps identify these outliers. 2. OLS assumes a linear relationship between continuous predictors and outcomes. Verify this assumption, but disregard for categorical predictors. 3. Categorical predictors in regression are typically dummy coded, showing average outcome differences between each category and a reference group. 4. R automatically dummy codes categorical variables in OLS regression, using the first category as the reference. 5. Coefficient plots are standard for visualizing OLS regression results. 6. For continuous predictors, visualizing model predictions with synthetic data points is valuable. --- # Today's (important) functions - Post `lm()` diagnostics: + `plot(model, which = 5)`: Identify outliers + `plot(model, which = 1)`: Check linearity assumption - Useful functions: + `tidy()`: Convert OLS results to tibble + `predict()`: Apply model to synthetic data to obtain `\(\hat{Y}\)` --- # References .font80[ Lange, M. and A. Dawson (2009). "Dividing and Ruling the World? A Statistical Test of the Effects of Colonialism on Postcolonial Civil Violence". In: _Social Forces_, pp. 785-817. ] --- class: inverse middle center # Binary outcomes --- class: clear # LPM versus Generalized Linear Models (GLM) .left-column[ **Linear Probability Model (LPM)**: - Uses OLS to predict binary outcomes <br> (0 = "No" / 1 = "Yes") - Predicts `\(\hat{\text{P}(y_i = 1|x_{i})}\)` - Controversial issues: + Violates linearity assumption + Can predict probabilities < 0 or > 1 ] .right-column[ <img src="http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1523361626/linear_vs_logistic_regression_h8voek.jpg" width="100%" style="display: block; margin: auto;" /> .center[.backgrnote[*Source*: [datacamp: Logistic Regression in R Tutorial](https://www.datacamp.com/community/tutorials/logistic-regression-R)]] ] --- # Logistic regression .font70[GLM for binary outcomes] .push-left[ `$$\begin{align*} \text{logit}(y_{i}) &= \alpha + \beta x, \\ y_{i} &= \text{logit}^{-1}(\alpha + \beta x) \\ & = \text{logistic}(\alpha + \beta x) \\ &= \frac{1}{1+e^{-(\alpha + \beta x)}}. \end{align*}$$` .font90[ | `\(\alpha + \beta x\)` | `\(e^{(\alpha + \beta x)}\)` | `\(\frac{1}{e^{(\alpha + \beta x)}} \color{gray}{= e^{-(\alpha + \beta x)}}\)` | `\(\frac{1}{1 + e^{(\alpha + \beta x)}}\)` | `\(\frac{1}{1 +e^{-(\alpha + \beta x)}}\)` | |--------------------:|-------------------------:|-------------------------------------------------------------:|---------------------------------------:|---------------------------------------:| | -2 | 0.135 | 7.389 | 0.881 | 0.119 | | -1 | 0.368 | 2.718 | 0.731 | 0.269 | | 0 | 1 | 1 | 0.5 | 0.5 | | 1 | 2.718 | 0.368 | 0.269 | 0.731 | | 2 | 7.389 | 0.135 | 0.119 | 0.881 | ]] .push-right[ We need a 'link function' `\(g^{-1}\)` that maps the linear model results to the [0, 1] range. The sigmoid-shaped logistic function serves this purpose, with its inverse being the logit (log odds). <img src="4-OLS-Wisdoms_files/figure-html/Logit-fun-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: middle center GLMs (e.g., logistic regression) have their own issues (Breen, Karlson, and Holm, 2018). .alert[We use Linear Probability Models for binary outcomes.] For categorical outcomes, use 0/1 coding.