class: center, middle, inverse, title-slide .title[ # Multiple Regression &
Fundamentals of Causal Inference ] .subtitle[ ##
2. Recap: Correlation & OLS Regression ] .author[ ### Merlin Schaeffer
Department of Sociology ] .date[ ### 2025-09-10 ] --- # Goal of empirical sociology .font130[.center[Use data to .alert[discover patterns], <br> and the social mechanisms that bring them about.]] <img src="https://researchleap.com/wp-content/uploads/2021/12/Population-data.jpeg" width="70%" style="display: block; margin: auto;" /> --- class: inverse # Today's schedule 1. **Today's research question**: Socialism, citizenship rights, and poverty. + Application Programming Interfaces (API) + World Bank API + Democracy data API + Join different data sources 2. **Recap** 2.1 *Scatter plots* 2.2 *Correlation* + Z-standardization + `\(r_{y,x}\)` 2.3 *Bivariate OLS regression* + OLS estimation + Causal versus descriptive interpretation --- class: clear # Remember? .font70[Civic and political Citizenship rights across the world] .right-column[ [Freedom House World Map 2021](https://freedomhouse.org/explore-the-map?type=fiw&year=2020) <img src="./img/FreedomHouse.png" width="100%" style="display: block; margin: auto;" /> ] .left-column[ One may criticize:<br> *Aren't socialist countries better at providing* **social** *citizenship rights, like affordable housing, healthcare, work, and minimum quality of life?* ] --- class: inverse # Today's research question .center[.font140[ **Is there a freedom/equality trade-off?** ] .font110[ In other words:<br> **Are socialist countries good at reducing poverty**,<br> potentially at the cost of offering less freedom? ]] <br> .push-left[ <img src="https://miro.medium.com/max/1280/1*8Y_EPw2a67TRRos3b24YlA.jpeg" width="90%" style="display: block; margin: auto;" /> ] .push-right[ <img src="https://chineseposters.net/sites/default/files/2020-06/pc-1968-l-005.jpg" width="85%" style="display: block; margin: auto;" /> ] --- # Preparations .panelset[ .panel[.panel-name[Packages for today's session] ``` r pacman::p_load( tidyverse, # Data manipulation, ggplot2, # beautiful figures, * wbstats, # download data from Worldbank. Tremendous source of global socio-economic data. * vdemdata, # Use varieties of democracy data, estimatr, # OLS with robust SE, modelsummary, # regression tables with nice layout, countrycode) # Easy recodings of country names. ``` ]] --- class: clear # (1) Varieties of Democracy Data .left-column[ We focus on: **Equality before the law & individual freedom:** impartial public administration, transparent laws with predictable enforcement, access to justice for men/women, property rights for men/women, freedom from torture, freedom from political killings, from forced labor for men/women, freedom of religion, freedom of foreign movement, and freedom of domestic movement for men/women. ] .right-column[ .panelset[ .panel[.panel-name[The data] <iframe src='https://en.wikipedia.org/wiki/V-Dem_Institute' width='1200' height='480' frameborder='0' scrolling='yes'></iframe> ] .panel[.panel-name[Get the data] ``` r (Dat_vdem <- vdem %>% # The data are part of the package we loaded earlier as_tibble() %>% # turn into a tiblle select(country_name, country_text_id, year, v2xcl_rol) %>% rename(equal_liberty = v2xcl_rol, # rename liberty variable country = country_name)) #rename country variable # # A tibble: 27,913 × 4 # country country_text_id year equal_liberty # <chr> <chr> <dbl> <dbl> # 1 Mexico MEX 1789 0.204 # 2 Mexico MEX 1790 0.204 # 3 Mexico MEX 1791 0.204 # 4 Mexico MEX 1792 0.204 # 5 Mexico MEX 1793 0.204 # 6 Mexico MEX 1794 0.204 # 7 Mexico MEX 1795 0.204 # 8 Mexico MEX 1796 0.204 # 9 Mexico MEX 1797 0.204 # 10 Mexico MEX 1798 0.204 # # ℹ 27,903 more rows ``` ] .panel[.panel-name[Denmark & Germany since 1789] <img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(data = Dat_vdem %>% filter(country == "Denmark" | country == "Germany"), # Make coordinate system for data from Denmark, aes(y = equal_liberty, x = year, # Y- and X-axis of plot, color = country)) + geom_line() + # plot data as is in a bar chart, labs(y = "Index of equality before the law \n and individual liberty", x = "", cap = "Source: V-Dem", color = "") + # Axis labels, theme_minimal() + # Simple background layout, theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position="bottom") # Write country names in a 60 degree angle. ``` ] .panel[.panel-name[Across the world '24] <img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(data = Dat_vdem %>% filter(year == 2024), # Make coordinate system for data from 2022, aes(y = equal_liberty, # Y- and X-axis of plot, x = reorder(country, equal_liberty))) + geom_bar(stat = "identity") + # plot data as is in a bar chart, labs(y = "Index of equality before the law \n and individual liberty in 2024", x = "", cap = "Source: V-Dem") + # Axis labels, theme_minimal() + # Simple background layout, theme(axis.text.x = element_text(angle = 60, hjust = 1)) # Write country names in a 60 degree angle. ``` ]]] --- class:c clear # Application Programming Interfaces (API) <img src="https://media.geeksforgeeks.org/wp-content/uploads/20230216170349/What-is-an-API.png" width="90%" style="display: block; margin: auto;" /> --- class: clear # (2) World Bank Data .font70[Poverty across the world] .panelset[ .panel[.panel-name[Searching WB archive] - With `wbstats::wb_search()`, you can search the Wb archive for any keyword! Here I use "poverty": ``` r *(wb_poverty_archive <- wb_search("Poverty")) # Search WB data bank for "Poverty" # # A tibble: 702 × 3 # indicator_id indicator indicator_desc # <chr> <chr> <chr> # 1 1.0.HCount.1.90usd Poverty Headcount ($1.90 a day) The poverty headcount index measures the proportio… # 2 1.0.HCount.2.5usd Poverty Headcount ($2.50 a day) The poverty headcount index measures the proportio… # 3 1.0.HCount.Mid10to50 Middle Class ($10-50 a day) Headcount The poverty headcount index measures the proportio… # 4 1.0.HCount.Ofcl Official Moderate Poverty Rate-National The poverty headcount index measures the proportio… # 5 1.0.HCount.Poor4uds Poverty Headcount ($4 a day) The poverty headcount index measures the proportio… # 6 1.0.HCount.Vul4to10 Vulnerable ($4-10 a day) Headcount The poverty headcount index measures the proportio… # 7 1.0.PGap.1.90usd Poverty Gap ($1.90 a day) The poverty gap captures the mean aggregate income… # 8 1.0.PGap.2.5usd Poverty Gap ($2.50 a day) The poverty gap captures the mean aggregate income… # 9 1.0.PGap.Poor4uds Poverty Gap ($4 a day) The poverty gap captures the mean aggregate income… # 10 1.0.PSev.1.90usd Poverty Severity ($1.90 a day) The poverty severity index combines information on… # # ℹ 692 more rows ``` ] .panel[.panel-name[Use WB API] ``` r *(Dat_poverty <- wb_data("SI.POV.DDAY", # Download poverty data: <$2.15 per day, * start_date = 1972, end_date = 2025) %>% rename(poverty = SI.POV.DDAY, # rename poverty variable, year = date, # rename year variable, country_text_id = iso3c) %>% # rename country abbreviation, select(country_text_id, year, country, poverty) %>% # Keep only 3 variables drop_na(poverty) %>% # Drop cases with missing data, group_by(country) %>% filter(year == max(year)) %>% ungroup()) # Keep the most recent poverty statistic per country. # # A tibble: 171 × 4 # country_text_id year country poverty # <chr> <dbl> <chr> <dbl> # 1 ALB 2020 Albania 0.3 # 2 DZA 2011 Algeria 0 # 3 AGO 2018 Angola 39.3 # 4 ARG 2023 Argentina 1.2 # 5 ARM 2023 Armenia 1.9 # 6 AUS 2018 Australia 0.5 # 7 AUT 2022 Austria 0.6 # 8 AZE 2005 Azerbaijan 0 # 9 BGD 2022 Bangladesh 8 # 10 BRB 2016 Barbados 1.7 # # ℹ 161 more rows ``` ] .panel[.panel-name[Purchasing power parity (PPP)] <img src="./img/PPP2.png" width="100%" style="display: block; margin: auto;" /> .push-left[ <img src="./img/PPP.png" width="75%" style="display: block; margin: auto;" /> ] .push-right[ <br> .content-box-red[ $1 buys in the US, what kr. 6.5 buy in Denmark. `\(\rightarrow\)` US$2.15 = kr. 14 per day. `\(\rightarrow\)` Less than `\((30\text{Days}\times\text{Kr. }14 \approx) \text{kr. }420\)` to get by per month. ]]] .panel[.panel-name[Poverty across the world] <img src="2-Corr-n-Reg_files/figure-html/poverty-world-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- class: inverse # .push-left[ <br> <br> <br> <br> OK great, now I have two tibbles. But how can I combine them? ] .push-right[ <img src="https://powietrze.malopolska.pl/wp-content/uploads/2020/10/q.jpg" width="96%" style="display: block; margin: auto;" /> ] --- # Relational data If you work with multiple tibbles, you work with relational data .alert[if they have one or more variable(s) in common]. Our tibbles are related, because both contain *countries* at various *dates*. <br>The combination of country+date is the .alert[key] that allows us to relate both tibbles. .push-left[ ``` r (Dat_vdem <- Dat_vdem %>% select(-country)) # Drop the country variable # # A tibble: 27,913 × 3 # country_text_id year equal_liberty # <chr> <dbl> <dbl> # 1 MEX 1789 0.204 # 2 MEX 1790 0.204 # 3 MEX 1791 0.204 # 4 MEX 1792 0.204 # 5 MEX 1793 0.204 # 6 MEX 1794 0.204 # 7 MEX 1795 0.204 # 8 MEX 1796 0.204 # 9 MEX 1797 0.204 # 10 MEX 1798 0.204 # # ℹ 27,903 more rows ``` ] .push-right[ ``` r Dat_poverty # # A tibble: 171 × 4 # country_text_id year country poverty # <chr> <dbl> <chr> <dbl> # 1 ALB 2020 Albania 0.3 # 2 DZA 2011 Algeria 0 # 3 AGO 2018 Angola 39.3 # 4 ARG 2023 Argentina 1.2 # 5 ARM 2023 Armenia 1.9 # 6 AUS 2018 Australia 0.5 # 7 AUT 2022 Austria 0.6 # 8 AZE 2005 Azerbaijan 0 # 9 BGD 2022 Bangladesh 8 # 10 BRB 2016 Barbados 1.7 # # ℹ 161 more rows ``` ] --- # **Join** .font60[Four types] .push-left[ <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/inner-join.gif" width="90%" style="display: block; margin: auto;" /> .center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]] ] -- .push-right[ <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/left-join.gif" width="90%" style="display: block; margin: auto;" /> .center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]] ] --- # **Join** .font60[Four types] .push-left[ <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/right-join.gif" width="90%" style="display: block; margin: auto;" /> .center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]] ] -- .push-right[ <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/full-join.gif" width="90%" style="display: block; margin: auto;" /> .center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]] ] --- # Inner join .font70[Poverty and citizenship rights] ``` r *(Dat <- inner_join(Dat_poverty, Dat_vdem, by = c("country_text_id", "year"))) # # A tibble: 161 × 5 # country_text_id year country poverty equal_liberty # <chr> <dbl> <chr> <dbl> <dbl> # 1 ALB 2020 Albania 0.3 0.912 # 2 DZA 2011 Algeria 0 0.564 # 3 AGO 2018 Angola 39.3 0.564 # 4 ARG 2023 Argentina 1.2 0.869 # 5 ARM 2023 Armenia 1.9 0.845 # 6 AUS 2018 Australia 0.5 0.967 # 7 AUT 2022 Austria 0.6 0.934 # 8 AZE 2005 Azerbaijan 0 0.381 # 9 BGD 2022 Bangladesh 8 0.353 # 10 BRB 2016 Barbados 1.7 0.925 # # ℹ 151 more rows ``` --- # (3) Socialism .font70[Construct our own index ...] .panelset[ .panel[.panel-name[Socialist countries] .left-column[ - Wikipedia has tables on self-declared socialist countries. - I suggest a simple index: + Years socialist minus years since not socialist anymore. + Min. 5 years given any socialist history. ] .right-column[ <iframe src='https://en.wikipedia.org/wiki/List_of_socialist_states' width='1200' height='480' frameborder='0' scrolling='yes'></iframe> ]] .panel[.panel-name[Coding the index] .font70[ ``` r Dat <- Dat %>% mutate( socialist = case_when( # Years socialist minus years since not socialist anymore, country == "China" ~ year - 1949, country == "Viet Nam" ~ year - 1945, country == "Algeria" ~ year - 1962, str_detect(country,"Portugal|Bangladesh") ~ year - 1972, country == "Guinea-Bissau" ~ year - 1973, country == "India" ~ year - 1976, country == "Nicaragua" ~ year - 1979, country == "Sri Lanka" ~ year - 1978, country == "Tanzania" ~ year - 1964, country == "Albania" ~ (1992 - 1944) - (year - 1992), str_detect(country, "Angola|Cabo Verde|Madagascar") ~ (1992 - 1975) - (year - 1992), str_detect(country,"Belarus|Bulgaria") ~ (1990 - 1946) - (year - 1990), str_detect(country, "Benin|Mozambique") ~ (1990 - 1975) - (year - 1990), country == "Chad" ~ (1975 - 1962) - (year - 1975), country == "Congo, Rep." ~ (1992 - 1970) - (year - 1992), country == "Czech Republic" ~ (1990 - 1948) - (year - 1990), country == "Djibouti" ~ (1992 - 1981) - (year - 1992), country == "Ethiopia" ~ (1991 - 1974) - (year - 1991), country == "Ghana" ~ (1966 - 1960) - (year - 1966), country == "Guinea" ~ (1984 - 1958) - (year - 1984), country == "Hungary" ~ (1989 - 1949) - (year - 1989), country == "Iraq" ~ (2005 - 1958) - (year - 2005), country == "Mali" ~ (1991 - 1960) - (year - 1991), country == "Mauritania" ~ (1978 - 1961) - (year - 1978), country == "Mongolia" ~ (1992 - 1924) - (year - 1992), country == "Myanmar" ~ (1988 - 1962) - (year - 1988), country == "Poland" ~ (1989 - 1945) - (year - 1989), country == "Romania" ~ (1989 - 1947) - (year - 1989), country == "Russian Federation" ~ (1991 - 1922) - (year - 1991), country == "Seychelles" ~ (1991 - 1977) - (year - 1991), country == "Senegal" ~ (1981 - 1960) - (year - 1981), country == "Sierra Leone" ~ (1991 - 1978) - (year - 1991), country == "Somalia" ~ (1991 - 1969) - (year - 1991), country == "Sudan" ~ (1985 - 1969) - (year - 1985), country == "Syria" ~ (2012 - 1963) - (year - 2012), country == "Tunisia" ~ (1988 - 1964) - (year - 1988), country == "Ukraine" ~ (1991 - 1919) - (year - 1991), country == "Yemen, Rep." ~ (1991 - 1967) - (year - 1991), country == "Zambia" ~ (1991 - 1973) - (year - 1991), str_detect(country,"Slovenia|Croatia|Serbia|Montenegro|Bosnia and Herzegovina|North Macedonia|Kosovo") ~ (1992 - 1943) - (year - 1992), TRUE ~ 0), socialist = case_when( # Min. 5 years given any socialist history, socialist != 0 & socialist < 5 ~ 5, TRUE ~ socialist)) %>% drop_na() # Drop countries with missing values. ``` ]] .panel[.panel-name[Resulting data] ``` r Dat # # A tibble: 161 × 6 # country_text_id year country poverty equal_liberty socialist # <chr> <dbl> <chr> <dbl> <dbl> <dbl> # 1 ALB 2020 Albania 0.3 0.912 20 # 2 DZA 2011 Algeria 0 0.564 49 # 3 AGO 2018 Angola 39.3 0.564 5 # 4 ARG 2023 Argentina 1.2 0.869 0 # 5 ARM 2023 Armenia 1.9 0.845 0 # 6 AUS 2018 Australia 0.5 0.967 0 # 7 AUT 2022 Austria 0.6 0.934 0 # 8 AZE 2005 Azerbaijan 0 0.381 0 # 9 BGD 2022 Bangladesh 8 0.353 50 # 10 BRB 2016 Barbados 1.7 0.925 0 # # ℹ 151 more rows ``` ] .panel[.panel-name[Socialist history across the world] <img src="2-Corr-n-Reg_files/figure-html/socialism-world-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- class: inverse middle center # Break <iframe src='https://panel.letstimeit.com/instant-timer/15-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> --- class: inverse middle center # Scatter plots --- # Visual inspection .left-column[ .content-box-blue[ .center[**4 questions for scatter plots**] 1. What is the *direction* of the relationship? 2. What *form* does the relation have? 3. How much *spread* is in the data? 4. Are there any *outliers*? ]] .right-column[ <img src="2-Corr-n-Reg_files/figure-html/socialism-corr1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse middle center # Correlation --- class: clear # Z-standardization .font60[Give two variables a comparable unit] .panelset[ .panel[.panel-name[What is it?] .push-left[ `$$z(x) = \frac{x - \bar{x}}{\text{SD}(x)}$$` **We subtract the mean:** Values above 0 are above average, values below 0 are below average. **We divide by the standard deviation:** Our variable now has standard deviations as unit.<br><br> `\(\rightarrow\)` Intuitive understanding: How common vis-á-vis extreme is a case? ] .push-right[ <img src="https://www.native-instruments.com/fileadmin/userlib/images/7727639_4467.normal-light.png" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[R Code] ``` r (Dat <- Dat %>% mutate( # Z-Standardize variables. * z_socialist = scale(socialist) %>% as.numeric(), * z_poverty = scale(poverty) %>% as.numeric())) # # A tibble: 161 × 8 # country_text_id year country poverty equal_liberty socialist z_socialist z_poverty # <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 ALB 2020 Albania 0.3 0.912 20 0.961 -0.672 # 2 DZA 2011 Algeria 0 0.564 49 2.98 -0.686 # 3 AGO 2018 Angola 39.3 0.564 5 -0.0835 1.14 # 4 ARG 2023 Argentina 1.2 0.869 0 -0.432 -0.630 # 5 ARM 2023 Armenia 1.9 0.845 0 -0.432 -0.598 # 6 AUS 2018 Australia 0.5 0.967 0 -0.432 -0.663 # 7 AUT 2022 Austria 0.6 0.934 0 -0.432 -0.658 # 8 AZE 2005 Azerbaijan 0 0.381 0 -0.432 -0.686 # 9 BGD 2022 Bangladesh 8 0.353 50 3.05 -0.315 # 10 BRB 2016 Barbados 1.7 0.925 0 -0.432 -0.607 # # ℹ 151 more rows ``` ] .panel[.panel-name[Illustration] <img src="img/Correlation.png" width="40%" style="display: block; margin: auto;" /> .backgrnote[.center[ *Source*: Veaux, Velleman, and Bock (2021, p.199) ]]] .panel[.panel-name[Figure] <img src="2-Corr-n-Reg_files/figure-html/socialism-corr2-1.png" width="70%" style="display: block; margin: auto;" /> ]] --- class: inverse # .push-left[ <img src="https://thumbs.dreamstime.com/b/charakter-d-der-eine-lupe-h%C3%A4lt-und-ein-questio-kontrolliert-99243756.jpg" width="70%" style="display: block; margin: auto;" /> ] .push-right[ <br> <br> <br> <br> OK but eye-balling is hardly enough to count as scientific evidence, is it? ] --- # The correlation coefficient: `\(r_{y,x}\)` .panelset[ .panel[.panel-name[What is it?] .push-left[ <img src="img/Correlation.png" width="80%" style="display: block; margin: auto;" /> .backgrnote[.center[ *Source*: Veaux, Velleman, and Bock (2021, p.199) ]]] .push-right[ .content-box-blue[ .center[**A precise statistic** <br> in three steps] `$$r_{y,x} = \frac{\sum^{n}_{i=1}z_y*z_x}{n-1}$$` 1. `\(\color{orange}{z_y*z_x}\)`: positive for a green points, zero for blue ones, and negative for red ones. Larger products contribute more to the association. 2. `\(\color{orange}{\sum^{n}_{i=1}z_y*z_x}\)`: The general trend. 3. `\(\color{orange}{\frac{\sum^{n}_{i=1}z_y*z_x}{n-1}}\)`: We divide by `\(n - 1\)`; the resulting `\(r\)` varies between -1 and 1. ]]] .panel[.panel-name[Poverty & Socialism] .left-column[ .content-box-green[.center[ How do we<br>interpret this result? ]]] .right-column[ ``` r Dat %>% # Use our data, select(poverty, socialist) %>% # Select vars for analysis, * cor() # Estimate correlation. # poverty socialist # poverty 1.0000 -0.0968 # socialist -0.0968 1.0000 ``` ]]] --- class: middle clear .left-column[ <img src="https://www.laserfiche.com/wp-content/uploads/2014/10/femalecoder.jpg" width="80%" style="display: block; margin: auto;" /> <iframe src='https://www.online-timer.net/' width='400' height='385' frameborder='0' scrolling='yes'></iframe> ] .right-column[ <br> <iframe src='exercise1.html' width='1000' height='600' frameborder='0' scrolling='yes'></iframe> ] --- class: inverse middle center # Break <iframe src='https://panel.letstimeit.com/instant-timer/10-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> --- class: inverse middle center # OLS regression --- # Correlation = linear trend .right-column[ <img src="2-Corr-n-Reg_files/figure-html/socialism-ols-1.png" width="100%" style="display: block; margin: auto;" /> ] .left-column[ **How can we directly calculate that trend line?** <br><br> Then we could state how much of a reduction in poverty we would expect for a year increase of socialism. ] --- # Models .left-column[ **Model**: A reduced representation of reality. <br><br>Model should capture answer to our research question. .backgrnote[ Models should not be driven by few singular cases, like in this example.] ] .right-column[ <img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Linear models .left-column[ .content-box-blue[ .center[**Linear models**<br>defined by two parameters] `\(\color{orange}{\alpha}\)` .alert[constant/intercept]: The value of y at which the line intercepts the Y-axis `\((\hat{Y}|X=0)\)`. `\(\color{orange}{\beta}\)` .alert[slope]: How does `\(\hat{Y}\)` change, if `\(X\)` increases by one unit. ]] .right-column[ <img src="img/LinearModel.png" width="100%" style="display: block; margin: auto;" /> ] --- # Regressing linear models from data .panelset[ .panel[.panel-name[Residuals, e] .left-column[ **Residuals**: `\(e_{i} =y_{i} - \hat{y}\)`<br> differences between what model predicts and actual data. `\(e_{\text{Denmark}} = 0.2\% - 15.7\%=-15.5\%\)` ] .right-column[ <img src="2-Corr-n-Reg_files/figure-html/residuals-1.png" width="100%" style="display: block; margin: auto;" /> ]] .panel[.panel-name[Minimize 1] .left-column[ - **_Best_ fitting line**: `$$\begin{align*} \min \text{RSS} &= \min \sum_{i=1}^{n} e_{i}^{2} \\ &= \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\ &= \min \sum_{i=1}^{n} (y_{i} - (\color{orange}{\alpha} + \color{orange}{\beta} x_{i})^{2} \end{align*}$$` ] .right-column[ <img src="2-Corr-n-Reg_files/figure-html/min_resid-1.png" width="100%" style="display: block; margin: auto;" /> ]] .panel[.panel-name[... 2] .left-column[ - **_Best_ fitting line**: `$$\begin{align*} \min \text{RSS} &= \min \sum_{i=1}^{n} e_{i}^{2} \\ &= \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\ &= \min \sum_{i=1}^{n} (y_{i} - (\color{orange}{\alpha} + \color{orange}{\beta} x_{i})^{2} \end{align*}$$` ] .right-column[ <img src="https://i.redd.it/gyw14y0tvak21.gif" width="50%" style="display: block; margin: auto;" /> .backgrnote[.center[ *Source*: [aftersox on Reddit](https://www.reddit.com/r/dataisbeautiful/comments/axl1jm/oc_ordinary_least_squares_ols_finding_the_line/) ]]]] .panel[.panel-name[R2 model fit] .left-column[ How much smaller are the residuals from our model (blue line), compared to simply using the average `\(\bar{y}\)` (orange line)? `$$\text{TSS}=\sum_{i=1}^{n}(y_i-\bar{y})^2$$` `$$\text{RSS}=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$` `$$R^2=\frac{\text{TSS} - \text{RSS}}{\text{TSS}}$$` ] .right-column[ <img src="2-Corr-n-Reg_files/figure-html/R2-1.png" width="100%" style="display: block; margin: auto;" /> ]] .panel[.panel-name[Regression using R] .right-column[ ``` r ols <- lm_robust(data = Dat, formula = poverty ~ socialist) zols <- lm_robust(data = Dat, formula = z_poverty ~ z_socialist) modelsummary(list("OLS" = ols, "Std. OLS" = zols), # Nicely-formatted table, statistic = NULL, # Don't report stat. inference (yet), gof_map = c("nobs", "r.squared")) # Only 2 model-fit stats. ``` <img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-35-1.png" width="70%" style="display: block; margin: auto;" /> ] .left-column[ <br> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> OLS </th> <th style="text-align:center;"> Std. OLS </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 15.682 </td> <td style="text-align:center;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> socialist </td> <td style="text-align:center;"> −0.145 </td> <td style="text-align:center;"> </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> z_socialist </td> <td style="text-align:center;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> −0.097 </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 161 </td> <td style="text-align:center;"> 161 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.009 </td> <td style="text-align:center;"> 0.009 </td> </tr> </tbody> </table> ]] .panel[.panel-name[Interpretation] .push-left[ <img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> ] .push-right[ $$ \operatorname{\widehat{poverty}} = 15.68 - 0.15(\operatorname{socialist}) $$ `\(\rightarrow\)` the best-fitting line that `\(\min \sum_{i=1}^{n} e_{i}^{2}\)`. Among countries without a socialist past, poverty is on average 15.68%. + `\((\hat{y}|\text{Socialism = 0}) = 15.68\%\)`. With every year of socialism, the average level of poverty is -0.15 percentage points lower. This model accounts for 0.009*100% = 0.9% more of the variance of poverty across the world, than the average `\(\bar{y} = 14.78\%\)`. ]]] --- # Two types of interpretation .left-column[ .center[**1. Causal**] With every additional year of socialism, poverty is expected to decline by -0.15 percentage points. Thus, if China and Vietnam stay socialist, poverty will further decline. .alert[Beware, this interpretation only holds under some conditions.] I will teach you how to estimate regressions that have a causal interpretation later this semester! ] -- .right-column[ .center[ **2. Descriptive: conditional means `\(\bar{y}|x\)`** ] With every year of socialism, the average level of poverty -0.15 percentage points lower. Here regression is a (linear) model that describes the average of the outcome for different values of the predictor. <img src="https://isem-cueb-ztian.github.io/Intro-Econometrics-2017/handouts/lecture_notes/lecture_6/figure/fig-4-4.png" width="75%" style="display: block; margin: auto;" /> .backgrnote[.center[ *Source*: [Zheng Tian](https://isem-cueb-ztian.github.io/Intro-Econometrics-2017/handouts/lecture_notes/lecture_6/lecture_6.html#org39dfbe6) ]] ] --- class: inverse middle center # Break <iframe src='https://panel.letstimeit.com/instant-timer/15-minute' width='600' height='400' frameborder='0' scrolling='yes'></iframe> --- class: middle clear .left-column[ <img src="https://cdn.dribbble.com/users/10549/screenshots/9916149/media/a9dbfea8e23e5b8e23db142528c3bc9f.png?compress=1&resize=1200x900&vertical=top" width="100%" style="display: block; margin: auto;" /> <img src="2-Corr-n-Reg_files/figure-html/pov-citiz-corr2-1.png" width="90%" style="display: block; margin: auto;" /> <img src="2-Corr-n-Reg_files/figure-html/citiz-sicial-corr-1.png" width="90%" style="display: block; margin: auto;" /> ] .right-column[ <br> <iframe src='exercise2.html' width='1000' height='600' frameborder='0' scrolling='yes'></iframe> ] --- class: inverse # Today's general lessons 1. R provides convenient access to a wide range of interesting data through APIs, allowing for easy downloading. 2. When datasets share common variables that uniquely identify cases, you can join them together, enabling fascinating analyses and excellent term papers! 3. It's always beneficial to create a scatter plot to visualize the relationship between the variables you wish to correlate. 4. Z-standardization aids interpretation and provides a common unit for different variables. 5. The correlation coefficient is a simple statistic that measures the strength of association between two variables. 6. Bivariate OLS regression, being a linear model, expresses an outcome variable as a linear function of a predictor. 7. The slope, denoted by β, indicates how average levels of the predicted variable (ŷ) change with a unit increase in the predictor (x). 8. OLS determines the linear model that best fits the data. 9. It is generally recommended not to interpret regression in causal terms, except under normal circumstances. --- class: inverse # Today's (important) functions 1. `cor()`: Estimate correlation coefficient. 2. `estimatr::lm_robust()`: Estimate linear OLS regression (with robust standard errors, which matters when using weights). 3. `plot(model_object)` to test regression assumptions. 4. `inner_join()`, `left_join()`, `right_join()`, and `full_join()` allow you to join/merge different tibbles together than have common observations and a key that identifies them. 5. `modelsummary()`: Create nicely-formatted (html, Word, ASCII, or Latex) tables of (one or several) regression models. 6. `scale()` z-standardizes variables. But sometimes it returns a matrix rather than a vector. Therefore it makes sense to always code `scale(x) %>% as.numeric()` to ensure you get an numeric vector out of it. --- # References .font80[ Veaux, D., Velleman, and Bock (2021). _Stats: Data and Models, Global Edition_. Pearson Higher Ed. ]