Multiple Regression & Fundamentals of Causal Inference

class: center, middle, inverse, title-slide

.title[
# Multiple Regression & <br><br> Fundamentals of Causal Inference
]
.subtitle[
## <br>2. Recap: Correlation & OLS Regression
]
.author[
### Merlin Schaeffer<br> Department of Sociology
]
.date[
### 2024-09-11
]

---

# Goal of empirical sociology

.font130[.center[Use data to .alert[discover patterns], <br> and the social mechanisms that bring them about.]]

---
class: inverse
# Today's schedule

1. **Today's research question**: Socialism, citizenship rights, and poverty.
  + Application Programming Interfaces (API)
  + World Bank API
  + Democracy data API
  + Join different data sources
  
2. **Recap**
  
  2.1 *Scatter plots*
  
  2.2 *Correlation*
    + Z-standardization
    + `$r_{y,x}$`
  
  2.3 *Bivariate OLS regression*
    + OLS estimation
    + Causal versus descriptive interpretation

---
class: clear
# Remember? .font70[Civic and political Citizenship rights across the world]

.right-column[
[Freedom House World Map 2021](https://freedomhouse.org/explore-the-map?type=fiw&year=2020)

<img src="./img/FreedomHouse.png" width="100%" style="display: block; margin: auto;" />
]

.left-column[
One may criticize:<br> *Aren't socialist countries better at providing* **social** *citizenship rights, like affordable housing, healthcare, work, and minimum quality of life?*
]

---
class: inverse
# Today's research question

.center[.font140[
**Is there a freedom/equality trade-off?**
]
.font110[
In other words:<br>
**Are socialist countries good at reducing poverty**,<br> potentially at the cost of offering less freedom?
]]

<br>
.push-left[

<img src="https://miro.medium.com/max/1280/1*8Y_EPw2a67TRRos3b24YlA.jpeg" width="90%" style="display: block; margin: auto;" />
]

.push-right[
<img src="https://chineseposters.net/sites/default/files/2020-06/pc-1968-l-005.jpg" width="85%" style="display: block; margin: auto;" />
]

---
# Preparations

.panelset[
.panel[.panel-name[Packages for today's session]

``` r
pacman::p_load(
  tidyverse, # Data manipulation,
  ggplot2, # beautiful figures,
* wbstats, # download data from Worldbank. Tremendous source of global socio-economic data.
* democracyData, # Use democracy data APIs,
  estimatr, # OLS with robust SE,
  modelsummary, # regression tables with nice layout,
  countrycode) # Easy recodings of country names.
```
]]

---
class:c clear
# Application Programming Interfaces (API)

---
class: clear
# (1) Freedom House Data .font70[Civic and political citizenship rights]

.panelset[
.panel[.panel-name[The data]

.left-column[
Since 1972, Freedom House codes civil and political citizenship rights around the world on a scale from 12 (strong citizenship rights) to 0 (no citizenship rights).
]

.right-column[
<iframe src='https://en.wikipedia.org/wiki/Freedom_House' width='1200' height='480' frameborder='0' scrolling='yes'></iframe>
]]
.panel[.panel-name[Use API to get the data]

``` r
*(Dat_citi_rights <- download_fh(verbose = FALSE) %>% # Use API to download FH data for all countries since 1972,
   rename(country = fh_country, # rename country ID,
          citizen_rights = fh_total_reversed, # rename Citizenship rights indicator,
          date = year) %>% # rename year variable,
   mutate(country = case_when( 
     country == "Vietnam" ~ "Viet Nam", # Rename Vietnam
     TRUE ~ country)) %>% # Leave all others as they are
   select(country, date, citizen_rights)) # Keep only these 3 variables.
# # A tibble: 9,045 × 3
#    country      date citizen_rights
#    <chr>       <dbl>          <dbl>
#  1 Afghanistan  1972              5
#  2 Afghanistan  1973              1
#  3 Afghanistan  1974              1
#  4 Afghanistan  1975              1
#  5 Afghanistan  1976              1
#  6 Afghanistan  1977              2
#  7 Afghanistan  1978              0
#  8 Afghanistan  1979              0
#  9 Afghanistan  1980              0
# 10 Afghanistan  1982              0
# # ℹ 9,035 more rows
```
]

.panel[.panel-name[Citizenship rights across the world '22]
<img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Plot code]

``` r
ggplot(data = Dat_citi_rights %>% filter(date == 2022), # Make coordinate system for data from 2022,
       aes(y = citizen_rights, # Y- and X-axis of plot,
           x = reorder(country, citizen_rights))) +
  geom_bar(stat = "identity") + # plot data as is in a bar chart,
  labs(y = "Citizenship rights", x = "", cap = "Source: Freedom House data for 2022") + # Axis labels,
  theme_minimal() + # Simple background layout,
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) # Write country names in a 60 degree angle.
```
]]

---
class: clear
# (2) World Bank Data .font70[Poverty across the world]

.panelset[
.panel[.panel-name[Searching WB archive]
- With `wbstats::wb_search()`, you can search the Wb archive for any keyword! Here I use "poverty":

``` r
*(wb_poverty_archive <- wb_search("Poverty")) # Search WB data bank for "Poverty"
# # A tibble: 702 × 3
#    indicator_id         indicator                               indicator_desc                                     
#    <chr>                <chr>                                   <chr>                                              
#  1 1.0.HCount.1.90usd   Poverty Headcount ($1.90 a day)         The poverty headcount index measures the proportio…
#  2 1.0.HCount.2.5usd    Poverty Headcount ($2.50 a day)         The poverty headcount index measures the proportio…
#  3 1.0.HCount.Mid10to50 Middle Class ($10-50 a day) Headcount   The poverty headcount index measures the proportio…
#  4 1.0.HCount.Ofcl      Official Moderate Poverty Rate-National The poverty headcount index measures the proportio…
#  5 1.0.HCount.Poor4uds  Poverty Headcount ($4 a day)            The poverty headcount index measures the proportio…
#  6 1.0.HCount.Vul4to10  Vulnerable ($4-10 a day) Headcount      The poverty headcount index measures the proportio…
#  7 1.0.PGap.1.90usd     Poverty Gap ($1.90 a day)               The poverty gap captures the mean aggregate income…
#  8 1.0.PGap.2.5usd      Poverty Gap ($2.50 a day)               The poverty gap captures the mean aggregate income…
#  9 1.0.PGap.Poor4uds    Poverty Gap ($4 a day)                  The poverty gap captures the mean aggregate income…
# 10 1.0.PSev.1.90usd     Poverty Severity ($1.90 a day)          The poverty severity index combines information on…
# # ℹ 692 more rows
```
]
.panel[.panel-name[Use WB API]

``` r
*(Dat_poverty <- wb_data("SI.POV.DDAY", # Download poverty data: <$2.15 per day,
*                       start_date = 1972, end_date = 2024) %>%
   rename(poverty = SI.POV.DDAY) %>% # rename poverty variable,
   select(country, date, poverty) %>% # Keep only 3 variables
   drop_na(poverty) %>% # Drop cases with missing data,
   group_by(country) %>% # Group by country,
   filter(date == max(date)) %>% ungroup()) # Keep the most recent poverty statistic per country.
# # A tibble: 168 × 3
#    country     date poverty
#    <chr>      <dbl>   <dbl>
#  1 Albania     2020     0  
#  2 Algeria     2011     0.5
#  3 Angola      2018    31.1
#  4 Argentina   2022     0.6
#  5 Armenia     2022     0.8
#  6 Australia   2018     0.5
#  7 Austria     2021     0.5
#  8 Azerbaijan  2005     0  
#  9 Bangladesh  2022     5  
# 10 Belarus     2020     0  
# # ℹ 158 more rows
```
]
.panel[.panel-name[Purchasing power parity (PPP)]
<img src="./img/PPP2.png" width="100%" style="display: block; margin: auto;" />
.push-left[
<img src="./img/PPP.png" width="75%" style="display: block; margin: auto;" />
]
.push-right[
<br>
.content-box-red[
$1 buys in the US, what Kr. 6.5 buy in Denmark.

`$\rightarrow$` US$2.15 = Kr. 14 per day.

`$\rightarrow$` Less than `$30\text{Days}\times\text{Kr. }14 \approx \text{Kr. }420$` to get by per month.
]]]

.panel[.panel-name[Poverty across the world]
<img src="2-Corr-n-Reg_files/figure-html/poverty-world-1.png" width="100%" style="display: block; margin: auto;" />
]]

---
class: inverse
#

.push-left[
<br>
<br>
<br>
<br>
OK great, now I have two tibbles.

But how can I combine them?
]

.push-right[
<img src="https://powietrze.malopolska.pl/wp-content/uploads/2020/10/q.jpg" width="96%" style="display: block; margin: auto;" />
]

---
# Relational data

If you work with multiple tibbles, you work with relational data .alert[if they have one or more variable(s) in common].

Our tibbles are related, because both contain *countries* at various *dates*. <br>The combination of country+date is the .alert[key] that allows us to relate both tibbles.

.push-left[

``` r
Dat_citi_rights
# # A tibble: 9,045 × 3
#    country      date citizen_rights
#    <chr>       <dbl>          <dbl>
#  1 Afghanistan  1972              5
#  2 Afghanistan  1973              1
#  3 Afghanistan  1974              1
#  4 Afghanistan  1975              1
#  5 Afghanistan  1976              1
#  6 Afghanistan  1977              2
#  7 Afghanistan  1978              0
#  8 Afghanistan  1979              0
#  9 Afghanistan  1980              0
# 10 Afghanistan  1982              0
# # ℹ 9,035 more rows
```
]

.push-right[

``` r
Dat_poverty
# # A tibble: 168 × 3
#    country     date poverty
#    <chr>      <dbl>   <dbl>
#  1 Albania     2020     0  
#  2 Algeria     2011     0.5
#  3 Angola      2018    31.1
#  4 Argentina   2022     0.6
#  5 Armenia     2022     0.8
#  6 Australia   2018     0.5
#  7 Austria     2021     0.5
#  8 Azerbaijan  2005     0  
#  9 Bangladesh  2022     5  
# 10 Belarus     2020     0  
# # ℹ 158 more rows
```
]

---
# **Join** .font60[Four types]

.push-left[
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/inner-join.gif" width="90%" style="display: block; margin: auto;" />
.center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]]
]

.push-right[
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/left-join.gif" width="90%" style="display: block; margin: auto;" />
.center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]]
]

---
# **Join** .font60[Four types]

.push-left[
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/right-join.gif" width="90%" style="display: block; margin: auto;" />
.center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]]
]

.push-right[
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/full-join.gif" width="90%" style="display: block; margin: auto;" />
.center[.backgrnote[*Source*: [Tidy Animated Verbs](https://github.com/gadenbuie/tidyexplain)]]
]

---
# Inner join .font70[Poverty and citizenship rights]

``` r
*(Dat <- inner_join(Dat_poverty, Dat_citi_rights, by = c("country", "date")))
# # A tibble: 149 × 4
#    country     date poverty citizen_rights
#    <chr>      <dbl>   <dbl>          <dbl>
#  1 Albania     2020     0                8
#  2 Algeria     2011     0.5              3
#  3 Angola      2018    31.1              3
#  4 Argentina   2022     0.6             10
#  5 Armenia     2022     0.8              6
#  6 Australia   2018     0.5             12
#  7 Austria     2021     0.5             12
#  8 Azerbaijan  2005     0                3
#  9 Bangladesh  2022     5                4
# 10 Belarus     2020     0                1
# # ℹ 139 more rows
```

---
# (3) Socialism .font70[Construct our own index ...]

.panelset[
.panel[.panel-name[Socialist countries]
.left-column[
- Wikipedia has tables on self-declared socialist countries.

- I suggest a simple index:
  + Years socialist minus years since not socialist anymore.
  + Min. 5 years given any socialist history.
]

.right-column[
<iframe src='https://en.wikipedia.org/wiki/List_of_socialist_states' width='1200' height='480' frameborder='0' scrolling='yes'></iframe>
]]

.panel[.panel-name[Coding the index]
.font70[

``` r
Dat <- Dat %>% mutate(
  socialist = case_when( # Years socialist minus years since not socialist anymore,
    country == "China" ~ date - 1949,
    country == "Viet Nam" ~ date - 1945,
    country == "Algeria" ~ date - 1962,
    str_detect(country,"Portugal|Bangladesh") ~ date - 1972,
    country == "Guinea-Bissau" ~ date - 1973, country == "India" ~ date - 1976, 
    country == "Nicaragua" ~ date - 1979, country == "Sri Lanka" ~ date - 1978,
    country == "Tanzania" ~ date - 1964, country == "Albania" ~ (1992 - 1944) - (date - 1992),
    str_detect(country, "Angola|Cabo Verde|Madagascar") ~ (1992 - 1975) - (date - 1992),
    str_detect(country,"Belarus|Bulgaria") ~ (1990 - 1946) - (date - 1990),
    str_detect(country, "Benin|Mozambique") ~ (1990 - 1975) - (date - 1990),
    country == "Chad" ~ (1975 - 1962) - (date - 1975), country == "Congo, Rep." ~ (1992 - 1970) - (date - 1992),
    country == "Czech Republic" ~ (1990 - 1948) - (date - 1990), country == "Djibouti" ~ (1992 - 1981) - (date - 1992),
    country == "Ethiopia" ~ (1991 - 1974) - (date - 1991), country == "Ghana" ~ (1966 - 1960) - (date - 1966),
    country == "Guinea" ~ (1984 - 1958) - (date - 1984), country == "Hungary" ~ (1989 - 1949) - (date - 1989),
    country == "Iraq" ~ (2005 - 1958) - (date - 2005), country == "Mali" ~ (1991 - 1960) - (date - 1991),
    country == "Mauritania" ~ (1978 - 1961) - (date - 1978), country == "Mongolia" ~ (1992 - 1924) - (date - 1992),
    country == "Myanmar" ~ (1988 - 1962) - (date - 1988), country == "Poland" ~ (1989 - 1945) - (date - 1989),
    country == "Romania" ~ (1989 - 1947) - (date - 1989), country == "Russian Federation" ~ (1991 - 1922) - (date - 1991),
    country == "Seychelles" ~ (1991 - 1977) - (date - 1991), country == "Senegal" ~ (1981 - 1960) - (date - 1981),
    country == "Sierra Leone" ~ (1991 - 1978) - (date - 1991), country == "Somalia" ~ (1991 - 1969) - (date - 1991),
    country == "Sudan" ~ (1985 - 1969) - (date - 1985), country == "Syria" ~ (2012 - 1963) - (date - 2012),
    country == "Tunisia" ~ (1988 - 1964) - (date - 1988), country == "Ukraine" ~ (1991 - 1919) - (date - 1991),
    country == "Yemen, Rep." ~ (1991 - 1967) - (date - 1991), country == "Zambia" ~ (1991 - 1973) - (date - 1991),
    str_detect(country,"Slovenia|Croatia|Serbia|Montenegro|Bosnia and Herzegovina|North Macedonia|Kosovo") ~ (1992 - 1943) - (date - 1992),
    TRUE ~ 0),
  socialist = case_when( # Min. 5 years given any socialist history,
    socialist != 0 & socialist < 5 ~ 5,
    TRUE ~ socialist)) %>% drop_na() # Drop countries with missing values.
```
]]
.panel[.panel-name[Resulting data]

``` r
Dat
# # A tibble: 149 × 5
#    country     date poverty citizen_rights socialist
#    <chr>      <dbl>   <dbl>          <dbl>     <dbl>
#  1 Albania     2020     0                8        20
#  2 Algeria     2011     0.5              3        49
#  3 Angola      2018    31.1              3         5
#  4 Argentina   2022     0.6             10         0
#  5 Armenia     2022     0.8              6         0
#  6 Australia   2018     0.5             12         0
#  7 Austria     2021     0.5             12         0
#  8 Azerbaijan  2005     0                3         0
#  9 Bangladesh  2022     5                4        50
# 10 Belarus     2020     0                1        14
# # ℹ 139 more rows
```
]
.panel[.panel-name[Socialist history across the world]
<img src="2-Corr-n-Reg_files/figure-html/socialism-world-1.png" width="100%" style="display: block; margin: auto;" />
]]

---
class: inverse middle center
# Break

---
class: inverse middle center
# Scatter plots

---
# Visual inspection

.left-column[
.content-box-blue[
.center[**4 questions for scatter plots**]
1. What is the *direction* of the 
relationship?

2. What *form* does the relation 
have?

3. How much *spread* is in the 
data?

4. Are there any *outliers*?
]]
.right-column[
<img src="2-Corr-n-Reg_files/figure-html/socialism-corr1-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: inverse middle center
# Correlation

---
class: clear
# Z-standardization .font60[Give two variables a comparable unit]

.panelset[
.panel[.panel-name[What is it?]

.push-left[
`$$z(x) = \frac{x - \bar{x}}{\text{SD}(x)}$$`
**We subtract the mean:** Values above 0 are above average, values below 0 are below average.

**We divide by the standard deviation:** Our variable now has standard deviations as unit.<br><br> `$\rightarrow$` Intuitive understanding: How common vis-á-vis extreme is a case?
]

.push-right[
<img src="https://www.native-instruments.com/fileadmin/userlib/images/7727639_4467.normal-light.png" width="100%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[R Code]

``` r
(Dat <- Dat %>%
   mutate( # Z-Standardize variables.
*    z_socialist = scale(socialist) %>% as.numeric(),
*    z_poverty = scale(poverty) %>% as.numeric()))
# # A tibble: 149 × 7
#    country     date poverty citizen_rights socialist z_socialist z_poverty
#    <chr>      <dbl>   <dbl>          <dbl>     <dbl>       <dbl>     <dbl>
#  1 Albania     2020     0                8        20       0.916    -0.620
#  2 Algeria     2011     0.5              3        49       2.88     -0.592
#  3 Angola      2018    31.1              3         5      -0.102     1.11 
#  4 Argentina   2022     0.6             10         0      -0.441    -0.587
#  5 Armenia     2022     0.8              6         0      -0.441    -0.575
#  6 Australia   2018     0.5             12         0      -0.441    -0.592
#  7 Austria     2021     0.5             12         0      -0.441    -0.592
#  8 Azerbaijan  2005     0                3         0      -0.441    -0.620
#  9 Bangladesh  2022     5                4        50       2.95     -0.342
# 10 Belarus     2020     0                1        14       0.509    -0.620
# # ℹ 139 more rows
```
]
.panel[.panel-name[Illustration]
<img src="img/Correlation.png" width="40%" style="display: block; margin: auto;" />
.backgrnote[.center[
*Source*: Veaux, Velleman, and Bock (2021, p.199)
]]]
.panel[.panel-name[Figure]
<img src="2-Corr-n-Reg_files/figure-html/socialism-corr2-1.png" width="70%" style="display: block; margin: auto;" />
]]

---
class: inverse
#

.push-left[
<img src="https://thumbs.dreamstime.com/b/charakter-d-der-eine-lupe-h%C3%A4lt-und-ein-questio-kontrolliert-99243756.jpg" width="70%" style="display: block; margin: auto;" />
]

.push-right[
<br>
<br>
<br>
<br>
OK but eye-balling is hardly enough to count as scientific evidence, is it?
]

---
# The correlation coefficient: `$r_{y,x}$`

.panelset[
.panel[.panel-name[What is it?]
.push-left[
<img src="img/Correlation.png" width="80%" style="display: block; margin: auto;" />
.backgrnote[.center[
*Source*: Veaux, Velleman, and Bock (2021, p.199)
]]]

.push-right[
.content-box-blue[
.center[**A precise statistic** <br> in three steps]

`$$r_{y,x} = \frac{\sum^{n}_{i=1}z_y*z_x}{n-1}$$`

1. `$\color{orange}{z_y*z_x}$`: positive for a green points, zero for blue ones, and negative for red ones. Larger products contribute more to the association.

2. `$\color{orange}{\sum^{n}_{i=1}z_y*z_x}$`: The general trend.

3. `$\color{orange}{\frac{\sum^{n}_{i=1}z_y*z_x}{n-1}}$`: We divide by `$n - 1$`; the resulting `$r$` varies between -1 and 1.
]]]
.panel[.panel-name[Poverty & Socialism]
.left-column[
.content-box-green[.center[
How do we<br>interpret this result?
]]]
.right-column[

``` r
Dat %>% # Use our data,
  select(poverty, socialist) %>% # Select vars for analysis,
* cor() # Estimate correlation.
#           poverty socialist
# poverty     1.000    -0.089
# socialist  -0.089     1.000
```
]]]

---
class: middle clear

.left-column[
<img src="https://www.laserfiche.com/wp-content/uploads/2014/10/femalecoder.jpg" width="80%" style="display: block; margin: auto;" />

.right-column[
<br>

---
class: inverse middle center
# Break

---
class: inverse middle center
# OLS regression

---
# Correlation = linear trend

.right-column[
<img src="2-Corr-n-Reg_files/figure-html/socialism-ols-1.png" width="100%" style="display: block; margin: auto;" />
]

.left-column[
**How can we directly calculate that trend line?** <br><br> Then we could state how much of a reduction in poverty we would expect for a year increase of socialism.
]

---
# Models

.left-column[
**Model**: A reduced representation of reality. <br><br>Model should capture answer to our research question. .backgrnote[
Models should not be driven by few singular cases, like in this example.]
]

.right-column[
<img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# Linear models

.left-column[
.content-box-blue[
.center[**Linear models**<br>defined by two parameters]

`$\color{orange}{\alpha}$` .alert[constant/intercept]: The value of y at which the line intercepts the Y-axis `$(\hat{Y}|X=0)$`.
  
`$\color{orange}{\beta}$` .alert[slope]: How does `$\hat{Y}$` change, if `$X$` increases by one unit.
]]

.right-column[
<img src="img/LinearModel.png" width="100%" style="display: block; margin: auto;" />
]

---
# Regressing linear models from data

.panelset[
.panel[.panel-name[Residuals, e]

.left-column[
**Residuals**: `$e_{i} =y_{i} - \hat{y}$`<br>
differences between what model predicts and actual data.

`$e_{\text{Denmark}} = 0.2\% - 11.9\%=-11.7\%$`
]
.right-column[
<img src="2-Corr-n-Reg_files/figure-html/residuals-1.png" width="100%" style="display: block; margin: auto;" />
]]

.panel[.panel-name[Minimize 1]

.left-column[
- **_Best_ fitting line**:
`$$\begin{align*}
      \min \text{RSS} &= \min \sum_{i=1}^{n} e_{i}^{2} \\
      &= \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\
      &= \min \sum_{i=1}^{n} (y_{i} - (\color{orange}{\alpha} + \color{orange}{\beta} x_{i})^{2}
    \end{align*}$$`
]

.right-column[
<img src="2-Corr-n-Reg_files/figure-html/min_resid-1.png" width="100%" style="display: block; margin: auto;" />
]]

.panel[.panel-name[... 2]
.left-column[
- **_Best_ fitting line**:
`$$\begin{align*}
      \min \text{RSS} &= \min \sum_{i=1}^{n} e_{i}^{2} \\
      &= \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\
      &= \min \sum_{i=1}^{n} (y_{i} - (\color{orange}{\alpha} + \color{orange}{\beta} x_{i})^{2}
    \end{align*}$$`
]

.right-column[
<img src="https://i.redd.it/gyw14y0tvak21.gif" width="50%" style="display: block; margin: auto;" />
.backgrnote[.center[
*Source*: [aftersox on Reddit](https://www.reddit.com/r/dataisbeautiful/comments/axl1jm/oc_ordinary_least_squares_ols_finding_the_line/)
]]]]

.panel[.panel-name[R2 model fit]
.left-column[
How much smaller are the residuals from our model (blue line), compared to simply using the average `$\bar{y}$` (orange line)?
`$$\text{TSS}=\sum_{i=1}^{n}(y_i-\bar{y})^2$$`
`$$\text{RSS}=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$`
`$$R^2=\frac{\text{TSS} - \text{RSS}}{\text{TSS}}$$`
]
.right-column[
<img src="2-Corr-n-Reg_files/figure-html/R2-1.png" width="100%" style="display: block; margin: auto;" />
]]

.panel[.panel-name[Regression using R]
.right-column[

``` r
ols <- lm_robust(data = Dat, formula = poverty ~ socialist)
zols <- lm_robust(data = Dat, formula = z_poverty ~ z_socialist)

modelsummary(list("OLS" = ols, "Std. OLS" = zols), # Nicely-formatted table,
             statistic = NULL, # Don't report stat. inference (yet),
             gof_map = c("nobs", "r.squared")) # Only 2 model-fit stats.
```

<img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-34-1.png" width="70%" style="display: block; margin: auto;" />
]

.left-column[
<br> 
<table class="table" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> OLS </th>
   <th style="text-align:center;"> Std. OLS </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 11.860 </td>
   <td style="text-align:center;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> socialist </td>
   <td style="text-align:center;"> −0.109 </td>
   <td style="text-align:center;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px"> z_socialist </td>
   <td style="text-align:center;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> −0.089 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 149 </td>
   <td style="text-align:center;"> 149 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.008 </td>
   <td style="text-align:center;"> 0.008 </td>
  </tr>
</tbody>
</table>

]]

.panel[.panel-name[Interpretation]
.push-left[
<img src="2-Corr-n-Reg_files/figure-html/unnamed-chunk-35-1.png" width="100%" style="display: block; margin: auto;" />
]

.push-right[

$$
\operatorname{\widehat{poverty}} = 11.86 - 0.11(\operatorname{socialist})
$$
 `$\rightarrow$` the best-fitting line that `$\min \sum_{i=1}^{n} e_{i}^{2}$`.

Among countries without a socialist past, poverty is on average 11.86%.
  + `$(\hat{y}|\text{Socialism = 0}) = 11.86\%$`.
  
With every year of socialism, the average level of poverty is -0.11 percentage points lower.

This model accounts for 0.008*100% = 0.8% more of the variance of poverty across the world, than the average `$\bar{y} = 11.15\%$`.
]]]

---
# Two types of interpretation

.left-column[
.center[**1. Causal**]

With every additional year of socialism, poverty is expected to decline by -0.11 percentage points. Thus, if China and Vietnam stay socialist, poverty will further decline.

.alert[Beware, this interpretation only holds under some conditions.]

I will teach you how to estimate regressions that have a causal interpretation later this semester!
]

.right-column[
.center[
**2. Descriptive: conditional means `$\bar{y}|x$`**
]

With every year of socialism, the average level of poverty -0.11 percentage points lower.

Here regression is a (linear) model that describes the average of the outcome for different values of the predictor.

<img src="https://isem-cueb-ztian.github.io/Intro-Econometrics-2017/handouts/lecture_notes/lecture_6/figure/fig-4-4.png" width="75%" style="display: block; margin: auto;" />
.backgrnote[.center[
*Source*: [Zheng Tian](https://isem-cueb-ztian.github.io/Intro-Econometrics-2017/handouts/lecture_notes/lecture_6/lecture_6.html#org39dfbe6)
]]

]

---
class: inverse middle center
# Break

---
class: middle clear

.left-column[
<img src="https://cdn.dribbble.com/users/10549/screenshots/9916149/media/a9dbfea8e23e5b8e23db142528c3bc9f.png?compress=1&resize=1200x900&vertical=top" width="100%" style="display: block; margin: auto;" />

<img src="2-Corr-n-Reg_files/figure-html/citiz-sicial-corr-1.png" width="90%" style="display: block; margin: auto;" />
]

.right-column[
<br>

---
class: inverse
# Today's general lessons

1. R provides convenient access to a wide range of interesting data through APIs, allowing for easy downloading.
2. When datasets share common variables that uniquely identify cases, you can join them together, enabling fascinating analyses and excellent term papers!
3. It's always beneficial to create a scatter plot to visualize the relationship between the variables you wish to correlate.
4. Z-standardization aids interpretation and provides a common unit for different variables.
5. The correlation coefficient is a simple statistic that measures the strength of association between two variables.
6. Bivariate OLS regression, being a linear model, expresses an outcome variable as a linear function of a predictor.
7. The slope, denoted by β, indicates how average levels of the predicted variable (ŷ) change with a unit increase in the predictor (x).
8. OLS determines the linear model that best fits the data.
9. It is generally recommended not to interpret regression in causal terms, except under normal circumstances.

---
class: inverse
# Today's (important) functions

1. `cor()`: Estimate correlation coefficient.
2. `estimatr::lm_robust()`: Estimate linear OLS regression (with robust standard errors, which matters when using weights).
3. `plot(model_object)` to test regression assumptions.
4. `inner_join()`, `left_join()`, `right_join()`, and `full_join()` allow you to join/merge different tibbles together than have common observations and a key that identifies them.
5. `modelsummary()`: Create nicely-formatted (html, Word, ASCII, or Latex) tables of (one or several) regression models.
6. `scale()` z-standardizes variables. But sometimes it returns a matrix rather than a vector. Therefore it makes sense to always code `scale(x) %>% as.numeric()` to ensure you get an numeric vector out of it.

---
# References

.font80[
Veaux, D., Velleman, and Bock (2021). _Stats: Data and Models, Global Edition_. Pearson Higher Ed.
]