Code
library(tidyverse)
library(lubridate)
theme_set(theme_light())
library(OECD)
library(countrycode)
library(janitor)
library(ggrepel)
Mitsuo Shiota
April 27, 2023
June 14, 2023
The headline of Asahi Shinbun article on April 27, 2023 (Japanese) caught my eye, which is “5 trillion increase in public expenditure for family is estimated to increase population by 0.9 to 1.8 million persons in 2060.” This article let me know the estimate on page 3 titled “Possible utilization of EBPM to analyze policy effects” in the Economic and Fiscal Advisory Council material presented by 4 private-sector members on April 26, 2023 (Japanese).
It shows the figure below, and warns us that correlation is not causality in the footnote. Title is “Correlation between public expenditure for family and fertility rate in OECD countries.” X-axis is percent ratio of public expenditure for family to GDP in 2017, and Y-axis is fertility rate in 2021. Red point is Japan. I find 34 points in the figure.
Then, they make a causal statement, “1 percent of GDP (5 trillion yen) increase in public expenditure for family leads to 0.05 to 0.1 increase in fertility rate. This estimate is based on 30 OECD countries data since 2010”.
I have got confused. Do they think this figure, which does not show confidence interval, suggests correlation? And, even if it does, do they think they can morph correlation into causality by more samples from multiple years? So I have decided to get data from OECD.stat.
# OECD 38 countries
oecd_countries <- c("ISR", "LUX", "NLD", "GRC", "ISL", "LVA", "SVN", "FRA", "POL", "TUR", "ESP", "BEL", "MEX", "IRL", "CHE", "FIN", "CHL", "CAN", "KOR", "NZL", "AUS", "HUN", "USA", "GBR", "AUT", "SWE", "DNK", "SVK", "CZE", "COL", "JPN", "NOR","CRI", "LTU", "DEU", "EST", "PRT", "ITA")
# "OAVG" OECD average
family <- get_dataset(
"FAMILY",
filter = list(c(oecd_countries, "OAVG"), "TOTAL", c("FAM1", "FAM3")),
start_time = 1990, end_time = 2021
) |>
clean_names() |>
mutate(
obs_value = parse_number(obs_value),
time = parse_number(time)
)
# "OECD" OECD average
family_expenditure <- get_dataset(
"SOCX_AGG",
filter = list("10", "5", "0", "0", "PCT_GDP", c(oecd_countries, "OECD")),
start_time = 1980, end_time = 2021
) |>
clean_names() |>
mutate(
obs_value = if_else(obs_value == "NaN", NA, obs_value),
obs_value = parse_number(obs_value),
time = parse_number(time)
)
# fertility rate vs public family expenditure; point
fer_vs_exp <- family |>
filter(ind == "FAM1", time == 2021) |>
select(cou, fertility = obs_value) |>
mutate(cou = if_else(cou == "OAVG", "OECD", cou)) |>
left_join(family_expenditure |>
filter(time == 2017) |>
select(cou = country, fam_exp = obs_value),
by = "cou") |>
mutate(
country = if_else(cou == "OECD",
"OECD average",
countrycode(cou, "iso3c", "country.name"))
)
fer_vs_exp2 <- fer_vs_exp |>
filter(cou != "OECD")
I plot the replicated chart below with confidence interval. There are 38 points, as this chart includes all 38 OECD countries. It is uncertain whether the slope is positive or negative.
fer_vs_exp2 |>
ggplot(aes(fam_exp, fertility)) +
geom_point(aes(color = cou == "JPN"), show.legend = FALSE) +
geom_smooth(method = "lm", formula = y ~ x) +
geom_text_repel(aes(color = cou == "JPN", label = country),
show.legend = FALSE) +
scale_color_manual(values = c("gray50", "red")) +
coord_cartesian(xlim = c(0, 4), ylim = c(0, 3)) +
labs(x = "Ratio of public expenditure for family to GDP\n(percent, 2017)",
y = "Fertility rate\n(percent, 2021)",
title = "No clear correlation between public expenditure for family\nand fertility rate in one year",
caption = "Source: OECD.stat")
It is difficult to say there is a positive or negative correlation.
Call:
lm(formula = fertility ~ fam_exp, data = fer_vs_exp2)
Residuals:
Min 1Q Median 3Q Max
-0.70589 -0.13334 -0.02066 0.09695 1.41083
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.44241 0.14142 10.200 3.65e-12 ***
fam_exp 0.06662 0.06362 1.047 0.302
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3104 on 36 degrees of freedom
Multiple R-squared: 0.02956, Adjusted R-squared: 0.002604
F-statistic: 1.097 on 1 and 36 DF, p-value: 0.302
I turn to multiple years data from 2010 to 2021 for fertility rate, and 2006 to 2017 for public expenditure for family, as I guess they assume 4 year ago public expenditure for family effects fertility rate this year. Confidence interval is now much narrower, as the sample size is larger.
family_expenditure_4yr_ago <- family_expenditure |>
select(cou = country, time, fam_exp = obs_value) |>
mutate(time = time + 4)
fer_vs_exp_multi <- family |>
filter(ind == "FAM1") |>
select(cou, time, fertility = obs_value) |>
mutate(cou = if_else(cou == "OAVG", "OECD", cou)) |>
left_join(family_expenditure_4yr_ago,
by = c("cou", "time")) |>
mutate(
country = if_else(cou == "OECD",
"OECD average",
countrycode(cou, "iso3c", "country.name"))
)
fer_vs_exp_multi2 <- fer_vs_exp_multi |>
filter(cou != "OECD", time >= 2010, time <= 2021)
fer_vs_exp_multi2 |>
ggplot(aes(fam_exp, fertility)) +
geom_path(aes(color = country)) +
geom_smooth(method = "lm", formula = y ~ x) +
coord_cartesian(xlim = c(0, 4), ylim = c(0, 3)) +
labs(x = "Ratio of public expenditure for family to GDP\n(percent, 4 years ago)",
y = "Fertility rate\n(percent, current year)",
color = NULL,
title = "Apparent positive correlation between public expenditure for family and\nfertility rate from 2010-2021 data. Is it real?",
subtitle = "Each colored line is a country.",
caption = "Source: OECD.stat") +
theme(legend.position = "none")
The slope is estimated as 0.047, which is close to their assertion of 0.05 to 0.1 based on 30 countries. P-value is less than 0.01.
Call:
lm(formula = fertility ~ fam_exp, data = fer_vs_exp_multi2)
Residuals:
Min 1Q Median 3Q Max
-0.79849 -0.20412 -0.05127 0.12143 1.46186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.55612 0.03751 41.48 < 2e-16 ***
fam_exp 0.04748 0.01649 2.88 0.00416 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3288 on 445 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.0183, Adjusted R-squared: 0.0161
F-statistic: 8.296 on 1 and 445 DF, p-value: 0.004164
Should I agree with them? Absolutely not. Each colored line does not show a positive slope. The positive slope of the fitted line comes mainly from cross-country differences. And there is no positive correlation between public spending for family and fertility rate either in Japan or OECD average in the figure below.
oecd_jpn <- fer_vs_exp_multi |>
filter(cou %in% c("OECD", "JPN")) |>
filter(!is.na(fam_exp), !is.na(fertility))
oecd_jpn |>
ggplot(aes(fam_exp, fertility)) +
geom_path(aes(color = country)) +
geom_label_repel(aes(label = time),
data = oecd_jpn |>
filter(time %in% c(1990, 2021))) +
coord_cartesian(xlim = c(0, 4), ylim = c(0, 3)) +
labs(x = "Ratio of public expenditure for family to GDP\n(percent, 4 years ago)",
y = "Fertility rate\n(percent, current year)",
color = NULL,
title = "No positive correlation over years between public spending for\nfamily and fertility rate either in Japan or OECD average",
caption = "Source: OECD.stat") +
theme(legend.position = "top")
Still they can argue fertility rate would have fallen deeper without an increase in public expenditure for family. This counterfactual argument should be based on some assumed causal mechanism. Increasing the number of samples does not morph correlation into causality, anyway. Do not try to increase the number of samples in a weird way.
---
title: Will higher public spending for family lead to higher fertility rate?
author: Mitsuo Shiota
date: '2023-04-27'
date-modified: '2023-06-14'
categories:
- economics
- R
- Media
knitr:
opts_chunk:
out.width: '70%'
---
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(lubridate)
theme_set(theme_light())
library(OECD)
library(countrycode)
library(janitor)
library(ggrepel)
```
## Assertions in the Economic and Fiscal Advisory Council material
The headline of [Asahi Shinbun article on April 27, 2023 (Japanese)](https://www.asahi.com/articles/ASR4V72G8R4VULFA01D.html) caught my eye, which is "5 trillion increase in public expenditure for family is estimated to increase population by 0.9 to 1.8 million persons in 2060." This article let me know the estimate on page 3 titled "Possible utilization of EBPM to analyze policy effects" in [the Economic and Fiscal Advisory Council material presented by 4 private-sector members on April 26, 2023 (Japanese)](https://www5.cao.go.jp/keizai-shimon/kaigi/minutes/2023/0426/shiryo_01_2.pdf).
It shows the figure below, and warns us that correlation is not causality in the footnote. Title is "Correlation between public expenditure for family and fertility rate in OECD countries." X-axis is percent ratio of public expenditure for family to GDP in 2017, and Y-axis is fertility rate in 2021. Red point is Japan. I find 34 points in the figure.
![Quoted figure titled correlation between public expenditure for family and fertility rate](images/Screenshot%20from%202023-04-28%2009-05-19.png){#fig-screenshot fig-alt="Quoted figure titled correlation between public expenditure for family and fertility rate" fig-align="center" width="70%"}
Then, they make a causal statement, "1 percent of GDP (5 trillion yen) increase in public expenditure for family leads to 0.05 to 0.1 increase in fertility rate. This estimate is based on 30 OECD countries data since 2010".
I have got confused. Do they think this figure, which does not show confidence interval, suggests correlation? And, even if it does, do they think they can morph correlation into causality by more samples from multiple years? So I have decided to get data from OECD.stat.
## Replicate the quoted figure
```{r}
#| label: get_data
#| warning: false
# OECD 38 countries
oecd_countries <- c("ISR", "LUX", "NLD", "GRC", "ISL", "LVA", "SVN", "FRA", "POL", "TUR", "ESP", "BEL", "MEX", "IRL", "CHE", "FIN", "CHL", "CAN", "KOR", "NZL", "AUS", "HUN", "USA", "GBR", "AUT", "SWE", "DNK", "SVK", "CZE", "COL", "JPN", "NOR","CRI", "LTU", "DEU", "EST", "PRT", "ITA")
# "OAVG" OECD average
family <- get_dataset(
"FAMILY",
filter = list(c(oecd_countries, "OAVG"), "TOTAL", c("FAM1", "FAM3")),
start_time = 1990, end_time = 2021
) |>
clean_names() |>
mutate(
obs_value = parse_number(obs_value),
time = parse_number(time)
)
# "OECD" OECD average
family_expenditure <- get_dataset(
"SOCX_AGG",
filter = list("10", "5", "0", "0", "PCT_GDP", c(oecd_countries, "OECD")),
start_time = 1980, end_time = 2021
) |>
clean_names() |>
mutate(
obs_value = if_else(obs_value == "NaN", NA, obs_value),
obs_value = parse_number(obs_value),
time = parse_number(time)
)
# fertility rate vs public family expenditure; point
fer_vs_exp <- family |>
filter(ind == "FAM1", time == 2021) |>
select(cou, fertility = obs_value) |>
mutate(cou = if_else(cou == "OAVG", "OECD", cou)) |>
left_join(family_expenditure |>
filter(time == 2017) |>
select(cou = country, fam_exp = obs_value),
by = "cou") |>
mutate(
country = if_else(cou == "OECD",
"OECD average",
countrycode(cou, "iso3c", "country.name"))
)
fer_vs_exp2 <- fer_vs_exp |>
filter(cou != "OECD")
```
I plot the replicated chart below with confidence interval. There are 38 points, as this chart includes all 38 OECD countries. It is uncertain whether the slope is positive or negative.
```{r}
#| label: fig-fer_vs_exp2
#| fig-cap: No clear correlation in one year
#| fig-align: center
#| fig-asp: 1
#| warning: false
fer_vs_exp2 |>
ggplot(aes(fam_exp, fertility)) +
geom_point(aes(color = cou == "JPN"), show.legend = FALSE) +
geom_smooth(method = "lm", formula = y ~ x) +
geom_text_repel(aes(color = cou == "JPN", label = country),
show.legend = FALSE) +
scale_color_manual(values = c("gray50", "red")) +
coord_cartesian(xlim = c(0, 4), ylim = c(0, 3)) +
labs(x = "Ratio of public expenditure for family to GDP\n(percent, 2017)",
y = "Fertility rate\n(percent, 2021)",
title = "No clear correlation between public expenditure for family\nand fertility rate in one year",
caption = "Source: OECD.stat")
```
It is difficult to say there is a positive or negative correlation.
```{r}
#| label: fer_exp_model
fer_exp_model <- lm(fertility ~ fam_exp, data = fer_vs_exp2)
summary(fer_exp_model)
```
## Narrower confidence interval from multiple years data?
I turn to multiple years data from 2010 to 2021 for fertility rate, and 2006 to 2017 for public expenditure for family, as I guess they assume 4 year ago public expenditure for family effects fertility rate this year. Confidence interval is now much narrower, as the sample size is larger.
```{r}
#| label: fig-fer_vs_exp_multi
#| fig-cap: Apparent positive correlation in 2010-2021
#| fig-align: center
#| fig-asp: 1
#| warning: false
family_expenditure_4yr_ago <- family_expenditure |>
select(cou = country, time, fam_exp = obs_value) |>
mutate(time = time + 4)
fer_vs_exp_multi <- family |>
filter(ind == "FAM1") |>
select(cou, time, fertility = obs_value) |>
mutate(cou = if_else(cou == "OAVG", "OECD", cou)) |>
left_join(family_expenditure_4yr_ago,
by = c("cou", "time")) |>
mutate(
country = if_else(cou == "OECD",
"OECD average",
countrycode(cou, "iso3c", "country.name"))
)
fer_vs_exp_multi2 <- fer_vs_exp_multi |>
filter(cou != "OECD", time >= 2010, time <= 2021)
fer_vs_exp_multi2 |>
ggplot(aes(fam_exp, fertility)) +
geom_path(aes(color = country)) +
geom_smooth(method = "lm", formula = y ~ x) +
coord_cartesian(xlim = c(0, 4), ylim = c(0, 3)) +
labs(x = "Ratio of public expenditure for family to GDP\n(percent, 4 years ago)",
y = "Fertility rate\n(percent, current year)",
color = NULL,
title = "Apparent positive correlation between public expenditure for family and\nfertility rate from 2010-2021 data. Is it real?",
subtitle = "Each colored line is a country.",
caption = "Source: OECD.stat") +
theme(legend.position = "none")
```
```{r}
#| label: fer_exp_multi_model
fer_exp_multi_model <- lm(fertility ~ fam_exp, data = fer_vs_exp_multi2)
```
The slope is estimated as `r coef(fer_exp_multi_model)[2] |> round(3)`, which is close to their assertion of 0.05 to 0.1 based on 30 countries. P-value is less than 0.01.
```{r}
#| label: fer_exp_multi_model_summary
summary(fer_exp_multi_model)
```
## It's a bad idea to treat cross-country multiple-year data as independent observations
Should I agree with them? Absolutely not. Each colored line does not show a positive slope. The positive slope of the fitted line comes mainly from cross-country differences. And there is no positive correlation between public spending for family and fertility rate either in Japan or OECD average in the figure below.
```{r}
#| label: fig-oecd_jpn
#| fig-cap: 1990-2021 in Japan and OECD average
#| fig-align: center
#| fig-asp: 1
oecd_jpn <- fer_vs_exp_multi |>
filter(cou %in% c("OECD", "JPN")) |>
filter(!is.na(fam_exp), !is.na(fertility))
oecd_jpn |>
ggplot(aes(fam_exp, fertility)) +
geom_path(aes(color = country)) +
geom_label_repel(aes(label = time),
data = oecd_jpn |>
filter(time %in% c(1990, 2021))) +
coord_cartesian(xlim = c(0, 4), ylim = c(0, 3)) +
labs(x = "Ratio of public expenditure for family to GDP\n(percent, 4 years ago)",
y = "Fertility rate\n(percent, current year)",
color = NULL,
title = "No positive correlation over years between public spending for\nfamily and fertility rate either in Japan or OECD average",
caption = "Source: OECD.stat") +
theme(legend.position = "top")
```
## Causality does not come from data
Still they can argue fertility rate would have fallen deeper without an increase in public expenditure for family. This counterfactual argument should be based on some assumed causal mechanism. Increasing the number of samples does not morph correlation into causality, anyway. Do not try to increase the number of samples in a weird way.