Introduction
这是Tidy Finance with R的学习记录,使用的R软件版本为4.2.1(2022-06-23),下载源码1可点击页面最顶端标题右侧”Code > Download Rmd”。
对内容有疑问或建议的小伙伴,欢迎在这里提交Issues或者Pull requests,帮助完善,在下先行谢过了。🤝🤝
1 Introduction to Tidy Finance
说到Tidy(整洁),首先加载两个tidy开头的包:
library(tidyverse)
library(tidyquant)
这里补充一下管道操作符%>%
的知识,其作用是将前一步生成的结果作为下一步函数的第一个参数2:
x %>% f()
等同于f(x)
x %>% f(y)
等同于f(x, y)
x %>% f() %>% g()
等同于g(f(x))
f(x) %>% g(y) %>% t(z)
等同于t(g(f(x), y), z)
是不是有点像”然后”的意思?它可以让写代码、读代码都安装从上到下、从左到右的顺序一步一步地进行,且不会生成很多的中间变量,大部分时候可以从数据到结果”一管到底”,大大增加数据分析的流畅性。快捷键为Ctrl + Shift + M
。在4.1以上的R版本中,自带了原生的管道:|>
,它的主要作用于%>%
大致相同,不过%>%
功能更丰富。
揣着这个效率神器,我们出发!🚀
1.1 单个股票分析
读取苹果公司股票价格数据,tq_get()
函数默认使用雅虎数据库,因此在国内需要连接外网后使用3。
price <- tq_get("AAPL", # Apple Stock
get = "stock.price",
from = "2000-01-01",
to = "2022-06-30")
price
#> # A tibble: 5,659 × 8
#> symbol date open high low close volume adjusted
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AAPL 2000-01-03 0.936 1.00 0.908 0.999 535796800 0.855
#> 2 AAPL 2000-01-04 0.967 0.988 0.903 0.915 512377600 0.782
#> 3 AAPL 2000-01-05 0.926 0.987 0.920 0.929 778321600 0.794
#> 4 AAPL 2000-01-06 0.948 0.955 0.848 0.848 767972800 0.725
#> 5 AAPL 2000-01-07 0.862 0.902 0.853 0.888 460734400 0.760
#> 6 AAPL 2000-01-10 0.911 0.913 0.846 0.873 505064000 0.746
#> 7 AAPL 2000-01-11 0.857 0.887 0.808 0.828 441548800 0.708
#> 8 AAPL 2000-01-12 0.848 0.853 0.772 0.778 976068800 0.666
#> 9 AAPL 2000-01-13 0.844 0.882 0.826 0.864 1032684800 0.739
#> 10 AAPL 2000-01-14 0.893 0.913 0.887 0.897 390376000 0.767
#> # … with 5,649 more rows
#> # ℹ Use `print(n = ...)` to see more rows
简单的可视化,呈现价格走势:
price %>%
ggplot(aes(date, adjusted)) +
geom_line() +
labs(x = NULL, y = NULL,
title = "AAPL stock prices",
subtitle = "Prices in USD, adjusted for dividend payments and stock splits")
根据以下公式,用调整后价格计算日收益率(Daily Return)。
\[\begin{equation} \text{daily return} = \frac{p_t - p_{t-1}}{p_{t-1}} = \frac{p_t}{p_{t-1}} - 1 \end{equation}\]计算前需先将数据根据日期排序,由于2000-01-03为第一天,因此计算收益率时产生缺失值(NA),可以直接剔除。
returns <- price %>%
arrange(date) %>%
mutate(ret = adjusted / lag(adjusted) - 1) %>%
select(symbol, date, ret) %>%
drop_na(ret)
returns
#> # A tibble: 5,658 × 3
#> symbol date ret
#> <chr> <date> <dbl>
#> 1 AAPL 2000-01-04 -0.0843
#> 2 AAPL 2000-01-05 0.0146
#> 3 AAPL 2000-01-06 -0.0865
#> 4 AAPL 2000-01-07 0.0474
#> 5 AAPL 2000-01-10 -0.0176
#> 6 AAPL 2000-01-11 -0.0512
#> 7 AAPL 2000-01-12 -0.0600
#> 8 AAPL 2000-01-13 0.110
#> 9 AAPL 2000-01-14 0.0381
#> 10 AAPL 2000-01-18 0.0348
#> # … with 5,648 more rows
#> # ℹ Use `print(n = ...)` to see more rows
对日收益率进行汇总统计:
summary(returns$ret)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.519 -0.010 0.001 0.001 0.013 0.139
可以看到最高日收益率为13.905%,最低日收益达到-51.869%,而平均日收益率为0.123%,略高于0%。
绘制日收益率柱状图,更直观地展示数据。图中的红色竖线代表日收益率的5%分位数(-0.037)4。
quantile_05 <- quantile(returns$ret, .05)
returns %>%
ggplot(aes(ret)) +
geom_histogram(bins = 100) +
geom_vline(aes(xintercept = quantile_05),
color = "red", linetype = "dashed") +
scale_x_continuous(labels = scales::percent) +
labs(x = NULL, y = NULL,
title = "Distribution of daily AAPL return",
subtitle = "The dotted vertical line indicates the historical 5% quantile")
按年分组后,对日收益率进行分组统计:
returns %>%
mutate(ret = ret * 100) %>%
group_by(year = year(date)) %>%
summarise(across(ret,
list(
daily_mean = mean,
daily_sd = sd,
daily_min = min,
daily_max = max
),
.names = "{.fn}"
)) %>%
print(n = Inf)
#> # A tibble: 23 × 5
#> year daily_mean daily_sd daily_min daily_max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 -0.346 5.49 -51.9 13.7
#> 2 2001 0.233 3.93 -17.2 12.9
#> 3 2002 -0.121 3.05 -15.0 8.46
#> 4 2003 0.186 2.34 -8.14 11.3
#> 5 2004 0.470 2.55 -5.58 13.2
#> 6 2005 0.349 2.45 -9.21 9.12
#> 7 2006 0.0949 2.43 -6.33 11.8
#> 8 2007 0.366 2.38 -7.02 10.5
#> 9 2008 -0.265 3.67 -17.9 13.9
#> 10 2009 0.382 2.14 -5.02 6.76
#> 11 2010 0.183 1.69 -4.96 7.69
#> 12 2011 0.104 1.65 -5.59 5.89
#> 13 2012 0.130 1.86 -6.44 8.87
#> 14 2013 0.0472 1.80 -12.4 5.14
#> 15 2014 0.145 1.36 -7.99 8.20
#> 16 2015 0.00199 1.68 -6.12 5.74
#> 17 2016 0.0575 1.47 -6.57 6.50
#> 18 2017 0.164 1.11 -3.88 6.10
#> 19 2018 -0.00573 1.81 -6.63 7.04
#> 20 2019 0.266 1.65 -9.96 6.83
#> 21 2020 0.281 2.94 -12.9 12.0
#> 22 2021 0.131 1.58 -4.17 5.39
#> 23 2022 -0.170 2.26 -5.64 6.98
1.2 多个股票分析
基于整洁数据(Wickham 2014)的特点,可以很方便地将单个股票的分析拓展到多个股票。接下来分析目前道琼斯工业指数中所有的成分股:
ticker <- tq_index("DOW") # 获取成分股信息
ticker
#> # A tibble: 30 × 8
#> symbol company ident…¹ sedol weight sector share…² local…³
#> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 UNH UnitedHealth Group Incorp… 91324P… 2917… 0.110 Healt… 5706213 USD
#> 2 GS Goldman Sachs Group Inc. 38141G… 2407… 0.0667 Finan… 5706213 USD
#> 3 HD Home Depot Inc. 437076… 2434… 0.0607 Consu… 5706213 USD
#> 4 MSFT Microsoft Corporation 594918… 2588… 0.0560 Infor… 5706213 USD
#> 5 MCD McDonald's Corporation 580135… 2550… 0.0534 Consu… 5706213 USD
#> 6 AMGN Amgen Inc. 031162… 2023… 0.0506 Healt… 5706213 USD
#> 7 V Visa Inc. Class A 92826C… B2PZ… 0.0428 Infor… 5706213 USD
#> 8 HON Honeywell International I… 438516… 2020… 0.0386 Indus… 5706213 USD
#> 9 CAT Caterpillar Inc. 149123… 2180… 0.0381 Indus… 5706213 USD
#> 10 CRM Salesforce Inc. 79466L… 2310… 0.0367 Infor… 5706213 USD
#> # … with 20 more rows, and abbreviated variable names ¹identifier,
#> # ²shares_held, ³local_currency
#> # ℹ Use `print(n = ...)` to see more rows
获取所有成分股在2000-01-01到2022-06-30的股票信息:
index_prices <- tq_get(ticker,
get = "stock.prices",
from = "2000-01-01",
to = "2022-06-30")
index_prices
#> # A tibble: 161,753 × 15
#> symbol company ident…¹ sedol weight sector share…² local…³ date open
#> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <date> <dbl>
#> 1 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-03 6.64
#> 2 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-04 6.67
#> 3 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-05 6.64
#> 4 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-06 6.62
#> 5 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-07 7.19
#> 6 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-10 7.73
#> 7 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-11 7.38
#> 8 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-12 7.48
#> 9 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-13 7.52
#> 10 UNH UnitedHe… 91324P… 2917… 0.110 Healt… 5706213 USD 2000-01-14 7.75
#> # … with 161,743 more rows, 5 more variables: high <dbl>, low <dbl>,
#> # close <dbl>, volume <dbl>, adjusted <dbl>, and abbreviated variable names
#> # ¹identifier, ²shares_held, ³local_currency
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
共获得了来自30家公司的161753行数据,对调整价格进行可视化,查看总体走势:
index_prices %>%
ggplot(aes(date, adjusted, color = symbol)) +
geom_line() +
labs(x = NULL, y = NULL, color = NULL,
title = "DOW index stock prices",
subtitle = "Prices in USD, adjusted for dividend payments and stock splits") +
theme(legend.position = "none")
对30家公司的日收益率进行汇总统计:
all_returns <- index_prices %>%
group_by(symbol) %>%
mutate(ret = adjusted / lag(adjusted) - 1) %>%
select(symbol, date, ret) %>%
drop_na(ret) %>%
ungroup() # 解除分组
all_returns %>%
mutate(ret = ret * 100) %>%
group_by(symbol) %>%
summarise(across(
ret,
list(
daily_mean = mean,
daily_sd = sd,
daily_min = min,
daily_max = max
),
.names = "{.fn}"
))
#> # A tibble: 30 × 5
#> symbol daily_mean daily_sd daily_min daily_max
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 AAPL 0.123 2.52 -51.9 13.9
#> 2 AMGN 0.0483 1.98 -13.4 15.1
#> 3 AXP 0.0514 2.30 -17.6 21.9
#> 4 BA 0.0544 2.23 -23.8 24.3
#> 5 CAT 0.0670 2.04 -14.5 14.7
#> 6 CRM 0.117 2.69 -27.1 26.0
#> 7 CSCO 0.0300 2.39 -16.2 24.4
#> 8 CVX 0.0523 1.76 -22.1 22.7
#> 9 DIS 0.0437 1.93 -18.4 16.0
#> 10 DOW 0.0625 2.69 -21.7 20.9
#> # … with 20 more rows
#> # ℹ Use `print(n = ...)` to see more rows
到此为止,就可以下载和处理任意股票组合的价格数据了。比如用ticker <- tq_index("SP500")
修改ticker
为标普500的成分股信息,就可以分析标普500中股票价格数据。因为有很多数据需要下载,这个操作比较耗时,这里不演示。
1.3 其他数据汇总形式
道琼斯总交易量:
volume <- index_prices %>%
mutate(volume_usd = volume * close / 1e9) %>%
group_by(date) %>%
summarise(volume = sum(volume_usd))
volume %>%
ggplot(aes(date, volume)) +
geom_line() +
labs(x = NULL, y = NULL,
title = "Aggregate daily trading volume in billion USD")
可利用45°线比较第\(t\)天与第\(t-1\)天的交易量:
volume %>%
ggplot(aes(lag(volume), volume)) +
geom_point() +
geom_abline(aes(intercept = 0, slope = 1),
linetype = "dotted", color = "red",
size = .8) +
labs(x = "Previous day aggregate trading volume (billion USD)",
y = "Aggregate trading volume (billion USD)",
title = "Trading volume on DOW Index versus previous day volume")
纯纯目测可以发现:交易量大的日子后面往往也是交易量大的日子。
我看45°线
将股票价格记作\(p\),这里的\(x\)轴为\(p_{t-1}\),\(y\)轴为\(p_t\),在一阶自回归模型中,\(\beta_1\)(线性回归系数)相比45°线可以显示更多信息。后面作者讲深入应该会涉及。
1.4 投资组合问题
最佳投资组合追求高回报、低风险:
The standard framework for optimal portfolio selection considers investors that prefer higher future returns but dislike future return volatility (defined as the square root of the return variance): the mean-variance investor.
# 清洗掉行数较少的企业的所有股票
index_prices <- index_prices %>%
group_by(symbol) %>%
mutate(n = n()) %>%
ungroup() %>%
filter(n == max(n)) %>%
select(-n)
# 计算每个企业每月的回报率
returns <- index_prices %>%
mutate(month = floor_date(date, "month")) %>%
group_by(symbol, month) %>%
summarise(price = last(adjusted), .groups = "drop_last") %>%
mutate(ret = price / lag(price) - 1) %>%
drop_na(ret) %>%
select(-price)
将收益率转为\((T \times N)\)的矩阵,计算样本回报率均值矩阵\(\hat{\mu}\)和样本协方差矩阵\(\hat{\Sigma}\): \[\hat\mu = \frac{1}{T}\sum\limits_{t=1}^T r_t\] \[\hat\Sigma = \frac{1}{T-1}\sum\limits_{t=1}^T (r_t - \hat\mu)(r_t - \hat\mu)'\]
return_matrix <- returns %>%
pivot_wider(names_from = symbol,
values_from = ret) %>%
select(-month)
mu <- colMeans(return_matrix)
Sigma <- cov(return_matrix)
难点警告
下面的内容理解还不透彻,先记录。
接下来计算最小方差的投资组合权重\(\omega_\text{mvp}\)和它的预期收益\(\omega_\text{mvp}'\mu\)和波动率\(\sqrt{\omega_\text{mvp}'\Sigma\omega_\text{mvp}}\)。\(\omega_\text{mvp}\)是以下问题的解: \[\omega_\text{mvp} = \arg\min w'\Sigma w \\ \text{ s.t. } \sum\limits_{i=1}^Nw_i = 1\]
N <- ncol(return_matrix)
iota <- rep(1, N)
mvp_weights <- solve(Sigma) %*% iota # solve用于求逆
mvp_weights <- mvp_weights / sum(mvp_weights)
tibble(
expected_ret = t(mvp_weights) %*% mu,
volatility = sqrt(t(mvp_weights) %*% Sigma %*% mvp_weights)
)
#> # A tibble: 1 × 2
#> expected_ret[,1] volatility[,1]
#> <dbl> <dbl>
#> 1 0.00801 0.0314
尝试寻找一个可以获得三倍于最小方差组合的预期收益的投资组合权重:
mu_bar <- 3 * t(mvp_weights) %*% mu
C <- as.numeric(t(iota) %*% solve(Sigma) %*% iota)
D <- as.numeric(t(iota) %*% solve(Sigma) %*% mu)
E <- as.numeric(t(mu) %*% solve(Sigma) %*% mu)
lambda_tilde <- as.numeric(2 * (mu_bar - D / C) / (E - D^2 / C))
efp_weights <- mvp_weights +
lambda_tilde / 2 * (solve(Sigma) %*% mu - D * mvp_weights)
1.5 有效边界
只要有两个有效的投资组合,就可以通过它们权重的线性组合得到有效边界。有效边界描述了在每个风险水平下可以实现的最高预期收益。
c <- seq(from = -.4, to = 1.9, by = .01)
res <- tibble(
c = c,
mu = NA,
sd = NA
)
for (i in seq_along(c)) {
w <- (1 - c[i]) * mvp_weights + (c[i]) * efp_weights
res$mu[i] <- 12 * 100 * t(w) %*% mu
res$sd[i] <- 12 * sqrt(100) * sqrt(t(w) %*% Sigma %*% w)
}
res %>%
ggplot(aes(sd, mu)) +
geom_point() +
geom_point(
data = res %>% filter(c %in% c(0, 1)),
size = 4
) +
geom_point(
data = tibble(
mu = 12 * 100 * mu,
sd = 12 * 10 * sqrt(diag(Sigma))
),
aes(sd, mu), size = 1, color = "blue"
) +
labs(
x = "Annualized standard deviation (in percent)",
y = "Annualized expected return (in percent)",
title = "Efficient frontier for Dow Jones constituents",
subtitle = "Thick dots indicate the location of the minimum variance and efficient tangency portfolio"
)
2 访问和处理金融数据
2.1 Fama-French data
library(lubridate)
library(scales)
library(frenchdata)
定义获取和储存的金融数据的起止日期,方便未来更新:
start_date <- ymd("1960-01-01")
end_date <- ymd("2020-12-31")
frenchdata
包提供了从Prof. Kenneth French金融数据库下载和读取数据集的功能。
# 月度数据
factors_ff_monthly_raw <- download_french_data("Fama/French 3 Factors")
factors_ff_monthly <- factors_ff_monthly_raw$subsets$data[[1]] %>%
transmute(
month = floor_date(ymd(str_c(date, "01")), "month"),
rf = as.numeric(RF) / 100,
mkt_excess = as.numeric(`Mkt-RF`) / 100,
smb = as.numeric(SMB) / 100,
hml = as.numeric(HML) / 100
) %>%
filter(month >= start_date & month <= end_date)
# 每日数据
factors_ff_daily_raw <- download_french_data("Fama/French 3 Factors [Daily]")
factors_ff_daily <- factors_ff_daily_raw$subsets$data[[1]] |>
transmute(
date = ymd(date),
rf = as.numeric(RF) / 100,
mkt_excess = as.numeric(`Mkt-RF`) / 100,
smb = as.numeric(SMB) / 100,
hml = as.numeric(HML) / 100
) |>
filter(date >= start_date & date <= end_date)
industries_ff_monthly_raw <- download_french_data("10 Industry Portfolios")
industries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>
mutate(month = floor_date(ymd(str_c(date, "01")), "month")) |>
mutate(across(where(is.numeric), ~ . / 100)) |>
select(month, everything(), -date) |>
filter(month >= start_date & month <= end_date)
可以用get_french_data_list()
查看Kenneth French主页上的投资组合汇报时间序列数据。
2.2 q-facrors
factors_q_monthly_link <-
"http://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2020.csv"
factors_q_monthly <- read_csv(factors_q_monthly_link) |>
mutate(month = ymd(str_c(year, month, "01", sep = "-"))) |>
select(-R_F, -R_MKT, -year) |>
rename_with(~ str_remove(., "R_")) |>
rename_with(~ str_to_lower(.)) |>
mutate(across(-month, ~ . / 100)) |>
filter(month >= start_date & month <= end_date)
#> Rows: 648 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (8): year, month, R_F, R_MKT, R_ME, R_IA, R_ROE, R_EG
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.3 宏观经济预测因素
library(readxl)
library(googledrive)
googledrive
包用于连接谷歌硬盘。通常需要认证后才能用R与谷歌硬盘交互,而读取公共链接存储的数据,无需认证。
drive_deauth()
macro_predictors_link <-
"https://drive.google.com/file/d/1ACbhdnIy0VbCWgsnXkjcddiV8HF4feWv/view"
drive_download(
macro_predictors_link,
path = "data/macro_predictors.xlsx"
)
#> File downloaded:
#> • 'PredictorData2020.xlsx' <id: 1ACbhdnIy0VbCWgsnXkjcddiV8HF4feWv>
#> Saved locally as:
#> • 'data/macro_predictors.xlsx'
数据被放置在了当前工作目录下的data文件夹5。使用readxl
包进行读取:
macro_predictors <- read_xlsx("data/macro_predictors.xlsx",
sheet = "Monthly"
) %>%
mutate(month = ym(yyyymm)) %>%
filter(month >= start_date & month <= end_date) %>%
mutate(across(where(is.character), as.numeric)) %>%
mutate(
IndexDiv = Index + D12,
logret = log(IndexDiv) - log(lag(IndexDiv)),
Rfree = log(Rfree + 1),
rp_div = lead(logret - Rfree, 1), # Future excess market return
dp = log(D12) - log(Index), # Dividend Price ratio
dy = log(D12) - log(lag(Index)), # Dividend yield
ep = log(E12) - log(Index), # Earnings price ratio
de = log(D12) - log(E12), # Dividend payout ratio
tms = lty - tbl, # Term spread
dfy = BAA - AAA # Default yield spread
) %>%
select(month, rp_div, dp, dy, ep, de, svar,
bm = `b/m`, ntis, tbl, lty, ltr,
tms, dfy, infl
) %>%
drop_na()
读取数据至内存后,可以删除下载下来的文件:
file.remove("data/macro_predictors.xlsx")
#> [1] TRUE
2.4 其他宏观经济数据
利用tidyquant
包从圣路易斯联邦储备银行提供的联邦储备经济数据(FRED)中下载数据,例如下载CPI数据:
cpi_monthly <- tq_get("CPIAUCNS",
get = "economic.data",
from = start_date,
to = end_date)